Since my last post, and as a result of the community stepping in to point out the error of my ways (see comments on my previous post), I’ve made a big step forward; though problems remain.
As Owen Stephens helpfully pointed out what I needed was not better GREP, not to introduce Python, but a better SPARQL query. After a little work, entering this query in the British Museum SPARQL query interface returned most (if not all) the records I need with the descriptions in the output so that I (and you!) can just use the download options in the GUI to grab the data.
A challenge remains around getting SPARQL to give me the metadata I need in a form that is useful. As Owen mentioned and as I had observed when first scoping out the research, useful information about publishers, authors, titles, dates of publication, and even the Catalogue of Political and Personal Satires Preserved in the Department of Prints and Drawings in the British Museum reference numbers themselves are embedded within indistinguishable display wrappers such as:
<rdf:Description rdf:about="http://collection.britishmuseum.org/id/object/P_1868-0808-8317"> <bmo:PX_display_wrap>Production date :: 1816 ::</bmo:PX_display_wrap> </rdf:Description>
The net result is that extracting an ID to replace the meaningless file names (meaningful I’m told for internal BM purposes, but without chronological logic meaningless to me) will be tricky. My hope is that by using those lines of the data with a museum object number, such as P_1868-0808-8317, numbers which are unique (they relate to when the object was acquired by the BM) I’ll be able – somehow – to cross-reference them with the BM Satires numbers (which are listed in largely chronological order) and begin sorting the files from there.
Anyhow rather than sit on my hands and wait for this magic solution to appear, I’ve pressed with an alternative, lossy approach so as to at least to get better a sense of the data (Rule 1 of doing research: know your sources/data!).
Once out of the BM SPARQL interface, I pushed the data into OpenRefine. From there I:
- converted the column of descriptions into lower case (GREL: `value.toLowercase()`);
- removed newlines in the descriptions (GREL: `value.replace(“\n”, ” “)` – note that I did this after much pain failing with the next step and not knowing why; the stray newlines were the culprit…);
- created a new column based on a four number match in the descriptions column (GREL: `value.match(/.*(\d{4}).*/)[0]`) to pluck out some publication year dates (more on that ‘some’ later…);
- filtered the columns containing object URLs with `P_’ to remove the duplicates RDF impose on me and keep only the lines attached to the museum object number;
- filtered the new year-date column in regular expression mode with ‘1[6-8][0-9][0-9]’ to catch only dates between 1600 and 1899;
- exported the data out of refine as a .tsv file.
Step 3) is the problem here. The character match does not know which four number strings are a year date and which are not, the filtering step required as a result takes us from 23363 records to 21213 records. Worse still, some descriptions contain more than one year date, often year dates for reprints or for recent pertinent events – either added by the curator in the description of the print or included in the transcription of the text the satire contains. Looking at the frequency of prints for each year below, I’d suggest that the strong peaks around 1784 and 1818-1822 (many returned as 1818 were in fact, if we look at the descriptions, published in 1809) are likely as much to do with the presence of reprints, and interesting as that is it represents rogue data for my purposes.
All of which means my data is okay but lacking precision and is hence of use only for some purposes (data analysis combined with plenty of close reading). The next step is to establish how to pluck out all the year date info from the descriptions, and to then either filter for those with one apparent year date (thus losing some data, gaining some quality) or hand QA all the duplicate suggestions for year dates by hand. The latter is preferable (presuming circa 15% or fewer records have duplicate four number strings in their descriptions…) for though I know the data pretty well – especially the 1783-1812 period from time spent extracting information from the physical catalogues back in the day; I’ve come a long way…) – it wouldn’t hurt to get to know the data a little better through this – enforced – qualitative sampling.
It certainly would help with the research question I have in mind: does reading across this data illuminate characteristics of single sheet satires published between circa 1730 and 1830 (the period I’ll likely pick as the coverage is best) or the circumstances by which the catalogue was compiled at the British Museum between 1877 and 1954 (the period during which satires published 1730-1830 were catalogued)?
You must be logged in to post a comment.