Metadata for all the British Musuem Satires: part three

Since my last post, and as a result of the community stepping in to point out the error of my ways (see comments on my previous post), I’ve made a big step forward; though problems remain.

As Owen Stephens helpfully pointed out what I needed was not better GREP, not to introduce Python, but a better SPARQL query. After a little work, entering this query in the British Museum SPARQL query interface returned most (if not all) the records I need with the descriptions in the output so that I (and you!) can just use the download options in the GUI to grab the data.

A challenge remains around getting SPARQL to give me the metadata I need in a form that is useful. As Owen mentioned and as I had observed when first scoping out the research, useful information about publishers, authors, titles, dates of publication, and even the Catalogue of Political and Personal Satires Preserved in the Department of Prints and Drawings in the British Museum reference numbers themselves are embedded within indistinguishable display wrappers such as:

<rdf:Description rdf:about="http://collection.britishmuseum.org/id/object/P_1868-0808-8317">
    <bmo:PX_display_wrap>Production date :: 1816 ::</bmo:PX_display_wrap>
</rdf:Description>

The net result is that extracting an ID to replace the meaningless file names (meaningful I’m told for internal BM purposes, but without chronological logic meaningless to me) will be tricky. My hope is that by using those lines of the data with a museum object number, such as P_1868-0808-8317, numbers which are unique (they relate to when the object was acquired by the BM) I’ll be able – somehow – to cross-reference them with the BM Satires numbers (which are listed in largely chronological order) and begin sorting the files from there.

Anyhow rather than sit on my hands and wait for this magic solution to appear, I’ve pressed with an alternative, lossy approach so as to at least to get better a sense of the data (Rule 1 of doing research: know your sources/data!).

Once out of the BM SPARQL interface, I pushed the data into OpenRefine. From there I:

  1. converted the column of descriptions into lower case (GREL: `value.toLowercase()`);
  2. removed newlines in the descriptions (GREL: `value.replace(“\n”, ” “)` – note that I did this after much pain failing with the next step and not knowing why; the stray newlines were the culprit…);
  3. created a new column based on a four number match in the descriptions column (GREL: `value.match(/.*(\d{4}).*/)[0]`) to pluck out some publication year dates (more on that ‘some’ later…);
  4. filtered the columns containing object URLs with `P_’ to remove the duplicates RDF impose on me and keep only the lines attached to the museum object number;
  5. filtered the new year-date column in regular expression mode with ‘1[6-8][0-9][0-9]’ to catch only dates between 1600 and 1899;
  6. exported the data out of refine as a .tsv file.

Step 3) is the problem here. The character match does not know which four number strings are a year date and which are not, the filtering step required as a result takes us from 23363 records to 21213 records. Worse still, some descriptions contain more than one year date, often year dates for reprints or for recent pertinent events – either added by the curator in the description of the print or included in the transcription of the text the satire contains. Looking at the frequency of prints for each year below, I’d suggest that the strong peaks around 1784 and 1818-1822 (many returned as 1818 were in fact, if we look at the descriptions, published in 1809) are likely as much to do with the presence of reprints, and interesting as that is it represents rogue data for my purposes.

All of which means my data is okay but lacking precision and is hence of use only for some purposes (data analysis combined with plenty of close reading). The next step is to establish how to pluck out all the year date info from the descriptions, and to then either filter for those with one apparent year date (thus losing some data, gaining some quality) or hand QA all the duplicate suggestions for year dates by hand. The latter is preferable (presuming circa 15% or fewer records have duplicate four number strings in their descriptions…) for though I know the data pretty well – especially the 1783-1812 period from time spent extracting information from the physical catalogues back in the day; I’ve come a long way…) – it wouldn’t hurt to get to know the data a little better through this – enforced – qualitative sampling.

It certainly would help with the research question I have in mind: does reading across this data illuminate characteristics of single sheet satires published between circa 1730 and 1830 (the period I’ll likely pick as the coverage is best) or the circumstances by which the catalogue was compiled at the British Museum between 1877 and 1954 (the period during which satires published 1730-1830 were catalogued)?

Advertisements

5 thoughts on “Metadata for all the British Musuem Satires: part three”

  1. Hi James,

    The British Museum data (as you probably know) uses the CIDOC-CRM model (http://www.cidoc-crm.org). This is a rich but complex way of modelling cultural heritage data, leading to some challenges for querying the data with SPARQL.

    While some (or possibly all) the data you are interested in is surfaced via the ‘bmo:PX_display_wrap’ the majority (if not all) the data is also more richly modelled in the CIDOC-CRM model.

    CIDOC-CRM tends to abstract each possible aspect of an object into a concept or event, and then uses this to surface details about that aspect. To try to be a bit more concrete – for example, rather than having a ‘date of production’ (for example), each object has a ‘production’ entity, which in turn has its own properties including things like production techniques used, people involved in the production, and date of production (see http://collection.britishmuseum.org/resource/ecrm/E12_Production for a more detailed description of this class).

    This means you have to ‘follow’ various URIs through the SPARQL query until you get to the part of the graph where the literal value you want is stored.

    I find using the British Museum web interface to their linked data and following links to explore it as a good way of getting to know the constructs I’m interested in.

    Here is a ‘production’ for an object:

    http://collection.britishmuseum.org/resource?uri=http%3A%2F%2Fcollection.britishmuseum.org%2Fid%2Fobject%2FPPA42612%2Fproduction

    A timespan for that production:
    http://collection.britishmuseum.org/resource?uri=http%3A%2F%2Fcollection.britishmuseum.org%2Fid%2Fobject%2FPPA42612%2Fproduction%2F1

    The dates associated with that timespan:
    http://collection.britishmuseum.org/resource?uri=http%3A%2F%2Fcollection.britishmuseum.org%2Fid%2Fobject%2FPPA42612%2Fproduction%2F1%2Fdate

    I’ve tried to write a SPARQL query that gets all the information you originally mentioned as desirable. The SPARQL query is at:

    There are a number of compromises/decisions I’ve had to make in making this query. For example, to avoid complications of multiple titles, I’ve used the objects rdfs:label – this gives me a single ‘title’ per object, but misses alternative titles/different language titles.

    I’ve tried to note major compromises or issues in the comments at the URL above.

    Finally, this query results in multiple lines per object where there are multiple descriptions or multiple creators (multiplying up if there are multiples of both). I used OpenRefine to sort on the ID, then de-duplicate by joining together descriptions/creators etc. in OpenRefine – this eventually gave me 23365 unique objects – I think this is right, but didn’t spend enough time checking to be sure I haven’t made a mistake in the de-duplication process.

    My SPARQL knowledge isn’t good enough to do a better job with the query, but I’d be really surprised if there isn’t a way of avoiding at least some of the duplication I get in my results.

  2. Oh and a final thought / comment (for now). Once you’ve got the URIs in OpenRefine, you can use the ‘Add column by fetching URLs’ option to get JSON representations of the RDF for each URI you have. You can then use the parseJson function to extract further information. This can be a bit slow, but it works and means you can always extend the available information by fetching more from the BM site when you need it.

  3. Realised I could use GROUP_CONCAT to achieve in the SPARQL query some of the work I was doing in OpenRefine – so have an updated SPARQL query that only gets a single line per item. Where there are duplicate titles, creators, creation dates, or descriptions these are concatenated into a single string in the query result using the pipe ‘|’ character to join together separate values.

  4. Realised I could use GROUP_CONCAT to achieve in the SPARQL query some of the work I was doing in OpenRefine – so have an updated SPARQL query that only gets a single line per item. Where there are duplicate titles, creators, creation dates, or descriptions these are concatenated into a single string in the query result using the pipe ‘|’ character to join together separate values.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s