Last week I received from the lovely people at the British Museum a subset of metadata for the Catalogue of Political and Personal Satires Preserved in the Department of Prints and Drawings in the British Museum the definitive – if far from comprehensive – collection of British satirical prints circa 1700-1900 (see ‘Metadata for all the British Museum Satires in One Query‘)
The subset data dump I received can be downloaded at http://collection.britishmuseum.org/dumps/satires.tgz (if this link goes down, do let me know).
If I’m honest, I’ve struggled to work with this dump. For although well described, the bits I need to do what I want to do – titles, dates, authors, descriptions – are scattered around verbose .xml. This isn’t helped by the file names not mapping to the British Museum Catalogue of Political and Personal Satires references well known in the field of satirical prints, an unreasonable expectation I know but important to any researcher in working with these objects. In PPA85715.xml this catalogue reference is covered by:
<rdf:Description rdf:about="http://collection.britishmuseum.org/id/object/PPA85715"> <bmo:PX_display_wrap>Bibliograpic reference :: BM Satires 12759 ::</bmo:PX_display_wrap> </rdf:Description>
(note the spelling error, a legacy of the hand-keyed, free text nature of the BM collections pages)
Anyway in spite of these initial struggles, what I aim figure out is a means of compiling from this data (once I get the full dump) all the ‘physical descriptions’ for the prints – the curatorial descriptions assigned by Frederick George Stephens and Mary Dorothy George, the two authors of the Catalogue of Political and Personal Satires, between 1870 and 1954 (when the printed catalogue was produced), as well as any descriptions of subsequent additions to the collection.
In the .xml these descriptions take the following form (example here PPA85715.xml):
<rdf:Description rdf:about="http://collection.britishmuseum.org/id/object/PPA85715"> <bmo:PX_physical_description>The Regent sits in an arm-chair, one gouty leg supported on a stool, and holding a crutch, between Princess Charlotte (left) and Prince Leopold, who stand facing each other. The Prince wears hussar uniform with a large busby and sabre, and holds out a big German sausage, saying, "Dere mine Frow, dere is de best part of a Yarmany Man, dot is vat de Yarmany Ladies love so veil!!" She bends forward eagerly, arms outstretched, saying, "O dear me it is the longest and the thickest I ever saw, do let me taste it." The Regent looks up at her with a pained expression, saying, "There's for you! —I told him you liked a good thing as well as your Father, its all scented, perfumed, curry'd & spiced, but you must not take too much of it at a time, you'll find it very hot." She wears a white high-waisted décolletée dress, slightly trained, and is scarcely caricatured. Behind her is the end of a cloth-covered dinner-table, with decanters and a bird. Over this hangs a whole length picture of a Chinese sage, inscribed 'Confucius'. The chimney-piece, partly visible on the extreme right, is supported by the carved figure of a standing mandarin; on it are a Chinese vase of flowers and a squatting mandarin. April 1816 Hand-coloured etching</bmo:PX_physical_description> </rdf:Description>
Though I’m sure a Python script could pluck these out, I’ve been seeing how far I can get with unix commands (for I know them best, see my lessons with Ian Milligan on the Programming Historian, such as ‘Counting and Mining Research Data with Unix‘) and despite some sage advice on Twitter (many thanks to Martin Eve, Brett Lempereur, Andy Jackson, Adam Crymble, Owen Stephens, and Sharon Howard for helping me out!) I’ve yet to crack `grep` across line breaks in the unix terminal (or to – perhaps more importantly – figure out if it is even possible in this environment…)
Rather than just flounder about mourning by lack of talent, I’ve pressed ahead and used what I know to try and get a sense of the data – for though aware of the data I’m loosing by doing this, I’m still retaining more than my brain could ever have and starting to get a useful of what is there. The steps I’ve taken to do this are as follows:
- Step 1: `grep PX_physical_description *.xml > d.tsv` [this misses those lines in the description not preceded or followed by ‘PX_physical_description’, such as ‘April 1816’ in PPA85715.xml above. Download: 2014-11-16_BMsatires-grep-output.tsv]
- Step 2: `grep -v /bmo:PX_physical_description d.tsv > d2.tsv` [this results in the loss of some data, though usually only descriptions of the production method such as ‘Hand-coloured etching’. I’m not too bothered about these right now.]
- Step 3: grep -v ‘For description see other impression’ d2.tsv > d2-deduplicated1.tsv [this takes out redundancy in the data created by the BM holding multiple copies of the same print]
- Step 4: `sort d2-deduplicated1.tsv > d2-sorted.tsv` [sorts the data]
- Step 5: `uniq d2-sorted.tsv > d2-deduplicated2.tsv` [de-duplicates the data, important with this linked data .rdf stuff]
From here, I’ve tidied up the output in OpenRefine, and in particular used `value.match(/.*(\d{4}).*/)[0]` to find a year-date match in the description and assign that to a new column, filtered out those entries without a year-date match in the description (quite a few, again loosing some data…) and ended up with a .txt file (download: 2014-11-16_BMsatires-lossy-clean-sort.txt) with all the descriptions that matched the criteria in order and ready for easy processing in something like Voyant.
Obviously this process has been massively lossy, but it is better than nothing (I’m left with 1637 descriptions from 4485 files, though some 800 of these can accounted for by the ‘For description see other impression’ entries) . The next step is to work out how to capture all of the descriptions irrespective of line breaks (for which I think I’m going to need some clever Python…) and either compile the data (dumb) or keep the files separate (clever) – ideally with the titles, dates, authors et al intact – for much smarter processing, analysis et al. And what I’d really like is to get them into a form that Zotero can ingest, both so they are in my research archive and so that I can throw them at Paper Machines (for another angle) , but – as you may have noticed – I fear I’ll run out of talent before then…
Thoughts, comments, help, support, collaboration most (most) welcome.
Ahh, I see now what you’re trying to do. My first advice: put away the grep and get some proper XML parsing tools. It might seem harder at first but there’s so much XML data out there I think you’ll find it a worthwhile investment.
If you bite the bullet and try Python, the intro to Beautiful Soup in PH looks like it might be a good starting point. But you would need to go beyond that because it covers HTML only and you’d need to add an XML parsing library – see the BS documentation. (In PHP it’d be quite straightforward with DOMDocument and Xpath.)
There are some possible tools staying within the unix terminal though – I’ve never tried them but you should look into XSLT tools like Saxon or xsltproc. I’ve also seen recommends for a toolkit called XMLStarlet which looks easier to use (but may not have been recently updated).
I realised quite quick that I’d need to look somewhere else to get this done, so thanks so much for the advice. I shall dig in next I get the chance!
XMLStarlet is great, but has quite a steep learning curve. An alternative might be xmllint, which now has an XPath-based ‘grep’ mode, apparently: http://stackoverflow.com/a/14492020/6689
Although it’s been delivered as XML, this is actually RDF, and to be honest the serialisation of RDF to XML is ugly and not very nice to work with.
You *can* parse this with XML parsers, but it might be easier to use the RDF with a decent RDF parser.
To be honest, getting the descriptions is only a small tweak to the original SPARQL query, so my initial reaction is to go back to the SPARQL query and use that to grab the descriptions directly.
You should be able to add this statement at the bottom of the existing statement to get the descriptions:
?object bmo:PX_physical_description ?desc
and adding ?desc to your select statement
There’s an issue with this that because it looks like there are several identifiers. You can work around this by using the ‘preferred identifier’ which is the same across all representations:
?object crm:P48_has_preferred_identifier ?id
Extracting the correct reference (e.g. Bibliograpic reference :: BM Satires 12759 ::) looks like a pain in the neck – I can’t see this in the structured data at all, which means you are reduced to extracting from the display data (bmo:PX_display_wrap) – but this has the problem that there is all kinds of data in PX_display_wrap statements, which means you get a whole load of information you didn’t want/need. Not difficult to parse out the stuff you really want e.g. in OpenRefine afterwards, but it bugs me you can’t do this neatly in a SPARQL statement – I may be missing something in the data model, but I don’t think so
Owen. Thanks for this. I have limited experience with SPARQL (the folks at the BM built my query), but having played around for a bit I got this query https://gist.github.com/drjwbaker/e4a48b16a001d92002d3 to work http://bit.ly/1EQdqA6 . Not sure it is capturing the preferred identifier bit at least it isn’t failing!
Without wishing to impose much longer, do you have any recommendations for starting point for grabbing the data post-query? Again, I’ve been relying thus far on the BM folks sending me a dump but I’m sure there is a better way of going about it.
Yes – you’ve got the query right as far as I can see.
You can get the results of the SPARQL query as JSON or XML – look at the link top right of the query results in the BM web interface and you should see links to each format. OpenRefine happily reads the JSON for the output of this query.
The ‘preferred identifier’ bit is slightly odd. When I look at the results in the BM web interface it seems to ignore these duplicate representations and only give you one result per object. However, when I run the query remotely, OR when I export the results from the BM as JSON/XML the duplication appears.
Getting the preferred identifier is just a matter of adding one further statement to the query so you end up with:
PREFIX crm:
PREFIX skos:
PREFIX thes:
PREFIX rdfs:
PREFIX bmo:
SELECT DISTINCT ?pid ?desc
{?object crm:P70i_is_documented_in ;
crm:P128_carries / crm:P129_is_about ?satire ;
crm:P48_has_preferred_identifier ?id ;
bmo:PX_physical_description ?desc ;
crm:P48_has_preferred_identifier ?pid.
?satire skos:inScheme thes:subject;
rdfs:label “satire”}
Having a poke around the data it looks like in a few cases a single object has multiple descriptions, so you just need to look out for that when processing the output (in OpenRefine you can permanently sort based on ID then blank down, swap to records mode and if you want join the description fields together)
Apologies, had a dim moment there – didn’t realise that as the ‘desc’ bit surfaced the description on the results that they’d appear in the JSON/XML… Silly me.
And thanks for the query edit. Looks like I’ve now got something to throw at Refine (and yeah, that records swap trick is invaluable!)
btw to get the Author and date takes a bit more complex sparqling because the data model in the BM data puts all this stuff at one step removed. I think you could get all the fields you list (titles, dates, authors, descriptions) in a single SPARQL, but it would have to de-reference URLs at one or two steps removed from the object
Oh – also I think the current query will omit any items that don’t have a description/preferred ID. Would need further tweaking to return those where these are blank if items with blank descriptions are of interest