Tag Archives: python

Metadata for all the British Musuem Satires: part two

Last week I received from the lovely people at the British Museum a subset of metadata for the Catalogue of Political and Personal Satires Preserved in the Department of Prints and Drawings in the British Museum the definitive – if far from comprehensive – collection of British satirical prints circa 1700-1900 (see ‘Metadata for all the British Museum Satires in One Query‘)

The subset data dump I received can be downloaded at http://collection.britishmuseum.org/dumps/satires.tgz (if this link goes down, do let me know).

If I’m honest, I’ve struggled to work with this dump. For although well described, the bits I need to do what I want to do – titles, dates, authors, descriptions – are scattered around verbose .xml. This isn’t helped by the file names not mapping to the British Museum Catalogue of Political and Personal Satires references well known in the field of satirical prints, an unreasonable expectation I know but important to any researcher in working with these objects. In PPA85715.xml this catalogue reference is covered by:

  <rdf:Description rdf:about="http://collection.britishmuseum.org/id/object/PPA85715">
      <bmo:PX_display_wrap>Bibliograpic reference :: BM Satires 12759 ::</bmo:PX_display_wrap>
  </rdf:Description>

(note the spelling error, a legacy of the hand-keyed, free text nature of the BM collections pages)

Anyway in spite of these initial struggles, what I aim figure out is a means of compiling from this data (once I get the full dump) all the ‘physical descriptions’ for the prints – the curatorial descriptions assigned by Frederick George Stephens and Mary Dorothy George, the two authors of the Catalogue of Political and Personal Satires, between 1870 and 1954 (when the printed catalogue was produced), as well as any descriptions of subsequent additions to the collection.

In the .xml these descriptions take the following form (example here PPA85715.xml):

  <rdf:Description rdf:about="http://collection.britishmuseum.org/id/object/PPA85715">
      <bmo:PX_physical_description>The Regent sits in an arm-chair, one gouty leg supported on a stool, and holding a crutch, between Princess Charlotte (left) and Prince Leopold, who stand facing each other. The Prince wears hussar uniform with a large busby and sabre, and holds out a big German sausage, saying, "Dere mine Frow, dere is de best part of a Yarmany Man, dot is vat de Yarmany Ladies love so veil!!" She bends forward eagerly, arms outstretched, saying, "O dear me it is the longest and the thickest I ever saw, do let me taste it." The Regent looks up at her with a pained expression, saying, "There's for you! —I told him you liked a good thing as well as your Father, its all scented, perfumed, curry'd &amp; spiced, but you must not take too much of it at a time, you'll find it very hot." She wears a white high-waisted décolletée dress, slightly trained, and is scarcely caricatured. Behind her is the end of a cloth-covered dinner-table, with decanters and a bird. Over this hangs a whole length picture of a Chinese sage, inscribed 'Confucius'. The chimney-piece, partly visible on the extreme right, is supported by the carved figure of a standing mandarin; on it are a Chinese vase of flowers and a squatting mandarin.

  April 1816

  Hand-coloured etching</bmo:PX_physical_description>
  </rdf:Description>

Though I’m sure a Python script could pluck these out, I’ve been seeing how far I can get with unix commands (for I know them best, see my lessons with Ian Milligan on the Programming Historian, such as ‘Counting and Mining Research Data with Unix‘) and despite some sage advice on Twitter (many thanks to Martin Eve, Brett Lempereur, Andy Jackson, Adam Crymble, Owen Stephens, and Sharon Howard for helping me out!) I’ve yet to crack `grep` across line breaks in the unix terminal (or to – perhaps more importantly – figure out if it is even possible in this environment…)

Rather than just flounder about mourning by lack of talent, I’ve pressed ahead and used what I know to try and get a sense of the data – for though aware of the data I’m loosing by doing this, I’m still retaining more than my brain could ever have and starting to get a useful of what is there. The steps I’ve taken to do this are as follows:

  • Step 1: `grep PX_physical_description *.xml > d.tsv` [this misses those lines in the description not preceded or followed by ‘PX_physical_description’, such as ‘April 1816’ in PPA85715.xml above. Download: 2014-11-16_BMsatires-grep-output.tsv]
  • Step 2: `grep -v /bmo:PX_physical_description d.tsv > d2.tsv` [this results in the loss of some data, though usually only descriptions of the production method such as ‘Hand-coloured etching’. I’m not too bothered about these right now.]
  • Step 3: grep -v ‘For description see other impression’ d2.tsv > d2-deduplicated1.tsv [this takes out redundancy in the data created by the BM holding multiple copies of the same print]
  • Step 4: `sort d2-deduplicated1.tsv > d2-sorted.tsv` [sorts the data]
  • Step 5: `uniq d2-sorted.tsv > d2-deduplicated2.tsv` [de-duplicates the data, important with this linked data .rdf stuff]

From here, I’ve tidied up the output in OpenRefine, and in particular used `value.match(/.*(\d{4}).*/)[0]` to find a year-date match in the description and assign that to a new column, filtered out those entries without a year-date match in the description (quite a few, again loosing some data…) and ended up with a .txt file (download: 2014-11-16_BMsatires-lossy-clean-sort.txt) with all the descriptions that matched the criteria in order and ready for easy processing in something like Voyant.

Obviously this process has been massively lossy, but it is better than nothing (I’m left with 1637 descriptions from 4485 files, though some 800 of these can accounted for by the ‘For description see other impression’ entries) . The next step is to work out how to capture all of the descriptions irrespective of line breaks (for which I think I’m going to need some clever Python…) and either compile the data (dumb) or keep the files separate (clever) – ideally with the titles, dates, authors et al intact – for much smarter processing, analysis et al. And what I’d really like is to get them into a form that Zotero can ingest, both so they are in my research archive and so that I can throw them at Paper Machines (for another angle) , but – as you may have noticed – I fear I’ll run out of talent before then…

Thoughts, comments, help, support, collaboration most (most) welcome.