All posts by jwbaker

James Baker is a Lecturer in Digital History at the University of Sussex (and the awesome Sussex Humanities Lab). He is a historian of long eighteenth century Britain and a Software Sustainability Institute Fellow. He holds degrees from the University of Southampton and latterly the University of Kent, where in 2010 he completed his doctoral research on the late-Georgian satirical artist-engraver Isaac Cruikshank. As an eighteenth centuryist, his research interests include satirical art, the making and selling of printed objects, urban protest, and corpus analysis. His contemporary historian interests include the curation of personal digital archives, the critical examination of forensic software and captures, the use of born-digital archives in historical research, and scribing and archiving in the age of the hard disk. Prior to joning Sussex, James has held positions of Digital Curator at the British Library and Postdoctoral Fellow with the Paul Mellon Centre for Studies of British Art. He is a convenor of the Institute of Historical Research Digital History seminar and a member of the History Lab Plus Advisory Board. Git -- Publications -- CV -- Twitter -- Email -- Tumblr Zenodo -- Notes from talks, papers, events -- Slides

Printed images and computational image recognition

This weekend I bought a print at my local antiques and vintage market. It is a George Cruikshank satire, etched in 1849 that was published as part of The Comic Almanack 1850. It is called As it Ought to Be Or The Ladies Trying a Contemptible Scoundrel for a “Breach of Promise” and it satirises the women’s rights movement by constructing a fantasy: a “Queen’s” iteration of the Court of the King’s Bench, a fixture of the English legal system. Despite studying the Cruikshank family for some years, this is the first Cruikshank print that I’ve purchased (they can be rather expensive). What I like about this one is a) that it came from a book and so is a nice example of the multi-modal print publishing that grew in England after the 1820s, and b) that it has very simple colouring that differs from other copies, such as one at the Victoria & Albert Museum, and so reminds us (and my students!) that prints in this period were not ‘reproductions’ as we think of them, that they were unique objects produced though craft-like processes rather than exact replicas.

IMG_20160807_131836(1)Today I put a digital copy of that print on Wikimedia Commons. Also today, I figured out how to run Pastec, a tool that can be used to search for duplicate and near duplicate images in a corpus of digital images held locally. This was a tricky one to get running: it required me to configure a virtual machine (mostly because I don’t have a native Ubuntu machine at the moment), normalise some data, adapt some scripts good people (Ryan Baumann, Shawn Graham, Matthew Lincoln) had generously shared, and work out enough Ruby to figure out the syntax for running a .rb script. I’d started all this last week, since when I’d mostly failed repeatedly, but with each fail I learnt a little more about shell scripting, programming languages, and trace backs. I’ve distilled this all into guide called ‘Getting Pastec up and running‘ because documenting a process like this is the best hope I have of remembering it and building on it. I hope you find it useful as well.

As part of the Pastec guide I’ve posted an output from running the tool over some test data. Examination of this output shows that Pastec can indeed find not only exact replicas of images (as in, the same file with a different file name) but also reuses of the same woodblock on a different page, printed at a slightly different angle, with subtly different inking, and even more subtle degradation in the printing block. This brings me back to Cruikshank and As It Ought To Be. For some time I’ve been interested in the versions of printed images, specifically satirical images, that entered the London marketplace circa 1750 to 1850 either as single sheets or as book illustrations. I’ve spent many hours in archives and print rooms studying different versions of prints and comparing them with notes on versions in other archives and print rooms. As a result of these activities, I’ve built what I believe to be strong evidence for seeing the processes of making reproducible images as important factors in shaping and constraining artistic agency. Now, having wrangled Pastec into my historian’s toolkit, I can do something I’ve wanted to be able to do for sometime: take a large corpus of digital images, more than I could reasonably work through by hand, and find near duplicates, thus building evidence that can support, complicate, or outright reject my hypothesis.

Of course before making any claims based on this work, I would need to find out more about what Pastec is doing. It is, after all, taking impression from inked copper plates subsequently hand-coloured and turning them into data, into numbers that are abstractions of reality, abstractions that as Pat Hudson’s evergreen History by Numbers reminds us the historian must ensure are representative of that reality. Nevertheless, I’m pleased – for the second time this summer – to have been able to computationally process a multimedia source for itself, rather than through textual surrogates, to have figured out how to semi-automate and scale my fascination with near duplicates of printed images.