All posts by jwbaker

James Baker is Director of Digital Humanities at the University of Southampton. James is a Software Sustainability Institute Fellow, a Fellow of the Royal Historical Society, and holds degrees from the University of Southampton and latterly the University of Kent, where in 2010 he completed his doctoral research on the late-Georgian artist-engraver Isaac Cruikshank. James works at the intersection of history, cultural heritage, and digital technologies. He is currently working on a history of knowledge organisation in twentieth century Britain. In 2021, I begin a major new Arts and Humanities Research Council funded project 'Beyond Notability: Re-evaluating Women’s Work in Archaeology, History and Heritage, 1870 – 1950'. Previous externally funded research projects have focused on legacy descriptions of art objects ('Legacies of Catalogue Descriptions and Curatorial Voice: Opportunities for Digital Scholarship', Arts and Humanities Research Council), the preservation of intangible cultural heritage ('Coptic Culture Conservation Collective', British Council, and 'Heritage Repertoires for inclusive and sustainable development', British Academy), the born digital archival record ('Digital Forensics in the Historical Humanities', European Commission), and decolonial futures for museum collections ('Making African Connections: Decolonial Futures for Colonial Collections', Arts and Humanities Research Council). Prior to joining Southampton, James held positions of Senior Lecturer in Digital History and Archives at the University of Sussex and Director of the Sussex Humanities Lab, Digital Curator at the British Library, and Postdoctoral Fellow with the Paul Mellon Centre for Studies in British Art. He is a member of the Arts and Humanities Research Council Peer Review College, a convenor of the Institute of Historical Research Digital History seminar, a member of The Programming Historian Editorial Board and a Director of ProgHist Ltd (Company Number 12192946), and an International Advisory Board Member of British Art Studies.

Printed images and computational image recognition

This weekend I bought a print at my local antiques and vintage market. It is a George Cruikshank satire, etched in 1849 that was published as part of The Comic Almanack 1850. It is called As it Ought to Be Or The Ladies Trying a Contemptible Scoundrel for a “Breach of Promise” and it satirises the women’s rights movement by constructing a fantasy: a “Queen’s” iteration of the Court of the King’s Bench, a fixture of the English legal system. Despite studying the Cruikshank family for some years, this is the first Cruikshank print that I’ve purchased (they can be rather expensive). What I like about this one is a) that it came from a book and so is a nice example of the multi-modal print publishing that grew in England after the 1820s, and b) that it has very simple colouring that differs from other copies, such as one at the Victoria & Albert Museum, and so reminds us (and my students!) that prints in this period were not ‘reproductions’ as we think of them, that they were unique objects produced though craft-like processes rather than exact replicas.

IMG_20160807_131836(1)Today I put a digital copy of that print on Wikimedia Commons. Also today, I figured out how to run Pastec, a tool that can be used to search for duplicate and near duplicate images in a corpus of digital images held locally. This was a tricky one to get running: it required me to configure a virtual machine (mostly because I don’t have a native Ubuntu machine at the moment), normalise some data, adapt some scripts good people (Ryan Baumann, Shawn Graham, Matthew Lincoln) had generously shared, and work out enough Ruby to figure out the syntax for running a .rb script. I’d started all this last week, since when I’d mostly failed repeatedly, but with each fail I learnt a little more about shell scripting, programming languages, and trace backs. I’ve distilled this all into guide called ‘Getting Pastec up and running‘ because documenting a process like this is the best hope I have of remembering it and building on it. I hope you find it useful as well.

As part of the Pastec guide I’ve posted an output from running the tool over some test data. Examination of this output shows that Pastec can indeed find not only exact replicas of images (as in, the same file with a different file name) but also reuses of the same woodblock on a different page, printed at a slightly different angle, with subtly different inking, and even more subtle degradation in the printing block. This brings me back to Cruikshank and As It Ought To Be. For some time I’ve been interested in the versions of printed images, specifically satirical images, that entered the London marketplace circa 1750 to 1850 either as single sheets or as book illustrations. I’ve spent many hours in archives and print rooms studying different versions of prints and comparing them with notes on versions in other archives and print rooms. As a result of these activities, I’ve built what I believe to be strong evidence for seeing the processes of making reproducible images as important factors in shaping and constraining artistic agency. Now, having wrangled Pastec into my historian’s toolkit, I can do something I’ve wanted to be able to do for sometime: take a large corpus of digital images, more than I could reasonably work through by hand, and find near duplicates, thus building evidence that can support, complicate, or outright reject my hypothesis.

Of course before making any claims based on this work, I would need to find out more about what Pastec is doing. It is, after all, taking impression from inked copper plates subsequently hand-coloured and turning them into data, into numbers that are abstractions of reality, abstractions that as Pat Hudson’s evergreen History by Numbers reminds us the historian must ensure are representative of that reality. Nevertheless, I’m pleased – for the second time this summer – to have been able to computationally process a multimedia source for itself, rather than through textual surrogates, to have figured out how to semi-automate and scale my fascination with near duplicates of printed images.