The extraction of these data from the archive is beset with problems that will be familiar to anyone who has explored ECCO. As is now well known, the optical character recognition (OCR) software used by Gale, the publisher, compromises the reliability of the data extracted. Although this is regrettable, the following study is intended to be exemplar of a new kind of conceptual history. When in the not-too-distant future the glitches in the software no longer cause these problems, the compilation of more secure data will be possible. But since I doubt that there will be significant changes to the profiles I have created for the concepts studied here, the revision of precise numerical values will be unlikely to lead to different conclusions. I am, nevertheless, confident that at the time of carrying out the searches (for the most part in 2009-2010) all of the data are presented as accurate
Peter de Bolla, The architecture of concepts the historical formation of human rights (2013), 8.
In February Andrew Prescott visited the Sussex Humanities Lab to speak on the subject ‘Searching for Dr Johnson’ (my notes from the talks are on GitHub). His concern was that historians do not sufficiently consider the complex origins of digitisation projects when they use digitised material in their research. The British Library’s efforts to digitise newspapers, for example, were not motivated – in Prescott’s account – by a desire to enable full text search but to resolve issues with access to microfilm of their newspapers collections: there were in short not enough reels to go around.
One example of published historical research that Prescott singled out for comment was Peter de Bolla’s monograph The architecture of concepts the historical formation of human rights. Apparently, well received among historians of the history of ideas, Prescott found de Bolla’s disinterest toward how the data he was searching (that is, Eighteenth Century Collections Online for words and phrases in texts) problematic.
Having tracked down a copy of The architecture of concepts I can see what Andrew is getting it. For although de Bolla use of word counts as a jumping off point to close reading is precisely the kind of computational method I like (eg Bob Nicholson, ‘Counting Culture; or, How to Read Victorian Newspapers from a Distance’, Journal of Victorian Culture 17:2 (2012) doi: 10.1080/13555502.2012.683331), the approach to the data – described in the passage above – feels to me a little odd.
What I’m going to do is go through this passage line-by-line to illustrate what I – and suspect Prescott – mean. I should say that I’m not making any claims about de Bolla’s thesis nor do I intend to undermine his claims: I haven’t read the book beyond the introduction and if the counts are just a jumping off point as he suggests (p9) I’m sure all is well. Rather, I want to unpick the assumptions beyond the passage because I think they speak to the complex relationship between historians and digital sources that many of us with ‘Digital History’ in our job titles are keen to better understand.
So, from page 9:
“The extraction of these data from the archive is beset with problems that will be familiar to anyone who has explored ECCO..”
Agreed. Any use of ECCO that is more than superficial should alert the user to many ‘problems’.
“..As is now well known, the optical character recognition (OCR) software used by Gale, the publisher, compromises the reliability of the data extracted..”
Okay. But as the OCR software used to create the digital text is knowable (presumably a old version of ABBYY FineReader or similar) why not track it down, find out what it is good at, and what later versions sought to improve? Projects such as IMPACT and SUCCEED can help a researcher to better understand the likely impact of software choices on the ‘reliability’ of digitisation. And these software choices will, in all likelihood, take Gale off the pedestal they are given by de Bolla and reveal a complex array of commercial and institutional pressures spread across and between a number of organisation beyond the ‘publisher’ of ECCO. I’m also not exactly sure what ‘compromises the reliability of the data extracted’ means here. The data extracted is the data extracted, it is the best that human directed software could do to understand the human readable contents of a series of digital images. And the quality of those digital images isn’t the only determiner of the ‘reliability of the data extracted’, also of vital importance is the condition of the thing from which the images are made, be they original objects or microfilm: digitisation and conservation are part of the same workflow.
“..Although this is regrettable, the following study is intended to be exemplar of a new kind of conceptual history..”
So, history first. Fine with me. But can we do history and be curious about the processes by which the sources we use to construct that history are made?
“..When in the not-too-distant future the glitches in the software no longer cause these problems, the compilation of more secure data will be possible..”
Will it? I have little confidence that memory institutions or commercial providers have the capital required to reOCR vast swathes of heritage: given how Eurocentric and bookish mass digitisation projects have been, the impetus to digitise collections that represent the diverse range of historical sources available to us will (and probably should) take precedence (see, for example, the wonderful Endangered Archives Programme). The phrase ‘glitches in the software’ also jars. An OCR process that outputs textual data that differs from the text a human can read on the same page isn’t a glitch, it is the software doing what it does. Another software programme may come along and do it ‘better’ than another, but there is no glitch: run the same software again and you’ll likely get the same results, which is a pattern of ‘glitches’, and patterns we can work with…
“..But since I doubt that there will be significant changes to the profiles I have created for the concepts studied here, the revision of precise numerical values will be unlikely to lead to different conclusions..”
Given that de Bolla is counting documents than contain any number of a word or phrase (rather than how many instances of a word or phrase occur across each and every document), I’m tempted to agree here. But of course you can never know how many ‘correct’ entries your search of OCRd text didn’t return without reading all the texts manually (which, given the volume of texts we are talking about here is likely to be impossible..)
“..I am, nevertheless, confident that at the time of carrying out the searches (for the most part in 2009-2010) all of the data are presented as accurate.”
de Bolla notes elsewhere that he was unable to get hold of the OCR text to work with locally, so his searches are using the interface provided by Gale. I sympathise with him here: the paywall does stalk our access to the archive (on which, see another Prescott). It is also unlikely that the interface and the search technology underlying it changed substantially during this period: providers tend towards conservatism knowing that we historians don’t like change. But without their code being out in the open there is sure way to know this. Accuracy, therefore, can in a strict sense only come from the historian having the data at a file level and querying that data through tools and software that they can control. Somehow, historians need to take control of digital sources they use. This starts, in the UK at least, with us making better use of the recent(ish) text and data mining copyright exception (for more, see the Jisc advise on this). It continues with us historians taking seriously the possibility of us working on the code that is the interface between us and our digitised sources.
One thought on “Interfaces between us and our digital sources”