All posts by jwbaker

James Baker is a Lecturer in Digital History at the University of Sussex (and the awesome Sussex Humanities Lab). He is a historian of long eighteenth century Britain and a Software Sustainability Institute Fellow. He holds degrees from the University of Southampton and latterly the University of Kent, where in 2010 he completed his doctoral research on the late-Georgian satirical artist-engraver Isaac Cruikshank. As an eighteenth centuryist, his research interests include satirical art, the making and selling of printed objects, urban protest, and corpus analysis. His contemporary historian interests include the curation of personal digital archives, the critical examination of forensic software and captures, the use of born-digital archives in historical research, and scribing and archiving in the age of the hard disk. Prior to joning Sussex, James has held positions of Digital Curator at the British Library and Postdoctoral Fellow with the Paul Mellon Centre for Studies of British Art. He is a convenor of the Institute of Historical Research Digital History seminar and a member of the History Lab Plus Advisory Board. Git -- Publications -- CV -- Twitter -- Email -- Tumblr Zenodo -- Notes from talks, papers, events -- Slides

OCRing history in the cloud: first impressions, next steps

Last week I decided to try Optical Character Recognition (OCR) for the first time. The context was the publication of a dataset as a book..

.. and the suggestion that the 2014 UK Text and Data Mining Exception might allow us to mine it, if only we can scan it effectively (that is, with the visible structure of a dictionary converted into machine readable form).

At the same time, I decided this would be a good opportunity to revisit cloud processing. I’d looked at this some time back during some work with Microsoft Research, and had marvelled at Ian Milligan’s innovative use of cloud to power historical research, but hadn’t had a practical application for it since. Then someone sent me a tutorial on ‘Setting Up a Simple OCR Server‘ and the two things collided. This post is about how I got on.

  1. Cloud was easier

The sell of cloud computing is usually hooked around the benefits of how much stuff you can shove in the cloud and how much compute you can bring to bear on that stuff. But the benefit for me in this instance, was that it was easier to configure. Using the tutorial, I tried to get Tesseract (the OCR engine) and various dependencies working on my Mac, but I ran into a traceback (that is, an error) I didn’t understand. I then tried to get it working on my Ubuntu 14.04 machine, which is configured to do a bunch of other useful stuff – like computational image analysis with Pastec – and I ran (again) into a traceback I didn’t understand. With DigitalOcean (my chosen cloud provider for this) I could for very little money ($5 a month) stand up a clean Ubuntu 14.04 install in minutes and get the whole thing configured. $5 a month only gets me a 512mb machine, but given that I’m not running a full operating system on it, it proved more than fast enough to do the OCR testing I needed to.

  1. Using the Tutorial

In terms of the tutorial itself, it worked a treat. But to get it configured still required me to draw on my knowledge things like shell commands. The reason for this is that although the tutorial was only published in 2014, it suffers from the fact that the tech world moves quickly, in this case the closure of Google Code in 2015. When I encountered problems relating to this I could read the tracebacks, and with a few tweaks was able to point the wget command (a utility for downloading files) to the GitHub repository for Tesseract. So the setup went smoothly, but needed a little terminal configuration experience/expertise.

  1. The OCR Engine

Having configured all this, it is fair to say that the outputs weren’t great. The below image..

sample3..outputs this:

Good font for the OCR\nD\ufb01\u00e9zufrfanr\ufb01zv r\ufb01e am:\nThe sml|l {mu far 001\n\nGood font size for OCR

This image..

vertical_table..outputs this:

Plant pH 01 Average P\\anl\nGroup Soil Growth (cm)\n1 6.0 25.4\n2 6.2 33.0\n3 6.4 50.8\n4 6.6 533\n5 6.9 53.3\n6 7.0 30.5\n7 7.2 22.9

And this image..

data-table_-_example..outputs this:

Name of cereal\n\nAmount of elemental iron\nfrom least to greatest\n\nCoco Puffs 3\nTotal 5\nCorn Pops 1\nCheerios 4\nFruit Loops 2

And this image..

img-12-small580

..outputs this:

Fnglond .ve:m11m.n ommm )’mm*c S\u2018p(Am mm\n1500 1 00 1 00 1 00 1 00 1 00 1 00\n1000 095 110 094 090 099 031\n1700 133 134 009 10x 090 0212;\n1750 H1 141 102 111 090 094\n1800 142 120 102 105 090 0:31\n\nFrvglemd NeIhrrI1mdx Gnmmmx Franco S[)(A1n nan\n1500 1.420 1.000 1220 1.310 1.450 1.000\n1000 13% 1 no 1150 1.300 1.440 1.300\n1700 1,390 2,150 1210 1.440 1.430 1.400\nmo 2,50 zzm 1,250 1500 1.100 1500\n1x00 2010 2.040 1250 1.400 1.300 1.300

So not bad, but not good enough (and it gets worse the more complex the tables gets..). This suggests to me that perhaps Tesseract is a poor OCR engine for the job or that I need to do more preprocessing. On the latter, I should say that these examples are of outputs after I adapted ocr.py to include additional Image.Filter modules from the Pillow fork of the Python Image Library. These did make a difference and as I’m no Python expert I suspect I need to learn more Python to get the best out of Pillow. For example, having tested the OCR engine with a variety of images, I suspect making the images bigger is a easy win. Alternatively, I could integrate a separate pre-processing step, for example this script recommended by Mario Klingemann. Either way, having got the engine working it is clear that I now need to optimise it for the task at hand.

  1. What Next

So far I have tested the software and infrastructure. I’m pleased with the results. The next steps are to a) keep testing the software; b) test hardware for digitising the texts I care about, that is any structured data locked away in books written by historians; and c) test the text and data mining exception, for this kind of work is best achieved as a team who might not all have the same research goals, something that perhaps doesn’t fit the spirit of the law).

The Sussex Humanities Lab has plans to purchase the hardware I need. I have the will to keep playing with software side. And the legal aspect is ripe for testing. With this in mind, I’m hoping to put together a super practical event in the near future where we can gather some historians together to test all this together. If you are interested in joining us, let me know.