OCRing history in the cloud: first impressions, next steps

Last week I decided to try Optical Character Recognition (OCR) for the first time. The context was the publication of a dataset as a book..

.. and the suggestion that the 2014 UK Text and Data Mining Exception might allow us to mine it, if only we can scan it effectively (that is, with the visible structure of a dictionary converted into machine readable form).

At the same time, I decided this would be a good opportunity to revisit cloud processing. I’d looked at this some time back during some work with Microsoft Research, and had marvelled at Ian Milligan’s innovative use of cloud to power historical research, but hadn’t had a practical application for it since. Then someone sent me a tutorial on ‘Setting Up a Simple OCR Server‘ and the two things collided. This post is about how I got on.

  1. Cloud was easier

The sell of cloud computing is usually hooked around the benefits of how much stuff you can shove in the cloud and how much compute you can bring to bear on that stuff. But the benefit for me in this instance, was that it was easier to configure. Using the tutorial, I tried to get Tesseract (the OCR engine) and various dependencies working on my Mac, but I ran into a traceback (that is, an error) I didn’t understand. I then tried to get it working on my Ubuntu 14.04 machine, which is configured to do a bunch of other useful stuff – like computational image analysis with Pastec – and I ran (again) into a traceback I didn’t understand. With DigitalOcean (my chosen cloud provider for this) I could for very little money ($5 a month) stand up a clean Ubuntu 14.04 install in minutes and get the whole thing configured. $5 a month only gets me a 512mb machine, but given that I’m not running a full operating system on it, it proved more than fast enough to do the OCR testing I needed to.

  1. Using the Tutorial

In terms of the tutorial itself, it worked a treat. But to get it configured still required me to draw on my knowledge things like shell commands. The reason for this is that although the tutorial was only published in 2014, it suffers from the fact that the tech world moves quickly, in this case the closure of Google Code in 2015. When I encountered problems relating to this I could read the tracebacks, and with a few tweaks was able to point the wget command (a utility for downloading files) to the GitHub repository for Tesseract. So the setup went smoothly, but needed a little terminal configuration experience/expertise.

  1. The OCR Engine

Having configured all this, it is fair to say that the outputs weren’t great. The below image..

sample3..outputs this:

Good font for the OCR\nD\ufb01\u00e9zufrfanr\ufb01zv r\ufb01e am:\nThe sml|l {mu far 001\n\nGood font size for OCR

This image..

vertical_table..outputs this:

Plant pH 01 Average P\\anl\nGroup Soil Growth (cm)\n1 6.0 25.4\n2 6.2 33.0\n3 6.4 50.8\n4 6.6 533\n5 6.9 53.3\n6 7.0 30.5\n7 7.2 22.9

And this image..

data-table_-_example..outputs this:

Name of cereal\n\nAmount of elemental iron\nfrom least to greatest\n\nCoco Puffs 3\nTotal 5\nCorn Pops 1\nCheerios 4\nFruit Loops 2

And this image..

img-12-small580

..outputs this:

Fnglond .ve:m11m.n ommm )’mm*c S\u2018p(Am mm\n1500 1 00 1 00 1 00 1 00 1 00 1 00\n1000 095 110 094 090 099 031\n1700 133 134 009 10x 090 0212;\n1750 H1 141 102 111 090 094\n1800 142 120 102 105 090 0:31\n\nFrvglemd NeIhrrI1mdx Gnmmmx Franco S[)(A1n nan\n1500 1.420 1.000 1220 1.310 1.450 1.000\n1000 13% 1 no 1150 1.300 1.440 1.300\n1700 1,390 2,150 1210 1.440 1.430 1.400\nmo 2,50 zzm 1,250 1500 1.100 1500\n1x00 2010 2.040 1250 1.400 1.300 1.300

So not bad, but not good enough (and it gets worse the more complex the tables gets..). This suggests to me that perhaps Tesseract is a poor OCR engine for the job or that I need to do more preprocessing. On the latter, I should say that these examples are of outputs after I adapted ocr.py to include additional Image.Filter modules from the Pillow fork of the Python Image Library. These did make a difference and as I’m no Python expert I suspect I need to learn more Python to get the best out of Pillow. For example, having tested the OCR engine with a variety of images, I suspect making the images bigger is a easy win. Alternatively, I could integrate a separate pre-processing step, for example this script recommended by Mario Klingemann. Either way, having got the engine working it is clear that I now need to optimise it for the task at hand.

  1. What Next

So far I have tested the software and infrastructure. I’m pleased with the results. The next steps are to a) keep testing the software; b) test hardware for digitising the texts I care about, that is any structured data locked away in books written by historians; and c) test the text and data mining exception, for this kind of work is best achieved as a team who might not all have the same research goals, something that perhaps doesn’t fit the spirit of the law).

The Sussex Humanities Lab has plans to purchase the hardware I need. I have the will to keep playing with software side. And the legal aspect is ripe for testing. With this in mind, I’m hoping to put together a super practical event in the near future where we can gather some historians together to test all this together. If you are interested in joining us, let me know.

5 thoughts on “OCRing history in the cloud: first impressions, next steps”

    1. I did. It failed somewhere around building tesseract with `make` (I think). I’ve mostly given up on homebrew for complex installs and generally prefer to just use an Ubuntu machine, physical or cloud.

      1. Have to say that I’ve had really goood luck with homebrew, including tesseract.

        That said, doesn’t homebrew install binaries, so you don’t have to ‘make’ anything?

      2. Huh. No idea to be honest. Lots of other stuff works well with homebrew, I just tend to find Linux easier.

Leave a comment