All posts by jwbaker

James Baker is a Lecturer in Digital History at the University of Sussex (and the awesome Sussex Humanities Lab). He is a historian of long eighteenth century Britain and a Software Sustainability Institute Fellow. He holds degrees from the University of Southampton and latterly the University of Kent, where in 2010 he completed his doctoral research on the late-Georgian satirical artist-engraver Isaac Cruikshank. As an eighteenth centuryist, his research interests include satirical art, the making and selling of printed objects, urban protest, and corpus analysis. His contemporary historian interests include the curation of personal digital archives, the critical examination of forensic software and captures, the use of born-digital archives in historical research, and scribing and archiving in the age of the hard disk. Prior to joning Sussex, James has held positions of Digital Curator at the British Library and Postdoctoral Fellow with the Paul Mellon Centre for Studies of British Art. He is a convenor of the Institute of Historical Research Digital History seminar and a member of the History Lab Plus Advisory Board. Git -- Publications -- CV -- Twitter -- Email -- Tumblr Zenodo -- Notes from talks, papers, events -- Slides

Dewizardification

Yesterday I tweeted..

and..

I was asked to write a blog about by Thomas Padilla it because..

and..

So here it is. It isn’t very interesting. You’ve been warned.

On Thursday (22 June) I revived a backup of an old cloud instance (I use DigitalOcean) that I had spun up in November to test using Tesseract (an Optical Character Recognition engine) to extract text/data from tables. I ran through the steps in my notes – I keep dated .md files of notes and irregular bash history backups..

history > DATE_bash-log.txt

..because otherwise I just forget! – I hit go, and it didn’t work. I looked at the traceback and didn’t understand it. I hit go again. It didn’t work. I looked up the traceback (something about the OCR Server I’d setup not working), didn’t understand what I found, ran it again, it didn’t work again.

I then killed the cloud whole instance. Revived it from scratch. Failed to get into the instance because of a traceback around clearing keys. As I didn’t know what that was I searched around, figured it out, cleared the keys, got in, installed all the software again from scratch. Ran it again. It didn’t work again. Same traceback. I then faffed around, search online, periodically ran it again, failed, gave up, did something else.

On Monday (26 July) I woke full of optimism. It was a new week, I thought. Maybe I’d just done something wrong on Thursday. So, I killed the cloud whole instance again. Revived it from scratch. Installed all the software again from scratch. Ran it again. It failed again. Same traceback.

I then did some other stuff: I had some reading to do, a review to write.

Late-afternoon I returned to my OCR task. I stumbled across a way of installing Tesseract using Docker. I didn’t really know what Docker was or a use case that made it useful for me to spend some time figuring it out, but I knew it was a way of installing software without worrying too much about dependencies (which are a pain) and without having to use a virtual machine (which is processor heavy). It also seems fashionable, buzzy, the sort of software that people say “you should try it” when it comes up. Plus Nora and Ben were going to a conference about Docker the next day and were tweeting about it a bit. So I decided to have a go.

As it happens it was really easy (the documentation for Docker was great). I had an OCR engine in the cloud in no time. And it came with some scripts I could understand enough of so that I could amend them to point at different files. I tested it for a bit. It worked. This was great.

And then I decided I wanted to OCR more than one file at a time. This shouldn’t, I thought, be too hard to implement. I wasn’t super confident about integrating loops into a shell script, but I’d written simple loops before and thought it worth learning by trying. So I played around a bit and failed. Then I remembered that The Sourcecaster – something I’d worked on (with Thomas as it happens) – had a script for OCRing multiple files using Tesseract. I tried to build that syntax into my script. It failed. So I asked on Twitter. I got some feedback (thanks Ben and Ben!). I tried to implement that feedback the next day (27 July). It failed. I had something else do. I posted the scripts. I gave up for now.

Yesterday (28 July) I asked a colleague (another Ben) if they could help point me in the right direction. They were very helpful. They prodded my Tesseract/Docker install, failed a bit, then came up with an alternative solution. “Why not”, they said, “just install Tesseract and run a simple three line script in the folder than contains the images you want to OCR”. Something like:

for file in *.tif; 
do tesseract $file ${file%.*} -l eng -psm 1 --oem 2 txt; 
done

So I tried that. And it failed with an error ‘permission denied’. I tried appending sudo. That didn’t work. I got confused. Searched around a bit. Found a solution that reminded me that you need to make new .sh files executable before you run them (I’d had this problem before and forgotten all about it). I fixed this and ran the script again. This time it failed because Tesseract wasn’t installed, only through Docker (at least that is what I think was going on). Then I quickly searched around and found an install guide for Tesseract on Ubuntu that looked very similar to something I’d seen before, so decided to use it, but then it failed near the end when attempting to make Tesseract. Then I asked my colleague again and they pointed out that all I needed to do was something like:

sudo apt-get update
sudo apt-get install tesseract-ocr

Which worked. I then ran the simple script. And it worked. Which meant that after all that effort all I actually needed was two lines of code to install something and a three line shell script to process the files.

Now, I could kick myself for this. But I won’t. I learnt a huge amount along the way. I now get Docker and can see where it could have value in future work. I am more proficient than ever at setting up cloud instances (and indeed, use them for lots of tasks). And I have extended the limits of what I can do with shell scripting. None of the time was wasted (and I got everything else done I had intended to get done on those days, so nothing was put back).

If we don’t – as Thomas implies – tell these sorts of stories enough it is, first of all, because they are boring and mundane. But there is probably also some pride involved. In the safe space of a hack day, these kind of failures are normal, they are expected: indeed, in many respects, the point of going to a hack day is to learn through collaboration, to fail in a supportive environment.

But Thomas is also right to say that we do need to tell these stories. I’ve worked on some shiny and complex looking technical projects with well respected DH people. I run things, convene things, and am a fellow of things that confer to me some kind of technical authority. I am a historian who is part of a Lab. And I write about my successful use of data and computers to inform my research. But I am not a wizard. These ‘powers’ are not innate. I had to learn them through hard work. And more often than not even hard work isn’t enough. I fail. I ask other people for help. I fail again. And then I compromise on something that does the job imperfectly, not quite as I’d originally hoped. This is the default mode of my ‘digital’ history/archives/curatorial/library/humanities work.

At the 2016 Software Sustainability Institute Collaborations Workshop, Robert Davey coined (for me at least) the term “Dewizardification”

He suggested that researchers who work with tech know they aren’t wizards and that they need to tell people that more often because it makes what we do more approachable, it makes it seem more achievable.

And he is right. Let’s all try to make #dewizardification a thing. Let’s spend more time telling our stories of failing with software, code, tech. Let’s ensure fellow researchers looking to do ‘digital’ things know we aren’t wizards, just people willing to fail, persist, fail, ask for help, fail, compromise, and fail before we eventually – partially – succeed.