All posts by jwbaker

James Baker is Director of Digital Humanities at the University of Southampton. James is a Software Sustainability Institute Fellow, a Fellow of the Royal Historical Society, and holds degrees from the University of Southampton and latterly the University of Kent, where in 2010 he completed his doctoral research on the late-Georgian artist-engraver Isaac Cruikshank. James works at the intersection of history, cultural heritage, and digital technologies. He is currently working on a history of knowledge organisation in twentieth century Britain. In 2021, I begin a major new Arts and Humanities Research Council funded project 'Beyond Notability: Re-evaluating Women’s Work in Archaeology, History and Heritage, 1870 – 1950'. Previous externally funded research projects have focused on legacy descriptions of art objects ('Legacies of Catalogue Descriptions and Curatorial Voice: Opportunities for Digital Scholarship', Arts and Humanities Research Council), the preservation of intangible cultural heritage ('Coptic Culture Conservation Collective', British Council, and 'Heritage Repertoires for inclusive and sustainable development', British Academy), the born digital archival record ('Digital Forensics in the Historical Humanities', European Commission), and decolonial futures for museum collections ('Making African Connections: Decolonial Futures for Colonial Collections', Arts and Humanities Research Council). Prior to joining Southampton, James held positions of Senior Lecturer in Digital History and Archives at the University of Sussex and Director of the Sussex Humanities Lab, Digital Curator at the British Library, and Postdoctoral Fellow with the Paul Mellon Centre for Studies in British Art. He is a member of the Arts and Humanities Research Council Peer Review College, a convenor of the Institute of Historical Research Digital History seminar, a member of The Programming Historian Editorial Board and a Director of ProgHist Ltd (Company Number 12192946), and an International Advisory Board Member of British Art Studies.

Dewizardification

Yesterday I tweeted..

and..

I was asked to write a blog about by Thomas Padilla it because..

and..

So here it is. It isn’t very interesting. You’ve been warned.

On Thursday (22 June) I revived a backup of an old cloud instance (I use DigitalOcean) that I had spun up in November to test using Tesseract (an Optical Character Recognition engine) to extract text/data from tables. I ran through the steps in my notes – I keep dated .md files of notes and irregular bash history backups..

history > DATE_bash-log.txt

..because otherwise I just forget! – I hit go, and it didn’t work. I looked at the traceback and didn’t understand it. I hit go again. It didn’t work. I looked up the traceback (something about the OCR Server I’d setup not working), didn’t understand what I found, ran it again, it didn’t work again.

I then killed the cloud whole instance. Revived it from scratch. Failed to get into the instance because of a traceback around clearing keys. As I didn’t know what that was I searched around, figured it out, cleared the keys, got in, installed all the software again from scratch. Ran it again. It didn’t work again. Same traceback. I then faffed around, search online, periodically ran it again, failed, gave up, did something else.

On Monday (26 July) I woke full of optimism. It was a new week, I thought. Maybe I’d just done something wrong on Thursday. So, I killed the cloud whole instance again. Revived it from scratch. Installed all the software again from scratch. Ran it again. It failed again. Same traceback.

I then did some other stuff: I had some reading to do, a review to write.

Late-afternoon I returned to my OCR task. I stumbled across a way of installing Tesseract using Docker. I didn’t really know what Docker was or a use case that made it useful for me to spend some time figuring it out, but I knew it was a way of installing software without worrying too much about dependencies (which are a pain) and without having to use a virtual machine (which is processor heavy). It also seems fashionable, buzzy, the sort of software that people say “you should try it” when it comes up. Plus Nora and Ben were going to a conference about Docker the next day and were tweeting about it a bit. So I decided to have a go.

As it happens it was really easy (the documentation for Docker was great). I had an OCR engine in the cloud in no time. And it came with some scripts I could understand enough of so that I could amend them to point at different files. I tested it for a bit. It worked. This was great.

And then I decided I wanted to OCR more than one file at a time. This shouldn’t, I thought, be too hard to implement. I wasn’t super confident about integrating loops into a shell script, but I’d written simple loops before and thought it worth learning by trying. So I played around a bit and failed. Then I remembered that The Sourcecaster – something I’d worked on (with Thomas as it happens) – had a script for OCRing multiple files using Tesseract. I tried to build that syntax into my script. It failed. So I asked on Twitter. I got some feedback (thanks Ben and Ben!). I tried to implement that feedback the next day (27 July). It failed. I had something else do. I posted the scripts. I gave up for now.

Yesterday (28 July) I asked a colleague (another Ben) if they could help point me in the right direction. They were very helpful. They prodded my Tesseract/Docker install, failed a bit, then came up with an alternative solution. “Why not”, they said, “just install Tesseract and run a simple three line script in the folder than contains the images you want to OCR”. Something like:

for file in *.tif; 
do tesseract $file ${file%.*} -l eng -psm 1 --oem 2 txt; 
done

So I tried that. And it failed with an error ‘permission denied’. I tried appending sudo. That didn’t work. I got confused. Searched around a bit. Found a solution that reminded me that you need to make new .sh files executable before you run them (I’d had this problem before and forgotten all about it). I fixed this and ran the script again. This time it failed because Tesseract wasn’t installed, only through Docker (at least that is what I think was going on). Then I quickly searched around and found an install guide for Tesseract on Ubuntu that looked very similar to something I’d seen before, so decided to use it, but then it failed near the end when attempting to make Tesseract. Then I asked my colleague again and they pointed out that all I needed to do was something like:

sudo apt-get update
sudo apt-get install tesseract-ocr

Which worked. I then ran the simple script. And it worked. Which meant that after all that effort all I actually needed was two lines of code to install something and a three line shell script to process the files.

Now, I could kick myself for this. But I won’t. I learnt a huge amount along the way. I now get Docker and can see where it could have value in future work. I am more proficient than ever at setting up cloud instances (and indeed, use them for lots of tasks). And I have extended the limits of what I can do with shell scripting. None of the time was wasted (and I got everything else done I had intended to get done on those days, so nothing was put back).

If we don’t – as Thomas implies – tell these sorts of stories enough it is, first of all, because they are boring and mundane. But there is probably also some pride involved. In the safe space of a hack day, these kind of failures are normal, they are expected: indeed, in many respects, the point of going to a hack day is to learn through collaboration, to fail in a supportive environment.

But Thomas is also right to say that we do need to tell these stories. I’ve worked on some shiny and complex looking technical projects with well respected DH people. I run things, convene things, and am a fellow of things that confer to me some kind of technical authority. I am a historian who is part of a Lab. And I write about my successful use of data and computers to inform my research. But I am not a wizard. These ‘powers’ are not innate. I had to learn them through hard work. And more often than not even hard work isn’t enough. I fail. I ask other people for help. I fail again. And then I compromise on something that does the job imperfectly, not quite as I’d originally hoped. This is the default mode of my ‘digital’ history/archives/curatorial/library/humanities work.

At the 2016 Software Sustainability Institute Collaborations Workshop, Robert Davey coined (for me at least) the term “Dewizardification”

He suggested that researchers who work with tech know they aren’t wizards and that they need to tell people that more often because it makes what we do more approachable, it makes it seem more achievable.

And he is right. Let’s all try to make #dewizardification a thing. Let’s spend more time telling our stories of failing with software, code, tech. Let’s ensure fellow researchers looking to do ‘digital’ things know we aren’t wizards, just people willing to fail, persist, fail, ask for help, fail, compromise, and fail before we eventually – partially – succeed.