20 January 2013

Ode to OCR

Preparing an article on text analysis for the upcoming issue of Scientific Computing World (watch this space) I've been revisiting the various ways of getting physical text into digital form.
Optical Character Recognition (OCR) is the workhorse of text transcription and, while we all grumble about its very real real short comings shortcomings, it really does a remarkably good job, most of the time, of rendering graphic images of printed fonts into digitised text for analysis.
Even at the lowliest manual level, and with all its admitted faults, OCR is a useful tool. A colleague and I recently had to add a six hundred and fifty page eighteenth century text to a digitised textbase for analysis. The rare and valuable paper original which we located was in a library, and could not be removed. Filing a request for the digitisation to be carried out would take weeks. With the consent of the library we used a smartphone, a ten year old copy of ABBYY FineReader 5 (now in release 11, and correspondingly more developed, as part of a software range for different text tasks) and a netbook. Even allowing for manual error correction, we had our validated data within three hours and the library added a copy to its own digital records. A similarly sized text already available as graphic only PDF was transferred more quickly still.

No comments: