Preparing
an article on text analysis for the upcoming issue of Scientific
Computing World (watch this space) I've been revisiting the various
ways of getting physical text into digital form.
Optical
Character Recognition (OCR) is the workhorse of text transcription
and, while we all grumble about its very real real short comings shortcomings, it really does a
remarkably good job, most of the time, of rendering graphic images of printed fonts into
digitised text for analysis.
Even at
the lowliest manual level, and with all its admitted faults, OCR is a
useful tool. A colleague and I recently had to add a six hundred and
fifty page eighteenth century text to a digitised textbase for
analysis. The rare and valuable paper original which we located was
in a library, and could not be removed. Filing a request for the
digitisation to be carried out would take weeks. With the consent of
the library we used a smartphone, a ten year old copy of ABBYY
FineReader 5 (now in release 11, and correspondingly more developed,
as part of a software range for different text tasks) and a netbook.
Even allowing for manual error correction, we had our validated data
within three hours and the library added a copy to its own digital
records. A similarly sized text already available as graphic only PDF was transferred more quickly still.
No comments:
Post a Comment