20 February 2013

Making sense of the census

In preparing an article on any topic, there is often more good stuff than makes it into the final edit.
In the case of my recent “Joy of text” piece, about text analysis for Scientific Computing World, one of the out takes was a prototype content-based image retrieval system framework developed for census searches by Kenton McHenry and his ISDA group at NCSA. The problem to be solved is computerised searching of large volume handwritten census returns.
A user inputs a handwritten query – I might, for instance, write "Grant". The system derives a numerical feature vector which describes that input, then seeks occurrences of similar vectors within the image database.
The system is designed to self-validate, by recording which returned entities are selected by the user. I will not, for example, select false hits such as "Grand" or "Ghent" and the system will note which results I do or do not follow up. Over time, other Grants will make similar decisions and increase the system's confidence in selecting some hits for return and not others; gradually, those writing “Grant” as their query will see fewer and fewer offers of documents containing similar looking words.
The computer analytic process behind all this is progressive.
First, the lines and boxes on the census forms are used to carve up the content into image segments (for instance, surname will be in a box at the same location on each form and will become a data entity). Each segment is then converted into a numerical feature vector representing its appearance, and similar feature vectors are grouped hierarchically. Two million XSEDE (Extreme Science and Engineering Discovery Environment) CPU hours have been requested for initial record processing.
When the search query is entered, word spotting is used to compare its vector with those stored in the database, seeking matches within statistically defined limits of similarity. The search is not a blind one from the beginning of the database through seventy billion image segments to the end; the hierarchical grouping guides greatly reduces the number of entities which need to be compared.

No comments: