In preparing an article on any topic, there is often
more good stuff than makes it into the final edit.
In the case of my recent “Joy of text” piece, about
text analysis for Scientific Computing World, one of the out
takes was a prototype content-based image retrieval system framework
developed for census searches by Kenton
McHenry and his ISDA group at NCSA. The problem to be solved is
computerised searching of large volume handwritten census returns.
A user inputs a handwritten query – I might, for
instance, write "Grant". The system derives a numerical
feature vector which describes that input, then seeks occurrences of
similar vectors within the image database.
The system is designed to self-validate, by recording
which returned entities are selected by the user. I will not, for
example, select false hits such as "Grand" or "Ghent"
and the system will note which results I do or do not follow up. Over
time, other Grants will make similar decisions and increase the
system's confidence in selecting some hits for return and not others;
gradually, those writing “Grant” as their query will see fewer
and fewer offers of documents containing similar looking words.
The computer analytic process behind all this is
progressive.
First, the lines and boxes on the census forms are used
to carve up the content into image segments (for instance, surname
will be in a box at the same location on each form and will become a
data entity). Each segment is then converted into a numerical feature
vector representing its appearance, and similar feature vectors are
grouped hierarchically. Two million XSEDE (Extreme Science and
Engineering Discovery Environment) CPU hours have been requested for
initial record processing.
When the search query is entered, word spotting is used
to compare its vector with those stored in the database, seeking
matches within statistically defined limits of similarity. The search
is not a blind one from the beginning of the database through seventy
billion image segments to the end; the hierarchical grouping guides
greatly reduces the number of entities which need to be compared.
No comments:
Post a Comment