21 February 2013

Omnilingual word spotting

In yesterday's out take from my recent “Joy of text” piece in Scientific Computing World, I mentioned in passing the term “word spotting”.
Word spotting looks for visually similar discrete components within a text, and classifies them using statistical comparisons. In approximate human terms, it is treating words (or combinations of words) as ideograms rather than as phonetic constructs.
Word spotting is not limited to written or printed material, though that is the context with which I'm concerned here: it also applies, for example, to speech recognition. Nor, intriguingly, is it necessarily limited to words of known meaning; it can equally well be applied to semantic units of entirely unknown signification. It could, as an extreme example, be used to analyse the manuscripts from a lost extraterrestrial civilisation in H Beam Piper's classic science fiction story Omnilingual.
Reverse the Omnilingual example to imagine a hypothetical extraterrestrial archaeologist trying to study post apocalyptic remains of our own cultures. It will be obvious from context that certain signification units are associated with the physical sciences: "volts" on electrical signs and appliances, just to pick one.
Our xenoarchaeologist (who may not have alphabetic scription systems, or even vocalisation, and certainly cannot assume that each letter represents a sound) looks at a mass of textual matter whose content, subject matter, purpose and reliability are unknowable. Where should attention be concentrated? The only certain knowledge is that words are identifiable visual entities found in isolation and occurring with spatial separation in books.
Word spotting, with no semantic assumptions, quickly shows that some books contain many instances of the visually related signifiers "volts" "volt", "voltage", "voltmeter" and so on, suggesting that those sections contain material related to electricity; others do not. There will, of course, be false hits such as "Voltaire" and "revolt", but as one discriminator amongst others in a multiple sieving process it would nevertheless be invaluable.
Handwritten notes and journals would be less amenable than printed books (in my own handwriting, for instance, computer transcription systems have trouble separating "volt" from "bolt" and sometimes even "void"), but could still be sieved using multiple discriminators in the same way.

  • Piper, H.B., Omnilingual, in Astounding Science Fiction. 1957, Dell Magazines: Northwalk CT.

