In yesterday's out take from
my recent “Joy
of text” piece in Scientific
Computing World, I mentioned
in passing the term “word spotting”.
Word spotting looks for
visually similar discrete components within a text, and classifies
them using statistical comparisons. In approximate human terms, it is
treating words (or combinations of words) as ideograms rather than as
phonetic constructs.
Word spotting is not limited
to written or printed material, though that is the context with which
I'm concerned here: it also applies, for example, to speech
recognition. Nor, intriguingly, is it necessarily limited to words of
known meaning; it can equally well be applied to semantic units of
entirely unknown signification. It could, as an extreme example, be
used to analyse the manuscripts from a lost extraterrestrial
civilisation in H Beam Piper's classic science fiction story
Omnilingual.
Reverse the Omnilingual
example to imagine a hypothetical extraterrestrial archaeologist
trying to study post apocalyptic remains of our own cultures. It will
be obvious from context that certain signification units are
associated with the physical sciences: "volts" on
electrical signs and appliances, just to pick one.
Our xenoarchaeologist (who
may not have alphabetic scription systems, or even vocalisation, and
certainly cannot assume that each letter represents a sound) looks at
a mass of textual matter whose content, subject matter, purpose and
reliability are unknowable. Where should attention be concentrated?
The only certain knowledge is that words are identifiable visual
entities found in isolation and occurring with spatial separation in
books.
Word spotting, with no
semantic assumptions, quickly shows that some books contain many
instances of the visually related signifiers "volts"
"volt", "voltage", "voltmeter" and so
on, suggesting that those sections contain material related to
electricity; others do not. There will, of course, be false hits such
as "Voltaire" and "revolt", but as one
discriminator amongst others in a multiple sieving process it would
nevertheless be invaluable.
Handwritten notes and
journals would be less amenable than printed books (in my own
handwriting, for instance, computer transcription systems have
trouble separating "volt" from "bolt" and
sometimes even "void"), but could still be sieved using
multiple discriminators in the same way.
- Piper, H.B., Omnilingual, in Astounding Science Fiction. 1957, Dell Magazines: Northwalk CT.
No comments:
Post a Comment