15 February 2013

The joy of text

It's a mundane truism, not normally worth mentioning, that words and phrases as signification units in natural language have only the fuzziest of relations to that which they signify. It is, nevertheless, a live issue for the many researchers attempting to computerise data analytic activity using text as raw material. It's also a truism of which I have been reminded afresh as I discussed the topic with practitioners and consumers of textual analysis, no two of whom used the term in exactly the way.
Strictly speaking, textual analysis describes a social sciences methodology for examining and categorising communication content. In practice, though, it is widely used to cover a range of activities in which unstructured or partially structured textual material is submitted to rigorous analytic treatment. What they all have in common is a desire to wrestle the petabytes of potentially valuable information locked up in an ever inflating text reservoir (blogs, books, chat rooms, clinical notes, departmental minutes, emails, field journals, lab notebooks, patents, reports, specification sheets, web sites and a million other sources) into a form which is susceptible to useful, objective data analytic treatment. Temis, of whom more later, have on their website a headline which sums it up neatly: "Big data issue #1: a lot of content and no insights". Text mining, the consequent knowledge bases, and analysis of the results have become a major component of biomedical and pharmaceutical research.
For our purposes here, I have taken it to mean analysis whose purpose is to extract scientific value from texts, to examine those texts scientifically, or some combination of the two. [More]

No comments: