05 February 2011

Slaving over half a trillion words...

A number of emailed queries suggest that I may, in my (19 December last year) "Picking over half trillion words" post, have inadvertently made light of the task facing anyone who wants to do their own analysis of Google's raw data sets. Here is a quick run down on what to expect; I think it worth the effort myself, but not everyone shares my view of what constitutes light entertainment.

The first issue, and the one behind most of the emails, is sheer file size. You won't be able to just load a gigabyte text file into Excel (or even into something more useful, like statistical or mathematical software) and start analysing ... never mind the full nine gigabytes of the whole data set. Excel will just choke; OpenOffice Calc will load as much as it can and abandon the rest without ever even reaching "a". You're going to have to extract the subsets you want, first.

How do you do that?

  1. If you are old enough (like me) or otherwise happen to be familiar with MSDOS command line programming, you can segment files into smaller chunks from there. If not, then I don't suggest you start learning now, just for this job.
  2. There are text editors which will handle any file for which you have disk space. (There are also many which say they can handle unlimited file size, but fall over when asked to load these files.) You can use one of those to copy the sections which interest you into a new file, or to cut the data down into smaller files of maybe about fifty thousand lines each.
  3. My own preference, after trying various options, is to import the data into a database table (Microsoft Access, for example, or OpenOffice Base) and subset it from there. The advantages are many, but include access to all the sorting, filtering and other tools which a DBMS (Data Base Management System) provides – not just for subsetting but for cleaning (coming up next) and other tasks. A DBMS also eases the problem of having words in which you are interested scattered across ten separate alphabetically sorted lists with no clues to which word is in which file.
Next is the fact that these datasets are actually very noisy – and, a separate but related problem, probably contain a lot of information in which you have no interest. DBMS filtration is the quickest and most efficient way to deal with these.

"Noise" is mostly down to OCR (machine text reading) errors. The many lines referring to the word "A&AINST", for instance, are probably misreadings of "AGAINST". That would be a fairly quick and easy correction to make ... but it is only the first of many similar examples, and suppose you are mistaken in your interpretation? Others are harder. A lot of those entries which begin with "99" ("99For" for example) may well be misreadings of quotation marks ... but without going back to the original text there is no way top be sure ... and what about "99mTclabelled"? Is "9ANTA" meant to be "SANTA", or am I just responding to the present season? What am I to make of "A+H"? The only realistic path, I reluctantly decided, was to delete (using DBMS filters) all lines containing nonalphanumeric characters (except the apostrophe – elisions such as "I'm" or "hasn't" clearly have a place).

Then there are hundreds of lines for symbols alone ... "#" for instance, or "?". I deleted those, too.

Money ... I decided that thousands of lines referring to specific amounts of money ("$100.00" ... "$9.21" ... even "$0.00") were also dead weight from my point of view. What about other numbers? There is no doubt much fascination to be derived from an enquiry into why the number "5589" was mentioned in three books in 1932, then dropped out of sight and did not reappear again until 1947 ... but it is a different enquiry from the one into words.

So, for simplicity, you could (as I have) delete all lines containing a digit in the word field as well, leaving only alphabetics (with or without apostrophes). A side benefit of all this mass deletion is that the file gets smaller and more manageable, too ... not immensely smaller, but every little helps.

All the above needs to be done for each of the ten files.

That's the bulk of the heavy lifting done. There are plenty of other issues cluttering the path ahead ... capitalisation, for example: "Case", "CASE" and "case" are three separate entries in the Google datasets ... and misspellings (again probably OCR artefacts) ... but you get the idea by now, and can deal with them as you encounter them.

Combining the files into one database has both advantages and a drawback. The advantages have to do with being able to seek out every word in a single place; the down side is, once again, file size. My own decision (not necessarily yours) was to split the first cleaned file down into a set of twenty six tables – one for each letter of the alphabet – then add the results of each subsequently cleaned file into those same tables. End result: all words beginning with "A" in one table, and so on.

However you arrange things, you then have a source from which you can easily copy and paste subsets (line by line or block by block) into your chosen spreadsheet of analysis application. Or import it directly using a filter. Here, for example, is an extracted table showing the number of books (published in English) which mentioned the town of Abingdon in the decade 1957-1966. Year Books
1957 31
1958 31
1959 32
1960 28
1961 34
1962 40
1963 39
1964 32
1965 38
1966 30

No comments: