19 December 2010

Picking over half a trillion words

Following JSBlog's enthusiasm, yesterday, (“Google just blew my bibliographic socks off”) for Google's new Ngram viewer, I've been busily catching up.

First stop was the viewer itself. Then a start on downloading the raw data sets which lie behind it, for more detailed analysis than the online viewer can deliver. Finally, while the data downloaded in the background (almost two gigabytes of it just for single words in English, even in ZIP form ... nearer to ten when expanded), reading the associated Science article by Michel et al.

It's going to be a good while before anything significant comes of the downloads, but I've done a couple of test drives. They can be intuitively checked with a quick visit to the viewer.

First experiment, resulting from a recent off the cuff discussion amongst a group of students: correlating uses of the words "twat", "twit" and "twerp". It's interesting to find positive correlation between the first and last from 1935 to 1980, but negative between them and "twerp" over the same period – which then reverses so that all three positively correlate over the past thirty years.

Second: the tendency to concatenate "bigrams" into single words. This train of thought was started by Google's example comparison of "child care" with "nursery school" and "kindergarten" ... I tried it out, and then added "childcare" to see if it made a difference. As examples to cut a long story short, "child care" declines markedly as "childcare" slightly increases (a negative correlation) from 1996 to 2008; "brood mare" and "broodmare" show a similar negative correlation from 1960 to 2000 but then "brood mare" recovers and the correlation becomes positive through to the present.

Those are, of course, trivial investigations and show nothing ... I mention them only to show the sort of five finger exercises that I've been playing with since yesterday. Much more interesting is some of the investigation mentioned by the authors of the Science article.

For example:

Suppression – of a person, or an idea – leaves quantifiable fingerprints... ... ... Such examples are found in many countries, including Russia (e.g. Trotsky), China (Tiananmen Square) and the US (the Hollywood Ten, blacklisted in 1947)...

We probed the impact of censorship on a person’s cultural influence in Nazi Germany. Led by such figures as the librarian Wolfgang Hermann, the Nazis created lists of authors and artists whose “undesirable”, “degenerate” work was banned from libraries and museums and publicly burned... We plotted median usage in German for five such lists ... ... ... The five suppressed groups exhibited a decline. This decline was modest for writers of history (9%) and literature (27%), but pronounced in politics (60%), philosophy (76%), and art (56%). The only group whose signal increased during the Third Reich was the Nazi party members [a 500% increase...].

Given such strong signals, we tested whether one could identify victims of Nazi repression de novo. We computed a “suppression index” s for each person by dividing their frequency from 1933 – 1945 by the mean frequency in 1925-1933 and in 1955-1965... In English, the distribution of suppression indices is tightly centered around unity. Fewer than 1% of individuals lie at the extremes... In German, the distribution in much wider, and skewed leftward: suppression in Nazi Germany was not the exception, but the rule... At the far left, 9.8% of individuals showed strong suppression... This population is highly enriched for documented victims of repression, such as Pablo Picasso..., the Bauhaus architect Walter Gropius, and Hermann Maas... ... ... At the other extreme, 1.5% of the population exhibited a dramatic rise... This subpopulation is highly enriched for Nazis and Nazi-supporters, who benefited immensely from government propaganda...

These results provide a strategy for rapidly identifying likely victims of censorship from a large pool of possibilities, and highlights how culturomic methods might complement existing historical approaches.


  • Jean-Baptiste Michel, et al., "Quantitative Analysis of Culture Using Millions of Digitized Books" in Science, 2010. DOI 10.1126/science.1199644

1 comment:

Julie Heyward said...

[looking at watch; tapping toes impatiently]

How long can it take to run "chicken boogers" through that thing? Surely that was your first and most urgent task, on realizing the power of such linguistic machinery?