08 November 2013

Too big for its boots

In a delightful way, from a writing perspective, my last few data analysis topics have first synergised with one another and then led naturally on to the consideration of Big Data.
What exactly is "big data"? The answer, it should come as no particular surprise to hear, is "it depends". As a broad, rough and ready definition, it means data in sufficient volume, complexity and velocity to present practical problems in storage, management, curation and analysis within a reasonable time scale. In other words, data which becomes, or at least threatens to become in a specific context, too dense, too rapidly acquired and too various to handle. Clearly, specific contexts will vary from case to case and over time (technology continuously upgrades our ability to manage data as well as generating it in greater volume) but broadly speaking the gap remains – and seems likely to remain in the immediate future. The poster boys and girls of big data, in this respect, are the likes of genomics, social research, astronomy and the Large Hadron Collider (LHC) whose unmanaged gross sensor output would be around fifty zettabytes per day.
There are other thorny issues besides the technicalities of computing. Some of them concern research ethics: to what extent, for example, is it justifiable to use big data gathered for other purposes (for example, from health, telecommunications, credit card usage or social networking) in ways to which the subjects did not give consent? Janet Currie (to mention only one recent example amongst many) suggests a stark tightrope with her "Big data vs big brother" consideration of large scale pædiatric studies. Others are more of concern to statisticians like me: there is a tendency for the sheer density of data available to obscure the idea of a representative sample– and a billion unbalanced data points can actually give much less reliable results than thirty well selected ones.
Conversely, however, big data can also be defined in terms not of problems but of opportunity. Big data approaches open up the opportunity to explore very small but crucial effects. They can be used to validate (or otherwise) smaller and more focussed data collection, as for instance in Ansolabehere and Hersh’s study [1] of survey misreporting. As technology gives us expanding data capture capabilities at ever finer levels of resolution, all areas of scientific endeavour are becoming increasingly data intensive. That means (in principle, at least) knowing the nature of our studies in greater detail than statisticians of my generation could ever have dreamed. A while back, to look at the smaller end of the scale, I mentioned [2] the example of an automated entomological field study régime simultaneously sampling two thousand variables at a resolution of several hundred cases per second. That’s not, by any stretch of the imagination, in LHC territory but it’s big enough data to make significant call on a one terabyte portable hard drive. It’s also a goldmine opportunity for small team or even individual study of phenomena which would not long ago have been beyond the reach of even the largest government funded programme: big data has revolutionised small science.
There is, in any case, no going back; big data is here to stay – and to grow ever bigger, because it can. Like all progress, it’s a double edged sword and the trick as always is to manage the obstacles in ways which deliver the prize. [more]

[1] Ansolabehere, S. and E. Hersh, Validation: "What Big Data Reveal About Survey Misreporting and the Real Electorate". Political Analysis, 2012. 20(4): p. 437-459.

[2] Grant, F., "Retrieving data day queries", in Scientific Computing World. 2013, Europa Science: Cambridge. p. 10-12..

No comments: