In a delightful way, from a writing perspective, my last few data analysis topics have first synergised with one another and then led
naturally on to the consideration of Big Data.
What exactly is "big data"? The answer, it should come as no particular
surprise to hear, is "it depends". As a broad, rough and ready definition, it
means data in sufficient volume, complexity and velocity to present practical
problems in storage, management, curation and analysis within a reasonable time
scale. In other words, data which becomes, or at least threatens to become in a
specific context, too dense, too rapidly acquired and too various to handle.
Clearly, specific contexts will vary from case to case and over time (technology
continuously upgrades our ability to manage data as well as generating it in
greater volume) but broadly speaking the gap remains – and seems likely to
remain in the immediate future. The poster boys and girls of big data, in this
respect, are the likes of genomics, social research, astronomy and the Large
Hadron Collider (LHC) whose unmanaged gross sensor output would be around fifty
zettabytes per day.
There are other thorny issues besides the technicalities of computing. Some
of them concern research ethics: to what extent, for example, is it justifiable
to use big data gathered for other purposes (for example, from health,
telecommunications, credit card usage or social networking) in ways to which the
subjects did not give consent? Janet Currie (to mention only one recent example
amongst many) suggests a stark tightrope with her "Big data vs big brother"
consideration of large scale pædiatric studies. Others are more of concern to
statisticians like me: there is a tendency for the sheer density of data
available to obscure the idea of a representative sample– and a billion
unbalanced data points can actually give much less reliable results than thirty
well selected ones.
Conversely, however, big data can also be defined in terms not of problems
but of opportunity. Big data approaches open up the opportunity to explore very
small but crucial effects. They can be used to validate (or otherwise) smaller
and more focussed data collection, as for instance in Ansolabehere and Hersh’s
study [1] of survey misreporting. As technology gives us expanding
data capture capabilities at ever finer levels of resolution, all areas of
scientific endeavour are becoming increasingly data intensive. That means (in
principle, at least) knowing the nature of our studies in greater detail than
statisticians of my generation could ever have dreamed. A while back,
to look at the smaller end of the scale, I mentioned [2] the example
of an automated entomological field study régime simultaneously sampling two
thousand variables at a resolution of several hundred cases per second. That’s
not, by any stretch of the imagination, in LHC territory but it’s big enough
data to make significant call on a one terabyte portable hard drive. It’s also a
goldmine opportunity for small team or even individual study of phenomena which
would not long ago have been beyond the reach of even the largest government
funded programme: big data has revolutionised small science.
There is, in any case, no going back; big data is here to stay – and to grow
ever bigger, because it can. Like all progress, it’s a double edged sword and
the trick as always is to manage the obstacles in ways which deliver the prize. [more]
[1] Ansolabehere, S. and E. Hersh, Validation: "What Big Data Reveal About Survey Misreporting and the Real Electorate". Political Analysis, 2012. 20(4): p. 437-459.
[2] Grant, F., "Retrieving data day queries", in Scientific Computing World. 2013, Europa Science: Cambridge. p. 10-12..
No comments:
Post a Comment