Saturday, March 10, 2012

Big Data and the End of Theory?

The Guardian just published a short post that I wrote which looks at the discourses surrounding 'big data.'

In it I argue that:

Gender, geography, race, income, and a range of other social and economic factors all play a role in how information is produced and reproduced. People from different places and different backgrounds tend to produce different sorts of information. And so we risk ignoring a lot of important nuance if relying on big data as a social/economic/political mirror.

We can of course account for such bias by segmenting our data. Take the case of using Twitter to gain insights into last summer's London riots. About a third of all UK Internet users have a twitter profile; a subset of that group are the active tweeters who produce the bulk of content; and then a tiny subset of that group (about 1%) geocode their tweets (essential information if you want to know about where your information is coming from).

Despite the fact that we have a database of tens of millions of data points, we are necessarily working with subsets of subsets of subsets. Big data no longer seems so big. Such data thus serves to amplify the information produced by a small minority (a point repeatedly made by UCL's Muki Haklay), and skew, or even render invisible, ideas, trends, people, and patterns that aren't mirrored or represented in the datasets that we work with.

Big data is undoubtedly useful for addressing and overcoming many important issues face by society. But we need to ensure that we aren't seduced by the promises of big data to render theory unnecessary.
We may one day get to the point where sufficient quantities of big data can be harvested to answer all of the social questions that most concern us. I doubt it though. There will always be digital divides; always be uneven data shadows; and always be biases in how information and technology are used and produced.

And so we shouldn't forget the important role of specialists to contextualise and offer insights into what our data do, and maybe more importantly, don't tell us.

You can check out the full piece here.


Dale Nicholson said...

I liked your post, you are correct that a lot of people don't consider the fact that a huge portion of the people out there still don't even use Twitter to begin with, much less actively tweet about everything that's going on around them.

Rich Farmbrough said...

Also a lot of the information that is hidden in big data needs sophisticated linguistic analysis to reveal, of near human level. Even when a billion items are boiled down to a "mere ten thousand" doing this type of analysis manually is prohibitive and doing it automatically is at the cutting edge of NLP.