The recent kerfuffle about Google flu trends showed all kinds of critics of big data come out of the woodwork. This is normal for anything new. I am sure there was an outpouring to hostility to the motor car when the first accident occurred. What I found useful was the cooling of the hype associated with big data. I have no doubt it is big and will, in fact, lead to a data economy. But it will not end world hunger or do any other wondrous thing.
One thing we have been pointing out is that the data must in some way be “representative” of the population being studied. The quotation marks are important. This is not representative in the sense we use the term in statistics. But there must be some connection with the population we are reaching conclusions about. That is why we have insisted that to talk about the poor in developing countries where datafication levels are low, the only relevant dataset is mobile transaction generated data, which are both broad in coverage, especially if multiple operators contribute data, and exhibit variation.
In a country like Sri Lanka LKR 13 billion or so is spent on the Samurdhi welfare scheme, with disbursements to over 30 percent of households, at least. One could say this covers the poor. Even if this data are in datafied form, there is not much variability (same amounts deposited in accounts every month). Withdrawals will not be regular and similar, but unless one can supplement with some other datasets, analysis of withdrawal patterns may not lead to very interesting conclusions. Though one has to muck around with the data to actually tell.
Another thing we have tried to emphasize is data quality. Data that is produced as a by-product of a transaction is less likely to be distorted. Agricultural prices that are captured from the electronic platforms the transactions occur on are likely to be of higher quality than prices that are reported by informants. Someone trying to rig the market will find it easier to manipulate the latter than the former.
One misconception is that new data collection schemes are needed for big data. It would be nice if data are collected with downstream analysis in mind (will reduce the need for data cleaning), but this is not a necessary condition. Mobile “meta data” has been in existence since the start of mobile telephony in the 1980s. They were not subject to data analytics then, but are now. Some of the reasons are given in this 2012 post.