Separating Hyperplanes

Big Data, Little Data

Matthew Martin 5/26/2013 10:14:00 PM

A quick rebuttal to the concept of "big data."

I've been busy with travel lately, so I now have a large backlog of rapidly-aging posts I wanted to write. Here's a point I wanted to make about so-called "big data" that recently entered the spotlight when Medicare decided to make it's database of hospital prices publicly available. The event was heralded as a new era of transparency made possible by "big data" of which the medicare database was a prime example.

Not so fast. We are glossing over the big problem with "big data," which is that while we get lots of individual-level data, it is generally very low quality. Consider, one of the prime examples of "big data:" the information that firms like Dunhumby USA collect on your grocery purchases every time you swipe that card for your local grocery chain. When I'm in Cincinnati, the local grocery chain is Kroger, which was one of the pioneers on the whole "rewards card" concept. At some level, having a record of nearly every single purchase at Kroger tied to individual-level characteristics sounds like a remarkable feat. Millions of highly detailed data points that lets you construct detailed statistical profiles of Kroger shoppers. But at another level, this is very bad quality data since there are all kinds of ways in which the rewards card system skews and biases information. Just consider my example: I've been using one of my college roommate's Kroger cards since, well, college. We are no longer roommates, and don't even live in the same metropolitan area anymore. But he likes the rewards points and all that, I don't. At an individual level there is no incentive for me to get my own card. Moreover, he and I don't have similar tastes at all--as far as statistics goes, any data collected from either of us this way is complete garbage.

While this may be a somewhat uncommon case, it represents a very real problem for big data: there is simply no quality control in the data collection process. People lend their cards to each other, they forget their cards, open multiple accounts etc. Dunhumby's strategy is to try to use the sheer volume of data collected to compensate for the remarkably low quality of the data recording process. Hopefully, this gives you the same quality results at a lower cost than actually enrolling individuals in a smaller-scale, quality-controlled study. The point is, we are only fooling ourselves if we think that there is no trade-off between "big data" and "little data."

It's much the same story with the Medicare hospital data. Don't get me wrong, having the data is much better than not having the data. But we must remember that the data is exceedingly low quality--it tells us, essentially, a list price that differs quite substantially from what Medicare and insurers actually pay. So when we see large disparities in the "price" of the same procedure across hospitals, it could mean that the hospitals vary widely in what they charge for that procedure, or it could mean that they all charge the same amount and very widely in bargaining strategies, or more likely somewhere in between. The point is, we don't know because the data is very low quality.

And finally for a bit of perspective: my graduate advisor at Cornell, a labor economist, was the king of big data. I don't really know how he managed to construct a data set with several billion observations--more than there are people in the world--but I do know that with such a large dataset, it is impossible to quality-check each observation. Even with his team of graduate assistants on the job, if they started counting data points now, they'd all be dead before they counted them all. So ultimately he is relying on individual level variation, some computational tricks, and the sheer volume of data to overcome what is hopefully fairly classical measurement error. There are applications where this data set will let us do things not otherwise possible. There are also plenty of applications where this would not be the best dataset to use.