Put A Fork in Big Data

Posted by dave on October 10, 2013 Data, Opinion

This article was originally written for VentureBeat and ran on August 13, 2013

Big data is a much-exploited term. It is now little more than jargon, and is often misunderstood and ill-applied. It is time that the data industry stops using the words “big data” to market software, because at the end of the day, the question is not “how big is your data?”, it is “what are you are doing with your data?” I am not alone in this battle - many well-known data vets are on my side or at the very least asking for clarification.

Stephen Few, one of the great voices in data, believes that the term “big data” has, essentially become a marketing campaign. “… Big data is more hype than substance,” Few writes , “And it [big data] thrives on remaining ill defined.”

That nebulous definition is a problem for Edd Dumbill, the program chair for the O’Reilly Strata Conference, too. Dumbill, who forged early meanings for the words big data, takes issue with the big data industry. He says it fails to supply its customers with products that solve business problems. “In that sense,” Dumbill writes, “The big data marketing push is pernicious - the same old things just rebrand themselves into the new trend.”

Even the comic strip Dilbert has taken sides in the big data argument. “Yes, the data shows that my productivity plunges whenever you learn new jargon,” Dilbert says in response to his Pointy-Haired Boss’ inquiry about “analytics from our big data in the cloud.”

Once printed in a mainstream comic strip like Dilbert, often a bellwether for shifting attitudes in business lexicon, it’s no wonder that experts and companies are questioning big data’s utility more and more.

But, the crux of Dilbert’s jab is twofold. Not only is big data jargon, but, in the real world companies searching for data solutions are often confused by all the big data marketing hype and sometimes end up wasting resources trying to prove that they too have big data. I wrote about this last year when the “big data” term was at the peak of its hype cycle.

In the article, I referenced a 2012 Microsoft research paper that revealed startling misuse of big data applications under its own roof, and at Yahoo. The researchers discovered several ill-conceived Hadoop installations - big data software - that were processing less than 14 gigabytes of data each. Big data software for that little data is overkill, which is exactly what the researchers concluded.

“In many cases,” the research team writes, “It may be easier and cheaper to scale up using a single server and adding more memory.” The researchers go on to explain that with component prices dropping - RAM, storage, CPUs - there are diminishing returns to using big data software packages for data sets operating under a terabyte.

Moreover both Microsoft and Yahoo are sophisticated technology companies. If mislabeling and handling data as big data can occur at those companies, it’s certainly going on elsewhere too. In any case, the Microsoft researchers were headed in the right direction in terms of defining big data. And I think the true definition is this - any data-set that can’t fit onto a single, local hard drive. At the moment that’s about four terabytes.

Four terabytes is an enormous amount of data. It’s more than enough, for example, to crunch all of the 2012 election data, compiled by Nate Silver at the New York Times, and by other various pollsters, and news organizations.

What’s really interesting about big data is that, in reality, we’ve always had it. But, what used to be big data in, say, 2001, is likely no longer included in the definition. It’s not included now because that same data set can likely be stored on a single drive, which was impossible earlier.

The truth of the matter is that only the Facebooks and Googles of the world have or need big data. For the rest of us, big data is just marketing hype, sales talk, and jargon. And through all the big data noise it’s easy to lose sight of the simple truth - it’s not the size of the data that counts, it’s how you use it. So let’s all agree to put a fork in big data.