Don't Blame Excel for Your Data Mistakes

Posted by Roi Avinoam on February 8, 2021 Chartio

Earlier this year, the British government announced that almost 16,000 people in the UK were infected with the coronavirus but weren’t being counted—and it was all because of a spreadsheet.

The problem happened because British government was tracking infections using an older version of Excel (.xls), which is pretty limited in how many rows of data it can handle. As labs sent back test results in .csv files, eventually the Excel spreadsheet hit its limit. At that point, additional cases weren’t counted.

Whether you’re a data analyst or CTO, when you saw that story you probably wanted to go around and (virtually) shake some of your users, saying, “This is why you should stop storing our data in your spreadsheets!”

I get it. But that understandable reaction overlooks an important question: why do so many people use spreadsheets in situations where everybody in tech knows it’s a terrible idea?

I don’t think the problem is spreadsheets. I think the real problem is our data platforms. If we’re going to get people to stop using spreadsheets in ways that don’t make sense, tech companies need to start building data solutions that are both as powerful and as easy to use as spreadsheets are.

What’s so bad about spreadsheets?

Spreadsheets are an awesome solution for some problems. But it’s easy to outstrip their limits. Here are just a few of the problems that crop up when spreadsheets act as your data repository, source of record, and analytical tool.

Spreadsheets can’t handle large data sets

As the UK government learned the hard way, throw enough data at a spreadsheet and at some point it will buckle. Go much over 1 million rows of data and you’re definitely in trouble, but power users know that smaller data sets can cause serious loading and processing delays.

And when a spreadsheet hits its limit, it isn’t designed to tell you that it can’t handle the load. The only way you’ll find out your spreadsheet is in trouble is when something goes horribly wrong.

It’s hard or impossible to find out how data has been changed

If you’re working with a data platform, you can usually find out how the data ended up in the version you are using. Most major transformations are the result either of scripts that were run on the data or an input process built into the data platform. Scripting bakes in transparency: you can examine the script to find out exactly when and how data was transformed (e.g., changing an entry’s formatting from “sunday” to “Sunday” or applying critical business logic to a core metric).

With a spreadsheet, however, it’s nearly impossible to figure out how the data has been manipulated. Sure, you can keep a data change sheet to document the process, but that’s fraught with potential pitfalls. Along with requiring users to keep multiple files updated, applying transformations that way is more permanent which makes them difficult to tweak or undo.

And what about users that manipulate their spreadsheet data by hand? That’s a road to ruin. Any time you’re repeating a process, it’s easy to make—or perpetuate—a mistake. The end result can lead to costly errors, in one case almost $6 billion in damages.

With spreadsheets, there is no single source of truth

One of the biggest problems with spreadsheets is that they practically encourage siloed data—they are quite possibly the greatest enemy of your single source of truth (SSOT).

It’s not hard to understand why: Marketing gets a slice of data from Product for a particular project they’re working on and they put it in a spreadsheet. The data turns out to be really useful, so other people start using it. Meanwhile, someone in Finance managed to wrangle similar-but-not-the-same data out of the inventory system. And other people start using that piece of data.

Both Marketing and Finance believe their spreadsheet is a single source of truth. But in reality, they’re working from different data, which means your team is using multiple resources to answer questions, none of them give the same answer, and, most importantly, it’s not immediately clear which answer is the right one.

A proper SSOT can’t be a file trapped on a user’s hard drive. They have to be accessible by others so the whole company can benefit (and work) from the same data.

So why are we all still using spreadsheets?

If spreadsheets are such a problem, why are they still so overused? I’ve learned the hard way as a CTO and cofounder of multiple startups—including one dedicated to simplifying data infrastructure—that the biggest reason people turn to spreadsheets is that today’s tech for managing data is an endless rabbit hole.

gif1

gif1

This problem is especially acute for startups because they have to be fast, decisive, and data-driven. They have to be able to strategize, execute, learn what works and what doesn’t, and quickly adjust. Startups need data solutions they can set up quickly, share with anyone on the team, and that will scale—in both data size and use-case complexity—without engineering effort.

One answer to the problem is building a custom solution that pulls in data, applies your business logic, and outputs key reports on demand. I’ve done that more times than I like to think about, and in my experience they not only cost more money than they should but were also a drain on our company’s most critical resource: time.

  • The time engineers spent maintaining data infrastructure instead of building their core products.
  • The time analysts and business owners spent waiting for the data they needed, instead of asking and answering business questions.
  • The time the business spent taking uninformed, indecisive action due to lack of readily accessible data.

Instead of focusing on our core products, we became bogged down with data maintenance tasks. At the end of the day, building data infrastructure became an opportunity cost, and again and again, we were forced to make an impossible tradeoff between dedicating development time to our core product or our data-driven culture.

And that’s why spreadsheets are so pervasive…and so powerful.

Anyone from the CEO to a new marketing hire can grab data and immediately work with it in Excel or Google Sheets. There’s zero learning curve, plus, users don’t have to wait for engineers to modify your business’ complex data infrastructure to get the data they need.

Is it any wonder that startups end up relying on spreadsheets more than they should?

Data platforms need to learn from Henry Ford

Why is Henry Ford famous? It’s not because he invented the combustion engine, or the wheels of a car, or a car itself. It’s because in a moment when everyone else was selling parts that buyers were supposed to DIY into a drivable car, Ford sold the final product, making it possible to start driving immediately.

For all of its advancements, data tech still hasn’t had its Henry Ford moment.

As a result, companies waste a ton of time assembling and maintaining all of the different parts of a data stack—pipelines, storage, governance, BI tools—instead of actually using the data. They’re no longer stuck creating DIY components for the stack, but they’re not getting ready-to-drive data, either.

  • Data tech needs a paradigm shift. To do that, though, data platforms will need to provide:
  • Easy syncs from all of your data sources, such as internal production DBs and external APIs and services
  • Seamless, trustworthy, real-time, and scalable collaboration like you have with documents in Google Drive, tasks in Jira, or code on Github.
  • Simple to use transformations and aggregations which let you quickly produce a dataset that makes analysis easy to do so your entire team will want to pull data from your startup’s SSOT.
  • Flexibility in the analytics tools you can use. That could mean Excel or Google Sheets, but it should be equally simple to use SQL, Visual SQL, Python notebooks, or even ML.

There’s no reason data management should be more complicated than that. And there’s no reason companies should have to spend time messing around with data infrastructure instead of what they really care about: figuring out what the data is saying and strategizing about the action to take.

That’s why Panoply and Chartio are such a good match. We get why Excel is so alluring, but we’re also committed to creating a better way. We don’t just want to build better pipelines or prettier charts; we want to put users in the driver’s seat of their data.

This guest post is written by Roi Avinoam,CTO and Co-Founder at Panoply, a company that’s dedicated to making it easy to sync and store your data. Roi believes that data is essential to every organization and has been working for years to make it as accessible and easy to use as Slack.