Interview

Off the Charts: An Interview with Kevin Novak, Data Scientist at Uber

Posted by dave on October 24, 2013 Off The Charts

I interviewed Kevin Novak, data scientist at Uber awhile ago, but we’re happy to finally publish our talk as part of our ongoing ‘Off the Charts’ interview series. If you’d like to hear more from Kevin you can follow him on Twitter, and be sure to keep an eye on our blog for more interviews.

You recently gave a talk at General Assembly about bridging the gap from academia to industry. What do you consider the biggest takeaway from your talk?

I did! Uber Data has been organizing a series of talks about data, tech, and entrepreneurship. We recently partnered with General Assembly, and I was asked to personally give our most recent one. We’re going to be doing several more with General Assembly going forward, so keep your eyes and ears open.

As for my biggest takeaway from my talk; I’d say that with the right attitude and framing, making the jump from academics to industry is pretty easy. Most academics I’ve spoken with get hung up on either their programming skills or they get intimidated with the pace or lingo in the industry. However, for me personally, the required programming skills I look for in a potential candidate are pretty low; mainly because a) the average programming skill of somebody in the bay area is higher than average to begin with, b) programming is an incredibly teachable skill, especially to somebody with a quantitative/logical mindset, and c) its a candidate’s market, their other skills are in incredible demand.

Once you (as somebody who wants to become a data scientist) realize that I’m interested in literally everything about your background except how well you program, it changes the game for how you think about the employment market. Now, don’t mistake me, programming is and will be a large part of your job and any successful data science person will have at least some familiarity with it. I’m simply stating that compared to abstract mathematics, statistics, and analysis skills, programming is low on the list.

How did you get into data science?

Even from the time I was considering what to study in college, I knew that I was interested in the intersection of math, science, and computers. I ended up studying physics, but frankly, I was a weird physicist; instead of running experiments on lab equipment, I was the guy sitting in the corner trying to figure out how to make computer-generated holograms on self-optimizing clusters. I took so many other quantitative classes I was interested in while in school I found out I earned a math and computer science degree without realizing it.

I went on to graduate school at Michigan State in physics, and that’s really where data entered the picture. I was looking for a lab that would let me continue to play around with computers and programming and I discovered that the nuclear physicists, especially the theorists, were doing really interesting work with data. They are building these really awesome incredibly complex models that were trying to confirm ideas theorists had about phenomena the physics community was seeing in the massive amounts of data coming out from new particle accelerators like the LHC. I spent most of my research time building a statistical software package that helps physicists tune their model’s parameters to the data they’re observing, and getting a crash course in nuclear physics, statistical physics, and data analysis along the way.

Unfortunately, career prospects in academia aren’t all that bright these days, and even less so in the nuclear physics field. I was contacted by Uber in July 2011 with the opportunity to join the team as their second full-time data scientist, and jumped into the world of data science, entrepreneurship, and transportation with both feet. Its been a great transition; I get to use data and mathematics to solve interesting and complex problems every day.

Do you think your physics background helped? Do you recommend others interested in data science to major in physics? If not, what do you think those interested in data science should study?

I do, but less for the subject matter and more for the mindset it encourages. Becoming successful in physics means you become adept at taking complex, abstract ideas and translating that into mathematical form. It helps you build mathematical intuition, which is especially valuable when you’re employed in a field that’s so young.

Another way in which it helps is in teaching you the science and art of approximation. Generally speaking, you spend your high school and first year of undergrad learning a greatly simplified version of physics (F=ma, gravity is 9.8 m/s^2, V=IR, etc.), and then the next several years learning that everything you just learned is factually incorrect, but rather a good approximation to another, more complex system. Then you go to grad school, and the process repeats itself. You become very good at understanding how to approximate, as well as when and why its appropriate. As a professional, especially in the startup field, approximation is agility; an approximate answer you can ship now is usually preferable to an exact answer next week.

Generally speaking, these sorts of skills are common to a lot of quantitative disciplines, not just physics. Mathematics, Statistics, Economics, or Computer Science are all excellent choices. Even theoretical chemistry or biology provide useful skillsets, especially if you learn computer programming and theory on the side.

Do you think the hype around data science is warranted? If so, why?

Yes and no. I definitely think that the hype is timely, we’re finally at a place where the basic ingredients of good data analysis are all available to a huge number of people. People can store information incredibly cheaply (so there’s more of it), computational horsepower is available on demand in the cloud, and our ever-more-connected world presents opportunities for data resources that simply didn’t exist before this. Add all this together, and you’ve got amazing opportunities for data.

However the flip side to this is that many get carried away and think that data science will solve all our problems, which simply isn’t true. We’re getting better at solving data problems faster and expanding what we can do every day, but data science still has plenty of limitations. If you don’t have the right data or enough data, there’s very little I can do for you, and usually, the only solution I can offer is to wait, or pursue a non-data solution.

How should companies interact with data scientists? What type of work do data scientists really love to dig into?

In my experience, data science is probably best thought of as more advanced Research and Development. I’ve found that approaching an issue as “What can data tell me about ?” initially tends to get best results; if or when you discover something of interest or use, you can dial in project requirements and operational considerations.

Additionally, building a data-friendly engineering culture is a major requirement for success as a data scientist. First, get greedy about what data you collect. Storage is cheap relative to what you can gain from historic data, and it can be quite costly in terms of wasting time to build a dataset. For example, at Uber, we’ve got every GPS point for every trip ever taken at Uber, going back to the Trip #1. Additionally, keeping historic information about your system and capabilities will make analysis doable down the road. Versioning your database schema, keeping change logs, etc., are all small engineering requirements which have huge impact on data science down the road. At Uber, the goal is to have everything on-hand to answer the question of “What did the Uber system look like at this point in time?”; from customer and supply behavior to inter-server communication to the state of the database. Even if something doesn’t have analytical value currently, there will almost certainly be a use for it down the road- many of Uber’s demand models use behavior data that was originally stored as a debugging tool.

If you’re a company interested in attracting data scientists, start capturing data now and start storing it somewhere (it doesn’t even have to be all that fancy or involved, as long as the data is there it can always be cleaned up later). Scale your engineering with historical analysis in mind, and start isolating “hard math” areas of the company’s goals that you think data can contribute to.

How does Uber use data science?

Data science is a fundamental part of Uber’s philosophy and product. Organizationally, Uber does an outstanding job of hiring data-oriented people throughout the company, so I get to spend most of my time doing predictive modeling, research and development, and data evangelism. The data team (“Team Science”) is embedded with Uber Engineering, but we tend to operate out of the normal development cycle, which lets us really dive deep into some of Uber’s fundamental math problems. On the product side, the data team has been responsible for all of the models powering our ETA algorithms (“Your driver will be here in 5 minutes”, etc.), our dynamic pricing algorithms, the fare estimator, and the heat maps we show to our drivers to show them where to position themselves in the city, as well as many more.

Is there any work you can share with us, from either an Uber perspective or a personal perspective?

Unfortunately, I don’t have a lot right now. I’ve been asked to keep most of the data-related libraries at Uber private at this point, and while I’ve got several personal side-projects brewing, none are quite ready for the light of day yet. However, I did write and contribute to several of Uber’s non-data related open-sourced projects, including Minos and Clay.

What are your go-to tools for doing data science?

I tend to write most of my own modules myself, and I’ve found Python to be an excellent, general purpose programming language when I’m working on data science projects. PandasNumpy/Scipy, and Matplotlib are some of my more common third-party modules. I use R, Matlab, or Octave occasionally, but I tend to find that they work well for one-off projects or prototypes, but are not well-suited to putting code into a production stack. For data visualization, I’m working on becoming more d3 savvy, most of my work is in javascript or Coffeescript. Most of my work isn’t in the “Big Data” regime, so i’ve found most SQL frameworks adequate, but I prefer Postgres.