Off the Charts: Interview with William Chen, Data Scientist at Quora

Posted by aj on March 8, 2016 Off The Charts,

We’re excited to interview William Chen, Data Scientist at Quora. William also contributes heavily to Quora’s community by writing extensively about Data Science, Statistics, Data Analysis, Machine Learning, and Probability.

Some of his most viewed answers include “How can I become a data scientist?”, “How do I prepare for a data scientist interview?”, and “What is a data scientist’s career path?”

William Chen, Data Scientist at Quora

You’ve written extensively on Quora about your path to becoming a data scientist. For our readers, can you briefly share your background?

My main academic interest in college shifted from Mathematics to Statistics sometime during my Sophomore year - a lot of that was due to the influence of a professor I had, named Joe Blitzstein, who taught the introduction to probability class that spurred my interest in the field.

I was looking for various opportunities the following summer where I could apply my statistical interest, and the awesome folks over at the Etsy data team decided to take a chance on me by giving me the opportunity to intern there. This was back in 2012 when “data science” was getting more and more traction. I really enjoyed my summer there and realized that it was a path that I wanted to continue on.

The following summer, I wanted to try something back on the West Coast around the San Francisco Area since that was where I was from, and that’s where all the buzz and action seemed to be - I applied for an internship at Quora in January 2013 (which actually is what caused me to start writing), got an offer, and worked there for the summer. Afterwards - I accepted the return offer and now have been working at Quora as a data scientist for a bit over a year.

I feel very fortunate to have been able to jump straight from undergrad into a data scientist career - especially since I hear of many people taking more circuitous routes to start in the same career as me. Not all of this was premeditated - in a separate universe I might have chosen to work in quantitative finance instead, but I’m very happy looking back at the choices I made to end up where I am today.

You contributed to The Data Science Handbook which is an amazing resource for anyone in the data science field. Can you tell us a little bit about the book and what the motivation behind it was?

The book is a collection of 25 interviews with data scientists all around the US from a variety of backgrounds. Some of them are extremely well-known in the field - especially DJ Patil, Hilary Mason, and Peter Skomoroch. The book (contrary to the name) is not actually meant to be a technical guide to data science - it’s actually rather meant to be a series of conversations with these data scientists where they can share their stories, advice, and learnings throughout their career. Kind of like a Founders at Work for data scientists.

Interestingly enough, I wasn’t part of the original founding team for the book - I actually joined a few months in after two of the team members decided to interview me for inclusion in the book (and then later extended an invitation to join them and their team!). The original idea came because one of my co-authors (Carl Shan) and our designer (Brittany Cheng) had created a similar book for product managers and they got together a group that was interested in creating a version for data scientists.

The primary goal of the book for us was not to make money - especially since we’re splitting the revenue four ways, the money that we get isn’t really enough to solely justify all the time we put into the project. We approached this project more as an opportunity to learn and share.

On the learning part - it was tremendously enjoyable for us to reach out to and speak to so many data scientists, especially since they were all people we admired in different ways. I really enjoyed the interviews that I was able to conduct, and also reading through the interviews over and over in the editing process to try to make their message as clear and as meaningful as possible. It was also interesting and motivating for me to learn more about the book publishing industry and the process of creating a self-published book.

On the sharing part - I really enjoy collecting together resources and helping people learn more about data science. It’s something that makes the experience valuable to me, and something that I continue to do by writing answers and curating resources on Quora. I enjoy these conversations and want other people to enjoy them as well.

Each person in the book added a different perspective on the question: “How can I become a data scientist?”. What was your biggest takeaway from the book on that question?

One of my favorite pieces of advice for aspirational data scientists was by Joe Blitzstein (the professor who taught me statistics, and one of the interviewees from the book). He said the following:

I noticed that’s a trap that people fall into, thinking,’I’m perpetually feeling unprepared.’ It’s a dangerous way of thinking — that until you know X, Y, Z and W, you’re not going to be able to do data science.

What I loved about this quote was that it so adequately summarized the main barrier some people have when trying to transition to data science. These people message me asking for advice occasionally.

People read all these articles about how data scientists have these huge ranges of skills - and they feel pressured that they have to learn all of them and be competent at all of them at the same time before being a data scientist. As Professor Blitzstein mentioned, these people decide that they have to finish reading these X books and complete these Y MOOCs before they can come a data scientist. This mode of thought is paralyzing, not practical at all, and sets up an insurmountable obstacle once you start.

While data science does cover a whole bunch of skills - you don’t have to learn them all at once. Professor Blitzstein suggested a better approach that facilitates more understanding - motivate what you need to learn by real, applied problems that you’re trying to tackle. This way, the task of learning data science from scratch will become more manageable.

This (getting data science knowledge motivated by solving real, applied problems) was a recurring theme in the Data Science Handbook. For example, Diane Wu took machine learning courses to help with her research, George Roumeliotis developed software engineering skills by building the prototypes as his startups, and Michelangelo D’Agostino was introduced to machine learning after working with a neutrino physics experiment at the South Pole.

The other way around (starting with knowledge of quantitative techniques and discovering data-science like problems) was common too, with even a greater quantity of examples from the Data Science Handbook.

The main takeaway here is that you can’t just learn data science or fall in love with it by reading books or taking online classes. You have to immerse yourself in real, applied problems.

You can’t just learn data science or fall in love with it by reading books or taking online classes. You have to immerse yourself in real, applied problems.

With 25 people included in the book, it must have been a lot of work. Did you learn anything interesting about putting together a successful book?

One of the most exciting opportunities that arose to use after we chose the self-publishing route was the opportunity to try a Pay What You Want (PWYW) model, where we let people literally pay what they want for a digital copy of The Data Science Handbook. The minimum was $0 and the recommended amount was $19.

This model was extremely effective when we’re selling a product that has zero marginal cost - what happened was the the book exploded in popularity in the first week through various mediums, notably Twitter, and people seemed much more willing to retweet and share the book since it was something they could download for free. The resulting publicity caused by virality and word-of-mouth was staggering, and would have cost us a lot of money had we decided to buy that kind of distribution instead. We got 2.5k downloads of the product in the first day alone, which netted us $5k of revenue.

We actually ran an experiment on our subscriber base to test out whether a PWYW model could work well before we actually launched the book. You can see the results of that test in our blog post here.

In addition to The Data Science Handbook, you’re responsible for another popular piece of content, The Only Probability Cheatsheet You’ll Ever Need. Can you tell us a bit about this project and why it has been so successful?

In Harvard’s Introduction to Probability Class (Stat 110), students were allowed an 8-page cheat sheet on their final exam to stuff with all of the formulas, concepts, and example problems that they needed. I had compiled a bunch of notes from the class in a series LaTeX documents already, so it wasn’t too hard for me to compile all of my notes into a compact 8-page cheatsheet for me to give students to use during the exam (I was a Teaching Fellow at that time, so my objective was to help the students).

As you can imagine, the cheatsheet was very popular within the class, so with permission from the instructor Joe Blitzstein, I decided to launch a public version of the cheatsheet in July 2014. The instructor Joe Blitzstein (who I know very well) made some significant contributions to the cheatsheet in August 2015, which led to the colorful and visual version of the cheatsheet that is available today.

The cheatsheet remains very popular and gets picked up on Twitter every month or so with a cumulative hundreds of retweets and likes. I’m pretty sure this is a combination of two things - demand and ease of use. On the demand side, you have plenty of students taking probability courses or courses with prerequisites of probability, and you have plenty of people in the workforce who need to reference probability occasionally. On the ease of use side - people love a good TL;DR, something that can get them the information that they need in an easily-digestible, efficient, and compact form.

The material on the cheatsheet is pretty comprehensive - it covers the whole of Harvard’s Introduction to Probability Class, which in turn covers most of the probability that anyone might need to know ever. It’s a kind of a personal mission for me to help synthesize and condense knowledge into ways that are easy to digest and understand. I want to keep making helpful resources like this, and continue to do things like this by answering questions on Quora.

Given Quora employs talented data science folks such as yourself, it’s probably a very data driven company. What do you think it takes for companies, especially less technical companies or departments, to become more data driven?

Companies need to be convinced that data can give them a competitive advantage in order for the top-level executives to decide to invest a significant amount of resources in building a data team or investing in a data-driven culture.

Sometimes the competitive advantage is immensely obvious - for example at Netflix, 75% of viewer activity is estimated to come from recommendations, so investing in data and machine learning as a core competency can clearly help the bottom line of a company.

Other times, the competitive advantage is less obvious - for example, Dropbox’s file-syncing is the best of its kind because of its reliability and ease of use, not because of data science. Still in these cases where the data science is not what distinguishes the product, there are still avenues for which companies to gain advantages using data.

For example - when your company makes revenue by selling premium versions of free enterprise software, data science can help target which free software users are most likely to convert to paid users, and thus increase the efficiency of the whole sales and marketing organization.

When your company makes revenue by selling premium versions of free enterprise software, data science can help target which free software users are most likely to convert to paid users, and thus increase the efficiency of the whole sales and marketing organization.

If your company is a consumer product the value of becoming data driven can also be clear - there will be various key funnels in the usage of the product that could involve actions such as conversions on a shopping cart or engagement on posts by others. Developing an experimentally-driven culture in these companies will help the product teams be able to test and launch new features to optimize these key metrics quickly and with autonomy. This can happen with a culture of listening to the data to make product decisions, instead of having all features need to go through a manual decision process.

Are you heavily involved in data science/analytics, and interested in being interviewed for our upcoming “Off the Charts” series? Email us!

More posts like this

Fast Analytics Experiment: SingleStore + Chartio

Off The Charts: Interview with Casey Haber, Visualization Engineer at Chartio

Off the Charts: Startups, Data and Fundraising - an Interview with Amanda Kattan