Even Basketball is benefitting from data. Read about the interesting research done on court positioning.
Even Basketball is benefitting from data. Read about the interesting research done on court positioning.
As a MailChimp customer and fan, it's wonderful to have John Foreman, chief data scientist at MailChimp, as part of our "Off the Charts" interview series. If you are going to Starta next week, check out his talk, Dissecting Data Science Algorithms using Spreadsheets, and/or follow him on Twitter for more insights.
Before I was a data scientist at MailChimp, I was for many years an analytics consultant to large enterprises. I got to work on some really awesome problems. Now, this was prior to the heady days of big data and data science. So once I entered the data science world, I quickly realized that those I'd left behind in the enterprise space didn't speak the same language as those in the data science world.
Specifically, Excel, SAS, Teradata, and Oracle seem to rule the enterprise where terms like "business intelligence" are bandied about, whereas R, Python, and Hadoop have a better hold in companies talking data science.
So I asked myself, how would I teach the important techniques of the data science world to folks in the enterprise world? And by teach, I mean really dig into the algorithms. It's hard to teach machine learning and data mining in R where most tutorials seem to start with "now load the relevant library that's done all the work for you."
I decided to teach data science in spreadsheets. There are no libraries to help you. There is no code. There are just numbers and formulas, which is awesome for learning and terrible for production. So in the book, I teach readers the algorithms in great detail. Then at the end of the book, I gently usher the reader into R once I know that everyone has a rock-solid understanding of the techniques.
So I asked myself, how would I teach the important techniques of the data science world to folks in the enterprise world? ... I decided to teach data science in spreadsheets.
Couldn't agree more. Everyone should be able to write spaghetti code, and everyone should be able to pull and analyze data. And I'm not just talking about business-folk here.
Everyone should feel the power of writing a terrible data-munging prototype script at some point.
Look at what's going on in the digital humanities. Now, even literature, history, and religious scholars can use data to shed new insight on old texts. How awesome is that? But you have to be able to actually analyze the data. That means being able to query and scrub; that means knowing a bit of probability and statistics. The difference between a median and mean would be a start.
So yes, it's no longer acceptable to say, "I suck at math!" and then ignore that part of the world.
I suck at physical exercise, but that doesn't mean it's OK for me to melt into a chair all day. We all need to work at the important stuff in life, and understanding data has become terribly important.
I've been practicing data science for years; it was just called analytics or operations research.
I started as a pure math guy. I loved abstract algebra. But I decided to try something a little more applied, so I worked for a professor studying the mathematics of knot-tying which has applications in protein folding and physical cosmology. This was my first taste of writing awful C code and hunting for memory leaks with valgrind.
I ended up at the NSA and then Booz Allen doing large analytics projects for the government. Awesome math, not so awesome clients, so I left the government to consult for large enterprises like Coca-Cola, building supply and revenue optimization models. I believe that the revenue optimization models that companies like Intercontinental Hotels and Royal Caribbean have been building in the enterprise world are great examples of data science before there was data science.
From there, MailChimp seems like a strange jump. Fortune 500s to a start-up culture.
But for however much data most old-guard Fortune 500s have, MailChimp has more. MailChimp sends four hundred million emails a day for customers all across the globe, and we track engagement (opens, clicks, unsubscribes, abuse reports, Google analytics data, e-commerce data, custom triggered events, etc.) on those newsletters to the tune of another hundred million events a day. And that's just the beginning.
That type of activity puts MailChimp on the Alexa 500. Twice.
So I traded in my Oracle and SAS chops for Postgres, R, and Redis, and the rest has been a blast.
For the MailChimp data science team, we endeavor to move beyond the aggregate. Too often companies stop at pulling aggregate stats via hadoop and call it a day. Or maybe they'll release an infographic of their aggregate data to prove to their investors that their big data investment was worth it. That's great, but it's not enough to give a company a competitive edge.
Yes we provide summary data in reports and blog posts to our internal and external customers. But our real passion lies in building data products. We use machine learning, optimization modeling, forecasting, etc. to build tools that improve MailChimp as an application and help other teams and our customers do their work better. That should always be the goal of a good data science team -- leading from the back, giving people the data-driven insight they need to do their work better.
That should always be the goal of a good data science team -- leading from the back, giving people the data-driven insight they need to do their work better.
MailChimp sends email, so to start at the top, we save all the content our users send through us. We also have all of the user's account meta-data. But let's move to the individual address level, because that's where things get interesting. MailChimp sends email to billions of unique email addresses all across the world. So we have all of an email address's subscriptions, which is a great vector when trying to understand interest and demographic data. The typical fantasy football newsletter subscriber is very different from the typical quilting newsletter subscriber.
Then there's email engagement data -- emails sent through us generally have open tracking and click tracking turned on. So we get sends, opens, and clicks at an individual level. We also get unsubscribes and abuse reports. With engagement comes geolocation and device preference, so we know whether the reader is on a mobile device for instance.
But the engagement data goes beyond clicking. Those MailChimp users who choose to can use MailChimp's Goal feature to track subscribers once they go to the sender's site from an email campaign. You can track abandoned shopping carts, purchases, etc. once the readers click through. These interactions power better segmentation and reporting for the user.
All of this data goes into building products that allow our users to better understand and speak to their audiences. This is handy for publishers, nonprofits, and small businesses alike. The more you can understand what your readership is interested in, the better you can engage them rather than just "blasting" at them as email marketers did years ago.
We track a lot of data to verify our models are working. For instance, we have a model called Omnivore that shuts down bad users while letting good users sail through. We track metrics around abuse to make sure it's working and not in need of retraining.
But at a high level, we try to avoid tracking a lot of ugly metrics like ARPU We've seen from our competitors that when you track these company-level revenue metrics, you start trying to do good by the metric rather than by your customers. And that's when things get perverse. That's when you get internal politics and people playing games with the numbers to further their own careers. So when it comes to things like the ROI of a billboard campaign, you can count on me to fight against my need to measure everything.
We've discovered all sorts of cool stuff! Regarding the optimal send time, we discovered that that's a myth. Humans are complex and their schedules are complex, so the best time to engage them depends on the sender and their content as well as on the reader. That's why we built Send Time Optimization. Rather than assume that all senders and subscribers are the same, the model uses data (what a novel idea!) to figure out when your readership will most likely engage with your content.
Regarding the optimal send time, we discovered that that's a myth.
I recently discussed in a post on the MailChimp blog the age and browser preference of email addresses from the big free email providers. I'm sure it comes as no surprise that Gmail users are about a decade younger than AOL users. For both email providers though, the number one way the readers view email is on the iPhone (for AOL, the AOL Explorer browser still plays a role which is frightening). Also, Gmail email addresses are disproportionately interested in software/apps newsletters, while AOL addresses seem to prefer reading about politics. Go figure.
R and Python are big tools for us. We also do a lot data pulling and manipulation via SQL and some in-house, custom map-reduce tools. But more than any of these, oddly enough, I'd say we spend a huge chunk of our time pushing around data in the command line. I've been putting in a lot more time this year using awk and sed than I ever would have thought! That's one set of skills every data scientist should have -- a great knowledge of how to work with files from the command line. Why write a python script to do something when a few pipes in bash will do the job in 10 seconds?
I'm a big fan of reading books. Seriously. They're often better at preparing individuals for a role than a bunch of bookmarked blog posts. And I say that as a guy who writes a blog.
My book would be a great start! But there are actually a number of excellent books out there. Max Kuhn's new Springer book is excellent for those who want a little more depth. Or go "full Hastie" if you like.
The data science community is really active online and generally friendly. Engage with others on sites like DataTau and Cross Validated, not to mention Twitter. Data science folks seem to like Twitter...must be all that streaming, unstructured text.
We're really excited about all of the Chartio news today. Not only did we get to announce some funding (more on that later), and a new services offering, but we've launched two new product features that we feel are major game changers for cloud Business Intelligence.
There's a lot to talk about, so let's get started with the new features.
Chartio now enables queries from different data sources to be combined (blended) together for visualization or further analysis. Data is fetched from different sources using multiple layers and then blended together with one of three different methods that are controlled by a simple drop down.
The blending process, we think, will have a huge impact on the standard Business Intelligence workflow.
Traditionally, organizations needing to do analytics across multiple data sets had to undergo the enormous process called "Extract, Transform, Load" or ETL for short. The process typically takes several months as a data team needs to design a data warehouse with all of the data loading processes involved, while predicting any possible analytic questions that may arise in the future.
ETL is an incredibly difficult and costly task and blending removes much of the need for it.
With blending, an engineer doesn't have to predict what may be interesting to chart or what all of the data sources will be months in advance. A user can just get going and worry about combining the data later, when the chart is actually being made. All queries are pulled from the live sources and very little extra work is needed to determine how the data is combined.
For more information see the data blending documentation or watch the following video tutorial.
The new Chartio Formulas enable you to create new calculated columns based on other columns in the query result set. It is quite similar to how you might create new formula based columns in Excel and is built right in as an optional step in the chart creation process.
The following example shows a formula generating a new column that is the four week moving average of another column in the data set.
A formula creating a 3rd column that is the Moving Average of the 2nd using the moving_avg function.
Formulas are especially exciting when mixed with Blending. The following two images show a a formula for mixing Google Analytics data with Subscription data from a database to calculate a conversion percentage for page visits to paid customers.
A Formula combining two blended columns into a conversion metric
The formula is simply written using the names of the blended columns, and multiplying them by 100.0 to get the value as a percentage.
100.0 * Subscriptions / ga:visitors
The resulting calculated column of the conversion rate from visitors to paying customers.
Once defined and the Transform button is clicked, the formula is applied to every row in the new column, and the values can be visualized in the final visualization step just like any other data in Chartio.
Again, no ETL or data warehousing was needed to merge these two data sources. The query sets from both were simply blended together and the conversion was calculated using a formula!
For a further explanation see the formulas documentation.
We've been focused for a long time on providing an easy to use self-service Business Intelligence solution. We frequently, however, run into customers who want more than a product, but also the time and expertise of experienced data scientists and database administrators.
Lastly, in order to support our new offerings and further accelerate our growth we've raised another $2.2M in funding from Avalon Ventures. The round also includes a new investor, Jeff Fluhr who is a founder of StubHub and spreecast. We are excited to include Jeff as the second investor to come to us first as a happy customer!
We'd love to hear what you think of the new announcements! If you have any thoughts, questions, or requests, as always send us a note at firstname.lastname@example.org.
Now that the holidays are over, we thought that it would be a good time to take a closer look at those toys that our children, younger siblings, and other little ones in our lives received during the holidays.
We worked with our friends over at Enigma.io to visualize a Consumer Product Safety Commission (CPSC) database of toy recalls since 1973 in Chartio. It turns out that some of the most beloved toy brands are the biggest offenders.
Would you believe it if we told you that Fisher Price is the most high-risk manufacturer in terms of recalls? Or that the bicycle is the most dangerous toy?
Let's dig into the data to see what it tells us.
Total Recalls By Year
2007 certainly goes down in history as the year of most toy recalls so far. 56 types of toys were recalled that year, which is nearly double the previous peak of 29 recalls in 2001. You may remember the heavy media coverage in 2007 about toy brands coming under fire because their manufacturers in China were using paint containing a toxic level of lead; this is precisely why we see a massive peak of recalls during that time. According to the New York Times, Mattel recalled 967,000 toys in total that year. Although many toy manufacturers purportedly took precautions to double check the conditions of their production chains, further investigation revealed that Chinese manufacturers may have been trying to cut costs by using less expensive paint containing high levels of lead. Mattel was ultimately fined $2.3 million dollars for violating the federal lead paint ban.
Interestingly, the graph above further illustrates how toxic levels of lead in toy paint didn't pose quite as much of a threat until 2007, and despite heightened awareness and increased regulation, it has taken years for the issue to be contained.
Top Recalls By Manufacturer
After taking the lead paint debacle of 2007 into consideration, it's no surprise that Fisher-Price and Mattel (Fisher-Price's parent company) are at the top of the list for having the most recalls by manufacturer.
While lead content may have been the leading reason for recalls from 2007 to 2008, it's actually choking that has been the predominant reason for toy recalls for over four decades. Over half (52.5%) of the toys recalled since 1973 were due to choking hazards, followed by lead (16.5%), lacerations (9.9%), and burns (7.2%).
Recalls By Country
Throughout the years, Taiwan has had the most issues manufacturing toys. It has also had the biggest impact on total U.S. toy imports despite representing a relatively small percentage overall. Vietnam is the only other country that has surpassed Taiwan, in 2004 and 2006. These unexpected peaks are presumably due to a massive recall on Cordless Push Button Toy Telephones made in Vietnam and sold by major school supply distributors from January 2004 through February 2006. Once again, the leading reason for recalls on toys manufactured in both Taiwan and Vietnam is choking.
As you can see, bicycles are the most recalled toy, for various reasons. You can find out more about why they are on the top of the list on Chartio's dashboard. Following bikes is toy animals (28.7%) and toy vehicles (18.1%).
The chart above provides a deeper look at the toy types recalled, minus the bicycles. In 2013 only 16 toys were recalled. Some of the toys in question include the Bearmerzz Stuffed Animals with Flashlight, Toys-R-Us Imaginarium Activity Walker, Toys-R-Us 3-Channel Helicopters, and Build-A-Bear Stuffed Monster. The full list of toys recalled this past year, their hazards, and their manufacturers can also be viewed on the Chartio dashboard.
Toy companies are required to report any hazards to the CSPC as soon as they become aware of them. According to this MSN Living article, "The CPSC then analyzes the company's findings and considers the proposed remedy (usually a replacement or refund). This process can take from a few days to several months. If the CPSC concurs, the toymaker proceeds with "corrective action" - a voluntary or mandatory recall."
That being said, it is clear that some toys tend to slip through the cracks. Safekids.org said that in 2010 an estimated 181,500 children were treated in an emergency room for toy-related injuries. In other words, always exercise caution when buying new toys for your young ones!
Destine Ozuygur of Enigma.io and I collaborated on this piece together. Enigma.io is a web service that allows its users to dig into a vast amount of publicly available (but hard-to-obtain) data. We encourage you to check them out.
Today we are excited to include Bullet Graphs to our family of supported chart types. Bullet graphs are ideal for viewing a single value in some extra quantitative context and are an improved replacement for the gauge and meter charts cluttering many other Business Intelligence products today.
Bullet Graph specification from Stephen Few's Bullet Graph Design Spec
Perhaps the easiest way to explain bullet graphs is with an example. Lets say that you and your team want to keep an eye on your year to date (YTD) revenue.
Simply putting that value on a dashboard is helpful, but the standalone value offers no information about whether it is considered good or not.
If you chose a bullet graph to plot the revenue, you can define and plot the value against a set of quantitative ranges.
In the figure above for example, a bad range is set for $0 to $20, good is from $20 to $90, and great is between $90 and $200. The ranges are indicated by different shades of grey in the backround.
Now those viewing the chart will instantly know that the revenue displayed above is in the range considered great by the expectations set.
Bullet graphs are made with the same querying and chart building interface as the other charts. On the chart creator you'll now notice an additonal chart type option for bullets.
Like Single Value charts, Bullet graphs require just a single measure to be returned from your data query. If you feed the chart with more than a single value you will be notified of the error.
Three quantitative ranges are allowed to be plotted and are manually configured in the Chart Settings. The fields to adjust are Minimum, Level 1, Level 2 and Maximum.