Performing Cohort Analysis Using MySQL

Data Tutorial

Overview

What is a Cohort Analysis?

A Cohort Analysis is a method of analyzing a metric by comparing its behavior between different groups (cohorts) of users. The groups, or cohorts can be defined in any number of ways but are most commonly based upon the date that a user started a service.

Why is it important?

It’s essential to gaining relevant, actionable information from your database. Let’s take a look at the example demonstrated in the following two charts.

The first chart, “User Acquisitions by Marketing Campaign”, shows that our marketing campaign “TV” is out-performing all others in generating new users. However, it tells us nothing about how active these users actually are after registration. Because there is a different cost associated with each campaign, all user acquisitions are not created equal.

Our second chart “Orders By Campaign / Months After Registration” visualizes a cohort analysis of how customers are using our web site. We compare each of the marketing campaigns by how active the users remain over time. In this case, we are looking at the number of orders placed per month after registration, grouped by marketing campaign.

This chart paints a completely different picture. Even though our “TV” campaign is the most effective at acquiring new users, those users are the least likely to become recurring customers. Using this data, we can evaluate if the acquisition cost per user for this marketing campaign is justified.

Definitely great information for us to have, but how did we get there? Let’s go through the steps of performing cohort analysis in MySQL.

Example Walkthrough

The first thing we want to do is take a look at the structure of our MySQL database by running the SHOW TABLES statement. sql SHOW TABLES;

Looking at the list of tables, we are probably most interested in the Users and Orders tables. Let’s use the DESCRIBE statement to dig deeper into them. sql DESCRIBE Users;

DESCRIBE Orders;

It looks like these tables contain all of the info we need:

  • Users.Created_Date - The user’s registration date.
  • Users.Campaign_ID - The marketing campaign that brought them to our web site.
  • Orders.Order_ID - The orders placed by the user.
  • Orders.Created_Date - The date the order was placed.
  • User_ID - The foreign key that will connect our Users and Orders table.

Now that we have identified the pertinent columns we can run the following SELECT statement to verify the data. sql SELECT Orders.Order_ID AS Order ID, Orders.Created_Date AS Order Date, Users.User_ID AS User ID, Users.Created_Date AS Join Date, Users.Campaign_ID AS Campaign ID FROM Orders JOIN Users ON Orders.User_ID = Users.User_ID

We use a basic MySQL join so we can view data from both tables together. When using joins we need to specify the table and column name as TABLE.COLUMN to account for ambiguous column names. Without this, MySQL wouldn’t be able to distinguish between the Created_Date and User_ID fields that exist in both the Orders and Users tables.

To make our results more readable we use an AS statement to create meaningful alias for our columns. For example, creating the alias ‘Join Date’ for Users.Created_Date.

The date column displays the date down to the millisecond but all we are concerned with is the month. We will use the DATE_FORMAT function to give us only the date of the order. sql DATE_FORMAT(Orders.Created_Date, '%Y%m') AS Month of Order Date, DATE_FORMAT(Users.Created_Date, '%Y%m') AS Month of Join Date, We are also going to drop Users.User_ID from our statement because this information is not relevant, we are only using this column to join our tables.

It is not necessary to include a column in your select statement when using it for a join. You will notice below that we are joining on Orders.User_ID = Users.User_ID but selecting neither of these columns. sql SELECT Orders.Order_ID AS Order ID, DATE_FORMAT(Orders.Created_Date, '%Y%m') AS Month of Order Date, DATE_FORMAT(Users.Created_Date, '%Y%m') AS Month of Join Date, Users.Campaign_ID AS Campaign ID FROM Orders AS Orders JOIN Users AS Users ON Orders.User_ID = Users.User_ID

We are getting closer now. But what we really want to know is how long a user remains active after they register. This can be determined by looking at the difference between the order and join date. To find this value we use the PERIOD_DIFF function. sql PERIOD_DIFF(DATE_FORMAT(Orders.Created_Date, '%Y%m'), <br> DATE_FORMAT(Users.Created_Date, '%Y%m')) AS Difference In Months

Here’s our final query: sql SELECT Orders.Order_ID AS Order ID, DATE_FORMAT(Orders.Created_Date, '%Y%m') AS Month of Order Date, DATE_FORMAT(Users.Created_Date, '%Y%m') AS Month of Join Date, PERIOD_DIFF(DATE_FORMAT(Orders.Created_Date, '%Y%m'), DATE_FORMAT(Users.Created_Date, '%Y%m')) AS Difference In Months, Users.Campaign_ID AS Campaign ID FROM Orders AS Orders JOIN Users AS Users ON Orders.User_ID = Users.User_ID

That looks pretty good, but we don’t need all of the information in this table. We are going to drop the ‘Month of Order Date’ and ‘Month of Join Date’ columns.

We are also going to use a COUNT aggregation to count the number of orders instead of listing them. sql COUNT(DISTINCT Orders.Order_ID) AS Count of Orders

Now we have the following query. sql SELECT COUNT(DISTINCT Orders.Order_ID) AS Count of Orders, PERIOD_DIFF(DATE_FORMAT(Orders.Created_Date, '%Y%m'), DATE_FORMAT(Users.Created_Date, '%Y%m')) AS Difference In Months, Users.Campaign_ID AS Campaign ID FROM Orders AS Orders JOIN Users AS Users ON Orders.User_ID = Users.User_ID

Where did all of our data go? Well, we performed a count of Order_ID, but we didn’t specify how we wanted to group the data, so our query counted all of the orders in our database and returned it in a single row.

Let’s add GROUP BY and ORDER BY statements to group our data into meaningful buckets. sql GROUP BY Difference In Months, Campaign ID ORDER BY Difference In Months ASC

Our final query looks like this. sql SELECT PERIOD_DIFF(DATE_FORMAT(Orders.Created_Date, '%Y%m'), DATE_FORMAT(Users.Created_Date, '%Y%m')) AS Difference In Months, Users.Campaign_ID AS Campaign ID, COUNT(DISTINCT Orders.Order_ID) AS Count of Orders FROM Orders AS Orders JOIN Users AS Users ON Orders.User_ID = Users.User_ID GROUP BY Difference In Months, Campaign ID ORDER BY Difference In Months ASC;

Our cohort analysis is successful and we can chart our data in a variety of ways to get meaningful visual data.