Close
Data TutorialsMarketing Analytics

Performing Cohort Analysis Using MySQL

Posted by AJ Welch, Mike Yi

Overview

What is a Cohort Analysis?

A cohort analysis is a method of analyzing a metric by comparing its behavior between different groups of users. The groups, or cohorts, can be defined in any number of ways but are most commonly based upon the date that a user started a service.

Why is it important?

Cohort analysis essential to gaining relevant, actionable information from your database. Let’s take a look at an example demonstrated in the following two charts.

The first chart, “User Acquisitions by Marketing Campaign”, shows that our marketing campaign “TV” is out-performing all others in generating new users. However, it tells us nothing about how active these users actually are after registration. Because there is a different cost associated with each campaign, all user acquisitions are not created equal. Figure 1: Bar chart of user acquisitions by campaign type

Our second chart “Orders By Campaign / Months After Registration” visualizes a cohort analysis of how customers are using our web site. We compare each of the marketing campaigns by how active the users remain over time. In this case, we are looking at the number of orders placed per month after registration, grouped by marketing campaign. Figure 2: Cohort-based bar chart of user retention by campaign type

This chart paints a completely different picture. Even though our “TV” campaign is the most effective at acquiring new users, those users are the least likely to become recurring customers. Using this data, we can evaluate if the acquisition cost per user for this marketing campaign is justified.

This is definitely great information for us to have, but how did we get there? Let’s go through the steps of performing cohort analysis in MySQL.

Example Walkthrough

The first thing we want to do is take a look at the structure of our MySQL database by running the SHOW TABLES statement.

SHOW TABLES;

Figure 3: Show tables result

Looking at the list of tables, we are probably most interested in the Users and Orders tables. Let’s use the DESCRIBE statement to dig deeper into them.

DESCRIBE Users;

Figure 4: Users table description

DESCRIBE Orders;

Figure 5: Orders table description

It looks like these tables contain all of the info we need:

  • Users.Created_Date - The user’s registration date.
  • Users.Campaign_ID - The marketing campaign that brought them to our web site.
  • Orders.Order_ID - The orders placed by the user.
  • Orders.Created_Date - The date the order was placed.
  • User_ID - The foreign key that will connect our Users and Orders table.

Now that we have identified the pertinent columns we can run the following SELECT statement to verify the data.

SELECT
Orders.Order_ID AS 'Order ID',
Orders.Created_Date AS 'Order Date',
Users.User_ID AS 'User ID',
Users.Created_Date AS 'Join Date',
Users.Campaign_ID AS 'Campaign ID'
FROM Orders
JOIN Users ON Orders.User_ID = Users.User_ID

We use a basic MySQL join so we can view data from both tables together. When using joins we need to specify the table and column name as TABLE.COLUMN to account for ambiguous column names. Without this, MySQL wouldn’t be able to distinguish between the Created_Date and User_ID fields that exist in both the Orders and Users tables.

To make our results more readable we use an AS statement to create meaningful alias for our columns. For example, creating the alias ‘Join Date’ for Users.Created_Date.

Figure 6: Joined users and orders table

The date column displays the date down to the millisecond but all we are concerned with is the month. We will use the DATE_FORMAT function to give us only the date of the order.

DATE_FORMAT(Orders.Created_Date, '%Y%m') AS 'Month of Order Date',
DATE_FORMAT(Users.Created_Date, '%Y%m') AS 'Month of Join Date',

We are also going to drop Users.User_ID from our statement because this information is not relevant, we are only using this column to join our tables.

It is not necessary to include a column in your select statement when using it for a join. You will notice below that we are joining on Orders.User_ID = Users.User_ID but selecting neither of these columns.

SELECT
Orders.Order_ID AS 'Order ID',
DATE_FORMAT(Orders.Created_Date, '%Y%m') AS 'Month of Order Date',
DATE_FORMAT(Users.Created_Date, '%Y%m') AS 'Month of Join Date',
Users.Campaign_ID AS 'Campaign ID'
FROM Orders AS 'Orders'
JOIN Users AS Users ON Orders.User_ID = Users.User_ID

Figure 7: Revised joined table

We are getting closer now. But what we really want to know is how long a user remains active after they register. This can be determined by looking at the difference between the order and join date. To find this value we use the PERIOD_DIFF function.

PERIOD_DIFF(DATE_FORMAT(Orders.Created_Date, '%Y%m'),
DATE_FORMAT(Users.Created_Date, '%Y%m')) AS 'Difference In Months'

Here’s our final query:

SELECT
Orders.Order_ID AS 'Order ID',
DATE_FORMAT(Orders.Created_Date, '%Y%m') AS 'Month of Order Date',
DATE_FORMAT(Users.Created_Date, '%Y%m') AS 'Month of Join Date',
PERIOD_DIFF(DATE_FORMAT(Orders.Created_Date, '%Y%m'),
DATE_FORMAT(Users.Created_Date, '%Y%m')) AS 'Difference In Months',
Users.Campaign_ID AS 'Campaign ID'
FROM Orders AS Orders
JOIN Users AS Users ON Orders.User_ID = Users.User_ID

Figure 8: Joined table with time between order and join date

That looks pretty good, but we don’t need all of the information in this table. We are going to drop the ‘Month of Order Date’ and ‘Month of Join Date’ columns.

We are also going to use a COUNT aggregation to count the number of orders instead of listing them.

COUNT(DISTINCT Orders.Order_ID) AS 'Count of Orders'

Now we have the following query:

SELECT
COUNT(DISTINCT Orders.Order_ID) AS 'Count of Orders',
PERIOD_DIFF(DATE_FORMAT(Orders.Created_Date, '%Y%m'),
DATE_FORMAT(Users.Created_Date, '%Y%m')) AS 'Difference In Months',
Users.Campaign_ID AS 'Campaign ID'
FROM Orders AS Orders
JOIN Users AS Users ON Orders.User_ID = Users.User_ID

Figure 9: Highly-summarized table - where is all the data?

Where did all of our data go? Well, we performed a count of Order_ID, but we didn’t specify how we wanted to group the data, so our query counted all of the orders in our database and returned it in a single row.

Let’s add GROUP BY and ORDER BY statements to group our data into meaningful buckets.

GROUP BY 'Difference In Months', 'Campaign ID'
ORDER BY 'Difference In Months' ASC

Our final query looks like this:

SELECT
PERIOD_DIFF(DATE_FORMAT(Orders.Created_Date, '%Y%m'), DATE_FORMAT(Users.Created_Date, '%Y%m')) AS 'Difference In Months',
Users.Campaign_ID AS 'Campaign ID',
COUNT(DISTINCT Orders.Order_ID) AS 'Count of Orders'
FROM Orders AS Orders
JOIN Users AS Users ON Orders.User_ID = Users.User_ID
GROUP BY 'Difference In Months', 'Campaign ID'
ORDER BY 'Difference In Months' ASC;

Figure 10: Correctly-summarized table of orders by campaign and time since signup

Our cohort analysis is successful and we can chart our data in a variety of ways to get meaningful visual data.

Figure 11: User retention by campaign type as a scatter plot

Figure 12L User retention by campaign type as a stacked bar chart