One key step towards tuning your Amazon Redshift database is carefully selecting sort keys to optimize your queries. This tutorial will explain how to select appropriate sort keys.
You can read about more techniques for optimizing Amazon Redshift performance here.
What are Sort Keys?
A sort key is a field in your table that determines the order in which the data is physically stored in the database. If you have a table of sales and you select the purchase time as the sort key, the data will be ordered from oldest to newest purchase.
Amazon Redshift stores your data in 1MB blocks and for each block it keeps metadata about the minimum and maximum values of your data. So if your data is sorted by purchase time and your query asks for the revenue made in the last week, the query planner can quickly review the metadata of each block and see that all the records it needs to process are together in the most recent block (or blocks), saving it from having to process all the blocks to find the sales of interest. If you have five years of sales data and only need one week, Amazon Redshift will only have to scan less than half a percent of your table - a huge savings!
A sort key is analogous to an index in a traditional database. While the mechanics are different, conceptually they serve the same purpose - to let the database know where certain data exists so it doesn’t have to scan the entire table for it. Read more about indexes here.
How to Select a Sort Key
A common case is a table that you typically query for specific date ranges. Say a large number of queries against it are interested in just this week’s (or this month’s / this year’s) revenue numbers. In those cases, the field with the date is a great candidate for a sort key. With this sort key in place, the database will know which blocks of data to process to get your result and not have to scan the entire table for the entries.
Another good candidate for a sort key is a field that is frequently joined on. If you make this field your sort key and distribution key, you will be greatly optimizing queries that have that join.
If you have a table where there is a pattern of more than one field that is typically filtered on, you should select a compound sort key with those fields. Returning to our revenue table example - say your organization is divided into geographical regions and most of your queries filter on the date and the region. You would then select date and region as your compound sort keys. Note that the order of your sort keys matters. If you select date then region, a query that only filters on region won’t see any benefit to the compound sort key because it isn’t first.
If, however, there is no clear pattern for the fields that are filtered on, but there are several that are used in different cases, consider an interleaved sortkey. These keys give an equal weight to all the fields in the sort key. So while a query against just the date won’t be as fast as the compound sort key with date and region, a query filtering only on the region will perform much better. One thing to note when using interleaved sort keys, however, is that they are costlier to maintain - loading and vacuuming these tables will be much slower. (More on vacuuming in the next section).
You can read a more thorough comparison of sort styles in this AWS article.
As you can see, sort keys only help the queries that use those fields in filters or joins. There isn’t a way to optimize a table for all possible queries. So you want to be sure that when selecting the sort keys you are considering all the users of the table and which types of queries should take priority when planning your optimization. In some extreme cases, if the query use cases are very different and performance is critical, you may consider duplicating the table and having different sort keys for each application. But be very thoughtful before making this decision since you will be doubling the storage requirement and will now have to manage data across two tables.
How to Get Value From Your Sort Keys
Once you select your sort keys, your job isn’t done. In order to reap the benefits of the sort keys, you need to keep in mind these three:
Don’t encode sort keys. More on data compression (encoding) here
In your queries, don’t perform a SQL function on the sort key. In table sales with a sort key on purchase_time, the following query won’t use the sort key because it is used in a function
SELECT SUM(revenue) FROM sales WHERE DATE(purchase_time) BETWEEN ‘2018-01-01’ AND ‘2018-01-31’;
To take advantage of the sort key, perform the transformation on the values themselves instead of the sort key:
SELECT SUM(revenue) FROM sales WHERE purchase_time BETWEEN ‘2018-01-01’ AND ‘2018-01-31 23:59:59’;
Vacuum and analyze your tables as needed - when data is added, updated or deleted from your table, the data won’t be automatically re-sorted and the metadata about what values are in what blocks of data will become stale. To resort your data and actualize the statistics on it, you need to vacuum and analyze your tables. Note that vacuuming and analyzing can be costly in terms of processing time, so only do so when there have been significant changes to your data. More on vacuuming and analyzing here.
Selecting and maintaining sort keys can be time consuming, but are key actions to improving the performance of your Amazon Redshift database.