Amazon Redshift is an incredibly powerful data warehouse solution, but it requires thoughtful setup to get the best performance. This tutorial will explain some tuning techniques to help speed up your queries and reduce your storage costs.
We will discuss three techniques:
- Making your big data as small as possible
- Organizing your data
- Keeping your data neat
Make Your Data Smaller: Data Type and Compression
Data types define what type and size of data can be stored in a given column. You can read more on data types. In general, you want to select the smallest data type that will fit your data to avoid wasting space.
Compressing your data (called encoding in Amazon Redshift) is critical in Amazon Redshift. Compression can reduce your storage by 50%-75%, depending on your data. You can read about how to select the best compression here.
The size of your data doesn’t just impact storage size and costs, it also affects query performance. The smaller the data, the less that has to be processed during expensive disk I/O (input/output, or write/read) operations.
Organize Your Data: Distribution Style and Sort Key
One design feature that makes Amazon Redshift so powerful is that it distributes your data across nodes, which allows for parallel processing that can greatly speed up a query. But to get the most out of this feature, your data needs to be properly distributed. If your data is skewed, some nodes will have to work more than others - and your query is only as fast as the slowest node. Amazon Redshift provides several resources to help you select the best distribution styles (here and here).
While distribution style refers to how data is organized across nodes, sort keys define how the data is organized within each node. If your query only needs a subset of data that is defined by a column that is in sorted order, Amazon Redshift can hone in on just that block of data for your query instead of scanning the entire table for the records it needs. Learn more about selecting sort keys in this tutorial.
Keep your data neat: Vacuuming and Analyzing
Once you have specified where your data should go with distribution style and sort keys, you want to make sure it actually goes in its place. While some Amazon Redshift commands (like
COPY) will automatically store data as you defined with distribution styles and sort keys, other commands will simply add your data to the end of the table. Vacuuming will put your data in the correct order and will also eliminate rows that were updated or deleted, freeing up valuable space.
And what good is having all this housekeeping if Amazon Redshift doesn’t know what data is where? Redshift stores data in 1 MB blocks and it keeps statistics (metadata) about the contents of each block to tell it what minimum and maximum values are stored within the blocks. It then uses these statistics when determining the query plan that optimizes your query execution. As data gets added, updated, or deleted, the metadata goes stale and hurts the performance of the query optimizer. Analyzing your table will update the statistics kept on the blocks of data so the optimizer can make the best decisions to speed up your queries.
You can read more on vacuuming and analyzing from AWS here.
While Amazon Redshift is a very powerful tool on its own, to get the most out of it you need to carefully consider and maintain your table architecture.