As a typical company’s amount of data has grown exponentially it’s become even more critical to optimize data storage. The size of your data doesn’t just impact storage size and costs, it also affects query performance. The smaller the data, the less that has to be processed during expensive disk I/O (input/output, or write/read) operations. This tutorial will explain how to select the best compression (or encoding) in Amazon Redshift.
You can read about more techniques for optimizing Amazon Redshift performance here.
What is Compression?
Compression, called encoding in Amazon Redshift, reduces the size of your data by converting it into different information that exactly describes your data using much less storage.
To visualize how compression works, consider this example of run-length encoding of an image described in Wikipedia’s Data Compression article. While the raw data of an image has an entry for each pixel, it’s likely that the image has several blocks of color that do not change over several pixels. For example, if 279 pixels have the same red color, the raw data stores 279 entries of that color. But instead, this could be stored as “279 red pixels”. This encoding just compressed 279 data entries into a much smaller statement that perfectly describes the same amount of pixels.
While your data may not be image-based, the same concepts apply to other forms of data. Data compression works because in general, data is highly redundant. For example, you may have millions of purchases in your database, but only sell thousands of different items - so each item is repeated thousands of times in a typical sales table.
Different encoding types apply different sophisticated statistical algorithms to take advantage of the redundancy.
How to Select the Best Compression Type in Amazon Redshift
Luckily, you don’t need to understand all the different algorithms to select the best one for your data in Amazon Redshift. Amazon Redshift provides a very useful tool to determine the best encoding for each column in your table. Simply load your data to a test table
test_table (or use the existing table) and execute the command:
ANALYZE COMPRESSION test_table;
The output will tell you the recommended compression for each column. Note that the recommendation is highly dependent on the data you’ve loaded. Use at least 100,000 rows of data and make sure the possible range of values for your columns are represented in the data. More on
ANALYZE COMPRESSION tool here.
In January 2017, Amazon Redshift introduced Zstandard (zstd) compression, developed and released in open source by compression experts at Facebook. You can read more about the algorithm here. This very powerful compression algorithm is the new standard and works across all Amazon Redshift data types. It will almost always be the type recommended by the
ANALYZE COMPRESSION command.
So if you want to skip the
ANALYZE COMPRESSION step, you can confidently choose zstd for all of your columns, except sort keys, which should not be encoded. More on sort keys here.
Making sure your data is properly encoded can have a huge impact on your database. Data compression reduces the size of your data, which directly reduces storage costs and improves query performance.