Amazon Redshift presents a very powerful storage solution for companies of all sizes to store data, perform analytic queries, and gain valuable insights. With Redshift, it is easy to expand the resources used to fit computational needs. Amazon manages all of the hardware necessary, and resources are reliably available. There are also many tools built to work with Redshift on both sides of Redshift’s place in the data pipeline. There are tools for extracting relevant data from other data sources and load them from Redshift, and tools to query the data stored in Redshift to create visualizations and dashboards.
While there are many good reasons to go with Redshift, it is not a magic bullet. Redshift cannot be used for all data purposes, and there are certain maintenance tasks expected on the customer’s side in order to bring out the most of Redshift’s potential. This article will outline some of the major points you should be aware of when you start to work with Redshift for your own purposes.
Despite the fact that a Redshift cluster can store and work with lots of data, it is important to keep in mind the expected use case for the data warehousing solution. Redshift is built to perform analytic queries, where summaries and aggregation are performed on previously collected data. In an analytic query, we will not be modifying our data, only combining them into a form that makes it easier to draw conclusions.
On the other hand, Redshift will struggle with transactional queries, where new data points are added or removed from the available data. Due to the way that Redshift stores data, it is much more suited to the reading of data than the writing or modification of data. In addition, transactional events will tend to be much more frequent than analytic queries. These two facts together mean that Redshift should not be relied upon as the first port of call for your data.
Instead, traditional data storage methods should continue to be used for these use cases where there are many small events involving the addition or removal of individual rows to data tables. If you have a lot of historic data to maintain, then keeping data in cheaper or distributed storage like Amazon S3 or Hadoop will be much more efficient than trying to keep it all in a massive Redshift cluster. That said, it is still worth recalling that Redshift can handle much more data than a conventional relational database. Data stored in the warehouse can be much more expansive and lake-like in nature, and so offer a greater flexibility in analyses that can be performed.
Another major consideration when setting up a Redshift cluster is how much computational resources you’ll need. A Redshift cluster consists of one or more compute nodes that handle the main query operations, and one leader node that mediates task assignment to the compute nodes. Amazon charges based on the number of compute nodes used and the amount of time that a cluster is running.
There are two major types of compute nodes that Redshift employs, called dense compute and dense storage. Dense compute nodes will process data faster, but have a lower limit on the amount of data that each one can handle. Dense storage nodes, on the other hand, have a higher data capacity, but will be slower on the same amount of data. Amazon themselves recommend starting with dense compute nodes up to 500 GB of data, but above that, the type of node you choose depends on the kinds of analyses that you need to perform. If data is to be queried frequently and fast performance is necessary, then expanding to more dense compute nodes makes sense. On the other hand, if performance isn’t quite a concern, and there will be fewer queries on large datasets, then moving to dense storage nodes may be a good choice. It is worth noting that different node types cannot be mixed in the same cluster. If you have different types of data with different use case profiles, then you may require multiple cluster instances in order to meet data needs.
Fortunately, you do not need to commit to a specific cluster configuration from the outset. As your company grows alongside your data needs, it is fairly easy to make changes to the resources you rent from Amazon. However, it can take some time to load data into a new configuration or to just start up a fresh instance. In general, it is standard to simply have a cluster running constantly to avoid delays when analyses need to be performed. Once you’ve got a configuration that you are happy with, you can also save money by prepaying for a reserved instance. A reserved instance is simply a commitment to utilize a certain amount of compute resources for a year or longer. You will pay for the reserved resources whether they are all active or not. Be sure to plan ahead for growth in your data, since any additional nodes will be priced at the standard, on-demand rate until they become part of a new reservation. Don’t just use the maximum storage capacity as a benchmark: past about 75% capacity, a compute node will start to struggle to have enough empty space to facilitate the moving data and performing queries.
Amazon will do a lot to maintain the health of your Redshift cluster, including the hosting of hardware, maintenance of backups, and some disk space cleanup after table updates or row deletions. However, this doesn’t absolve the need for a data engineer to keep an eye on the cluster to make sure that it runs smoothly and quickly. While one solution for improving performance might be to add more compute nodes to a cluster, this also introduces a higher cost to running the cluster. It would be better to make sure that resources are used smartly so that the cluster is actually sized properly to the company’s needs.
While we won’t go into technical details for maintaining data in a Redshift cluster, we’ll still highlight some major ideas here. First of all, data stored on Redshift still follows a tabular structure, and so works best on highly structured data. With structured data, each column has clear expectations about what kind of information should be observed, like a number, or state abbreviation. Unstructured data like general text or nested data types will not be easy for Redshift to work through. When working with this type of data, some re-encoding may be necessary, or simply some alternative solution necessary.
On the other hand, the column-based nature of Redshift’s storage allows for some clever optimizations not just in querying, but also in storage. With data stored by columns, compression is much more effective at reducing disk space necessary to store data. Since each column should have a predictable data type, patterns are much easier to perform compression on than if the types are mixed, as it is when reading the data by rows. Redshift also includes a couple of field specifications to improve the division of data to compute nodes, reducing the work necessary to get query results. Sort keys help reduce the amount of data that needs to be scanned when only a subset of the data is of interest, while distribution keys provide a plan for distributing data to the compute nodes.
As a final note, while Redshift will consider standard table properties like unique and foreign keys when creating query plans, it will not actually check those constraints against data that is inserted into tables. Thus, there is the risk of issues with data duplication when working with Redshift.
Although highly efficient use of Amazon Redshift will require some maintenance to be performed in both size and storage settings, this ability to dig into the details also allows for a higher level of tuning that you may not receive from other data storage solutions. If you have the right use cases for Redshift, it can be a versatile and powerful central data store from which insights can be drawn. This can range from fairly comprehensive high level data covering all of an organization’s operations, or tailored databases built for the interests of specific teams.
Business intelligence tools like Chartio can provide the ability to gain insights from Amazon Redshift and other data sources quickly and on demand. Click here to learn more about how you can start drawing insights from Amazon Redshift.