For many companies out there that work with data, one standard solution for their data stack’s storage component is a relational database. Relational databases like Postgres and MySQL make it possible to identify patterns and explore trends that can help a company grow and thrive.
As a company grows, however, they might start to outgrow the capabilities of a traditional relational database. Amazon Web Services’ Redshift presents a cloud-based data warehouse solution that allows companies to work with data at scale. In this article, you will see some of the ways in which Redshift can be a good next step up from a traditional relational database, as well as learn about some cautionary considerations in order to manage having Redshift as part of an analytics data stack.
The most immediate advantage of Redshift is that it is capable of scaling to a much larger amount of data than a traditional database system, while still being responsive for regular use. Performance of systems like Postgres and MySQL can start to degrade at the size of a database approaches a terabyte (TB) of data. Redshift is able to scale up to a thousand times that amount, into the petabyte (1 PB = 1000 TB) range.
The ways that Redshift can handle larger amounts of data also help it to also resolve queries faster. This is because Redshift employs distributed computing, where the load of individual queries are split up among multiple compute nodes. Each node can work in parallel and on a smaller amount of data in order to improve resolution speeds. In addition, Redshift data is stored in a columnar, or C-store fashion, as a contrast to the row-based storage employed in traditional relational databases. This means that when a query is made, Redshift only needs to read through columns relevant to the query. A traditional row-based store needs to pull up every part of every row in order to get the required data, regardless of how many columns are relevant. In addition, since columns tend to have consistency in the type of stored information, the data in a columnar store can be smartly compressed to take up less space.
Amazon automatically takes care of a number of administration tasks around Redshift. There is no need to manually set up computational or storage resources. Amazon will also manage the health of the data, to make sure it is fault free and readily available. It is also straightforward to increase the amount of resources on call as a company’s needs grow. While the computational entities available don’t change automatically – Redshift pricing is based on the number and type of nodes provisioned – it is easy to change the number of nodes that are part of a company’s cluster.
Redshift uses a SQL dialect that is based on Postgres 8.0, making it easy to get users who are familiar with SQL completely up to speed with working with data through Redshift. While there are some aspects of the query language that are different (see this part of the documentation for details), there shouldn’t be too much trouble for general users to continue being effective when working under Redshift.
As a final point, Redshift also integrates with many other tools in the data pipeline, even outside of other products in the Amazon ecosystem. These include applications on both the data wrangling side of the stack as well as the data analysis side of the stack. If you already have solutions for processing raw data and preparing it into an analyzable form, or applications for analyzing the processed data, then it is very likely that those solutions can be ported over to storing into or querying from Redshift without too much fuss.
While using Redshift provides a number of benefits, it has certain costs and management points that should also be considered when choosing it as a data storage solution.
In addition to Redshift’s SQL dialect having some differences from Postgres, it’s also worth mentioning that Redshift’s database tables are more limited in terms of data types that it supports. It may not be safe to simply do a 1:1 transfer of data stored in Postgres to Redshift. It is a good idea to perform an audit of how the data is being cleaned up for the current store, to anticipate any changes to the data wrangling plan with Redshift.
While it was mentioned above that Amazon takes a number of management steps to automatically maintain the health of a data cluster, this doesn’t mean that everything about the system is optimized. While it might be fine at the start to just hook in some data sources and letting things run, at some point you will need to perform manual tuning on the data warehouse structures. For example, sort keys help order the data blocks so that it reduces the amount of data that needs to be scanned for queries using those keys. Vacuuming tables can reclaim space left behind when data is deleted as well as re-organize and sort data to improve query performance. Understanding these and other administrative issues will ensure that the cluster runs quickly and efficiently, and so require some technical expertise to carry out.
As another elaboration on an earlier point, while there won’t be much difficulty with changing the configuration of the data warehouse cluster on Redshift, it needs to be noted that there is a significant time cost on the order of minutes to hours in order to start up the newly-defined cluster and have all previous data available. While columnar storage is well-suited for reading data and performing analyses, it is not so well-suited for data writes and revisions. It’s important to make sure that when expansion of a Redshift cluster needs to happen, the unavailability of the data is accounted for.
On a similar note, it’s valuable to keep in mind the purpose of the data store in order to make Redshift a good choice. Redshift’s storage structure makes it ideal for analytic queries, where data is aggregated on selected columns. It performs less well when other transactions need to be performed, like row additions and deletions or individual row lookups. Scenarios involving those smaller transactions also tend to have a higher volume of those transactions, such as adding user purchase information to an activity log. Make sure that you have tasks for Redshift that are suited for it, or you’ll be better off with a different type of data storage solution.
It is also worth emphasizing the pricing model for Redshift. Redshift counts the number and types of computational nodes that comprise a cluster, and charges based on the amount of time that the cluster is running. The cluster being up and running isn’t the same as being in use - if queries aren’t being run, then the cluster will be in an idle, but ready, state. However, as noted above, there is a significant time cost to bring a cluster online. In practice, it is easiest to just leave a cluster running continuously, even if it won’t constantly be in use. As long as the cluster is being utilized a moderate amount of the time, it will still be worth it to have constant availability.
Moving from a traditional relational database system to a larger, distributed, cloud-based data warehouse system is no small step to take. For Redshift, its major advantages lie in its ability to scale to a drastic amount of data and its ability to perform analytic operations on that data quickly. It is easy to specify resources for a Redshift cluster, and Redshift integrates well with other applications in the data pipeline. However, in order to take advantage of Redshift’s speed potential, some manual maintenance work needs to be performed to keep tables compact and organized. Using Redshift is not a fully automated fire-and-forget kind of operation, so if you aren’t approaching the limits of a traditional database, it may not make sense to make an immediate switch. But if you need a larger scale of data operations, or anticipate such needs in the future, Redshift can be a powerful way of keeping analytics data that you can tune carefully to high efficiency.
Chartio is a business intelligence solution that enables companies to easily analyze and visualize data from many data sources. Click here to learn more about how you can start drawing insights from Amazon Redshift.