Warehouse Maintenance: Amazon Redshift and Google BigQuery

Posted by tina on December 6, 2016 Data Governance

[Editor’s Note: This is our fifth installment in our “Data Warehouse Blog Series.” In our previous installment, we analyzed how Amazon Redshift and Google BigQuery handle data loading. Click here to read our previous post How to Load Data Into Amazon Redshift and Google BigQuery.]

With a cloud-based data warehouse, there’s no physical infrastructure to manage, allowing for a streamlined focus on analytics and insights, rather than hours of manual maintenance. But, like any system, every data warehouse needs to undergo maintenance for a tune up from time to time.

In this blog post, we’ll cover the crucial differences in how Amazon Redshift and Google BigQuery perform maintenance. For many companies, maintenance is a point of contention as it’s a leading indicator of overall data warehouse performance.

Amazon Redshift

As a data warehouse built with MPP concepts, Amazon Redshift requires periodic maintenance which makes the system run faster. However, this maintenance is fully taken on by Amazon Redshift and includes all facets of database management. From a performance perspective the ability to query, load, export, backup, restore and resize is parallelized for users.

Once in maintenance mode, Amazon Redshift monitors the health of a variety of components and failure conditions within an AZ and recovers from them automatically.

Another way Amazon Redshift performs maintenance is through the VACUUM feature, which is a command to remove rows that are no longer needed within the database and then sorts the data.

Running VACUUM is an optimal operation because it reclaims space and resort rows. Amazon Redshift allows its users to DELETE or UPDATE a table, this coupled with Amazon Redshift’s I/O minimization (only relevant data blocks are fetched)—this leads to optimal query performance. It’s important to note that running VACUUM is not required, particularly if Amazon Redshift is used in an append-only fashion.

The VACUUM command is a significant distinction between Amazon Redshift and Google BigQuery. As Amazon Redshift sorts data to fetch only relevant data blocks, it forgoes Google BigQuery’s method of reading an entire table, which could potentially lead to degraded query performance.

For more information on how Amazon Redshift utilizes the VACUUM command, reference Amazon Redshift’s documentation.

Google BigQuery

As aforementioned, Google has managed to solve a lot of common data warehouse concerns by throwing order of magnitude of hardware at the existing problems and thus eliminating them altogether. Unlike Amazon Redshift, running VACUUM in Google BigQuery is not an option. Google BigQuery is specifically architected without the need of the resource intensive VACUUM operation that is recommended for Amazon Redshift.

Since Google BigQuery does not require data provisioning, maintenance is much less of an issue because it’s not a requirement for the system to be performant. The advantages to not requiring maintenance is the flexibility of having your data available at all all times without periodic maintenance. However, a potential downside to not requiring maintenance is that users are unable to remove or resort irrelevant data, which can lead to a higher cost since Google BigQuery charges by data processed.

In a Google blog post titled BigQuery Under the Hood, it states, “The ultimate value of BigQuery is not in the fact that it gives you incredible computing scale, it’s that you are able to leverage this scale for your everyday SQL queries, without ever so much as thinking about software, virtual machines, networks or disks.”

Keep in mind that by design, Google BigQuery is append-only. Meaning, when planning to update or delete data, you’ll need to truncate the entire table and recreate the table with new data.

However, Google has implemented ways in which users can reduce the amount of data processed:

Partition their tables by specifying partition date in their queries
Use wildcard tables to shard their data by an attribute

Conclusion

While Google BigQuery may have instances where it is arguably faster than Amazon Redshift—it sacrifices organizational structure in favor of no maintenance. However, Amazon Redshift customers have appreciated its ability to optimize cost and performance for their use case.

In terms of maintenance, the key difference between Amazon Redshift and Google BigQuery is running the VACUUM command. For Amazon Redshift, this is possible and optimizes the ability to free up deleted space, sort data blocks and retrieve only relevant data.

In Google BigQuery, VACUUM is not required. It’s important to note that even though there is no VACUUM option, Google has architected ways in which users can reduce the amount of data processed without performing a full table read of the data. For many users, this is an advantage because they don’t have to proactively work maintenance into their operations, even in times of low activity.

In our next installment of our series, we’ll analyze how Amazon Redshift and Google BigQuery handle security. To learn how companies like Everlane and Reddit are leveraging data warehouses for hyper-growth, download our white paper What to Consider When Choosing Between Amazon Redshift and Google BigQuery now.

Amazon Redshift

Google BigQuery

Conclusion

More posts like this

Data Democratization – Chartio and Talend Podcast Chat

How to Use Data Warehouses in Business Intelligence

Best Practices for Developing an ELT Pipeline