Impact of Throughput and Concurrency on Your Data Warehouse

Posted by tina on November 22, 2016 Data Analytics, Data Governance

[Editor’s Note: This is our second installment in our “Data Warehouse Blog Series.” In our previous installment, we analyzed when it’s time to invest in a data warehouses and its high-level benefits. Click here to read our previous post What to Consider When Choosing Between Amazon Redshift and Google BigQuery.]

Interactive reporting is crucial for businesses. In enabling everyone to have access to data, it allows them to engage with information in new ways and ultimately, helps improve job performance.

As a data analyst, empowering business users with interactive reporting frees up your time to perform more sophisticated analyses. However, with the addition of users querying data, it can amount to performance pressures for a data warehouse.

Throughput

Throughput is the speed at which a data warehouse can perform queries. For those evaluating a data warehouse, throughput is a key area of consideration because customers want to ensure that they’re investing time, resources and money into a system that will adapt and perform to their needs. And yet, Amazon Redshift and Google BigQuery handle throughput in divergent ways.

Amazon Redshift

Amazon Redshift uses a distributed columnar architecture to minimize and parallelize the I/O hurdles that many traditional data warehouses come up against. Within the Amazon Redshift system, each column of a table is stored in data blocks with the goal of reducing I/O so only relevant data is retrieved from disks. Amazon Redshift is able reduce I/O through:

  • Column storage fetches data blocks only from the specific columns that are required for queries

  • Customers typically get 3 - 5X compression, which means less I/O and more effective storage

  • Zone maps: Each column is divided into 1MB data blocks; Redshift stores the min/max values of each block in memory and is able to identify the blocks that are required for a query

  • Direct-attached storage and large block sizes (1MB) that enable fast I/O

With the release of Dense Storage (DS2) in June 2015, it allows for twice the memory and compute power of its predecessor (DS1) and the same storage capacity at the same cost, which leads the way for overall improvements to Amazon Redshift. Since this release, Amazon has worked to improve Amazon Redshift’s throughput by 2X every six months. Additionally, Amazon Redshift has improved vacuuming performance by 10X.

For those working with a wide variety of workloads and a spectrum of query complexities, this is a welcome improvement to the Amazon Redshift system.

Google BigQuery

Similarly, Google BigQuery relies on Dremel and Colossus (the successor to the Google File System) for its storage, compute and memory. These systems are then tied together by Google’s Petabit Jupiter Network, which allows every node to communicate with other nodes at 10G. In leveraging this architecture, one of the major benefits is bottleneck-free scaling and concurrency.

In March 2016, Google BigQuery released Capacitor, an updated and state-of-the-art storage format system within the data warehouse. Capacitor is designed to take an opinionated approach to storage in allowing background processes to constantly evaluate customers’ usage patterns and automatically optimized these datasets to improve performance. This process is entirely automated and transparent for end users, which allows queries to run at a faster rate because it adapts to a user’s Google BigQuery environment. For further information on Google BigQuery’s throughput, you can read their Medium post.

Concurrency

Every data warehouse has concurrency limitations, or the maximum number of queries you can run simultaneously without leading to slowness in generating interactive reports. Having said that, you may come up against issues around concurrency when democratizing data access for users who explore data within pre-existing dashboards via a Business Intelligence tool. That, coupled with the data warehouse simultaneously having to ingest new data streams for reporting can be taxing on the overall system.

While this may seem like a simple maneuver each of these can easily generate a large number of queries with varied resource requirements, thus running up against the concurrency limitations and a lag in generating reports.

Ideally, all queries could operate without any contention for resources, under the hood, every data warehouse has resource constraints and thus practical limits on concurrent workload capabilities.

Amazon Redshift

As an Amazon Redshift administrator, in the Management console you’re able to set the concurrency limit for your Amazon Redshift cluster. While the concurrency limit is 50 parallel queries for a single period of time, this is on a per cluster basis, meaning you can launch as many clusters as fit for you business. It’s important to note that you’re able to also have a maximum of 500 concurrent connections per cluster. Meaning, queries from up to 500 users will get executed with up to 50 queries at any given time.

Having said that, Amazon’s documentation recommends only running 15 - 25 queries at a time for optimal throughput. Additionally, Amazon Redshift allows users to adjust their Workload Management (number of slots/queries, queues, memory allocation and time out) to best address their needs.  

This is an important aspect of Amazon Redshift’s ability to provide interactive reports to its customers. Whether your company is enabling data exploration by allowing business users to run their own queries via an interactive BI tool or you’re fielding ad hoc query requests, it’s crucial to understand Amazon Redshift’s concurrency limitation as this will impact the speed at which you receive results.

Google BigQuery

On the other hand, Google BigQuery isn’t immune to the 50 query concurrency limit. Although not a traditional MPP database, Google BigQuery still uses MPP concepts, only on a much larger scale. It introduces a few innovations that workaround the MPP limits, in particular concurrency limits:

  • Separating compute and storage: Historically, MPP systems couple compute and storage in order to avoid the performance hit of fetching data across the network. However, this limits the system’s ability to scale compute up and down based on the number of queries being processed. Google BigQuery separates compute and storage so that the system as a whole can scale compute to match the current query load at the cost of potentially increased network traffic for any particular query.

  • Using order of magnitude of more hardware: Unlike a traditional MPP system where administrators would have to plan the number of compute nodes a cluster should have ahead of time, Google dynamically allocates a varying number of servers. Where Amazon Redshift has a node limit of 128, Google is potentially using thousands of servers to process the query.

A downside to having more hardware means that Google provides this feature to all its users, meaning it’s a shared, multi-tenant system with other Google BigQuery users.

Conclusion

At the end of the day, throughput and concurrency impact one another and both impact the overall performance of a data warehouse. For users, they all want the same thing: a highly performant data warehouse that will meet their needs and allow everyone to access data.

In enabling everyone to access data, it generates questions, leads to crucial insights and drives better decision-making that wouldn’t be possible without a data warehouse.

Amazon Redshift is currently being used successfully by many companies such as NASDAQ, Soundcloud, theSkimm and Clever to provide interactive analytics. As an extremely agile, heavily documented and well understood system with a large community base, knowing the strengths and weaknesses of the system allows architects and engineers to maintain it appropriately within their data infrastructure.

Google BigQuery although used by enterprise sized companies such as The New York Times, Spotify and Zulily to provide flexible analytics at scale lacks the robust documentation and community that follows Amazon Redshift, which can make it a bit difficult to resolve issues when they appear. Though in the past few months Google has been working to improve Google BigQuery, particularly around cost transparency and moving towards Standard SQL.

Stay tuned as we continue to analyze the fundamental differences between Amazon Redshift and Google BigQuery on our blog. For immediate action in choosing a data warehouses, download our white paper, What to Consider When Choosing Between Amazon Redshift and Google BigQuery now.