How to Optimize Amazon Redshift Performance

Posted: September 15, 2017

#Data #Education #Technology

Amazon Redshift is a fully managed, petabyte-scale data warehouse that enables companies to quickly consolidate and analyze their data using a data analytics solution. In running complex queries against large amounts of data within your Amazon Redshift data warehouse, it can be taxing on the overall system.

It’s important to apply best practices and resolve optimization issues fast, because the longer you wait, the fewer real-time insights you’ll have access to and the more deep debugging awaits you in the future. This post will take you through the most common issues Amazon Redshift users come across, and will give you advice on how to address each of those issues.

Remember, even one inefficient query can cause performance issues, so let’s tackle them early and often. And once you’ve resolved your inefficient queries and reinstated optimal Amazon Redshift performance, you can continue real-time data analytics and drive your business forward.

  1. Time Consuming Queries

When monitoring the performance of the database, one the most important things you want to keep track of are basic statistics regarding execution time. With the following query, you can monitor the most time consuming queries along with the average, minimum and maximum execution time. Knowing which queries are most problematic is the first step in debugging the situation. Then, you can dive deeper trying to determine the reason why these queries are slow and how you can speed them up.

Let’s examine time consuming queries, which you can see in the chart below:

time consuming queries

Top 50 time comsuming queries:

SELECT max(query) AS max_query_id, min(run_minutes) AS “min”, max(run_minutes) AS “max”, avg(run_minutes) AS “avg”, sum(run_minutes) AS total FROM (SELECT userid, label, stl_query.query, trim(DATABASE) AS DATABASE, trim(querytxt) AS qrytext, md5(trim(querytxt)) AS qry_md5, starttime, endtime, (datediff(seconds, starttime,endtime)::numeric(12,2))/60 AS run_minutes, alrt.num_events AS alerts, aborted FROM stl_query LEFT OUTER JOIN (SELECT query, 1 AS num_events FROM stl_alert_event_log GROUP BY query) AS alrt ON alrt.query = stl_query.query WHERE userid <> 1 AND starttime >= dateadd(DAY, -7, CURRENT_DATE)) GROUP BY DATABASE, label, qry_md5, aborted ORDER BY total DESC LIMIT 50;

  1. Column encoding

As you know Amazon Redshift is a column-oriented database. When creating a table in Amazon Redshift you can choose the type of compression encoding you want, out of the available which you can find here.

The chosen compression encoding determines the amount of disk used when storing the columnar values and in general lower storage utilization leads to higher query performance. If no compression is selected, the data will be stored as RAW, resulting in a significant penalty in query’s performance.

Using the following query you can check which tables have column encoding:

column encoding

Column Encoding:

SELECT “table” tablename, CASE WHEN encoded = ‘Y’ THEN 1 ELSE 0 END has_col_encoding FROM svv_table_info ti JOIN (SELECT tbl, MIN(c) min_blocks_per_slice, MAX(c) max_blocks_per_slice, COUNT(DISTINCT slice) dist_slice FROM (SELECT b.tbl, b.slice, COUNT(*) AS c FROM STV_BLOCKLIST b GROUP BY b.tbl, b.slice) WHERE tbl IN (SELECT table_id FROM svv_table_info) GROUP BY tbl) iq ON iq.tbl = ti.table_id ORDER BY SCHEMA, “Table”;

  1. Skew Ratio

Being a distributed database architecture, Amazon Redshift is divided into nodes and slices, with each one of them storing a data subset.

In order to ensure your database’s optimal performance the key factor lies in the uniform data distribution into these nodes and slices. In the opposite case, you will end up with skewed tables resulting in uneven node utilization in terms of CPU load or memory creating a bottleneck to the database performance.

That being said, it is important to ensure that the skew ratio of your tables is as close to zero as possible and the following query can help you to monitor exactly this:

skew ratio.png

Skew Ratio:

SELECT “table” tablename, 100* (ROUND(100CAST(max_blocks_per_slice - min_blocks_per_slice AS FLOAT) / GREATEST(NVL (min_blocks_per_slice,0)::int,1),2)) ratio_skew_across_slices FROM svv_table_info ti JOIN (SELECT tbl, MIN(c) min_blocks_per_slice, MAX(c) max_blocks_per_slice, COUNT(DISTINCT slice) dist_slice FROM (SELECT b.tbl, b.slice, COUNT() AS c FROM STV_BLOCKLIST b GROUP BY b.tbl, b.slice) WHERE tbl IN (SELECT table_id FROM svv_table_info) GROUP BY tbl) iq ON iq.tbl = ti.table_id ORDER BY SCHEMA, “Table”;

You can also keep track of the CPU and memory utilization of each node with the following queries.

CPU Load:

SELECT slice, MAXVALUE FROM svv_diskusage WHERE name= ‘real_time_data’ AND col = 0 ORDER BY slice;

Available Space:

SELECT sum(used)::float / sum(capacity) AS pct_full FROM stv_partitions

  1. Sort Keys

When using Amazon Redshift you can specify a column as sort key. This means that data will be stored on the disk sorted by this key. An Amazon Reshift optimizer will take the sort key into consideration when evaluating different execution plans, ultimately determining the optimal way.

When it comes to deciding the best key for your table you need to consider how the table data is being used. For example, if two tables are joined together very often it makes sense to declare the join column as the sort key, while for tables with temporal locality the date column. More information about how to choose the best sort key can be found here.

The following query can help you determine which tables have a sort key declared.

sort keys.png

Sort Keys:

SELECT “table” tablename, CASE WHEN sortkey1 IS NOT NULL THEN 1 ELSE 0 END has_sort_key FROM svv_table_info ti JOIN (SELECT tbl, MIN(c) min_blocks_per_slice, MAX(c) max_blocks_per_slice, COUNT(DISTINCT slice) dist_slice FROM (SELECT b.tbl, b.slice, COUNT(*) AS c FROM STV_BLOCKLIST b GROUP BY b.tbl, b.slice) WHERE tbl IN (SELECT table_id FROM svv_table_info) GROUP BY tbl) iq ON iq.tbl = ti.table_id ORDER BY SCHEMA, “Table”

  1. Table Size

Monitoring your table size on a regular basis can save you from a lot of pain. Knowing the rate at which your database is growing is important in order not to end up running out of space out of the blue.

For this reason the following query will help you settle things down and monitor the top space consuming tables in your Amazon Redshift cluster.

table size.png

Table Size:

SELECT “table” tablename,

        SIZE size_in_mb FROM svv_table_info ti JOIN   (SELECT tbl,
    MIN(c) min_blocks_per_slice,
    MAX(c) max_blocks_per_slice,
    COUNT(DISTINCT slice) dist_slice    FROM
(SELECT b.tbl,
        b.slice,
        COUNT(*) AS c
FROM STV_BLOCKLIST b
GROUP BY b.tbl,
        b.slice)    WHERE tbl IN
(SELECT table_id
    FROM svv_table_info)    GROUP BY tbl) iq ON iq.tbl = ti.table_id ORDER BY SCHEMA,
    "Table";
  1. Queries Waiting on Queue Slots

In an Amazon Redshift cluster, each query is being assigned to one of the queues defined via the workload management (WLM). This means that it is possible that a query may take some time to be executed if the assigned queue is busy. As this is suboptimal, to decrease the waiting time you may increase the concurrency by allowing more queries to be executed in parallel. However, increased concurrency comes with a significant penalty in the memory share allocated to each query. When the memory share available for a query’s execution is not sufficient, disk storage will be used leading to poor performance as accessing the disk is much slower than accessing the memory.

With the following queries you can monitor the total execution time of your query and how this is divided between waiting time and actual execution along with the total number of disk based queries been executed:

execution time.png

Execution Time on WLM Slot:

SELECT w.query, w.total_queue_time / 1000000 AS queue_seconds ,w.total_exec_time / 1000000 exec_seconds ,(w.total_queue_time + w.total_Exec_time) / 1000000 AS total_seconds FROM stl_wlm_query w LEFT JOIN stl_query q ON q.query = w.query AND q.userid = w.userid WHERE w.queue_start_Time >= dateadd(day,-7,CURRENT_DATE) AND w.total_queue_Time > 0 ORDER BY w.total_queue_time DESC ,w.queue_start_time DESC limit 35

diskbasedqueries.png

Number of disk based queries:

SELECT count(*) FROM (SELECT q.query, trim(q.cat_text) FROM ( SELECT query, replace(listagg(text,’ ‘) withIN GROUP ( ORDER BY SEQUENCE), ‘n’, ‘ ‘) AS cat_text FROM stl_querytext WHERE userid>1 GROUP BY query) q JOIN ( SELECT DISTINCT query FROM svl_query_summary WHERE is_diskbased=’t’ AND (LABEL LIKE ‘hash%’ OR LABEL LIKE ‘sort%’ OR LABEL LIKE ‘aggr%’) AND userid > 1) qs ON qs.query = q.query)tmp

  1. Commit Queue Waits

With the commit_stats.sql script provided by the AWS team you can monitor the wait time on your commit queue. As an Amazon Redshift cluster is primarily designed for the execution of analytical queries, the cost of frequent commits is terms of execution time is quite increased.

commitqueue.png

Commit queue waits:

SELECT startqueue, node, datediff(ms,startqueue,startwork) AS queue_time, datediff(ms, startwork, endtime) AS commit_time, queuelen FROM stl_commit_stats WHERE startqueue >= dateadd(DAY, -2, CURRENT_DATE) ORDER BY queuelen DESC, queue_time DESC;

  1. Finding Common Alerts

Using an Amazon Redshift cluster makes it easy to keep an eye on the most common alerts your queries produce in order to investigate them further. The following query does the trick for you.

common alerts.png

Common Alerts:

SELECT max(l.query) AS “query id”, trim(split_part(l.event,’:’,1)) AS event, trim(l.solution) AS solution, count(*) AS “times occured” FROM stl_alert_event_log AS l LEFT JOIN stl_scan AS s ON s.query = l.query AND s.slice = l.slice AND s.segment = l.segment AND s.step = l.step WHERE l.event_time >= dateadd(DAY, -7, CURRENT_DATE) GROUP BY 2,3

  1. Nested Loop Join Queries

Investigating the most common alerts with the previously mentioned query, you may end up with a nested loop join warning.

In query execution, nested loop joins are typically a result of cross-joins. When joining two tables without any join condition then the cartesian product of the two tables is calculated. Although in cases where the outer input is small and the inner is pre indexed and large, nested joins can be reasonably effective, in general choosing them is suboptimal as their execution is computationally demanding and the penalty in performance significant.

With the following query you can monitor the number of nested loop join queries executed.

nestedqueries.png

Number of loop Join Queries:

SELECT date_trunc(‘hour’, starttime) AS START, count(query) FROM stl_query WHERE query IN (SELECT DISTINCT query FROM stl_alert_event_log WHERE event LIKE ‘Nested Loop Join in the query plan%’) GROUP BY START ORDER BY START ASC;

  1. Stale or Missing Statistics

Another common alert is raised when tables with missing plan statistics are detected. During query optimization and execution planning the Amazon Redshift optimizer will refer to the statistics of the involved tables in order to make the best possible decision. For this, having tables with stale or missing statistics may lead the optimizer to choose a suboptimal plan. Defining the problematic tables with the following queries will help you proceeding with the necessary VACUUM actions.

Stale Statistics:

SELECT schema || ‘.’ || “table” AS “table”, stats_off FROM svv_table_info WHERE stats_off > 5 ORDER BY 2;

Missing Statistics:

select count(tmp.cnt) “Table Count”, tmp.cnt “Missing Statistics” from (SELECT substring(trim(plannode),34,110) AS plannode ,COUNT(*) as cnt FROM stl_explain WHERE plannode LIKE ‘%missing statistics%’ AND plannode NOT LIKE ‘%redshift_auto_health_check_%’ GROUP BY plannode ORDER BY 2 DESC) tmp group by tmp.cnt

  1. Data Load Statistics

Regarding data loading there are best practices that the Amazon Redshift team advises users to implement. These include compressing files and loading many smaller files instead of a single huge one. Furthermore, ensuring that the number of files to load is a multiple of the number of slice results in even utilization of cluster nodes.

Some queries that help you ensure all the above are shown below.

Rows insert rate: SELECT trim(b.relname) AS “tablename”,

(sum(a.rows_inserted)*1000000/SUM(a.insert_micro)) AS insert_rate_rows_ps FROM (SELECT query, tbl, sum(rows) AS rows_inserted, max(endtime) AS endtime, datediff(‘microsecond’,min(starttime),max(endtime)) AS insert_micro FROM stl_insert GROUP BY query, tbl) a,
pg_class b, pg_namespace c,
(SELECT b.query, count(distinct b.bucket||b.key) AS distinct_files, sum(b.transfer_size)/1024/1024 AS MB_scanned, sum(b.transfer_time) AS load_micro FROM stl_s3client b WHERE b.http_method = ‘GET’ GROUP BY b.query) d WHERE a.tbl = b.oid AND b.relnamespace = c.oid AND d.query = a.query GROUP BY 1

Scanned data (MB):

SELECT trim(b.relname) AS “tablename”,

sum(d.MB_scanned) AS MB_scanned

FROM (SELECT query, tbl, sum(rows) AS rows_inserted, max(endtime) AS endtime, datediff(‘microsecond’,min(starttime),max(endtime)) AS insert_micro FROM stl_insert GROUP BY query, tbl) a,
pg_class b, pg_namespace c,
(SELECT b.query, count(distinct b.bucket||b.key) AS distinct_files, sum(b.transfer_size)/1024/1024 AS MB_scanned, sum(b.transfer_time) AS load_micro FROM stl_s3client b WHERE b.http_method = ‘GET’ GROUP BY b.query) d WHERE a.tbl = b.oid AND b.relnamespace = c.oid AND d.query = a.query GROUP BY 1

Scan rate (kbps):

SELECT trim(b.relname) AS “tablename”, (sum(d.MB_scanned)10241000000/SUM(d.load_micro)) AS scan_rate_kbps FROM (SELECT query, tbl, sum(rows) AS rows_inserted, max(endtime) AS endtime, datediff(‘microsecond’,min(starttime),max(endtime)) AS insert_micro FROM stl_insert GROUP BY query, tbl) a,
pg_class b, pg_namespace c,
(SELECT b.query, count(distinct b.bucket||b.key) AS distinct_files, sum(b.transfer_size)/1024/1024 AS MB_scanned, sum(b.transfer_time) AS load_micro FROM stl_s3client b WHERE b.http_method = ‘GET’ GROUP BY b.query) d WHERE a.tbl = b.oid AND b.relnamespace = c.oid AND d.query = a.query GROUP BY 1

Files Scanned:

SELECT trim(b.relname) AS “tablename”, sum(d.distinct_files) AS files_scanned FROM (SELECT query, tbl, sum(rows) AS rows_inserted, max(endtime) AS endtime, datediff(‘microsecond’,min(starttime),max(endtime)) AS insert_micro FROM stl_insert GROUP BY query, tbl) a,
pg_class b, pg_namespace c,
(SELECT b.query, count(distinct b.bucket||b.key) AS distinct_files, sum(b.transfer_size)/1024/1024 AS MB_scanned, sum(b.transfer_time) AS load_micro FROM stl_s3client b WHERE b.http_method = ‘GET’ GROUP BY b.query) d WHERE a.tbl = b.oid AND b.relnamespace = c.oid AND d.query = a.query GROUP BY 1

Average file size (mb):

SELECT trim(b.relname) AS “tablename”, (sum(d.MB_scanned)/sum(d.distinct_files)::numeric(19,3))::numeric(19,3) AS avg_file_size_mb FROM (SELECT query, tbl, sum(rows) AS rows_inserted, max(endtime) AS endtime, datediff(‘microsecond’,min(starttime),max(endtime)) AS insert_micro FROM stl_insert GROUP BY query, tbl) a,
pg_class b, pg_namespace c,
(SELECT b.query, count(distinct b.bucket||b.key) AS distinct_files, sum(b.transfer_size)/1024/1024 AS MB_scanned, sum(b.transfer_time) AS load_micro FROM stl_s3client b WHERE b.http_method = ‘GET’ GROUP BY b.query) d WHERE a.tbl = b.oid AND b.relnamespace = c.oid AND d.query = a.query GROUP BY 1

Rows inserted:

SELECT trim(b.relname) AS “tablename”, sum(a.rows_inserted) AS “rows_inserted” FROM (SELECT query, tbl, sum(ROWS) AS rows_inserted, max(endtime) AS endtime, datediff(‘microsecond’,min(starttime),max(endtime)) AS insert_micro FROM stl_insert GROUP BY query, tbl) a, pg_class b, pg_namespace c,

(SELECT b.query, count(DISTINCT b.bucket||b.key) AS distinct_files, sum(b.transfer_size)/1024/1024 AS MB_scanned, sum(b.transfer_time) AS load_micro FROM stl_s3client b WHERE b.http_method = ‘GET’ GROUP BY b.query) d WHERE a.tbl = b.oid AND b.relnamespace = c.oid AND d.query = a.query GROUP BY 1

For more expert times on how to optimize your Amazon Redshift performance, download Blendo’s white paper, Amazon Redshift Guide for Data Analysts, here

— 

This guest blog post was written by Kostas Pardalis, co-Founder of Blendo. Blendo is an integration-as-a-service platform that enables companies to extract their cloud-based data sources, integrate it and load it into a data warehouse for analysis.