Identifying Slow, Frequently Running Queries in Amazon Redshift

Data Tutorial Amazon Redshift

Detecting queries that are taking unusually long or are run on a higher frequency interval are good candidates for query tuning. In this tutorial we will look at a diagnostic query designed to help you do just that.

Why is the amount of time a query takes to run important?

During its entire time spent querying against the database that particular query is using up one of your cluster’s concurrent connections which are limited by Amazon Redshift.

If a query is sent to the Amazon Redshift instance while all concurrent connections are currently being used it will wait in the queue until there is an available connection. This sort of traffic jam will increase exponentially over time as more and more users are querying this connection. Long running queries are the rubberneckers of the database world. They use up resources that could be made available for follow other queries and can adversely impact your datasource performance.

Using the query below, you will be able to analyze your Amazon Redshift Instance’s STL tables to provide you with information regarding a specific table and expose the performance information:

  • Average Run Time
  • Quickest Run Time
  • Slowest Run Time
  • Last time it was run
  • Times aborted

Run times are important because, as we discussed earlier, queries with long run times are using up concurrent connections which is a resource drain. A large maximum run time isn’t all that telling but a large average run time is. An average and minimum run time at or near the max query duration is a big issue and those queries are leading candidates for query tuning. Queries with high average or minimum run times along with a number of aborted attempts should be the first queries you investigate to tune.

The Query

SELECT

	trim(database) as db, count(query) AS n_qry,

	max(substring (qrytext,1,80)) AS qrytext,

	min(run_minutes) AS "min",

	max(run_minutes) AS "max",

	avg(run_minutes) AS "avg",

	sum(run_minutes) AS total,  

	max(query) AS max_query_id,

	max(starttime)::DATE AS last_run,

	sum(alerts) AS alerts,

	aborted

FROM

	(SELECT

	 	userid,

	  	label,

	  	stl_query.query,

		trim(database) AS database,

		trim(querytxt) AS qrytext,

		md5(trim(querytxt)) AS qry_md5,

		starttime,

		endtime,

		(datediff(seconds, starttime,endtime)::numeric(12,2))/60 AS run_minutes,     

		alrt.num_events AS alerts,

		aborted

	FROM

	  	stl_query

		LEFT OUTER JOIN

			(SELECT

			 	query,

			 	1 as num_events

			FROM

			  	stl_alert_event_log

			GROUP BY

				query

			) AS alrt

		on alrt.query = stl_query.query

	WHERE userid <> 1 and starttime >=  dateadd(day, -7, current_date)

	)

GROUP BY

	database,

	label,

	qry_md5,

	aborted

ORDER BY total desc limit 50;

Some very useful columns that are going to come out of this query are:

  • avg(run_minutes) as “avg” - the average amount of time this query took to run in the last 7 days
  • aborted - The count of times this query was aborted in the last 7 days
  • max(substring (qrytext,1,80)) as qrytext - this column will give you the first 80 characters of the queries actual syntax to help identify it in your logs.

This query is sourced from the Amazon Docs website found here.