From 9006a3389bcc83d4807daf370119417d3d10758d Mon Sep 17 00:00:00 2001 From: Dillon Morrison Date: Tue, 23 Jan 2018 17:26:33 +0000 Subject: [PATCH] cleanup --- README.md => 1_README.md | 0 2_optimization_strategies.md | 150 +++++++++++++++++++++ redshift_query_inspection.dashboard.lookml | 14 +- 3 files changed, 157 insertions(+), 7 deletions(-) rename README.md => 1_README.md (100%) create mode 100644 2_optimization_strategies.md diff --git a/README.md b/1_README.md similarity index 100% rename from README.md rename to 1_README.md diff --git a/2_optimization_strategies.md b/2_optimization_strategies.md new file mode 100644 index 0000000..901a661 --- /dev/null +++ b/2_optimization_strategies.md @@ -0,0 +1,150 @@ +### Optimization Guide + +The first point of investigation or periodic review should typically be the performance dashboard /dashboards/redshift_model::redshift_performance + +![image](https://user-images.githubusercontent.com/9888083/35290094-05510524-001e-11e8-8fc2-e88d9f43cd06.png) + +By starting at the dashboard, you can focus your performance optimization efforts in areas where they will have the most impact: + + +Note: All data relating to query history is limited to the past 1 day directly in the model. If desired, this can be adjusted in the redshift_queries view definition. +
+ +

Identifying opportunities from individual queries

+The top section of the dashboard gives an overview of all queries run yesterday, with a histogram by run time, and a list of the top 10 longest running queries. You can drill into a list of queries by clicking on any bar in the histogram, and from either that list or the top 10 list, you can inspect queries you think might be problematic. +. + +![image](https://user-images.githubusercontent.com/9888083/35290134-1f215bde-001e-11e8-9459-7b905280ba91.png) + +

Identifying opportunities from network activity patterns

+The next section of the dashboard deals with the network activity caused by joins. Since network activity is highly impactful on execution time, and since it often follows consistent patterns, it is ripe for optimization, but most Redshift implementations neglect to properly analyze and optimize their network activity. + +![image](https://user-images.githubusercontent.com/9888083/35290151-31938314-001e-11e8-971e-cb3283545e9b.png) + +The pie chart on the left gives the approximate share of network activity for each type of network activity. Although there will always be some suboptimal redistribution of data for some share of queries in the system, when the red and yellow types account for more than 50% of the activity, there is a good indication that a different configuration would yield better system-wide average performance. + +The list on the right then shows individual opportunities, with all queries performing a particular join pattern grouped into a row, and then sorted by aggregate time running among those queries, so that you can focus on adjusting join patterns that will have a significant impact on your end users. + +Once you have identified a candidate join pattern for optimization based on this table, click the query count to see a drill-down of all the matching queries, and then select ones that appear representative or that are particularly slow to run to investigate further. + +Note: Nested loops are another problem sometimes caused by joins. Not only will they always result in DB_BCAST_INNER, but they can also cause excessive CPU load and disk-based operations. + + +

Identifying capacity issues

+In addition to slowly running queries, you might be experiencing slow response time simply because Redshift is queueing queries as a result of excessive demand / insufficient capacity. The line graph at the bottom of the dashboard will quickly reveal if and during what time of the day queries were queued. The blue line represents all queries received each hour, and the red line represents queries queued each hour. You can also click on the “minutes queued” disabled series to get an estimate of how much, in aggregate, the queued queries were delayed by being in the queue. + +![image](https://user-images.githubusercontent.com/9888083/35290181-42763a50-001e-11e8-8be8-e399182a81db.png) + +If you do find an issue here, you can of course increase capacity - or, you could manage demand by adjusting or cleaning out your PDT build schedules and scheduled looks/dashboards. + +PDTs: /admin/pdts + +Scheduled content by hour: /explore/i__looker/history?fields=history.created_hour,history.query_run_count&f[history.source]=Scheduled+Task&sorts=history.query_run_count+desc&limit=50&dynamic_fields=%5B%5D + +In addition to this capacity issue that directly affects query response time, you can also run into disk capacity issues. If your Redshift connection is a superuser connection, you can use the admin elements of the block to check this. + +

How to interpret diagnostic query information

+When you click “Inspect” from any query ID, you’ll be taken to the Query Inspection Dashboard: + +![image](https://user-images.githubusercontent.com/9888083/35290206-565a700e-001e-11e8-89e1-66e33d318f9f.png) + +The dashboard components can be interpreted as follows: + + + +

Common problems and corrective actions

+ +_Update: I presented on this at our JOIN 2017 conference, and you can find the presentation [here](https://discourse.looker.com/t/join-2017-deep-dive-redshift-optimization-with-lookers-redshift-block/5837)_ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
SituationPossible Corrective ActionsConsiderations
A join pattern causes a nested loop that is unintentional or on large tablesRefactor the join into an equijoin (or an equijoin and a small fixed-size nested loop)
Build a relationship table as a PDT so the nested loop only needs to be done once per ETL
Overall join patterns result in frequent broadcasts of inner tables, or distribution of large outer tables, or distribution of both tablesAdjust the dist style and distkey of the broadcast table, or of the receiving table based on overall join patterns in your system
Add denormalized column(s) to your ETL to enable better dist keys. E.g., in events -> users -> accounts, you could add account_id to the events tableDon’t forget to add the account_id as an additional condition in the events -> users join
Build a PDT to pre-join or redistribute the tableNot usually needed, though this may be worth the higher disk usage, and can be more efficient than distribution style “all”
Queries result in large amounts of scanned dataSet your first sort key to the most frequently filtered on or joined on columns
Check whether any distribution style “all” tables should be distributed instead (and possibly duplicated and re-distributed)With distribution style all, each node must scan the entire table, vs. just scanning its slice
Adjust table compression
Check for unsorted data in tables, and schedule vacuums or leverage sorted loading for append-only datasets
For large tables, set a always_filter declaration on your sort key to guide users
Queries have large steps with high skew, and/or disk-based operationsCheck table skew, skew of scan operations, and potentially adjust relevant distribution keys to better distribute the query processingFor small queries, higher skew can be ok.
The query planner incorrectly underestimates the resulting rows from a filter, leading to a broadcast of a large number of rowsCheck how off the statistics are for the relevant table, and schedule analyzes
Adjust your filter condition
Users frequently run full historical queries when recent data would do just as wellUse always_filter so users are required to specify a filter valueThe filtered field is ideally the sort key for a significant table. E.g., the created date field in an event table
+ +In case the above changes require changing content in your LookML model, you can use regex search within your project to find the relevant model code. For example `\${[a-zA-Z0-9_]+\.field_id}\s*=|=\s*${[a-zA-Z0-9_]+\.field_id}` would let you search for where a given field is involved in an equality/join, if you are using the same field name as the underlying database column name. diff --git a/redshift_query_inspection.dashboard.lookml b/redshift_query_inspection.dashboard.lookml index 0aea0a8..cd17487 100644 --- a/redshift_query_inspection.dashboard.lookml +++ b/redshift_query_inspection.dashboard.lookml @@ -9,7 +9,7 @@ type: field_filter explore: redshift_queries field: redshift_queries.query - default: 0 + # default: 0 elements: - name: time_executing @@ -157,7 +157,7 @@ type: single_value height: 3 width: 8 - title: + title: model: redshift_model explore: redshift_query_execution measures: [redshift_query_execution.total_bytes_broadcast, redshift_query_execution.total_bytes_distributed, @@ -205,7 +205,7 @@ type: single_value height: 3 width: 8 - title: + title: model: redshift_model explore: redshift_query_execution measures: [redshift_query_execution.total_bytes_broadcast, redshift_query_execution.total_bytes_distributed, @@ -253,7 +253,7 @@ type: single_value height: 3 width: 8 - title: + title: model: redshift_model explore: redshift_query_execution measures: [redshift_query_execution.total_bytes_broadcast, redshift_query_execution.total_bytes_distributed, @@ -298,12 +298,12 @@ single_value_title: Rows sorted hidden_fields: [redshift_query_execution.total_bytes_broadcast, redshift_query_execution.total_bytes_distributed, redshift_query_execution.total_bytes_scanned] - + - name: was_disk_based type: single_value height: 3 width: 8 - title: + title: model: redshift_model explore: redshift_query_execution measures: [redshift_query_execution.total_bytes_broadcast, redshift_query_execution.total_bytes_distributed, @@ -409,7 +409,7 @@ redshift_plan_steps.network_distribution_type, redshift_plan_steps.operation_argument, redshift_plan_steps.table, redshift_plan_steps.rows, redshift_plan_steps.bytes] listen: - query: redshift_plan_steps.query + query: redshift_plan_steps.query sorts: [redshift_plan_steps.step] limit: '2000' column_limit: '50'