-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid nested loop in query planner execution #612
Avoid nested loop in query planner execution #612
Conversation
- Move relation name not equals from outer SELECT `where` clause to dependency CTE - Seems to more reliably push query planner to avoid a nested loop - Should preserve original intent, i.e. to exclude self-references from final result - Execution plan suggests this reduces the row estimate for the outer SELECT - Real-world testing on internal company cluster under conditions where nested loop is used for original, shows that the new approach continues to perform and behave as expected (no nested loop)
@slin30 this is looking awesome! 🤩 Could you add a changelog entry via |
Thanks, @dbeatty10 -- let me go through the instructions in more detail, but at first glance it seems straightforward. Just need to find some focus time. Let me target before this coming Monday. |
@dbeatty10 hopefully ab3b41f does the trick. First time using |
* Adjust not equals application order - Move relation name not equals from outer SELECT `where` clause to dependency CTE - Seems to more reliably push query planner to avoid a nested loop - Should preserve original intent, i.e. to exclude self-references from final result - Execution plan suggests this reduces the row estimate for the outer SELECT - Real-world testing on internal company cluster under conditions where nested loop is used for original, shows that the new approach continues to perform and behave as expected (no nested loop) * Add changie entry
not equals
from outer SELECTwhere
clause to dependency CTEresolves #609
Problem
When running the statement in relations.sql, our Redshift cluster query planner will, seemingly at random, expect to use a nested loop join. This reflects in extended execution times, adding anywhere from 1-6 minutes to the start of a run/step.
Solution
The original logic that threw the planner for a loop (no pun intended) was the
WHERE !=
in the outer statement. Removing this or setting it to=
, or pushing it into theselect
as an=
flag, works fine. For the latter, if I subsequently attempt to filter on the flag (withNOT
), this predictably ends up with the same nested loop execution plan -- it's the attempt to filter on not equals in the outerselect
that triggers the nested loop (when the anomaly manifests).My adjustment simply pushes the evaluation up to the
dependency
CTE. My thinking was that in doing so, the planner should choose a more efficient path under more conditions.Checklist
relations.sql
fileSupplemental
Please see local benchmark comparing the original versus new,
10
runs of each withcache=off
, a new session for each run, and a one-second pause between each run. This comparison is under normal conditions, i.e. the query planner does not expect to use a nested loop for the original. It should be a fair comparison (for our environment) of the performance delta, if any, for the two versions (my conclusion is that there is no performance difference).Also, a row count of the results for each run, across the two groups. This is a light QC check on result equality across runs and between groups. Note y-axes are zoomed in to highlight any diffs and are relative to each metric.