Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid nested loop in query planner execution #612

Merged
merged 2 commits into from
Sep 25, 2023
Merged

Avoid nested loop in query planner execution #612

merged 2 commits into from
Sep 25, 2023

Conversation

slin30
Copy link
Contributor

@slin30 slin30 commented Sep 18, 2023

  • Move relation name not equals from outer SELECT where clause to dependency CTE
    • Seems to more reliably push query planner to avoid a nested loop
    • Should preserve original intent, i.e. to exclude self-references from final result
    • Execution plan suggests this reduces the row estimate for the outer SELECT
    • Real-world testing on internal company cluster under conditions where nested loop is used for original, shows that the new approach continues to perform and behave as expected (no nested loop)

resolves #609

Problem

copied from issue Current Behavior

When running the statement in relations.sql, our Redshift cluster query planner will, seemingly at random, expect to use a nested loop join. This reflects in extended execution times, adding anywhere from 1-6 minutes to the start of a run/step.

Solution

copied from issue Additional Context

The original logic that threw the planner for a loop (no pun intended) was the WHERE != in the outer statement. Removing this or setting it to =, or pushing it into the select as an = flag, works fine. For the latter, if I subsequently attempt to filter on the flag (with NOT), this predictably ends up with the same nested loop execution plan -- it's the attempt to filter on not equals in the outer select that triggers the nested loop (when the anomaly manifests).

My adjustment simply pushes the evaluation up to the dependency CTE. My thinking was that in doing so, the planner should choose a more efficient path under more conditions.

Checklist

  • I have read the contributing guide and understand what's expected of me
  • I have run this code in development and it appears to resolve the stated issue
    • NOT run on my end -- simply updated the sole relations.sql file
    • Update: while this is not a general test, the updated version has been running in our internal dbt implementation (via local macro override) as of Sept 22, 2023. This (a) resolved the recurrence of the nested loop plan and (b) has been running without issue since. Please see comment in issue.
  • This PR includes tests, or tests are not required/relevant for this PR
    • Not aware if unit tests are supplied for this specific macro, but if they are, I've not run them.
  • This PR has no interface changes (e.g. macros, cli, logs, json artifacts, config files, adapter interface, etc) or this PR has already received feedback and approval from Product or DX
    • This PR is for one specific macro, and it definitely needs to be reviewed and tested before merging.

Supplemental

Please see local benchmark comparing the original versus new, 10 runs of each with cache=off, a new session for each run, and a one-second pause between each run. This comparison is under normal conditions, i.e. the query planner does not expect to use a nested loop for the original. It should be a fair comparison (for our environment) of the performance delta, if any, for the two versions (my conclusion is that there is no performance difference).

Also, a row count of the results for each run, across the two groups. This is a light QC check on result equality across runs and between groups. Note y-axes are zoomed in to highlight any diffs and are relative to each metric.

image

  - Move relation name not equals from outer SELECT `where` clause
    to dependency CTE
      - Seems to more reliably push query planner to avoid a nested
        loop
      - Should preserve original intent, i.e. to exclude
        self-references from final result
      - Execution plan suggests this reduces the row estimate for
        the outer SELECT
      - Real-world testing on internal company cluster under
        conditions where nested loop is used for original, shows
	that the new approach continues to perform and behave as
	expected (no nested loop)
@slin30 slin30 requested a review from a team as a code owner September 18, 2023 22:16
@slin30 slin30 requested a review from Fleid September 18, 2023 22:16
@cla-bot cla-bot bot added the cla:yes label Sep 18, 2023
@dbeatty10 dbeatty10 added the ready_for_review Externally contributed PR has functional approval, ready for code review from Core engineering label Sep 22, 2023
@dbeatty10
Copy link
Contributor

@slin30 this is looking awesome! 🤩

Could you add a changelog entry via changie new?

@slin30
Copy link
Contributor Author

slin30 commented Sep 22, 2023

@slin30 this is looking awesome! 🤩

Could you add a changelog entry via changie new?

Thanks, @dbeatty10 -- let me go through the instructions in more detail, but at first glance it seems straightforward. Just need to find some focus time. Let me target before this coming Monday.

@dbeatty10 dbeatty10 changed the title Adjust not equals application order Avoid nested loop in query planner execution Sep 22, 2023
@slin30
Copy link
Contributor Author

slin30 commented Sep 23, 2023

@dbeatty10 hopefully ab3b41f does the trick. First time using changie and it's quite neat. Might incorporate into our internal workflow.

@colin-rogers-dbt colin-rogers-dbt merged commit 9385b49 into dbt-labs:main Sep 25, 2023
55 checks passed
abbywh pushed a commit to abbywh/dbt-redshift that referenced this pull request Oct 11, 2023
* Adjust not equals application order

  - Move relation name not equals from outer SELECT `where` clause
    to dependency CTE
      - Seems to more reliably push query planner to avoid a nested
        loop
      - Should preserve original intent, i.e. to exclude
        self-references from final result
      - Execution plan suggests this reduces the row estimate for
        the outer SELECT
      - Real-world testing on internal company cluster under
        conditions where nested loop is used for original, shows
	that the new approach continues to perform and behave as
	expected (no nested loop)

* Add changie entry
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla:yes ok to test ready_for_review Externally contributed PR has functional approval, ready for code review from Core engineering
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ADAP-900] [Bug] Nested loop in query planner execution of relations.sql
3 participants