Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Performance: prune potentially redundant edges, or document why they exist #10842

Open
3 tasks done
ttusing opened this issue Oct 11, 2024 · 1 comment
Open
3 tasks done
Labels
enhancement New feature or request triage

Comments

@ttusing
Copy link
Contributor

ttusing commented Oct 11, 2024

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Following up on this issue: #10434 (comment)

def add_test_edges(self, manifest: Manifest) -> None:
"""This method adds additional edges to the DAG. For a given non-test
executable node, add an edge from an upstream test to the given node if
the set of nodes the test depends on is a subset of the upstream nodes
for the given node."""
# Given a graph:
# model1 --> model2 --> model3
# | |
# | \/
# \/ test 2
# test1
#
# Produce the following graph:
# model1 --> model2 --> model3
# | /\ | /\ /\
# | | \/ | |
# \/ | test2 ----| |
# test1 ----|---------------|

I am unsure why the graph needs to be built in this way, It seems like at most, a single edge going from a test to the direct 1-depth children should be sufficient if the goal is to maintain build order. The current implementation means that tests are the direct parents of ALL non-test downstream nodes, meaning that a project with 5,000 models and 15,000 tests might have (5k*15k/2) = 37.5 million edges, where limiting to a depth of 1 might keep that in the hundreds of thousands.

This has large implications for memory usage, build times, etc. for projects with lots of tests and/or lots of nodes generally.

If this construction is needed, I would like to understand why and add some comments or documentation for future readers of this code exploring performance issues. Otherwise, I would like to consider changing the construction to use a depth of 1.

Describe alternatives you've considered

No response

Who will this benefit?

All users of DBT, especially those with large projects.

Are you interested in contributing this feature?

No response

Anything else?

No response

@ttusing ttusing added enhancement New feature or request triage labels Oct 11, 2024
@ttusing
Copy link
Contributor Author

ttusing commented Oct 11, 2024

@dbeatty10 @gshank

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request triage
Projects
None yet
Development

No branches or pull requests

1 participant