Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to skip relation cache population #7307

Merged
merged 8 commits into from
Apr 11, 2023
Merged

Conversation

stu-k
Copy link
Contributor

@stu-k stu-k commented Apr 10, 2023

resolves #6526

Description

Add a --populate-cache flag to optionally skip relation cache population, defaults to True.

Checklist

@stu-k stu-k requested review from jtcohen6, a team and aranke April 10, 2023 16:57
@cla-bot cla-bot bot added the cla:yes label Apr 10, 2023
@stu-k stu-k requested a review from ChenyuLInx April 10, 2023 16:58
Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works well in local testing! This makes a difference of several seconds for interactive compile on non-local DWHs.

import time
from dbt.cli.main import dbtRunner

dbt = dbtRunner()
for populate_cache in [True, False]:
    start = time.perf_counter()
    results, success = dbt.invoke(['compile', '--select', 'my_model'], populate_cache=populate_cache)
    end = time.perf_counter() - start
    print(f"With populate_cache: {populate_cache}, elapsed: {end}")
    print(results[0].node.compiled_code)

e.g. on Snowflake:

With populate_cache: True, elapsed: 2.3116801250000094
select 1::text as id
With populate_cache: False, elapsed: 0.7475685420000104
select 1::text as id

core/dbt/cli/main.py Outdated Show resolved Hide resolved
core/dbt/contracts/project.py Show resolved Hide resolved
Comment on lines +374 to +375
if not self.args.populate_cache:
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense to me!

out of curiosity - is there no real difference between self.args.populate_cache and get_flags().POPULATE_CACHE?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no real difference, no. self.args is Flags which we should be using much more where we can. I think the places we useget_flags instead of self.args was just to get the click feature branch over the line to be merged.

Comment on lines 732 to 737
# Jeremy: what was the intent behind this inner loop?
# cache_update: Set[Tuple[Optional[str], Optional[str]]] = set()
# for relation in cache_schemas:
# cache_update.add((database, schema))

self.cache.update_schemas(set((database, schema)))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtcohen6 While pairing with @ChenyuLInx we weren't sure what this inner loop was doing, especially since cache_schemas isn't defined anywhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stu-k The cache is keyed on database + schema. When dbt does a cache lookup, it asks: Have I already cached this database + schema combo? If the combo isn't present as a key in the cache, it's a cache miss, and dbt needs to go run a query. If those keys are present, and no relations are found, then the assumption is that the schema is empty (= missing from the database), rather than just missing from the cache.

If there are no relations returned by the query, we still want to record that, by inserting an empty set for each database + schema pair. That way, the next time dbt wants to know if any relations are in that database.schema, it doesn't need to run the same query over again — the cache says, I already know there aren't any relations there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that, but I do not understand the intent of your inner loop I've commented on here.

cache_update: Set[Tuple[Optional[str], Optional[str]]] = set()
for relation in cache_schemas:
    cache_update.add((database, schema))
self.cache.update_schemas(cache_update)

What is this code trying to accomplish? It is difficult to determine because cache_schemas isn't defined in your possible implementation on the original bug ticket.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I'm not sure :) can't remember what I was thinking when I wrote that code several months ago.

Given that this list_relations method is only ever being called for one database + schema pair at a time, it makes sense to just add that [(database, schema)] pair once!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay cool, I think I have it in a good state right now.

@stu-k stu-k force-pushed the CT-1751/skip-relation-cache branch from f17bbac to b621230 Compare April 11, 2023 14:57
@@ -718,6 +718,18 @@ def list_relations(self, database: Optional[str], schema: str) -> List[BaseRelat
# we can't build the relations cache because we don't have a
# manifest so we can't run any operations.
relations = self.list_relations_without_caching(schema_relation)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtcohen6 this actually means we are still retrieving all relations under a schema even if we are only running one models. Just at a later time.

It is better than before since compile now doesn't cache all relations. I am happy that this part doesn't change in this PR, but this is probably something we should consider another approach sometime.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChenyuLInx Good callout, and worth rethinking in the future. For now, this schema-level behavior is baked into how the cache works:

  • We run one caching query per database.schema, and up to a certain size, the slow bit is running that query in the DWH versus loading more information than strictly necessary into memory
  • We organize the cache on the basis of database.schema, and do all our lookups on that basis. If we trimmed down the relations we were looking for, we'd risk a false negative, where we think a relation isn't present in the schema but it actually is.

I think something like this is definitely worth doing as a shorter-term win for "catalog" queries run during docs generate:

@stu-k stu-k force-pushed the CT-1751/skip-relation-cache branch from cf15815 to eb53272 Compare April 11, 2023 16:51
@stu-k stu-k marked this pull request as ready for review April 11, 2023 16:51
@stu-k stu-k requested review from a team as code owners April 11, 2023 16:51
Copy link
Contributor

@ChenyuLInx ChenyuLInx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, also tested with dbt-server compile endpoint

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CT-1751] Config to optionally skip population of relation cache
3 participants