Add option to skip relation cache population #7307

stu-k · 2023-04-10T16:57:56Z

resolves #6526

Description

Add a --populate-cache flag to optionally skip relation cache population, defaults to True.

Checklist

I have read the contributing guide and understand what's expected of me
I have signed the CLA
I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR
I have opened an issue to add/update docs, or docs changes are not required/relevant for this PR
I have run changie new to create a changelog entry

jtcohen6

Works well in local testing! This makes a difference of several seconds for interactive compile on non-local DWHs.

import time
from dbt.cli.main import dbtRunner

dbt = dbtRunner()
for populate_cache in [True, False]:
    start = time.perf_counter()
    results, success = dbt.invoke(['compile', '--select', 'my_model'], populate_cache=populate_cache)
    end = time.perf_counter() - start
    print(f"With populate_cache: {populate_cache}, elapsed: {end}")
    print(results[0].node.compiled_code)

e.g. on Snowflake:

With populate_cache: True, elapsed: 2.3116801250000094
select 1::text as id
With populate_cache: False, elapsed: 0.7475685420000104
select 1::text as id

core/dbt/cli/main.py

core/dbt/contracts/project.py

jtcohen6 · 2023-04-10T19:55:02Z

core/dbt/task/runnable.py

+        if not self.args.populate_cache:
+            return


makes sense to me!

out of curiosity - is there no real difference between self.args.populate_cache and get_flags().POPULATE_CACHE?

There is no real difference, no. self.args is Flags which we should be using much more where we can. I think the places we useget_flags instead of self.args was just to get the click feature branch over the line to be merged.

stu-k · 2023-04-11T14:15:44Z

core/dbt/adapters/base/impl.py

+                # Jeremy: what was the intent behind this inner loop?
+                # cache_update: Set[Tuple[Optional[str], Optional[str]]] = set()
+                # for relation in cache_schemas:
+                #     cache_update.add((database, schema))
+
+                self.cache.update_schemas(set((database, schema)))


@jtcohen6 While pairing with @ChenyuLInx we weren't sure what this inner loop was doing, especially since cache_schemas isn't defined anywhere.

@stu-k The cache is keyed on database + schema. When dbt does a cache lookup, it asks: Have I already cached this database + schema combo? If the combo isn't present as a key in the cache, it's a cache miss, and dbt needs to go run a query. If those keys are present, and no relations are found, then the assumption is that the schema is empty (= missing from the database), rather than just missing from the cache.

If there are no relations returned by the query, we still want to record that, by inserting an empty set for each database + schema pair. That way, the next time dbt wants to know if any relations are in that database.schema, it doesn't need to run the same query over again — the cache says, I already know there aren't any relations there.

I understand that, but I do not understand the intent of your inner loop I've commented on here.

cache_update: Set[Tuple[Optional[str], Optional[str]]] = set() for relation in cache_schemas: cache_update.add((database, schema)) self.cache.update_schemas(cache_update)

What is this code trying to accomplish? It is difficult to determine because cache_schemas isn't defined in your possible implementation on the original bug ticket.

Ah, I'm not sure :) can't remember what I was thinking when I wrote that code several months ago.

Given that this list_relations method is only ever being called for one database + schema pair at a time, it makes sense to just add that [(database, schema)] pair once!

Okay cool, I think I have it in a good state right now.

ChenyuLInx · 2023-04-11T15:39:18Z

core/dbt/adapters/base/impl.py

@@ -718,6 +718,18 @@ def list_relations(self, database: Optional[str], schema: str) -> List[BaseRelat
        # we can't build the relations cache because we don't have a
        # manifest so we can't run any operations.
        relations = self.list_relations_without_caching(schema_relation)


@jtcohen6 this actually means we are still retrieving all relations under a schema even if we are only running one models. Just at a later time.

It is better than before since compile now doesn't cache all relations. I am happy that this part doesn't change in this PR, but this is probably something we should consider another approach sometime.

@ChenyuLInx Good callout, and worth rethinking in the future. For now, this schema-level behavior is baked into how the cache works:

We run one caching query per database.schema, and up to a certain size, the slow bit is running that query in the DWH versus loading more information than strictly necessary into memory

We organize the cache on the basis of database.schema, and do all our lookups on that basis. If we trimmed down the relations we were looking for, we'd risk a false negative, where we think a relation isn't present in the schema but it actually is.

I think something like this is definitely worth doing as a shorter-term win for "catalog" queries run during docs generate:

[CT-458] Catalog queries should filter on specific relations in busy schemas #4997

ChenyuLInx

LGTM, also tested with dbt-server compile endpoint

stu-k requested review from jtcohen6, a team and aranke April 10, 2023 16:57

cla-bot bot added the cla:yes label Apr 10, 2023

stu-k requested a review from ChenyuLInx April 10, 2023 16:58

jtcohen6 reviewed Apr 10, 2023

View reviewed changes

stu-k commented Apr 11, 2023

View reviewed changes

stu-k force-pushed the CT-1751/skip-relation-cache branch from f17bbac to b621230 Compare April 11, 2023 14:57

ChenyuLInx reviewed Apr 11, 2023

View reviewed changes

stu-k added 6 commits April 11, 2023 11:50

Add option to skip relation cache population

65e29ed

Move populate cache option to cli group

a9356bd

Add possible implementation for CT-1498

150f9f2

Use iterable

e06eb00

Remove comment

46bc735

Add test

eb53272

stu-k force-pushed the CT-1751/skip-relation-cache branch from cf15815 to eb53272 Compare April 11, 2023 16:51

stu-k marked this pull request as ready for review April 11, 2023 16:51

stu-k requested review from a team as code owners April 11, 2023 16:51

stu-k requested review from nssalian and peterallenwebb April 11, 2023 16:51

stu-k added 2 commits April 11, 2023 11:56

Merge conflict

80ea55c

Alphabetize

78b1a0d

ChenyuLInx approved these changes Apr 11, 2023

View reviewed changes

stu-k merged commit 87e25e8 into main Apr 11, 2023

stu-k deleted the CT-1751/skip-relation-cache branch April 11, 2023 19:00

This was referenced Apr 17, 2023

[CT-2426] Update --help text for --populate-cache #7381

Closed

--populate-cache/--no-populate-cache config dbt-labs/docs.getdbt.com#3207

Closed

MichelleArk mentioned this pull request Apr 20, 2023

add cloud-artifacts as codeowners of /schemas/dbt #7406

Merged

6 tasks

jtcohen6 mentioned this pull request Jun 25, 2023

Ability to use the relations cache in run-operations #2529

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to skip relation cache population #7307

Add option to skip relation cache population #7307

stu-k commented Apr 10, 2023 •

edited

Loading

jtcohen6 left a comment

jtcohen6 Apr 10, 2023

stu-k Apr 10, 2023

stu-k Apr 11, 2023

jtcohen6 Apr 11, 2023

stu-k Apr 11, 2023

jtcohen6 Apr 11, 2023

stu-k Apr 11, 2023

ChenyuLInx Apr 11, 2023

jtcohen6 Apr 11, 2023

ChenyuLInx left a comment

Add option to skip relation cache population #7307

Add option to skip relation cache population #7307

Conversation

stu-k commented Apr 10, 2023 • edited Loading

Description

Checklist

jtcohen6 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ChenyuLInx left a comment

Choose a reason for hiding this comment

stu-k commented Apr 10, 2023 •

edited

Loading