Implement relation filtering on get_catalog macro #964

mikealfare · 2023-12-16T00:13:33Z

resolves #900

Problem

When get_catalog runs, it needs to return all relations within a schema, hence it cannot be parallelized or scale. It also means that all relations will be returned, not just those managed by dbt.

Solution

Allow for a set of relations to be passed in to limit the query against the database.

add _get_one_catalog_relations method
point _get_one_catalog to this method since the second half is the same
register that this capability is available

Checklist

I have read the contributing guide and understand what's expected of me
I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR
This PR has no interface changes (e.g. macros, cli, logs, json artifacts, config files, adapter interface, etc) or this PR has already received feedback and approval from Product or DX

…o this method to reduce code duplication

…erridden

…ided

mikealfare · 2023-12-20T00:57:20Z

dbt/adapters/spark/impl.py


+    def _get_relation_metadata_at_column_level(self, relations: List[BaseRelation]) -> agate.Table:


This was the second half of _get_one_catalog above. It is broken out as it's own method now so it can be reused by the new method _get_one_catalog_by_relations below.

mikealfare · 2023-12-20T01:00:11Z

dbt/adapters/spark/cache.py

+
+
+class SparkRelationsCache(RelationsCache):
+    def get_relation_from_stub(self, relation_stub: BaseRelation) -> BaseRelation:


There is no method on RelationsCache that returns a specific BaseRelation. The next best option is to constantly return the set of relations in a schema (RelationsCache.get_relations) and then look for the target relation in that set. This method duplicates the logic in get_relations and adds the third check on identifier.

changelog

bbac669

mikealfare self-assigned this Dec 16, 2023

cla-bot bot added the cla:yes label Dec 16, 2023

mikealfare added the backport 1.7.latest Tag for PR to be backported to the 1.7.latest branch label Dec 16, 2023

add _get_one_catalog_by_relations method, redirect _get_one_catalog t…

28dd554

…o this method to reduce code duplication

mikealfare marked this pull request as ready for review December 16, 2023 00:42

mikealfare requested a review from a team as a code owner December 16, 2023 00:42

mikealfare added 11 commits December 15, 2023 20:53

override get_catalog_by_relations to align with how get_catalog is ov…

b9cfd81

…erridden

turn off get_catalog_by_relations to test

eab3067

call get_catalog by relation

73979ee

guard against multiple info schemas in get_catalog_by_relations

17c245f

reuse get_catalog logic, add ability to pass relations into new method

9b0bd5f

redirect get_catalog_relations to get_catalog to check plumbing

5236d40

manually create schema_map so that it's limited to the relations prov…

bf6e910

…ided

add check to guarantee relations is populated

18d30f7

catch exception when info_schemas is empty

844cfbf

catch exception when info_schemas is empty

2d4d3c2

update the connection name

c3e608d

mikealfare marked this pull request as draft December 18, 2023 18:16

mikealfare added 10 commits December 19, 2023 16:23

mimic get_catalog behavior

5221389

mimic get_catalog behavior

ccfeebd

redirect to new method

d609da9

error on empty list of relations

3afd659

use the cached version of the relation to ensure we have column metadata

f28c181

move cache logic into the cache

52b9dc5

fix typo

642bd12

remove whitespace fixes to reduce PR confusion

534cd0f

remove whitespace fixes to reduce PR confusion

0efc5a4

remove whitespace fixes to reduce PR confusion

0e5c614

mikealfare commented Dec 20, 2023

View reviewed changes

mikealfare marked this pull request as ready for review December 20, 2023 01:01

mikealfare marked this pull request as draft February 13, 2024 23:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement relation filtering on get_catalog macro #964

Implement relation filtering on get_catalog macro #964

mikealfare commented Dec 16, 2023 •

edited

Loading

mikealfare Dec 20, 2023 •

edited

Loading

mikealfare Dec 20, 2023


		def _get_relation_metadata_at_column_level(self, relations: List[BaseRelation]) -> agate.Table:



		class SparkRelationsCache(RelationsCache):
		def get_relation_from_stub(self, relation_stub: BaseRelation) -> BaseRelation:

Implement relation filtering on get_catalog macro #964

Are you sure you want to change the base?

Implement relation filtering on get_catalog macro #964

Conversation

mikealfare commented Dec 16, 2023 • edited Loading

Problem

Solution

Checklist

mikealfare Dec 20, 2023 • edited Loading

Choose a reason for hiding this comment

mikealfare Dec 20, 2023

Choose a reason for hiding this comment

mikealfare commented Dec 16, 2023 •

edited

Loading

mikealfare Dec 20, 2023 •

edited

Loading