feat(ingestion/snowflake):adds streams as a new dataset with lineage and properties. #12318

brock-acryl · 2025-01-10T16:27:09Z

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub
Properly picks up:
Streams and columns in a stream (including metadata columns)
Stream upstream lineage from show streams statement
Stream downstream lineage by parsing queries (tested inserts, CTAS, inserts with unions)
Stream properties

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_queries.py

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_query.py

codecov · 2025-01-13T17:43:58Z

Codecov Report

Attention: Patch coverage is 89.01734% with 19 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...ingestion/source/snowflake/snowflake_schema_gen.py	90.00%	10 Missing ⚠️
...ub/ingestion/source/snowflake/snowflake_queries.py	0.00%	6 Missing ⚠️
...hub/ingestion/source/snowflake/snowflake_schema.py	95.45%	2 Missing ⚠️
...ahub/ingestion/source/snowflake/snowflake_utils.py	80.00%	1 Missing ⚠️

Files with missing lines	Coverage Δ
...on/src/datahub/ingestion/source/common/subtypes.py	`100.00% <100.00%> (ø)`
...rc/datahub/ingestion/source/snowflake/constants.py	`100.00% <100.00%> (ø)`
...hub/ingestion/source/snowflake/snowflake_config.py	`98.01% <100.00%> (+0.69%)`	⬆️
...ahub/ingestion/source/snowflake/snowflake_query.py	`93.95% <100.00%> (+0.20%)`	⬆️
...hub/ingestion/source/snowflake/snowflake_report.py	`99.09% <100.00%> (+0.04%)`	⬆️
...datahub/ingestion/source/snowflake/snowflake_v2.py	`89.12% <100.00%> (+0.39%)`	⬆️
...ahub/ingestion/source/snowflake/snowflake_utils.py	`89.20% <80.00%> (+0.15%)`	⬆️
...hub/ingestion/source/snowflake/snowflake_schema.py	`88.85% <95.45%> (+1.07%)`	⬆️
...ub/ingestion/source/snowflake/snowflake_queries.py	`43.23% <0.00%> (-0.83%)`	⬇️
...ingestion/source/snowflake/snowflake_schema_gen.py	`84.61% <90.00%> (+2.52%)`	⬆️

... and 44 files with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aff3fae...ade6503. Read the comment docs.

metadata-ingestion/src/datahub/ingestion/source/sql/sql_config.py

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_v2.py

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_schema_gen.py

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_v2.py

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_query.py

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_schema_gen.py

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_v2.py

- moved stream lineage from snowflake_v2.py & snowflake_lineage_v2.py to snowflake_schema_gen.py - updated snowflake_schema_gen.py to use snowflake_utils.py

- changed from clone table logic to manually mapping columns since metadata columns were attempting to be mapped from the stream source

hsheth2

Left a couple quick comments from my skim over this

@mayurinehate should be back monday and can do the final reviews + merge this

hsheth2 · 2025-01-17T01:00:16Z

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_config.py

+    stream_pattern: AllowDenyPattern = Field(
+        default=AllowDenyPattern.allow_all(),
+        description="Regex patterns for streams to filter in ingestion.",
+    )


I believe this is redundant, since it inherits from SnowflakeFilterConfig

removed the code

hsheth2 · 2025-01-17T01:03:35Z

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_schema_gen.py

+                custom_properties["BASE_TABLES"] = table.base_tables
+
+            if table.stale_after:
+                custom_properties["STALE_AFTER"] = table.stale_after.isoformat()


might be easier to do something like this - and avoid all the if statements

custom_properties = { k: v for k, v in { "TABLE_NAME": table.table_name, ... } if v }

cleaned up code

hsheth2 · 2025-01-17T01:04:44Z

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_schema_gen.py

+        """
+        Populate Streams upstream tables excluding the metadata columns
+        """
+        if self.aggregator:


when would the aggregator be null?

Without this, lint throws the error:
src/datahub/ingestion/source/snowflake/snowflake_schema_gen.py:1475: error: Item "None" of "SqlParsingAggregator | None" has no attribute "add_known_query_lineage"

hsheth2 · 2025-01-17T01:07:48Z

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_schema_gen.py

@@ -58,6 +59,7 @@
    SnowflakeIdentifierBuilder,
    SnowflakeStructuredReportMixin,
    SnowsightUrlBuilder,
+    _split_qualified_name,


should we just make this method "public" by removing the _ prefix?

Made this public

… into snowflake-streams-v2 merge local

mayurinehate · 2025-01-21T11:47:21Z

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_queries.py

+            obj.get("objectDomain") == "Stream" for obj in direct_objects_accessed
+        )
+
+        # If a stream is used, default to query parsing.


Can you add a comment as to why this was required - as to the fact that snowflake objects_modified does not include correct stream references however direct_objects_accessed does.

Added comment in code

mayurinehate · 2025-01-21T11:49:59Z

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_queries.py

+        # If a stream is used, default to query parsing.
+        if has_stream_objects:
+            logger.debug("Found matching stream object")
+            self.aggregator.add_observed_query(


could you modify this to yield ObservedQuery and update signature of _parse_audit_log_row that it can return Optional[Union[TableRename, TableSwap, PreparsedQuery, ObservedQuery]] and any other required typing changes for this to work.

It would mean that we would add to aggregator only at one place and it would be easier to debug audit log.

mayurinehate · 2025-01-21T11:51:33Z

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_report.py

@@ -64,6 +64,7 @@ class SnowflakeReport(SQLSourceReport, BaseTimeWindowReport):
    num_table_to_view_edges_scanned: int = 0
    num_view_to_table_edges_scanned: int = 0
    num_external_table_edges_scanned: int = 0
+    num_stream_edges_scanned: int = 0


Is this used anywhere ?

No, removed

mayurinehate · 2025-01-21T11:51:47Z

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_report.py

@@ -112,6 +114,7 @@ class SnowflakeV2Report(
    table_lineage_query_secs: float = -1
    external_lineage_queries_secs: float = -1
    num_tables_with_known_upstreams: int = 0
+    num_streams_with_known_upstreams: int = 0


Is this used anywhere ?

No, removed

mayurinehate · 2025-01-21T11:57:16Z

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_schema.py

+    name: str
+    created: datetime
+    owner: str
+    comment: str


Is comment always present ? Can this be None ? If so, better to mark with type Optional[str]

made Optional[str]

mayurinehate · 2025-01-21T12:18:02Z

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_schema_gen.py

+        source_db, source_schema, source_name = source_parts
+
+        # Get columns from source object
+        source_columns = self.get_columns_for_table(


I'm slightly concerned about this call which would trigger fetch of columns of entire database containing this table but need to think through any better alternatives with less duplicates.

When calling columns_for_schema, a list is passed containing the table name. Shouldn't this then filter the tables?

mayurinehate · 2025-01-21T12:26:43Z

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_schema_gen.py

+            if column_lineage:
+                self.aggregator.add_known_query_lineage(
+                    known_query_lineage=KnownQueryLineageInfo(
+                        query_id=f"stream_lineage_{stream.name}",


Is stream_name fully qualified ? @hsheth2 is there better way for adding this to aggregator ? Using add_known_lineage_mapping will generate different query id every time.

@brock-acryl is there no way to get stream definition query ?

Streams do not have a definition like views and tables, they only have the output of show streams which is the same as describe stream.

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_utils.py

mayurinehate · 2025-01-21T12:28:26Z

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_v2.py

+            # TODO: This is slightly suboptimal because we create two SqlParsingAggregator instances with different configs
+            # but a shared schema resolver. That's fine for now though - once we remove the old lineage/usage extractors,
+            # it should be pretty straightforward to refactor this and only initialize the aggregator once.
+            self.report.queries_extractor = queries_extractor.report
+            yield from queries_extractor.get_workunits_internal()
+            queries_extractor.close()


Is change in indentation by accident ?

mayurinehate · 2025-01-21T12:30:41Z

metadata-ingestion/tests/integration/snowflake/test_snowflake_failures.py


        pipeline = Pipeline(snowflake_pipeline_config)
        pipeline.run()
-        assert "permission-error" in [
+        assert [] == [


Any thoughts why this got removed ?

snowflake show commands return empty lists and not permission errors. Without this change, the test fails.

- added comments - removed unused num_stream reports - made SnowflakeStream comments optional - defined tables and view argument datatypes - updated allowed pattern - fixed indentation

github-actions bot added ingestion PR or Issue related to the ingestion of metadata community-contribution PR or Issue raised by member(s) of DataHub Community labels Jan 10, 2025

gabe-lyons reviewed Jan 10, 2025

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_queries.py Show resolved Hide resolved

gabe-lyons reviewed Jan 10, 2025

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_query.py Outdated Show resolved Hide resolved

datahub-cyborg bot added the pending-submitter-response Issue/request has been reviewed but requires a response from the submitter label Jan 10, 2025

vercel bot deployed to Preview January 10, 2025 16:53 View deployment

datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Jan 11, 2025

brock-acryl closed this Jan 13, 2025

brock-acryl force-pushed the snowflake-streams-v2 branch from c9e611d to 457f96e Compare January 13, 2025 16:22

vercel bot deployed to Preview January 13, 2025 16:39 View deployment

merged changes from master

a268e3e

brock-acryl reopened this Jan 13, 2025

vercel bot deployed to Preview January 13, 2025 18:04 View deployment

mayurinehate reviewed Jan 14, 2025

View reviewed changes

datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Jan 14, 2025

- moved stream_pattern from sql_config.py to snowflake_config.py

b7dfa7c

- moved stream lineage from snowflake_v2.py & snowflake_lineage_v2.py to snowflake_schema_gen.py - updated snowflake_schema_gen.py to use snowflake_utils.py

datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Jan 14, 2025

vercel bot deployed to Preview January 14, 2025 18:24 View deployment

added streams to docs

30f2e53

vercel bot deployed to Preview January 14, 2025 18:48 View deployment

- removed unused method

e0b3c3b

- changed from clone table logic to manually mapping columns since metadata columns were attempting to be mapped from the stream source

vercel bot deployed to Preview January 15, 2025 04:50 View deployment

brock-acryl added 2 commits January 15, 2025 08:15

Merge branch 'master' into snowflake-streams-v2

f0124e2

merge changes

2ff19ba

vercel bot deployed to Preview January 15, 2025 13:42 View deployment

lintfix

fd87ca9

updated pytests and golden files

5f25f3c

vercel bot deployed to Preview January 16, 2025 20:11 View deployment

hsheth2 reviewed Jan 17, 2025

View reviewed changes

lintfix

18770a9

vercel bot deployed to Preview January 17, 2025 01:30 View deployment

brock-acryl added 2 commits January 16, 2025 20:37

code review updates.

f726f38

Merge branch 'master' into snowflake-streams-v2

59c21c7

vercel bot deployed to Preview January 17, 2025 02:13 View deployment

lint

0d95fcd

vercel bot deployed to Preview January 17, 2025 03:12 View deployment

lint

6ad3f70

vercel bot deployed to Preview January 17, 2025 03:39 View deployment

brock-acryl added 2 commits January 17, 2025 09:11

updated tests

c20aa2d

Merge branch 'master' into snowflake-streams-v2

67b8212

vercel bot deployed to Preview January 17, 2025 14:46 View deployment

updated tests

af9d421

vercel bot deployed to Preview January 17, 2025 19:09 View deployment

brock-acryl added 3 commits January 18, 2025 09:41

updated reporting

c1f0be8

Merge branch 'master' into snowflake-streams-v2

b97724a

Merge branch 'snowflake-streams-v2' of github.com:brock-acryl/datahub…

07cf0bd

… into snowflake-streams-v2 merge local

brock-acryl requested a review from hsheth2 January 18, 2025 14:48

vercel bot deployed to Preview January 18, 2025 15:16 View deployment

mayurinehate reviewed Jan 21, 2025

View reviewed changes

datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Jan 21, 2025

brock-acryl added 2 commits January 22, 2025 16:07

Merge branch 'master' into snowflake-streams-v2

623ecb5

- Updated docs with required permissions

ade6503

- added comments - removed unused num_stream reports - made SnowflakeStream comments optional - defined tables and view argument datatypes - updated allowed pattern - fixed indentation

datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Jan 22, 2025

vercel bot deployed to Preview January 22, 2025 21:45 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ingestion/snowflake):adds streams as a new dataset with lineage and properties. #12318

feat(ingestion/snowflake):adds streams as a new dataset with lineage and properties. #12318

brock-acryl commented Jan 10, 2025 •

edited

Loading

codecov bot commented Jan 13, 2025 •

edited

Loading

hsheth2 left a comment

hsheth2 Jan 17, 2025

brock-acryl Jan 17, 2025

hsheth2 Jan 17, 2025

brock-acryl Jan 17, 2025

hsheth2 Jan 17, 2025

brock-acryl Jan 17, 2025 •

edited

Loading

hsheth2 Jan 17, 2025

brock-acryl Jan 17, 2025

mayurinehate Jan 21, 2025

brock-acryl Jan 22, 2025

mayurinehate Jan 21, 2025

mayurinehate Jan 21, 2025

brock-acryl Jan 22, 2025

mayurinehate Jan 21, 2025

brock-acryl Jan 22, 2025

mayurinehate Jan 21, 2025

brock-acryl Jan 22, 2025

mayurinehate Jan 21, 2025

brock-acryl Jan 22, 2025

mayurinehate Jan 21, 2025

mayurinehate Jan 21, 2025

brock-acryl Jan 22, 2025

mayurinehate Jan 21, 2025 •

edited

Loading

brock-acryl Jan 22, 2025

mayurinehate Jan 21, 2025

brock-acryl Jan 22, 2025

feat(ingestion/snowflake):adds streams as a new dataset with lineage and properties. #12318

Are you sure you want to change the base?

feat(ingestion/snowflake):adds streams as a new dataset with lineage and properties. #12318

Conversation

brock-acryl commented Jan 10, 2025 • edited Loading

Checklist

codecov bot commented Jan 13, 2025 • edited Loading

Codecov Report

hsheth2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brock-acryl Jan 17, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mayurinehate Jan 21, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brock-acryl commented Jan 10, 2025 •

edited

Loading

codecov bot commented Jan 13, 2025 •

edited

Loading

brock-acryl Jan 17, 2025 •

edited

Loading

mayurinehate Jan 21, 2025 •

edited

Loading