-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(ingestion/clickhouse): add sql aggregator entity registration for sqlglot lineage computation and adding view definition to views #12390
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
... and 31 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
|
metadata-ingestion/src/datahub/ingestion/source/sql/clickhouse.py
Outdated
Show resolved
Hide resolved
metadata-ingestion/src/datahub/ingestion/source/sql/clickhouse.py
Outdated
Show resolved
Hide resolved
metadata-ingestion/tests/integration/clickhouse/test_clickhouse.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this feels like a lot of somewhat messy code. I think we'd benefit from
- integrating my refactor fix(ingest/clickhouse): remove unused lineage_properties code path #12442 first
- splitting into two separate PRs - one that adds the bulk fetch for view definitions, and one that improves the view -> lineage extraction
We have a lot of the infrastructure in sql_common already - so I do also suspect that this PR is too complex and possibly doing some duplicative work.
There's also a lingering open question - if we're now using sqlglot / sql aggregator to generate lineage for views, how much is the _populate_lineage
method really doing? Should it be removed / simplified?
# For regular views, we still want everything after AS but need to handle differently | ||
# as the CREATE VIEW statement is simpler | ||
if " AS " in create_query: | ||
view_definition = create_query.split(" AS ", 1)[1].strip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's wrong with having the create view name as ...
prefix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not as simple as that with Clickhouse SQL - https://clickhouse.com/docs/en/sql-reference/statements/create/view
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not exactly sure what you're saying - imo having create view name ...
in the view definition is harmless.
Does having that in the sql statements cause issues in some way? If so, how does removing it help? Or am I missing some other details here
return None | ||
return wu.metadata.proposedSnapshot | ||
|
||
def _register_schema(self, dataset_snapshot: DatasetSnapshotClass) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this registration process already automatically happens in _process_table
and _process_view
- why do we need to do it again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies, I wasn't aware that this already occurred. I will remove.
self, dataset_urn: str | ||
) -> Optional[Tuple[str, str, Optional[str]]]: | ||
"""Extract database name, table name and schema name from dataset URN.""" | ||
dataset_key = dataset_urn_to_key(dataset_urn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will not work if a platform_instance
is set
is_view, is_materialized_view, view_definition = view_info | ||
|
||
if view_definition: | ||
dataset_snapshot = self._add_view_properties( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since the inspector is already being overridden, the _get_view_definition
method from sql common should do the right thing automatically?
dataset_snapshot: DatasetSnapshotClass = wu.metadata.proposedSnapshot | ||
assert dataset_snapshot | ||
|
||
lineage_mcp, lineage_properties_aspect = self.get_lineage_mcp( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like the original code was complex for no reason. I'm going to simplify that here #12442, unless you'd like to integrate that into this PR
I'd be cautious about removing a lot of the code as there seem to be a number of things brought in from the system tables in Clickhouse, including the partition_key, total_rows, total_bytes, primary_key from my reading of the existing code and looking at the system tables. These could be repurposed to be in the relevant areas, whereas at the moment they look to be being added purely as properties. It's also worth noting that we should also be cautious about wholesale changes in the connector without first understanding the impact. For example, the extra SQL queries in the original connector do give table to table lineage, rather than purely table to view. There is some messy pieces from this PR that can be cleaned up, but in general a lot of the messy code is pre-existing. |
Checklist