Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingestion/clickhouse): add sql aggregator entity registration for sqlglot lineage computation and adding view definition to views #12390

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

acrylJonny
Copy link
Contributor

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

Copy link

codecov bot commented Jan 19, 2025

Codecov Report

Attention: Patch coverage is 77.77778% with 28 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...ion/src/datahub/ingestion/source/sql/clickhouse.py 77.77% 28 Missing ⚠️
Files with missing lines Coverage Δ
...ion/src/datahub/ingestion/source/sql/clickhouse.py 89.84% <77.77%> (-1.55%) ⬇️

... and 31 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d3ff189...4696698. Read the comment docs.

@acrylJonny acrylJonny changed the title implement functionality and add testing feat(ingestion/clickhouse): add reporter for sqlglot lineage computation and adding view definition to views Jan 21, 2025
@acrylJonny acrylJonny changed the title feat(ingestion/clickhouse): add reporter for sqlglot lineage computation and adding view definition to views feat(ingestion/clickhouse): add sql aggregator for sqlglot lineage computation and adding view definition to views Jan 21, 2025
@acrylJonny acrylJonny changed the title feat(ingestion/clickhouse): add sql aggregator for sqlglot lineage computation and adding view definition to views feat(ingestion/clickhouse): add sql aggregator entity registration for sqlglot lineage computation and adding view definition to views Jan 22, 2025
Copy link
Collaborator

@hsheth2 hsheth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this feels like a lot of somewhat messy code. I think we'd benefit from

  1. integrating my refactor fix(ingest/clickhouse): remove unused lineage_properties code path #12442 first
  2. splitting into two separate PRs - one that adds the bulk fetch for view definitions, and one that improves the view -> lineage extraction

We have a lot of the infrastructure in sql_common already - so I do also suspect that this PR is too complex and possibly doing some duplicative work.

There's also a lingering open question - if we're now using sqlglot / sql aggregator to generate lineage for views, how much is the _populate_lineage method really doing? Should it be removed / simplified?

# For regular views, we still want everything after AS but need to handle differently
# as the CREATE VIEW statement is simpler
if " AS " in create_query:
view_definition = create_query.split(" AS ", 1)[1].strip()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's wrong with having the create view name as ... prefix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not as simple as that with Clickhouse SQL - https://clickhouse.com/docs/en/sql-reference/statements/create/view

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not exactly sure what you're saying - imo having create view name ... in the view definition is harmless.

Does having that in the sql statements cause issues in some way? If so, how does removing it help? Or am I missing some other details here

return None
return wu.metadata.proposedSnapshot

def _register_schema(self, dataset_snapshot: DatasetSnapshotClass) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this registration process already automatically happens in _process_table and _process_view - why do we need to do it again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, I wasn't aware that this already occurred. I will remove.

self, dataset_urn: str
) -> Optional[Tuple[str, str, Optional[str]]]:
"""Extract database name, table name and schema name from dataset URN."""
dataset_key = dataset_urn_to_key(dataset_urn)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will not work if a platform_instance is set

is_view, is_materialized_view, view_definition = view_info

if view_definition:
dataset_snapshot = self._add_view_properties(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the inspector is already being overridden, the _get_view_definition method from sql common should do the right thing automatically?

dataset_snapshot: DatasetSnapshotClass = wu.metadata.proposedSnapshot
assert dataset_snapshot

lineage_mcp, lineage_properties_aspect = self.get_lineage_mcp(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like the original code was complex for no reason. I'm going to simplify that here #12442, unless you'd like to integrate that into this PR

@acrylJonny
Copy link
Contributor Author

Overall this feels like a lot of somewhat messy code. I think we'd benefit from

  1. integrating my refactor fix(ingest/clickhouse): remove unused lineage_properties code path #12442 first
  2. splitting into two separate PRs - one that adds the bulk fetch for view definitions, and one that improves the view -> lineage extraction

We have a lot of the infrastructure in sql_common already - so I do also suspect that this PR is too complex and possibly doing some duplicative work.

There's also a lingering open question - if we're now using sqlglot / sql aggregator to generate lineage for views, how much is the _populate_lineage method really doing? Should it be removed / simplified?

I'd be cautious about removing a lot of the code as there seem to be a number of things brought in from the system tables in Clickhouse, including the partition_key, total_rows, total_bytes, primary_key from my reading of the existing code and looking at the system tables. These could be repurposed to be in the relevant areas, whereas at the moment they look to be being added purely as properties. It's also worth noting that we should also be cautious about wholesale changes in the connector without first understanding the impact. For example, the extra SQL queries in the original connector do give table to table lineage, rather than purely table to view. There is some messy pieces from this PR that can be cleaned up, but in general a lot of the messy code is pre-existing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants