Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for delta lake table_changes table valued function as DBT source #12512

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

haon85
Copy link
Contributor

@haon85 haon85 commented Jan 31, 2025

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added ingestion PR or Issue related to the ingestion of metadata community-contribution PR or Issue raised by member(s) of DataHub Community labels Jan 31, 2025
@datahub-cyborg datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Jan 31, 2025
Copy link

codecov bot commented Jan 31, 2025

Codecov Report

Attention: Patch coverage is 80.00000% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...adata-ingestion/src/datahub/sql_parsing/_models.py 66.66% 2 Missing ⚠️

❌ Your patch check has failed because the patch coverage (87.50%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Files with missing lines Coverage Δ
...tion/src/datahub/sql_parsing/sql_parsing_common.py 100.00% <100.00%> (ø)
...gestion/src/datahub/sql_parsing/sqlglot_lineage.py 92.41% <100.00%> (-1.57%) ⬇️
...adata-ingestion/src/datahub/sql_parsing/_models.py 77.63% <66.66%> (+1.54%) ⬆️

... and 72 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aaaa655...9507176. Read the comment docs.

Copy link
Collaborator

@hsheth2 hsheth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super familiar with table-valued functions in delta lake. Could you (1) link to the docs on it and (2) add some unit tests for this code

Overall - modifying the from_sqlglot_table method feels pretty hacky and intuitively doesn't feel like the right place to make these changes

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Jan 31, 2025
@haon85
Copy link
Contributor Author

haon85 commented Feb 2, 2025

Thanks for the review @hsheth2,
Here is the link to TVF: https://docs.databricks.com/en/sql/language-manual/functions/table_changes.html

As for how to make this change, I actually thought about it for a while and did not find a good place. The scenario is we are using delta lake table for storage and registered tables in Hive metastore, then doing ETL in DBT SQL code. To support this use case, we either make code change in sqlglot so that in the Node, TVF style table source is defined correctly, or make change when we read table source from sqlglot Node, that's what I am doing now.

I am quite new to Datahub code and can definitely use some help, if you have any suggestion to better support this use case I am all ears.

I 'll add UT soon.

1 similar comment
@haon85
Copy link
Contributor Author

haon85 commented Feb 2, 2025

Thanks for the review @hsheth2,
Here is the link to TVF: https://docs.databricks.com/en/sql/language-manual/functions/table_changes.html

As for how to make this change, I actually thought about it for a while and did not find a good place. The scenario is we are using delta lake table for storage and registered tables in Hive metastore, then doing ETL in DBT SQL code. To support this use case, we either make code change in sqlglot so that in the Node, TVF style table source is defined correctly, or make change when we read table source from sqlglot Node, that's what I am doing now.

I am quite new to Datahub code and can definitely use some help, if you have any suggestion to better support this use case I am all ears.

I 'll add UT soon.

@datahub-cyborg datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Feb 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution PR or Issue raised by member(s) of DataHub Community ingestion PR or Issue related to the ingestion of metadata needs-review Label for PRs that need review from a maintainer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants