-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: Add support for Data Profiling Scan #1392
base: main
Are you sure you want to change the base?
Feature: Add support for Data Profiling Scan #1392
Conversation
fd42a67
to
524a19a
Compare
524a19a
to
e191796
Compare
88f64a4
to
38f1e8c
Compare
dbt/adapters/bigquery/impl.py
Outdated
@@ -999,3 +1022,142 @@ def validate_sql(self, sql: str) -> AdapterResponse: | |||
:param str sql: The sql to validate | |||
""" | |||
return self.connections.dry_run(sql) | |||
|
|||
# If the label `dataplex-dp-published-*` is not assigned, we cannot view the results of the Data Profile Scan from BigQuery | |||
def _update_labels_with_data_profile_scan_labels( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Data Profile Scan is sometimes used for purposes other than dbt. It is important to have a way to tell whether the information in Data Profile Scan was created via dbt when updating/deleting it mechanically using cli or sdk. You can use scan_id
, but I added the managed_by
label because it is easier to handle when structured like labels.
03f68e3
to
b59a087
Compare
7d9e7c5
to
9fa2586
Compare
9fa2586
to
8a99bfe
Compare
@colin-rogers-dbt @VersusFacit Could you review this pull request? I can also make a pull request to fix the documentation for BigQuery configurations, so please let me know if you need this 👍. If you need it, it would be helpful if you could let me know if you need it before this pull request is merged or if it would be sufficient after it is merged. |
resolves #1330
Problem
Dataplex data profiling lets you identify common statistical characteristics of the columns in your BigQuery tables. This information helps you to understand and analyze your data more effectively.
If you are managing tables with dbt, it is natural to want to configure Data Profile Scan in a yaml file. If data profiling could be set within dbt after the table is created, it would make it easier for dbt users to use the data profiling function.
Solution
I created this pull request to add support for Data Profiling Scan. If you write the following in
dbt_project.yml
and then rundbt run
, the Data Profile Scan settings will be configured automatically.You can also specify Data Profile Scan settings for individual model files, rather than
dbt_project.yml
.Checklist