Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Add support for Data Profiling Scan #1392

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

syou6162
Copy link
Contributor

@syou6162 syou6162 commented Nov 3, 2024

resolves #1330

Problem

Dataplex data profiling lets you identify common statistical characteristics of the columns in your BigQuery tables. This information helps you to understand and analyze your data more effectively.

If you are managing tables with dbt, it is natural to want to configure Data Profile Scan in a yaml file. If data profiling could be set within dbt after the table is created, it would make it easier for dbt users to use the data profiling function.

Solution

I created this pull request to add support for Data Profiling Scan. If you write the following in dbt_project.yml and then run dbt run, the Data Profile Scan settings will be configured automatically.

models:
  +on_schema_change: "sync_all_columns"
  my_project:
    +persist_docs:
      relation: true
      columns: true
    sandbox:
      +schema: sandbox
      +materialized: table
      +data_profile_scan:
        location: us-central1
        sampling_percent: 10
        enabled: "{{ target.name == 'prod'}}"
スクリーンショット 2024-11-04 9 04 13

You can also specify Data Profile Scan settings for individual model files, rather than dbt_project.yml.

version: 2
models:
  - name: my_table
    config:
      data_profile_scan:
        location: us-central1
        scan_id: my_profile_scan
        sampling_percent: 10
        row_filter: "TRUE"

Checklist

  • I have read the contributing guide and understand what's expected of me
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • This PR has no interface changes (e.g. macros, cli, logs, json artifacts, config files, adapter interface, etc) or this PR has already received feedback and approval from Product or DX

@cla-bot cla-bot bot added the cla:yes label Nov 3, 2024
@syou6162 syou6162 force-pushed the feature/introduce_data_profile_scan branch 3 times, most recently from fd42a67 to 524a19a Compare November 3, 2024 23:25
@syou6162 syou6162 force-pushed the feature/introduce_data_profile_scan branch from 524a19a to e191796 Compare November 3, 2024 23:27
@syou6162 syou6162 force-pushed the feature/introduce_data_profile_scan branch 2 times, most recently from 88f64a4 to 38f1e8c Compare November 3, 2024 23:49
@@ -999,3 +1022,142 @@ def validate_sql(self, sql: str) -> AdapterResponse:
:param str sql: The sql to validate
"""
return self.connections.dry_run(sql)

# If the label `dataplex-dp-published-*` is not assigned, we cannot view the results of the Data Profile Scan from BigQuery
def _update_labels_with_data_profile_scan_labels(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data Profile Scan is sometimes used for purposes other than dbt. It is important to have a way to tell whether the information in Data Profile Scan was created via dbt when updating/deleting it mechanically using cli or sdk. You can use scan_id, but I added the managed_by label because it is easier to handle when structured like labels.

@syou6162 syou6162 force-pushed the feature/introduce_data_profile_scan branch 4 times, most recently from 03f68e3 to b59a087 Compare November 4, 2024 02:35
@syou6162 syou6162 force-pushed the feature/introduce_data_profile_scan branch 2 times, most recently from 7d9e7c5 to 9fa2586 Compare November 4, 2024 03:30
@syou6162 syou6162 force-pushed the feature/introduce_data_profile_scan branch from 9fa2586 to 8a99bfe Compare November 4, 2024 03:35
@syou6162 syou6162 changed the title Feature/introduce data profile scan Feature: Add support for Data Profiling Scan Nov 4, 2024
@syou6162 syou6162 marked this pull request as ready for review November 4, 2024 03:52
@syou6162 syou6162 requested a review from a team as a code owner November 4, 2024 03:52
@syou6162
Copy link
Contributor Author

syou6162 commented Nov 4, 2024

@colin-rogers-dbt @VersusFacit Could you review this pull request?

I can also make a pull request to fix the documentation for BigQuery configurations, so please let me know if you need this 👍. If you need it, it would be helpful if you could let me know if you need it before this pull request is merged or if it would be sufficient after it is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Support Data Profiling in dbt
1 participant