diff --git a/docs-website/sidebars.js b/docs-website/sidebars.js index 2eb600eff74e8..9d6d2a59978f5 100644 --- a/docs-website/sidebars.js +++ b/docs-website/sidebars.js @@ -79,6 +79,18 @@ module.exports = { id: "docs/managed-datahub/observe/volume-assertions", className: "saasOnly", }, + { + label: "Open Assertions Specification", + type: "category", + link: { type: "doc", id: "docs/assertions/open-assertions-spec" }, + items: [ + { + label: "Snowflake", + type: "doc", + id: "docs/assertions/snowflake/snowflake_dmfs", + }, + ], + }, ], }, { diff --git a/docs/assertions/open-assertions-spec.md b/docs/assertions/open-assertions-spec.md new file mode 100644 index 0000000000000..519e917c30587 --- /dev/null +++ b/docs/assertions/open-assertions-spec.md @@ -0,0 +1,486 @@ +# DataHub Open Data Quality Assertions Specification + +DataHub is developing an open-source Data Quality Assertions Specification & Compiler that will allow you to declare data quality checks / expectations / assertions using a simple, universal +YAML-based format, and then compile this into artifacts that can be registered or directly executed by 3rd party Data Quality tools like [Snowflake DMFs](https://docs.snowflake.com/en/user-guide/data-quality-intro), +dbt tests, Great Expectations or Acryl Cloud natively. + +Ultimately, our goal is to provide an framework-agnostic, highly-portable format for defining Data Quality checks, making it seamless to swap out the underlying +assertion engine without service disruption for end consumers of the results of these data quality checks in catalogging tools like DataHub. + +## Integrations + +Currently, the DataHub Open Assertions Specification supports the following integrations: + +- [Snowflake DMF Assertions](snowflake/snowflake_dmfs.md) + +And is looking for contributions to build out support for the following integrations: + +- [Looking for Contributions] dbt tests +- [Looking for Contributions] Great Expectation checks + +Below, we'll look at how to define assertions in YAML, and then provide an usage overview for each support integration. + +## The Specification: Declaring Data Quality Assertions in YAML + +The following assertion types are currently supported by the DataHub YAML Assertion spec: + +- [Freshness](/docs/managed-datahub/observe/freshness-assertions.md) +- [Volume](/docs/managed-datahub/observe/volume-assertions.md) +- [Column](/docs/managed-datahub/observe/column-assertions.md) +- [Custom SQL](/docs/managed-datahub/observe/custom-sql-assertions.md) +- [Schema](/docs/managed-datahub/observe/schema-assertions.md) + +Each assertion type aims to validate a different aspect of structured table (e.g. on a data warehouse or data lake), from +structure to size to column integrity to custom metrics. + +In this section, we'll go over examples of defining each. + +### Freshness Assertions + +Freshness Assertions allow you to verify that your data was updated within the expected timeframe. +Below you'll find examples of defining different types of freshness assertions via YAML. + +#### Validating that Table is Updated Every 6 Hours + +```yaml +version: 1 +assertions: + - entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD) + type: freshness + lookback_interval: '6 hours' + last_modified_field: updated_at + schedule: + type: interval + interval: '6 hours' # Run every 6 hours +``` + +This assertion checks that the `purchase_events` table in the `test_db.public` schema was updated within the last 6 hours +by issuing a Query to the table which validates determines whether an update was made using the `updated_at` column in the past 6 hours. +To use this check, we must specify the field that contains the last modified timestamp of a given row. + +The `lookback_interval` field is used to specify the "lookback window" for the assertion, whereas the `schedule` field is used to specify how often the assertion should be run. +This allows you to schedule the assertion to run at a different frequency than the lookback window, for example +to detect stale data as soon as it becomes "stale" by inspecting it more frequently. + +#### Supported Source Types + +Currently, the only supported `sourceType` for Freshness Assertions is `LAST_MODIFIED_FIELD`. In the future, +we may support additional source types, such as `HIGH_WATERMARK`, along with data source-specific types such as +`AUDIT_LOG` and `INFORMATION_SCHEMA`. + + +### Volume Assertions + +Volume Assertions allow you to verify that the number of records in your dataset meets your expectations. +Below you'll find examples of defining different types of volume assertions via YAML. + +#### Validating that Tale Row Count is in Expected Range + +```yaml +version: 1 +assertions: + - entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD) + type: volume + metric: 'row_count' + condition: + type: between + min: 1000 + max: 10000 + # filters: "event_type = 'purchase'" Optionally add filters. + schedule: + type: on_table_change # Run when new data is added to the table. +``` + +This assertion checks that the `purchase_events` table in the `test_db.public` schema has between 1000 and 10000 records. +Using the `condition` field, you can specify the type of comparison to be made, and the `min` and `max` fields to specify the range of values to compare against. +Using the `filters` field, you can optionally specify a SQL WHERE clause to filter the records being counted. +Using the `schedule` field you can specify when the assertion should be run, either on a fixed schedule or when new data is added to the table. +The only metric currently supported is `row_count`. + +#### Validating that Table Row Count is Less Than Value + +```yaml +version: 1 +assertions: + - entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD) + type: volume + metric: 'row_count' + condition: + type: less_than_or_equal_to + value: 1000 + # filters: "event_type = 'purchase'" Optionally add filters. + schedule: + type: on_table_change # Run when new data is added to the table. +``` + +#### Validating that Table Row Count is Greater Than Value + +```yaml +version: 1 +assertions: + - entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD) + type: volume + metric: 'row_count' + condition: + type: greater_than_or_equal_to + value: 1000 + # filters: "event_type = 'purchase'" Optionally add filters. + schedule: + type: on_table_change # Run when new data is added to the table. +``` + + +#### Supported Conditions + +The full set of supported volume assertion conditions include: + +- `equal_to` +- `not_equal_to` +- `greater_than` +- `greater_than_or_equal_to` +- `less_than` +- `less_than_or_equal_to` +- `between` + + +### Column Assertions + +Column Assertions allow you to verify that the values in a column meet your expectations. +Below you'll find examples of defining different types of column assertions via YAML. + +The specification currently supports 2 types of Column Assertions: + +- **Field Value**: Asserts that the values in a column meet a specific condition. +- **Field Metric**: Asserts that a specific metric aggregated across the values in a column meet a specific condition. + +We'll go over examples of each below. + +#### Field Values Assertion: Validating that All Column Values are In Expected Range + +```yaml +version: 1 +assertions: + - entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD) + type: field + field: amount + condition: + type: between + min: 0 + max: 10 + exclude_nulls: True + # filters: "event_type = 'purchase'" Optionally add filters for Column Assertion. + # failure_threshold: + # type: count + # value: 10 + schedule: + type: on_table_change +``` + +This assertion checks that all values for the `amount` column in the `purchase_events` table in the `test_db.public` schema have values between 0 and 10. +Using the `field` field, you can specify the column to be asserted on, and using the `condition` field, you can specify the type of comparison to be made, +and the `min` and `max` fields to specify the range of values to compare against. +Using the `schedule` field you can specify when the assertion should be run, either on a fixed schedule or when new data is added to the table. +Using the `filters` field, you can optionally specify a SQL WHERE clause to filter the records being counted. +Using the `exclude_nulls` field, you can specify whether to exclude NULL values from the assertion, meaning that +NULL will simply be ignored if encountered, as opposed to failing the check. +Using the `failure_threshold`, we can set a threshold for the number of rows that can fail the assertion before the assertion is considered failed. + +#### Field Values Assertion: Validating that All Column Values are In Expected Set + +The validate a VARCHAR / STRING column that should contain one of a set of values: + +```yaml +version: 1 +assertions: + - entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD) + type: field + field: product_id + condition: + type: in + value: + - 'product_1' + - 'product_2' + - 'product_3' + exclude_nulls: False + # filters: "event_type = 'purchase'" Optionally add filters for Column Assertion. + # failure_threshold: + # type: count + # value: 10 + schedule: + type: on_table_change +``` + +#### Field Values Assertion: Validating that All Column Values are Email Addresses + +The validate a string column contains valid email addresses: + +```yaml +version: 1 +assertions: + - entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD) + type: field + field: email_address + condition: + type: matches_regex + value: "[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}" + exclude_nulls: False + # filters: "event_type = 'purchase'" Optionally add filters for Column Assertion. + # failure_threshold: + # type: count + # value: 10 + schedule: + type: on_table_change +``` + +#### Field Values Assertion: Supported Conditions + +The full set of supported field value conditions include: + +- `in` +- `not_in` +- `is_null` +- `is_not_null` +- `equal_to` +- `not_equal_to` +- `greater_than` # Numeric Only +- `greater_than_or_equal_to` # Numeric Only +- `less_than` # Numeric Only +- `less_than_or_equal_to` # Numeric Only +- `between` # Numeric Only +- `matches_regex` # String Only +- `not_empty` # String Only +- `length_greater_than` # String Only +- `length_less_than` # String Only +- `length_between` # String Only + + +#### Field Metric Assertion: Validating No Missing Values in Column + +```yaml +version: 1 +assertions: + - entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD) + type: field + field: col_date + metric: null_count + condition: + type: equal_to + value: 0 + # filters: "event_type = 'purchase'" Optionally add filters for Column Assertion. + schedule: + type: on_table_change +``` + +This assertion ensures that the `col_date` column in the `purchase_events` table in the `test_db.public` schema has no NULL values. + +#### Field Metric Assertion: Validating No Duplicates in Column + +```yaml +version: 1 +assertions: + - entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD) + type: field + field: id + metric: unique_percentage + condition: + type: equal_to + value: 100 + # filters: "event_type = 'purchase'" Optionally add filters for Column Assertion. + schedule: + type: on_table_change +``` + +This assertion ensures that the `id` column in the `purchase_events` table in the `test_db.public` schema +has no duplicates, by checking that the unique percentage is 100%. + +#### Field Metric Assertion: Validating String Column is Never Empty String + +```yaml +version: 1 +assertions: + - entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD) + type: field + field: name + metric: empty_percentage + condition: + type: equal_to + value: 0 + # filters: "event_type = 'purchase'" Optionally add filters for Column Assertion. + schedule: + type: on_table_change +``` + +This assertion ensures that the `name` column in the `purchase_events` table in the `test_db.public` schema is never empty, by checking that the empty percentage is 0%. + +#### Field Metric Assertion: Supported Metrics + +The full set of supported field metrics include: + +- `null_count` +- `null_percentage` +- `unique_count` +- `unique_percentage` +- `empty_count` +- `empty_percentage` +- `min` +- `max` +- `mean` +- `median` +- `stddev` +- `negative_count` +- `negative_percentage` +- `zero_count` +- `zero_percentage` + +### Field Metric Assertion: Supported Conditions + +The full set of supported field metric conditions include: + +- `equal_to` +- `not_equal_to` +- `greater_than` +- `greater_than_or_equal_to` +- `less_than` +- `less_than_or_equal_to` +- `between` + +### Custom SQL Assertions + +Custom SQL Assertions allow you to define custom SQL queries to verify your data meets your expectations. +The only condition is that the SQL query must return a single value, which will be compared against the expected value. +Below you'll find examples of defining different types of custom SQL assertions via YAML. + +SQL Assertions are useful for more complex data quality checks that can't be easily expressed using the other assertion types, +and can be used to assert on custom metrics, complex aggregations, cross-table integrity checks (JOINS) or any other SQL-based data quality check. + +#### Validating Foreign Key Integrity + +```yaml +version: 1 +assertions: + - entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD) + type: sql + statement: | + SELECT COUNT(*) + FROM test_db.public.purchase_events AS pe + LEFT JOIN test_db.public.products AS p + ON pe.product_id = p.id + WHERE p.id IS NULL + condition: + type: equal_to + value: 0 + schedule: + type: interval + interval: '6 hours' # Run every 6 hours +``` + +This assertion checks that the `purchase_events` table in the `test_db.public` schema has no rows where the `product_id` column does not have a corresponding `id` in the `products` table. + +#### Comparing Row Counts Across Multiple Tables + +```yaml +version: 1 +assertions: + - entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD) + type: sql + statement: | + SELECT COUNT(*) FROM test_db.public.purchase_events + - (SELECT COUNT(*) FROM test_db.public.purchase_events_raw) AS row_count_difference + condition: + type: equal_to + value: 0 + schedule: + type: interval + interval: '6 hours' # Run every 6 hours +``` + +This assertion checks that the number of rows in the `purchase_events` exactly matches the number of rows in an upstream `purchase_events_raw` table +by subtracting the row count of the raw table from the row count of the processed table. + +#### Supported Conditions + +The full set of supported custom SQL assertion conditions include: + +- `equal_to` +- `not_equal_to` +- `greater_than` +- `greater_than_or_equal_to` +- `less_than` +- `less_than_or_equal_to` +- `between` + + +### Schema Assertions (Coming Soon) + +Schema Assertions allow you to define custom SQL queries to verify your data meets your expectations. +Below you'll find examples of defining different types of custom SQL assertions via YAML. + +The specification currently supports 2 types of Schema Assertions: + +- **Exact Match**: Asserts that the schema of a table - column names and their data types - exactly matches an expected schema +- **Contains Match** (Subset): Asserts that the schema of a table - column names and their data types - is a subset of an expected schema + +#### Validating Actual Schema Exactly Equals Expected Schema + +```yaml +version: 1 +assertions: + - entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD) + type: schema + condition: + type: exact_match + columns: + - name: id + type: INTEGER + - name: product_id + type: STRING + - name: amount + type: DECIMAL + - name: updated_at + type: TIMESTAMP + schedule: + type: interval + interval: '6 hours' # Run every 6 hours +``` + +This assertion checks that the `purchase_events` table in the `test_db.public` schema has the exact schema as specified, with the exact column names and data types. + +#### Validating Actual Schema is Contains all of Expected Schema + +```yaml +version: 1 +assertions: + - entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD) + type: schema + condition: + type: contains + columns: + - name: id + type: integer + - name: product_id + type: string + - name: amount + type: number + schedule: + type: interval + interval: '6 hours' # Run every 6 hours +``` + +This assertion checks that the `purchase_events` table in the `test_db.public` schema contains all of the columns specified in the expected schema, with the exact column names and data types. +The actual schema can also contain additional columns not specified in the expected schema. + +#### Supported Data Types + +The following high-level data types are currently supported by the Schema Assertion spec: + +- string +- number +- boolean +- date +- timestamp +- struct +- array +- map +- union +- bytes +- enum diff --git a/docs/assertions/snowflake/snowflake_dmfs.md b/docs/assertions/snowflake/snowflake_dmfs.md new file mode 100644 index 0000000000000..e7801a5cbb71b --- /dev/null +++ b/docs/assertions/snowflake/snowflake_dmfs.md @@ -0,0 +1,224 @@ +# Snowflake DMF Assertions [BETA] + +The DataHub Open Assertion Compiler allows you to define your Data Quality assertions in a simple YAML format, and then compile them to be executed by Snowflake Data Metric Functions. +Once compiled, you'll be able to register the compiled DMFs in your Snowflake environment, and extract their results them as part of your normal ingestion process for DataHub. +Results of Snowflake DMF assertions will be reported as normal Assertion Results, viewable on a historical timeline in the context +of the table with which they are associated. + +## Prerequisites + +- You must have a Snowflake Enterprise account, where the DMFs feature is enabled. +- You must have the necessary permissions to provision DMFs in your Snowflake environment (see below) +- You must have the necessary permissions to query the DMF results in your Snowflake environment (see below) +- You must have DataHub instance with Snowflake metadata ingested. If you do not have existing snowflake ingestion, refer [Snowflake Quickstart Guide](https://datahubproject.io/docs/quick-ingestion-guides/snowflake/overview) to get started. +- You must have DataHub CLI installed and run [`datahub init`](https://datahubproject.io/docs/cli/#init). + +### Permissions + +*Permissions required for registering DMFs* + +According to the latest Snowflake docs, here are the permissions the service account performing the +DMF registration and ingestion must have: + +| Privilege | Object | Notes | +|------------------------------|---------------------|---------------------------------------------------------------------------------------------| +| USAGE | Database, schema | Database and schema where snowflake DMFs will be created. This is configured in compile command described below. | +| CREATE FUNCTION | Schema | This privilege enables creating new DMF in schema configured in compile command. | +| EXECUTE DATA METRIC FUNCTION | Account | This privilege enables you to control which roles have access to server-agnostic compute resources to call the system DMF. | +| USAGE | Database, schema | These objects are the database and schema that contain the referenced table in the query. | +| OWNERSHIP | Table | This privilege enables you to associate a DMF with a referenced table. | +| USAGE | DMF | This privilege enables calling the DMF in schema configured in compile command. | + +and the roles that must be granted: + +| Role | Notes | +|--------------------------|-------------------------| +| SNOWFLAKE.DATA_METRIC_USER | To use System DMFs | + +*Permissions required for running DMFs (scheduled DMFs run with table owner's role)* + +Because scheduled DMFs run with the role of the table owner, the table owner must have the following privileges: + +| Privilege | Object | Notes | +|------------------------------|------------------|---------------------------------------------------------------------------------------------| +| USAGE | Database, schema | Database and schema where snowflake DMFs will be created. This is configured in compile command described below. | +| USAGE | DMF | This privilege enables calling the DMF in schema configured in compile power. | +| EXECUTE DATA METRIC FUNCTION | Account | This privilege enables you to control which roles have access to server-agnostic compute resources to call the system DMF. | + +and the roles that must be granted: + +| Role | Notes | +|--------------------------|-------------------------| +| SNOWFLAKE.DATA_METRIC_USER | To use System DMFs | + +*Permissions required for querying DMF results* + +In addition, the service account that will be executing DataHub Ingestion, and querying the DMF results, must have been granted the following system application roles: + +| Role | Notes | +|--------------------------------|-----------------------------| +| DATA_QUALITY_MONITORING_VIEWER | Query the DMF results table | + +To learn more about Snowflake DMFs and the privileges required to provision and query them, see the [Snowflake documentation](https://docs.snowflake.com/en/user-guide/data-quality-intro). + +*Example: Granting Permissions* + +```sql +-- setup permissions to to create DMFs and associate DMFs with table +grant usage on database "" to role "" +grant usage on schema "." to role "" +grant create function on schema "." to role "" +-- grant ownership + rest of permissions to +grant role "" to role "" + +-- setup permissions for to run DMFs on schedule +grant usage on database "" to role "" +grant usage on schema "." to role "" +grant usage on all functions in "." to role "" +grant usage on future functions in "." to role "" +grant database role SNOWFLAKE.DATA_METRIC_USER to role "" +grant execute data metric function on account to role "" + +-- setup permissions for to query DMF results +grant application role SNOWFLAKE.DATA_QUALITY_MONITORING_VIEWER to role "" +``` + +## Supported Assertion Types + +The following assertion types are currently supported by the DataHub Snowflake DMF Assertion Compiler: + +- [Freshness](/docs/managed-datahub/observe/freshness-assertions.md) +- [Volume](/docs/managed-datahub/observe/volume-assertions.md) +- [Column](/docs/managed-datahub/observe/column-assertions.md) +- [Custom SQL](/docs/managed-datahub/observe/custom-sql-assertions.md) + +Note that Schema Assertions are not currently supported. + +## Creating Snowflake DMF Assertions + +The process for declaring and running assertions backend by Snowflake DMFs consists of a few steps, which will be outlined +in the following sections. + + +### Step 1. Define your Data Quality assertions using Assertion YAML files + +See the section **Declaring Assertions in YAML** below for examples of how to define assertions in YAML. + + +### Step 2. Register your assertions with DataHub + +Use the DataHub CLI to register your assertions with DataHub, so they become visible in the DataHub UI: + +```bash +datahub assertions upsert -f examples/library/assertions_configuration.yml +``` + + +### Step 3. Compile the assertions into Snowflake DMFs using the DataHub CLI + +Next, we'll use the `assertions compile` command to generate the SQL code for the Snowflake DMFs, +which can then be registered in Snowflake. + +```bash +datahub assertions compile -f examples/library/assertions_configuration.yml -p snowflake -x DMF_SCHEMA=. +``` + +Two files will be generated as output of running this command: + +- `dmf_definitions.sql`: This file contains the SQL code for the DMFs that will be registered in Snowflake. +- `dmf_associations.sql`: This file contains the SQL code for associating the DMFs with the target tables in Snowflake. + +By default in a folder called `target`. You can use config option `-o ` in `compile` command to write these compile artifacts in another folder. + +Each of these artifacts will be important for the next steps in the process. + +_dmf_definitions.sql_ + +This file stores the SQL code for the DMFs that will be registered in Snowflake, generated +from your YAML assertion definitions during the compile step. + +```sql +-- Example dmf_definitions.sql + +-- Start of Assertion 5c32eef47bd763fece7d21c7cbf6c659 + + CREATE or REPLACE DATA METRIC FUNCTION + test_db.datahub_dmfs.datahub__5c32eef47bd763fece7d21c7cbf6c659 (ARGT TABLE(col_date DATE)) + RETURNS NUMBER + COMMENT = 'Created via DataHub for assertion urn:li:assertion:5c32eef47bd763fece7d21c7cbf6c659 of type volume' + AS + $$ + select case when metric <= 1000 then 1 else 0 end from (select count(*) as metric from TEST_DB.PUBLIC.TEST_ASSERTIONS_ALL_TIMES ) + $$; + +-- End of Assertion 5c32eef47bd763fece7d21c7cbf6c659 +.... +``` + +_dmf_associations.sql_ + +This file stores the SQL code for associating with the target table, +along with scheduling the generated DMFs to run on at particular times. + +```sql +-- Example dmf_associations.sql + +-- Start of Assertion 5c32eef47bd763fece7d21c7cbf6c659 + + ALTER TABLE TEST_DB.PUBLIC.TEST_ASSERTIONS_ALL_TIMES SET DATA_METRIC_SCHEDULE = 'TRIGGER_ON_CHANGES'; + ALTER TABLE TEST_DB.PUBLIC.TEST_ASSERTIONS_ALL_TIMES ADD DATA METRIC FUNCTION test_db.datahub_dmfs.datahub__5c32eef47bd763fece7d21c7cbf6c659 ON (col_date); + +-- End of Assertion 5c32eef47bd763fece7d21c7cbf6c659 +.... +``` + + +### Step 4. Register the compiled DMFs in your Snowflake environment + +Next, you'll need to run the generated SQL from the files output in Step 3 in Snowflake. + +You can achieve this either by running the SQL files directly in the Snowflake UI, or by using the SnowSQL CLI tool: + +```bash +snowsql -f dmf_definitions.sql +snowsql -f dmf_associations.sql +``` + +:::NOTE +Scheduling Data Metric Function on table incurs Serverless Credit Usage in Snowflake. Refer [Billing and Pricing](https://docs.snowflake.com/en/user-guide/data-quality-intro#billing-and-pricing) for more details. +Please ensure you DROP Data Metric Function created via dmf_associations.sql if the assertion is no longer in use. +::: + +### Step 5. Run ingestion to report the results back into DataHub + +Once you've registered the DMFs, they will be automatically executed, either when the target table is updated or on a fixed +schedule. + +To report the results of the generated Data Quality assertions back into DataHub, you'll need to run the DataHub ingestion process with a special configuration +flag: `include_assertion_results: true`: + +```yaml +# Your DataHub Snowflake Recipe +source: + type: snowflake + config: + # ... + include_assertion_results: True + # ... +``` + +During ingestion we will query for the latest DMF results stored in Snowflake, convert them into DataHub Assertion Results, and report the results back into DataHub during your ingestion process +either via CLI or the UI visible as normal assertions. + +`datahub ingest -c snowflake.yml` + +## Caveats + +- Currently, Snowflake supports at most 1000 DMF-table associations at the moment so you can not define more than 1000 assertions for snowflake. +- Currently, Snowflake does not allow JOIN queries or non-deterministic functions in DMF definition so you can not use these in SQL for SQL assertion or in filters section. +- Currently, all DMFs scheduled on a table must follow same exact schedule, so you can not set assertions on same table to run on different schedules. +- Currently, DMFs are only supported for regular tables and not dynamic or external tables. + +## FAQ + +Coming soon! \ No newline at end of file diff --git a/metadata-service/configuration/src/main/resources/application.yaml b/metadata-service/configuration/src/main/resources/application.yaml index 4d188bd5c6183..9125bb046d7c8 100644 --- a/metadata-service/configuration/src/main/resources/application.yaml +++ b/metadata-service/configuration/src/main/resources/application.yaml @@ -485,4 +485,4 @@ metadataChangeProposal: maxAttempts: ${MCP_TIMESERIES_MAX_ATTEMPTS:1000} initialIntervalMs: ${MCP_TIMESERIES_INITIAL_INTERVAL_MS:100} multiplier: ${MCP_TIMESERIES_MULTIPLIER:10} - maxIntervalMs: ${MCP_TIMESERIES_MAX_INTERVAL_MS:30000} \ No newline at end of file + maxIntervalMs: ${MCP_TIMESERIES_MAX_INTERVAL_MS:30000}