You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
#579 introduces email notifications for failing tests, with a system to detect the state of test failures and only notify stakeholders for newly failing rows. This system works, but it also introduces some complexity that might make the system difficult to scale, namely:
Configuring a new test for notifications requires adding a new test-specific seed representing the universe of known failures
Any generics that are used by tests that are configured to be part of the system must be updated to accept an anti_join argument with a complicated schema
If other teams fix rows that are part of a test's known universe of failures, a Data team member will have to manually intervene to update its seed, or else the system will fail to alert if those rows ever start to fail again
For tests that are part of the notification system, the failure data saved in qc.test_run_failing_row will no longer record rows that are part of the known universe of failing rows, so our analytics dashboards will not capture any work that we do to reduce the known universe of failing rows
Adding a new SNS topic requires updating the big chunk of YAML that we pass to the --vars flag when running run_iasworld_data_tests in the test-dbt-models workflow
If the system is successful and the Data team ends up being responsible for configuring lots of tests to use it, these issues will make it much harder to maintain the system over time.
Proposed refactor
Here is a sketch of a new iteration of the notification system that will be easier to scale:
Instead of a seed for every test, the known universe of test failures should be stored in one Parquet dataset that lives remotely on S3. This dataset should have a name like qc.test_known_failure and should be defined as a source in the dbt DAG.
The dataset should be partitioned by tablename, test name, and snapshot date (or version), and each individual Parquet file should record the set of primary keys representing the universe of known failures for each test at a moment in time.
A script like dbt/scripts/update_test_known_failures.py should expose an interface for managing this state file, including:
Managing different state files depending on the dbt environment (prod or CI)
Copying prod state files to CI environments
Adding a new snapshot of data for a specific test on the current date
Excluding specific rows from a snapshot by primary key
The dbt/scripts/run_iasworld_data_tests.py script should check whether each failing test has a universe of known failures in qc.test_known_failure, and use the most recent snapshot to exclude known failures from notifications.
The dbt/scripts/run_iasworld_data_tests.py script should check whether the state of each failing test has changed from the last snapshot, and should determine whether to update the snapshot using the same logic as the update_test_known_failures script. In particular, the script needs to handle the following cases:
No new failures
New failures
No old failures fixed
Do nothing
Do nothing
Old failures fixed
Create new snapshot
Create new snapshot without new failures
The test-dbt-models workflow should automatically search for SNS topic overrides based on variable names when running the run_iasworld_data_tests script (see here for original suggestion)
Open questions
Should rows in qc.test_known_failure be unique by snapshot date, or snapshot version?
Snapshot date would be more intuitive for human readers, but precludes the possibility of multiple snapshots on a single date
If we go with version, we should probably preserve the date as well, to make the versions more understandable
Should the run_iasworld_data_tests script compare failing tests to the most recent snapshot in qc.test_known_failure, or should tests have to be configured with the snapshot date/version that corresponds to their universe of known failures?
Using the most recent snapshot will simplify the design, but reduce configurability
Next steps
Since this refactor will require substantial engineering time, we should only undertake it if and when we determine that the notification system is capable of gaining traction with stakeholders. Until then, we should focus on getting traction and validating the usefulness of the system.
The text was updated successfully, but these errors were encountered:
I did a little bit of research today to determine if we could leverage dbt snapshots for the snapshotting system; my determination is that we currently can't, because they can't handle the "Old failures fixed / new failures" condition in the snapshot matrix listed above.
More specifically, we could theoretically write a check snapshot for each test using the test's generic macro to select failures. This would share one similar problem as the seed-based solution, namely the requirement to create a config file (in this case a snapshot rather than a seed) for each test, but it would fix the problem that the seed has with requiring manual management every time the universe of known failures changes. Then, to update the snapshots we could just run dbt snapshot with a --select clause to select any tests that need an updated snapshot. However, this solution can't handle the behavior "Create new snapshot without new failures," meaning we can't handle the "Old failures fixed / new failures" condition in the snapshot matrix.
It's possible that the updates to snapshot configs in dbt Core 1.9 and 1.10 will make this easier, but I doubt it. Still, I'll be keeping an eye out in case that changes.
Background
#579 introduces email notifications for failing tests, with a system to detect the state of test failures and only notify stakeholders for newly failing rows. This system works, but it also introduces some complexity that might make the system difficult to scale, namely:
anti_join
argument with a complicated schemaqc.test_run_failing_row
will no longer record rows that are part of the known universe of failing rows, so our analytics dashboards will not capture any work that we do to reduce the known universe of failing rows--vars
flag when runningrun_iasworld_data_tests
in thetest-dbt-models
workflowIf the system is successful and the Data team ends up being responsible for configuring lots of tests to use it, these issues will make it much harder to maintain the system over time.
Proposed refactor
Here is a sketch of a new iteration of the notification system that will be easier to scale:
qc.test_known_failure
and should be defined as a source in the dbt DAG.dbt/scripts/update_test_known_failures.py
should expose an interface for managing this state file, including:dbt/scripts/run_iasworld_data_tests.py
script should check whether each failing test has a universe of known failures inqc.test_known_failure
, and use the most recent snapshot to exclude known failures from notifications.dbt/scripts/run_iasworld_data_tests.py
script should check whether the state of each failing test has changed from the last snapshot, and should determine whether to update the snapshot using the same logic as theupdate_test_known_failures
script. In particular, the script needs to handle the following cases:test-dbt-models
workflow should automatically search for SNS topic overrides based on variable names when running therun_iasworld_data_tests
script (see here for original suggestion)Open questions
qc.test_known_failure
be unique by snapshot date, or snapshot version?run_iasworld_data_tests
script compare failing tests to the most recent snapshot inqc.test_known_failure
, or should tests have to be configured with the snapshot date/version that corresponds to their universe of known failures?Next steps
Since this refactor will require substantial engineering time, we should only undertake it if and when we determine that the notification system is capable of gaining traction with stakeholders. Until then, we should focus on getting traction and validating the usefulness of the system.
The text was updated successfully, but these errors were encountered: