Simplify design of failing test notification system #595

jeancochrane · 2024-09-11T22:17:02Z

Background

#579 introduces email notifications for failing tests, with a system to detect the state of test failures and only notify stakeholders for newly failing rows. This system works, but it also introduces some complexity that might make the system difficult to scale, namely:

Configuring a new test for notifications requires adding a new test-specific seed representing the universe of known failures
Any generics that are used by tests that are configured to be part of the system must be updated to accept an anti_join argument with a complicated schema
If other teams fix rows that are part of a test's known universe of failures, a Data team member will have to manually intervene to update its seed, or else the system will fail to alert if those rows ever start to fail again
For tests that are part of the notification system, the failure data saved in qc.test_run_failing_row will no longer record rows that are part of the known universe of failing rows, so our analytics dashboards will not capture any work that we do to reduce the known universe of failing rows
Adding a new SNS topic requires updating the big chunk of YAML that we pass to the --vars flag when running run_iasworld_data_tests in the test-dbt-models workflow

If the system is successful and the Data team ends up being responsible for configuring lots of tests to use it, these issues will make it much harder to maintain the system over time.

Proposed refactor

Here is a sketch of a new iteration of the notification system that will be easier to scale:

Instead of a seed for every test, the known universe of test failures should be stored in one Parquet dataset that lives remotely on S3. This dataset should have a name like qc.test_known_failure and should be defined as a source in the dbt DAG.
- The dataset should be partitioned by tablename, test name, and snapshot date (or version), and each individual Parquet file should record the set of primary keys representing the universe of known failures for each test at a moment in time.
- A script like dbt/scripts/update_test_known_failures.py should expose an interface for managing this state file, including:
  - Managing different state files depending on the dbt environment (prod or CI)
  - Copying prod state files to CI environments
  - Adding a new snapshot of data for a specific test on the current date
  - Excluding specific rows from a snapshot by primary key
The dbt/scripts/run_iasworld_data_tests.py script should check whether each failing test has a universe of known failures in qc.test_known_failure, and use the most recent snapshot to exclude known failures from notifications.
The dbt/scripts/run_iasworld_data_tests.py script should check whether the state of each failing test has changed from the last snapshot, and should determine whether to update the snapshot using the same logic as the update_test_known_failures script. In particular, the script needs to handle the following cases:

	No new failures	New failures
No old failures fixed	Do nothing	Do nothing
Old failures fixed	Create new snapshot	Create new snapshot without new failures

The test-dbt-models workflow should automatically search for SNS topic overrides based on variable names when running the run_iasworld_data_tests script (see here for original suggestion)

Open questions

Should rows in qc.test_known_failure be unique by snapshot date, or snapshot version?
- Snapshot date would be more intuitive for human readers, but precludes the possibility of multiple snapshots on a single date
- If we go with version, we should probably preserve the date as well, to make the versions more understandable
Should the run_iasworld_data_tests script compare failing tests to the most recent snapshot in qc.test_known_failure, or should tests have to be configured with the snapshot date/version that corresponds to their universe of known failures?
- Using the most recent snapshot will simplify the design, but reduce configurability

Next steps

Since this refactor will require substantial engineering time, we should only undertake it if and when we determine that the notification system is capable of gaining traction with stakeholders. Until then, we should focus on getting traction and validating the usefulness of the system.

The text was updated successfully, but these errors were encountered:

jeancochrane · 2024-09-17T16:11:06Z

I did a little bit of research today to determine if we could leverage dbt snapshots for the snapshotting system; my determination is that we currently can't, because they can't handle the "Old failures fixed / new failures" condition in the snapshot matrix listed above.

More specifically, we could theoretically write a check snapshot for each test using the test's generic macro to select failures. This would share one similar problem as the seed-based solution, namely the requirement to create a config file (in this case a snapshot rather than a seed) for each test, but it would fix the problem that the seed has with requiring manual management every time the universe of known failures changes. Then, to update the snapshots we could just run dbt snapshot with a --select clause to select any tests that need an updated snapshot. However, this solution can't handle the behavior "Create new snapshot without new failures," meaning we can't handle the "Old failures fixed / new failures" condition in the snapshot matrix.

It's possible that the updates to snapshot configs in dbt Core 1.9 and 1.10 will make this easier, but I doubt it. Still, I'll be keeping an eye out in case that changes.

jeancochrane added the dbt Related to dbt (tests, docs, schema, etc) label Sep 11, 2024

jeancochrane mentioned this issue Sep 17, 2024

Add framework for instant email notifications of failing tests #579

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify design of failing test notification system #595

Simplify design of failing test notification system #595

jeancochrane commented Sep 11, 2024 •

edited

Loading

jeancochrane commented Sep 17, 2024

Simplify design of failing test notification system #595

Simplify design of failing test notification system #595

Comments

jeancochrane commented Sep 11, 2024 • edited Loading

Background

Proposed refactor

Open questions

Next steps

jeancochrane commented Sep 17, 2024

jeancochrane commented Sep 11, 2024 •

edited

Loading