Portable data integrity/invariant checks #8

anjsimmo · 2021-11-17T11:17:40Z

This idea was brought up in discussions #4 "Data quality issues should be easy to validate/verify" and #7 "Have a means to validate data to identify correlations / mistakes in the data"

Motivation:
Data issues often occur that in principle should be easy to detect. E.g., Google's data panel for COVID-19 deaths (which in turn was sourced from Wikipedia) was off by a factor of 10 for Australia, and the incorrect figure even found it's way into some news articles. It should have been obvious that something was wrong by the sudden jump and the fact that the number of deaths at country level did not add up to the sum of deaths in states and territories.

We think that the reason these issues are common is that every company that uses data would need to reimplement checks, which they don't have time for. What we need is a portable format for data integrity/invariant checks so that sharing data validation checks is as easy as sharing the data itself. E.g. if one system implements an integrity check that the number of cases in a country should equal to the sum of the number of cases in the states/territories in that country, there needs to be a portable way to share this check with other systems.

Specific problem:
While standards for representing data integrity checks already exist (e.g. SQL CHECK constraint), we need to better understand the practical barriers to reuse of data integrity checks and propose solutions. If successful, this research should have widespread practical impact in improving data quality and preventing misinformation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Portable data integrity/invariant checks #8

Portable data integrity/invariant checks #8

anjsimmo commented Nov 17, 2021 •

edited

Loading

Portable data integrity/invariant checks #8

Portable data integrity/invariant checks #8

Comments

anjsimmo commented Nov 17, 2021 • edited Loading

anjsimmo commented Nov 17, 2021 •

edited

Loading