You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This idea was brought up in discussions #4 "Data quality issues should be easy to validate/verify" and #7 "Have a means to validate data to identify correlations / mistakes in the data"
Motivation:
Data issues often occur that in principle should be easy to detect. E.g., Google's data panel for COVID-19 deaths (which in turn was sourced from Wikipedia) was off by a factor of 10 for Australia, and the incorrect figure even found it's way into some news articles. It should have been obvious that something was wrong by the sudden jump and the fact that the number of deaths at country level did not add up to the sum of deaths in states and territories.
We think that the reason these issues are common is that every company that uses data would need to reimplement checks, which they don't have time for. What we need is a portable format for data integrity/invariant checks so that sharing data validation checks is as easy as sharing the data itself. E.g. if one system implements an integrity check that the number of cases in a country should equal to the sum of the number of cases in the states/territories in that country, there needs to be a portable way to share this check with other systems.
Specific problem:
While standards for representing data integrity checks already exist (e.g. SQL CHECK constraint), we need to better understand the practical barriers to reuse of data integrity checks and propose solutions. If successful, this research should have widespread practical impact in improving data quality and preventing misinformation.
The text was updated successfully, but these errors were encountered:
This idea was brought up in discussions #4 "Data quality issues should be easy to validate/verify" and #7 "Have a means to validate data to identify correlations / mistakes in the data"
Motivation:
Data issues often occur that in principle should be easy to detect. E.g., Google's data panel for COVID-19 deaths (which in turn was sourced from Wikipedia) was off by a factor of 10 for Australia, and the incorrect figure even found it's way into some news articles. It should have been obvious that something was wrong by the sudden jump and the fact that the number of deaths at country level did not add up to the sum of deaths in states and territories.
We think that the reason these issues are common is that every company that uses data would need to reimplement checks, which they don't have time for. What we need is a portable format for data integrity/invariant checks so that sharing data validation checks is as easy as sharing the data itself. E.g. if one system implements an integrity check that the number of cases in a country should equal to the sum of the number of cases in the states/territories in that country, there needs to be a portable way to share this check with other systems.
Specific problem:
While standards for representing data integrity checks already exist (e.g. SQL CHECK constraint), we need to better understand the practical barriers to reuse of data integrity checks and propose solutions. If successful, this research should have widespread practical impact in improving data quality and preventing misinformation.
The text was updated successfully, but these errors were encountered: