It is widely cited that data analysts and data scientists today spend a large fraction (up to 80%) of their time on preparing and cleaning data.
At Microsoft Research, we are looking at ways to automate common data preparation tasks, where the goal is to empower enterprise workers as well as less-technical end-users (e.g., in Excel, Power BI, etc.), to solve their data preparation challenges and improve their productivity.
Technologies developed in this project have shipped as features in Microsoft products, such as in Power Query (natively integrated in Excel under the “Data” tab, also available in Power BI), and Azure Machine Learning Data Prep.
From time to time we receive requests from researchers for benchmark data sets used in our projects. We produce a compiled list here on GitHub to facilitate future research.
-
SEMA-Join: Join tables using semantic correlation. [data]
-
Auto-Join: Join tables using learnt-programs. [data]
-
Auto-Detect: Automatic detection of errors in tables. [data]
-
Auto-Type: Type-detection for semantic data types. [data]
-
Auto-Suggest: Recommend contextualized data-prep steps/operations. [data]
-
Auto-Fuzzy-Join: Fuzzy similarity-joins without labeled examples. [data]
-
Auto-Pipeline: Synthesize data pipelines by-target. [data]
-
Auto-Validate: Unsupervised data validation using patterns inferred from data lakes. [data]