"Self Service" Data Preparation

Overview

It is widely cited that data analysts and data scientists today spend a large fraction (up to 80%) of their time on preparing and cleaning data.

At Microsoft Research, we are looking at ways to automate common data preparation tasks, where the goal is to empower enterprise workers as well as less-technical end-users (e.g., in Excel, Power BI, etc.), to solve their data preparation challenges and improve their productivity.

Technologies developed in this project have shipped as features in Microsoft products, such as in Power Query (natively integrated in Excel under the “Data” tab, also available in Power BI), and Azure Machine Learning Data Prep.

List of benchmark data sets used in published work

From time to time we receive requests from researchers for benchmark data sets used in our projects. We produce a compiled list here on GitHub to facilitate future research.

TEGRA: Automatic table segmentation. [data]
SEMA-Join: Join tables using semantic correlation. [data]
Auto-Join: Join tables using learnt-programs. [data]
Mappings: Synthesize mappings using tables. [data]
Auto-Detect: Automatic detection of errors in tables. [data]
Auto-Type: Type-detection for semantic data types. [data]
TDE: Transform-Data-by-Example. [data]
Auto-EM: Pre-trained entity-matching models. [code]
Auto-Suggest: Recommend contextualized data-prep steps/operations. [data]
Auto-Fuzzy-Join: Fuzzy similarity-joins without labeled examples. [data]
Auto-Pipeline: Synthesize data pipelines by-target. [data]
Auto-Validate: Unsupervised data validation using patterns inferred from data lakes. [data]

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

"Self Service" Data Preparation

Overview

List of benchmark data sets used in published work

TEGRA: Automatic table segmentation. [data]

SEMA-Join: Join tables using semantic correlation. [data]

Auto-Join: Join tables using learnt-programs. [data]

Mappings: Synthesize mappings using tables. [data]

Auto-Detect: Automatic detection of errors in tables. [data]

Auto-Type: Type-detection for semantic data types. [data]

TDE: Transform-Data-by-Example. [data]

Auto-EM: Pre-trained entity-matching models. [code]

Auto-Suggest: Recommend contextualized data-prep steps/operations. [data]

Auto-Fuzzy-Join: Fuzzy similarity-joins without labeled examples. [data]

Auto-Pipeline: Synthesize data pipelines by-target. [data]

Auto-Validate: Unsupervised data validation using patterns inferred from data lakes. [data]

About

Releases

Packages

gigascake/Self-Service-Data-Preparation

Folders and files

Latest commit

History

Repository files navigation

"Self Service" Data Preparation

Overview

List of benchmark data sets used in published work

TEGRA: Automatic table segmentation. [data]

SEMA-Join: Join tables using semantic correlation. [data]

Auto-Join: Join tables using learnt-programs. [data]

Mappings: Synthesize mappings using tables. [data]

Auto-Detect: Automatic detection of errors in tables. [data]

Auto-Type: Type-detection for semantic data types. [data]

TDE: Transform-Data-by-Example. [data]

Auto-EM: Pre-trained entity-matching models. [code]

Auto-Suggest: Recommend contextualized data-prep steps/operations. [data]

Auto-Fuzzy-Join: Fuzzy similarity-joins without labeled examples. [data]

Auto-Pipeline: Synthesize data pipelines by-target. [data]

Auto-Validate: Unsupervised data validation using patterns inferred from data lakes. [data]

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages