DataDome: AI Data Quality Enhancer

Yantra VIT Central Hackathon '25 Finalist and Domain Best Project

What LLM is to chatbots, DataDome is to datasets

DataDome is an automated, end-to-end modular solution that makes the data cleaning and pre-processing tasks for your AI/ML applications a cakewalk.

It implements hypertuned algorithms to detect and resolve duplicates, missing values, outliers, and type inconsistencies. It goes a step further and scales the data, encodes categorical values, and, even augments sparse or uniform datasets with distribution-aware synthetic samples, if needed.

The Final Output:

A clean, consistent dataset optimized for high-performance analytics and model training.

Why DataDome?

Compare and contrast the efficacy of a dataset cleaned with our tools versus conventional cleaning using final output metrics.

Key Features

Duplicate Detection & Removal using MD5 hashing
Null Value Imputation using KNN for social-network type validation and diversity
Outlier Detection & Removal using DBSCAN on PCA Data for genuine anomaly detection
Intelligent Type Inference & Correction (e.g., proper datetime parsing, sanitizing categorical numeric-string)
CTGAN Synthetic Data Generation to enhance dataset diversity

Impact & SDG Contribution

This project aligns with SDG 9: Industry, Innovation, and Infrastructure by enhancing data quality and infrastructure across sectors such as healthcare, agriculture, and finance. By ensuring reliable, clean datasets, DataDome facilitates smarter decision-making and more impactful AI models.

With today's evolving AI paradigm, this tool ensures a mere dataset will never be a road-block to your innovation!

Installation

# Clone the repository
git clone https://github.com/prem-savla/DataDome.git
cd DataDome

# Install dependencies
pip install -r requirements.txt

# Start the application
python run.py

Access the prototype at: http://localhost:5000

Future Directions

Graph Neural Networks (GNNs) to capture complex relationships for more precise cleaning
Domain-Specific Optimizations for industry-specific data structures
Convolutional Neural Networks (CNNs) for image dataset cleaning (Undergoing)

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
app		app
output		output
pre_processing		pre_processing
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
Presentation.pptx		Presentation.pptx
README.md		README.md
requirements.txt		requirements.txt
run.py		run.py
sample_data.csv		sample_data.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataDome: AI Data Quality Enhancer

Yantra VIT Central Hackathon '25 Finalist and Domain Best Project

The Final Output:

Why DataDome?

Key Features

Impact & SDG Contribution

With today's evolving AI paradigm, this tool ensures a mere dataset will never be a road-block to your innovation!

Installation

Future Directions

About

Releases

Packages

Languages

License

prem-savla/DataDome

Folders and files

Latest commit

History

Repository files navigation

DataDome: AI Data Quality Enhancer

Yantra VIT Central Hackathon '25 Finalist and Domain Best Project

The Final Output:

Why DataDome?

Key Features

Impact & SDG Contribution

With today's evolving AI paradigm, this tool ensures a mere dataset will never be a road-block to your innovation!

Installation

Future Directions

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages