data-diff: compare datasets fast, within or across SQL databases

How it works

When comparing the data, data-diff utilizes the resources of the underlying databases as much as possible. It has two primary modes of comparison:

joindiff

Recommended for comparing data within the same database
Uses the outer join operation to diff the rows as efficiently as possible within the same database
Fully relies on the underlying database engine for computation
Requires both datasets to be queryable with a single SQL query
Time complexity approximates JOIN operation and is largely independent of the number of differences in the dataset

hashdiff

Recommended for comparing datasets across different databases
Can also be helpful in diffing very large tables with few expected differences within the same database
Employs a divide-and-conquer algorithm based on hashing and binary search
Can diff data across distinct database engines, e.g., PostgreSQL <> Snowflake
Time complexity approximates COUNT(*) operation when there are few differences
Performance degrades when datasets have a large number of differences

More information about the algorithm and performance considerations can be found here

Get started

Install data-diff with specific database adapters, e.g.:

pip install data-diff 'data-diff[postgresql,snowflake]' -U

Run data-diff with connection URIs. In the following example, we compare tables between PostgreSQL and Snowflake using hashdiff algorithm:

data-diff \
  postgresql://<username>:'<password>'@localhost:5432/<database> \
  <table> \
  "snowflake://<username>:<password>@<password>/<DATABASE>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<ROLE>" \
  <TABLE> \
  -k <primary key column> \
  -c <columns to compare> \
  -w <filter condition>

Check out documentation for the full command reference.

Use cases

Data Migration & Replication Testing

Compare source to target and check for discrepancies when moving data between systems:

Migrating to a new data warehouse (e.g., Oracle > Snowflake)
Converting SQL to a new transformation framework (e.g., stored procedures > dbt)
Continuously replicating data from an OLTP DB to OLAP DWH (e.g., MySQL > Redshift)

Data Development Testing

Test SQL code and preview changes by comparing development/staging environment data to production:

Make a change to some SQL code
Run the SQL code to create a new dataset
Compare the dataset with its production version or another iteration

data-diff integrates with dbt Core and dbt Cloud to seamlessly compare local development to production datasets.

👀 Watch 4-min demo video

Get started with data-diff & dbt

Also available in a VS Code Extension

Reach out on the dbt Slack in #tools-datafold for advice and support

Supported databases

Database	Status	Connection string
PostgreSQL >=10	💚	`postgresql://<user>:<password>@<host>:5432/<database>`
MySQL	💚	`mysql://<user>:<password>@<hostname>:5432/<database>`
Snowflake	💚	`"snowflake://<user>[:<password>]@<account>/<database>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<role>[&authenticator=externalbrowser]"`
BigQuery	💚	`bigquery://<project>/<dataset>`
Redshift	💚	`redshift://<username>:<password>@<hostname>:5439/<database>`
Oracle	💛	`oracle://<username>:<password>@<hostname>/database`
Presto	💛	`presto://<username>:<password>@<hostname>:8080/<database>`
Databricks	💛	`databricks://<http_path>:<access_token>@<server_hostname>/<catalog>/<schema>`
Trino	💛	`trino://<username>:<password>@<hostname>:8080/<database>`
Clickhouse	💛	`clickhouse://<username>:<password>@<hostname>:9000/<database>`
Vertica	💛	`vertica://<username>:<password>@<hostname>:5433/<database>`
DuckDB	💛
ElasticSearch	📝
Planetscale	📝
Pinot	📝
Druid	📝
Kafka	📝
SQLite	📝

💚: Implemented and thoroughly tested.
💛: Implemented, but not thoroughly tested yet.
⏳: Implementation in progress.
📝: Implementation planned. Contributions welcome.

Your database not listed here?

Contribute a new database adapter – we accept pull requests!
Get in touch about enterprise support and adding new adapters and features

Contributors

We thank everyone who contributed so far!

Analytics

Usage Analytics & Data Privacy

License

This project is licensed under the terms of the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 1,570 Commits
.github		.github
data_diff		data_diff
dev		dev
docs		docs
tests		tests
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
readthedocs.yml		readthedocs.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

data-diff: compare datasets fast, within or across SQL databases

How it works

joindiff

hashdiff

Get started

Use cases

Data Migration & Replication Testing

Data Development Testing

Supported databases

Contributors

Analytics

License

About

Uh oh!

Releases

Packages

Languages

License

brindasanth/data-diff

Folders and files

Latest commit

History

Repository files navigation

data-diff: compare datasets fast, within or across SQL databases

How it works

joindiff

hashdiff

Get started

Use cases

Data Migration & Replication Testing

Data Development Testing

Supported databases

Contributors

Analytics

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages