When comparing the data, data-diff
utilizes the resources of the underlying databases as much as possible. It has two primary modes of comparison:
- Recommended for comparing data within the same database
- Uses the outer join operation to diff the rows as efficiently as possible within the same database
- Fully relies on the underlying database engine for computation
- Requires both datasets to be queryable with a single SQL query
- Time complexity approximates JOIN operation and is largely independent of the number of differences in the dataset
- Recommended for comparing datasets across different databases
- Can also be helpful in diffing very large tables with few expected differences within the same database
- Employs a divide-and-conquer algorithm based on hashing and binary search
- Can diff data across distinct database engines, e.g., PostgreSQL <> Snowflake
- Time complexity approximates COUNT(*) operation when there are few differences
- Performance degrades when datasets have a large number of differences
More information about the algorithm and performance considerations can be found here
Install data-diff
with specific database adapters, e.g.:
pip install data-diff 'data-diff[postgresql,snowflake]' -U
Run data-diff
with connection URIs. In the following example, we compare tables between PostgreSQL and Snowflake using hashdiff algorithm:
data-diff \
postgresql://<username>:'<password>'@localhost:5432/<database> \
<table> \
"snowflake://<username>:<password>@<password>/<DATABASE>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<ROLE>" \
<TABLE> \
-k <primary key column> \
-c <columns to compare> \
-w <filter condition>
Check out documentation for the full command reference.
Compare source to target and check for discrepancies when moving data between systems:
- Migrating to a new data warehouse (e.g., Oracle > Snowflake)
- Converting SQL to a new transformation framework (e.g., stored procedures > dbt)
- Continuously replicating data from an OLTP DB to OLAP DWH (e.g., MySQL > Redshift)
Test SQL code and preview changes by comparing development/staging environment data to production:
- Make a change to some SQL code
- Run the SQL code to create a new dataset
- Compare the dataset with its production version or another iteration
data-diff
integrates with dbt Core and dbt Cloud to seamlessly compare local development to production datasets.
👀 Watch 4-min demo video
Get started with data-diff & dbt
Also available in a VS Code Extension
Reach out on the dbt Slack in #tools-datafold for advice and support
Database | Status | Connection string |
---|---|---|
PostgreSQL >=10 | 💚 | postgresql://<user>:<password>@<host>:5432/<database> |
MySQL | 💚 | mysql://<user>:<password>@<hostname>:5432/<database> |
Snowflake | 💚 | "snowflake://<user>[:<password>]@<account>/<database>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<role>[&authenticator=externalbrowser]" |
BigQuery | 💚 | bigquery://<project>/<dataset> |
Redshift | 💚 | redshift://<username>:<password>@<hostname>:5439/<database> |
Oracle | 💛 | oracle://<username>:<password>@<hostname>/database |
Presto | 💛 | presto://<username>:<password>@<hostname>:8080/<database> |
Databricks | 💛 | databricks://<http_path>:<access_token>@<server_hostname>/<catalog>/<schema> |
Trino | 💛 | trino://<username>:<password>@<hostname>:8080/<database> |
Clickhouse | 💛 | clickhouse://<username>:<password>@<hostname>:9000/<database> |
Vertica | 💛 | vertica://<username>:<password>@<hostname>:5433/<database> |
DuckDB | 💛 | |
ElasticSearch | 📝 | |
Planetscale | 📝 | |
Pinot | 📝 | |
Druid | 📝 | |
Kafka | 📝 | |
SQLite | 📝 |
- 💚: Implemented and thoroughly tested.
- 💛: Implemented, but not thoroughly tested yet.
- ⏳: Implementation in progress.
- 📝: Implementation planned. Contributions welcome.
Your database not listed here?
- Contribute a new database adapter – we accept pull requests!
- Get in touch about enterprise support and adding new adapters and features
We thank everyone who contributed so far!
This project is licensed under the terms of the MIT License.