Skip to content
This repository has been archived by the owner on May 17, 2024. It is now read-only.

Stop data-diff when maximum time or # different records is exceeded #402

Closed
JCZuurmond opened this issue Feb 20, 2023 · 7 comments
Closed
Labels
--dbt Issues/features related to the dbt integration enhancement New feature or request non-dbt Use cases outside of dbt stale_immune Immunity to stale bot

Comments

@JCZuurmond
Copy link
Contributor

JCZuurmond commented Feb 20, 2023

Is your feature request related to a problem? Please describe.

We run data-diff for many tables. Sometimes there are a lot of differences between the diffed tables. If so, the data diff for this tablepair might take a very long time (multiple hours). I prefer to skip this diff at a certain point, e.g., when a maximum diff time or # different records is exceeded. For such a diff, I do not care which records differ precisely, I am ok with knowing that this table is very off.

Describe the solution you'd like

Define a:

  • maximum diff time
  • OR, a maximum # different records
  • OR, a maximum % different records

If this threshold is exceeded, the diff is aborted, with a WARNING or ERROR message, and maybe an Exception.

Describe alternatives you've considered

I run data-diff programmatically and built this feature myself in the Python script that calls data-diff. This did not work as I hoped because data-diff uses a ThreadPool that continued with the diff after I broke out of the diff_tables iterable.

Additional context

@JCZuurmond JCZuurmond added the enhancement New feature or request label Feb 20, 2023
@github-actions
Copy link
Contributor

This issue has been marked as stale because it has been open for 60 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues/PRs that have gone stale label May 10, 2023
@JCZuurmond
Copy link
Contributor Author

I am still interested in this issue

@dlawin dlawin removed the stale Issues/PRs that have gone stale label May 10, 2023
@dlawin
Copy link
Contributor

dlawin commented May 10, 2023

Agreed this would be very useful.

We just added some automation here yesterday to help us wrangle the open issues list, thanks for bearing with us

@dlawin dlawin added --dbt Issues/features related to the dbt integration non-dbt Use cases outside of dbt labels May 10, 2023
@github-actions
Copy link
Contributor

This issue has been marked as stale because it has been open for 60 days with no activity. If you would like the issue to remain open, please comment on the issue and it will be added to the triage queue. Otherwise, it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues/PRs that have gone stale label Jul 10, 2023
@JCZuurmond
Copy link
Contributor Author

☝️

@github-actions github-actions bot added triage and removed stale Issues/PRs that have gone stale labels Jul 10, 2023
@dlawin dlawin added stale_immune Immunity to stale bot and removed triage labels Jul 10, 2023
@dungkhuc
Copy link

This is very useful as as it is data-diff would otherwise just stay stuck for many hours in our automation script if there are a big number of changes.

@glebmezh
Copy link
Contributor

@JCZuurmond @dungkhuc ,

I'm sorry for the delay in following up on this. Thank you for taking the time to raise this issue!

We made a hard decision to sunset the data-diff package and won't provide further development or support.

If that's of interest, over the past few months, we have rewritten the diffing engine in Datafold Cloud and solved many issues that existed in this package. In particular, we implemented sampling, per-column diff limits, and real-time result updates to help with the problem you are describing.

-Gleb

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
--dbt Issues/features related to the dbt integration enhancement New feature or request non-dbt Use cases outside of dbt stale_immune Immunity to stale bot
Projects
None yet
Development

No branches or pull requests

4 participants