Skip to content
This repository has been archived by the owner on May 17, 2024. It is now read-only.

Let the differ choose a shared hashing algorithm #103

Closed
erezsh opened this issue Jun 23, 2022 · 9 comments
Closed

Let the differ choose a shared hashing algorithm #103

erezsh opened this issue Jun 23, 2022 · 9 comments
Labels
stale Issues/PRs that have gone stale

Comments

@erezsh
Copy link
Contributor

erezsh commented Jun 23, 2022

Right now we only support md5 for hashing columns.

However, for some databases that might not be feasible. For example, in mssql it's too slow, and sounds like Spanner only supports sha1.

I think the solution should be to allow several implementations in subclasses of Database, and have the differ choose the best one that's shared between them. (or throw an error if there isn't one)

See issues #51 and #99.

@erezsh erezsh changed the title Negotiate hashing algorithm Let the differ choose a common hashing algorithm Jun 23, 2022
@erezsh erezsh changed the title Let the differ choose a common hashing algorithm Let the differ choose a shared hashing algorithm Jun 23, 2022
@sirupsen
Copy link
Contributor

Agree. sha1 is also sometimes faster than md5, because some instruction-sets have specific instructions for it, but not for md5. crc32 is also generally faster than both of those.

@abhinav1912
Copy link

I think this should be broken down into 2 tasks:

  1. Adding a list of supported algorithms to the base/child classes, and refactor md5_to_string to something generic (maybe something like hash_to_string?)
  2. Adding these changes to the interface/CLI where the DB type determines the intersection of hashing algorithms and offers a choice (and/or a default)

I'd be happy to start working on this, if it hasn't been picked up already.

@sirupsen
Copy link
Contributor

@abhinav1912 it'd be awesome if you could pick this up!

If you can do some research for (1) about what kinds of "alternate" negotiations we can do, and how you'd implement it, that'd be a great start before you start implementing it 👀 For example, for MSSQL, how can we do strings if we can't do md5 fast?

@abhinav1912
Copy link

@sirupsen good point, I'll start looking into the alternatives and probably create a table of algorithms for each database. Will reach out on the Slack channel to clarify a few questions though.
As for MSSQL, here's a good article highlighting the performance of each hashing algorithm. CHECKSUM has good perfomance but poor hashing, MD/SHA algorithms have worse performance but better hashing. Doing better than MD5 does seem a little difficult imo.

@sirupsen
Copy link
Contributor

sirupsen commented Jun 28, 2022

That MD5 performance doesn't look that bad... I wonder why it came up so slow in our tests. Maybe the instance just didn't have enough CPU? Might be worth confirming that result first by writing a basic driver!

@abhinav1912
Copy link

The performance point raises another question: is there be some script/utility that can calculate perf metrics for the current build? If not, then this could be a good addition.

@sirupsen
Copy link
Contributor

sirupsen commented Jun 28, 2022

@abhinav1912 working on it, will tag you in the PR — it's what generated what's in the README. Basically re-uses the type suite (otherwise usually benchmark suites get out of date)

@github-actions
Copy link
Contributor

github-actions bot commented Jun 1, 2023

This issue has been marked as stale because it has been open for 60 days with no activity. If you would like the issue to remain open, please comment on the issue and it will be added to the triage queue. Otherwise, it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues/PRs that have gone stale label Jun 1, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Jun 9, 2023

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment and it will be reopened for triage.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 9, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stale Issues/PRs that have gone stale
Projects
None yet
Development

No branches or pull requests

3 participants