Skip to content
This repository has been archived by the owner on May 17, 2024. It is now read-only.

doc: list of hashing algorithms supported by DBs #128

Closed

Conversation

abhinav1912
Copy link

Added table containing all hashing algorithm choices. Related to issue #103
cc: @sirupsen

@erezsh
Copy link
Contributor

erezsh commented Jun 29, 2022

Thanks for collecting this information. Please put it in Discussions.

We will not add to the README a list of algorithms that we don't support.

@erezsh erezsh closed this Jun 29, 2022
@sirupsen
Copy link
Contributor

@abhinav1912 this is an excellent collection, thank you! 😍

As Erez pointed out, I think we should keep it in the issue at-hand, as it won't be too useful for the general user I think.

@abhinav1912
Copy link
Author

@sirupsen agreed, I'll create a discussion for this. Now that there is a list of algorithms at hand, what should be done for issue #103?

@erezsh
Copy link
Contributor

erezsh commented Jun 29, 2022

I think that since the only issue here is performance, and the only one in mssql that performs well is not shared by anyone else.. the only path forward I can see is to implement some hash function using simple arithmetic, and try to prove that it's good enough to be used for comparing data.

@sirupsen
Copy link
Contributor

Yeah, I think the next step is to target MSSQL directly (which I assume you have experience with based on your existing posts?), we can look at others (e.g. Spanner) later.

I think we should investigate a sum-number as the first alternative to md5. Spanner is the only one I've seen so far that supports hashing, just not MD5.

  1. What is the MD5 performance in MSSQL compared to Postgres for 1M records for e.g. an int column?

  2. If it's really bad, what alternative scheme can we come up with for MSSQL?

For int, float, and datetime we could do a sum() (which is also what we'd negotiate with e.g. ElasticSearch). I think negotiating a sum-number algorithm could be a good start, since for every database this will also be much faster than MD5, so it's useful regardless.

  1. Can we do sum on a string? Is there a standard way to get e.g. a sum of all the UTF-8 characters?

@abhinav1912
Copy link
Author

@sirupsen of the hashing algorithms, MD5 is the best option. We could try using either CHECKSUM or BINARY_CHECKSUM (with some key points highlighted here), since both of these also work with strings.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants