Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bit pattern issue #102

Open
polyminer1 opened this issue Sep 7, 2023 · 1 comment
Open

Bit pattern issue #102

polyminer1 opened this issue Sep 7, 2023 · 1 comment

Comments

@polyminer1
Copy link

void MurmurHash3_x64_128 ( const void * key, const int len,

At this line you initialize a 64 bit hash from a 32 bit seed. Effectively forcing the high 32bit to 0 for every call. The mixing will be more patterned down the line. The input seed should be 64bit to ensure a good mix overall.

@sebres
Copy link

sebres commented Sep 7, 2023

Effectively forcing the high 32bit to 0 for every call

Nope. Firstly it has not really an "initialization" (if one carefully follows the mixing algorithm inside the blocks). Secondly it is not the case here, since 2 x 64 bits buffers get the seed in the low 32bits in:

uint64_t h1 = seed;
uint64_t h2 = seed;

But they get complexly mixed hereafter during hashing and finalization phase, so almost all 32bits of seed significantly distributed over the whole hash.

The input seed should be 64bit to ensure a good mix overall.

Then it will be not MM3 anymore.

Additionally note:

  1. MM3 is a non-cryptographic hash function, so its main goals are totally different and role of seed is secondary (see below for one of the purposes).
  2. Also don't confuse the seed with salt, which should be also mixed with the input using some cryptographically strong methods like HMAC and co.
  3. Therefore the seed is rather to select the specific computed value of hash function, for instance to distribute different types of objects with theoretically equal block representation across some hash table.
    (if two hashes use different seeds, they are very likely to compute distinct hash values for any also the same given input).
  4. Unsigned 32-bit seed provides 4 billions "unique" values (if hash function shows good distribution and few collisions), what is fully enough for purposes like described above typization (since 4 billions types are fully enough) etc.

And last but not least, you can have the "seed" as long as you need it - just prefix or suffix the input with some buffer (64 bits or larger) and calculate the hash for whole buffer (with increased length), or calculate it using chain mode (repeated calculation for data in parts), starting or ending with a buffer containing the seed of your arbitrary length.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants