Unicode & Python implementation #12

eoghanmurray · 2019-09-11T13:25:00Z

I'm wondering how to get the same hash for a unicode string in my browser using this function vs. using the same function in Python https://pypi.org/project/murmurhash/ (where it operates either on unicode or bytes).

> murmurhash3_32_gc('a', 1)
< 1485495528
> murmurhash3_32_gc('Җ', 1)
< 4216681732

Python:

>>> murmurhash.hash('a', 1)
1485495528
>>> murmurhash.hash_bytes('a'.encode('utf8'), 1)
1485495528
>>> murmurhash.hash_bytes('a'.encode('utf16'), 1)
-1631321036
>>> murmurhash.hash('Җ', 1)
-1621654532
>>> murmurhash.hash_bytes('Җ'.encode('utf8'), 1)
-1621654532
>>> murmurhash.hash_bytes('Җ'.encode('utf16'), 1)
1377606248

I'm wondering which implementation is correct, or if there's anything I need to do to either to get the output to match up?

The text was updated successfully, but these errors were encountered:

polarathene · 2021-04-30T06:53:50Z

Someone addressed that issue and published a revised version to NPM here. (EDIT: My bad the linked NPM package is not based on this project, it also expects you to convert strings to bytes, but provides examples how to do that). This package should work and handles the string conversion internally, it is a fork of this project repo but seems intended for NodeJS.

It can be a common issue with porting hash functions, since some languages default to strings in UTF-8, while others UTF-16 (like JS). Or rather, indexing on a string in Python 3 which is treated as a unicode string will iterate through it's individual code-points(32-bit values). With Rust, you would iterate through individual UTF-8 bytes, or the 32-bit code-points via chars().

However with JS and it's UTF-16 encoded strings. You can iterate through the strings sequence of 16-bit (2 byte) values, which may come in pairs for some glyphs/characters which are known as high and low surrogates. While iterating through the string will provide each value, using charCodeAt() behaves differently, each character is indexed instead and only returns the high surrogate ignoring any low surrogate.

For your example glyph that's a non-issue as it's only 2 bytes in size, the problem it still has is that a 16-bit value is returned while the JS hash code expects/treats it as an 8-bit value like regular ASCII text. It needs to be converted to UTF-8, which will still be 2 bytes (sometimes UTF-8 can be more bytes than UTF-16), and then it will work as expected.

Fun fact: For added confusion on this topic, some glyphs (or graphemes rather) are composed of multiple unicode characters. It is not uncommon with emoji, eg 🕵🏼‍♀️ can render as a single glyph/emoji, but is actually composed of 5 different characters and can be 14 bytes in UTF-16 but 17 bytes in UTF-8. The main difference for us is that it looks like a single value due to being an emoji, but the same JS behaviour will apply to it, and it will be treated as if it were 5 UTF-16 values, Rust and Python also would treat it as 5 components to iterate through too.

I'm wondering which implementation is correct

Python is correct. The hash_bytes operates on the bytes of the utf-8 string (4 bytes at a time) and outputs a 32-bit value as the hash.

The lib in python you used outputs signed integer (half the range can be a negative number); if output as a unsigned integer instead it would be always positive numbers (which this project does). There is no difference in the actual bytes (hash result), just how the number is interpreted. Thus both of the following are correct:

Signed Integer 32-bit: -1621654532
Unsigned Integer 32-bit: 2673312764

The reason why the JS version fails is because as mentioned, JS is working with a UTF-16 string but doesn't iterate through bytes as intended, thus it gets the wrong values to calculate with:

murmurhash-js/murmurhash3_gc.js

Line 47 in 0197ce3

case 1: k1 ^= (key.charCodeAt(i) & 0xff);

Like other lines there, the values from the string are taken with charCodeAt(), which can return a 16-bit value instead of 8-bit that is expected. Earlier in the code it takes 4 values this way to get 32-bits (4 bytes) to operate on (since this is 32-bit murmur3), and that works fine if each character in the string is 8-bit value like a-z,A-Z,0-9 values are. Since your character is 16-bits, the code only processed half of it, assuming only a single byte represented it.

or if there's anything I need to do to either to get the output to match up?

You can correctly convert the input string for processing this way: unescape(encodeURIComponent("Җ")).

Doing so I was able to get the value 2673312764 which is correct.

I could also reproduce the incorrect hash you provided in a rust implementation of murmur3 by using the 16-bit encoding [1174, 0] (this case there is only the 1174 value, thus it's only 2 bytes, charCodeAt() always returns that first 16-bit value, even if there is a 2nd part), this project then bitmasks it to an 8-bit value: 1174 & 0xFF = 150 and that's all it would operate on. If we encode to UTF-8, we have two bytes that can be used to produce the correct result: [210,150].

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode & Python implementation #12

Unicode & Python implementation #12

eoghanmurray commented Sep 11, 2019 •

edited

Loading

polarathene commented Apr 30, 2021 •

edited

Loading

Unicode & Python implementation #12

Unicode & Python implementation #12

Comments

eoghanmurray commented Sep 11, 2019 • edited Loading

polarathene commented Apr 30, 2021 • edited Loading

eoghanmurray commented Sep 11, 2019 •

edited

Loading

polarathene commented Apr 30, 2021 •

edited

Loading