Specify buzhash #24

zenhack · 2020-10-14T18:32:19Z

This specifies a concrete version of buzhash32 for us to use as a recommended hash function. I decided to call the concrete instance cp32 to distinguish it from the general concept of cyclic polynomials/buzhash.

closes #3

@bobg, @cole-miller, want to give this a look?

cole-miller · 2020-10-14T19:26:08Z

I think these changes (not mutually exclusive) would improve the presentation of the sequence G:

moving it to an appendix
formatting it as a table (unruled)
using hexadecimal notation and monospace type for the entries

It would also be good to put the entries in a text file hosted in the same place as the spec, for easy downloading and copy/paste. That can be a follow-up PR, though.

My only substantive concern is that if we're fixing a particular G we should make sure that the corresponding hash function has good behavior, even if (as I presume is the case) the probability that a randomly-chosen G defines a bad hash function is low.

spec.md

zenhack · 2020-10-14T20:18:28Z

I generally agree with your suggestions. Note though that it's easy enough to copy & paste from the HTML version of the spec -- but if we reformat as a table then depending on the markup that may cease to be the case. Hopefully we can do it without breaking that; I think monospace + hex should do a good enough job.

zenhack · 2020-10-14T20:32:45Z

Incorporated most of your suggestions. The HTML version still copies & pastes well.

I don't expect bad behavior on a randomly generated G is likely, but once we have an implementation we can test.

zenhack · 2020-10-15T01:03:14Z

Interesting observation about this hash function, discovered when testing the reference implementation I just wrote: if all of the bytes in the window are the same, the hash is always zero. This is due to the fact that the window size is twice the bit length of the hash, so you get two copies of each (after translation through g) xor'd together.

bobg · 2020-10-18T18:29:39Z

if all of the bytes in the window are the same, the hash is always zero

I'm not sure how big a problem this is, if at all. Long stretches of any single byte will produce long sequences of identical chunks, of course, and a lopsided hashsplit tree. The fact that this hash has the same value for all long monosequences, and that that value is zero, doesn't seem like it should matter much in practice. How many different long monosequences is one file likely to contain?

bobg · 2020-10-18T18:38:38Z

spec.md

+$\operatorname{CP32}(X) = \bigoplus_{i = 0}^{|X| - 1}
+\operatorname{ROT}_L(g(X_i), |X| - i + 1)$
+
+Where $g(n) = G_n$ and the sequence $G \in V_{32}$ is defined in the


Why define g(n) at all? Why not simply use G_n wherever you're using g(n)?

(Is it because you're trying to avoid a subscript of a subscript? If so, that has implications for my PR.)

Yeah, it was just to avoid the double subscript, though I don't feel incredibly strongly.

spec.md

bobg · 2020-10-18T18:46:04Z

spec.md

+related functions is sometimes also called "buzhash." `cp32` is the
+recommended hash function for use with hashsplit; use it unless you have
+clear reasons for doing otherwise.


It would be good to add some rationale here. (Perhaps in a subsequent PR.) Why is cp32 recommended? Why not rrs, which from the description below sounds much more common and therefore a likelier standard?

(These are rhetorical questions - I know the answer.)

@bobg

Thanks to @bobg for spotting this. Co-authored-by: Bob Glickstein <[email protected]>

zenhack · 2020-10-29T03:12:19Z

Going to go ahead and merge this.

Specify buzhash

3ffe3a9

cole-miller reviewed Oct 14, 2020

View reviewed changes

spec.md Show resolved Hide resolved

zenhack added 3 commits October 14, 2020 16:19

Explicitly state |X| \mod 32.

3c04951

Move G to an appendix, use a monospace font.

d6506f6

Display G in hexidecimal

12810b2

bobg reviewed Oct 18, 2020

View reviewed changes

Fix typo.

0bb9d97

Thanks to @bobg for spotting this. Co-authored-by: Bob Glickstein <[email protected]>

zenhack closed this Oct 29, 2020

zenhack reopened this Oct 29, 2020

zenhack added 2 commits October 28, 2020 23:14

Merge remote-tracking branch 'origin/master' into cp32

9f1390f

Clarify "summation style" xor notation with an example.

9e0af82

zenhack merged commit 73a56da into master Oct 29, 2020

zenhack deleted the cp32 branch October 29, 2020 03:18

zenhack mentioned this pull request Oct 29, 2020

Spell out rationale for recommending cp32 over rrs1 #26

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Specify buzhash #24

Specify buzhash #24

Uh oh!

zenhack commented Oct 14, 2020

Uh oh!

cole-miller commented Oct 14, 2020 •

edited

Loading

Uh oh!

Uh oh!

zenhack commented Oct 14, 2020

Uh oh!

zenhack commented Oct 14, 2020

Uh oh!

zenhack commented Oct 15, 2020

Uh oh!

bobg commented Oct 18, 2020

Uh oh!

bobg Oct 18, 2020

Uh oh!

zenhack Oct 18, 2020

Uh oh!

Uh oh!

bobg Oct 18, 2020

Uh oh!

zenhack commented Oct 29, 2020

Uh oh!

Uh oh!

Specify buzhash #24

Specify buzhash #24

Uh oh!

Conversation

zenhack commented Oct 14, 2020

Uh oh!

cole-miller commented Oct 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

zenhack commented Oct 14, 2020

Uh oh!

zenhack commented Oct 14, 2020

Uh oh!

zenhack commented Oct 15, 2020

Uh oh!

bobg commented Oct 18, 2020

Uh oh!

bobg Oct 18, 2020

Choose a reason for hiding this comment

Uh oh!

zenhack Oct 18, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bobg Oct 18, 2020

Choose a reason for hiding this comment

Uh oh!

zenhack commented Oct 29, 2020

Uh oh!

Uh oh!

cole-miller commented Oct 14, 2020 •

edited

Loading