Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify buzhash #24

Merged
merged 7 commits into from
Oct 29, 2020
Merged

Specify buzhash #24

merged 7 commits into from
Oct 29, 2020

Conversation

zenhack
Copy link
Collaborator

@zenhack zenhack commented Oct 14, 2020

This specifies a concrete version of buzhash32 for us to use as a recommended hash function. I decided to call the concrete instance cp32 to distinguish it from the general concept of cyclic polynomials/buzhash.

closes #3

@bobg, @cole-miller, want to give this a look?

@cole-miller
Copy link
Contributor

cole-miller commented Oct 14, 2020

I think these changes (not mutually exclusive) would improve the presentation of the sequence G:

  • moving it to an appendix
  • formatting it as a table (unruled)
  • using hexadecimal notation and monospace type for the entries

It would also be good to put the entries in a text file hosted in the same place as the spec, for easy downloading and copy/paste. That can be a follow-up PR, though.

My only substantive concern is that if we're fixing a particular G we should make sure that the corresponding hash function has good behavior, even if (as I presume is the case) the probability that a randomly-chosen G defines a bad hash function is low.

spec.md Show resolved Hide resolved
@zenhack
Copy link
Collaborator Author

zenhack commented Oct 14, 2020

I generally agree with your suggestions. Note though that it's easy enough to copy & paste from the HTML version of the spec -- but if we reformat as a table then depending on the markup that may cease to be the case. Hopefully we can do it without breaking that; I think monospace + hex should do a good enough job.

@zenhack
Copy link
Collaborator Author

zenhack commented Oct 14, 2020

Incorporated most of your suggestions. The HTML version still copies & pastes well.

I don't expect bad behavior on a randomly generated G is likely, but once we have an implementation we can test.

@zenhack
Copy link
Collaborator Author

zenhack commented Oct 15, 2020

Interesting observation about this hash function, discovered when testing the reference implementation I just wrote: if all of the bytes in the window are the same, the hash is always zero. This is due to the fact that the window size is twice the bit length of the hash, so you get two copies of each (after translation through g) xor'd together.

@bobg
Copy link
Contributor

bobg commented Oct 18, 2020

if all of the bytes in the window are the same, the hash is always zero

I'm not sure how big a problem this is, if at all. Long stretches of any single byte will produce long sequences of identical chunks, of course, and a lopsided hashsplit tree. The fact that this hash has the same value for all long monosequences, and that that value is zero, doesn't seem like it should matter much in practice. How many different long monosequences is one file likely to contain?

$\operatorname{CP32}(X) = \bigoplus_{i = 0}^{|X| - 1}
\operatorname{ROT}_L(g(X_i), |X| - i + 1)$

Where $g(n) = G_n$ and the sequence $G \in V_{32}$ is defined in the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why define g(n) at all? Why not simply use G_n wherever you're using g(n)?

(Is it because you're trying to avoid a subscript of a subscript? If so, that has implications for my PR.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it was just to avoid the double subscript, though I don't feel incredibly strongly.

spec.md Outdated Show resolved Hide resolved
Comment on lines +220 to +222
related functions is sometimes also called "buzhash." `cp32` is the
recommended hash function for use with hashsplit; use it unless you have
clear reasons for doing otherwise.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to add some rationale here. (Perhaps in a subsequent PR.) Why is cp32 recommended? Why not rrs, which from the description below sounds much more common and therefore a likelier standard?

(These are rhetorical questions - I know the answer.)

Thanks to @bobg for spotting this.

Co-authored-by: Bob Glickstein <[email protected]>
@zenhack
Copy link
Collaborator Author

zenhack commented Oct 29, 2020

Going to go ahead and merge this.

@zenhack zenhack closed this Oct 29, 2020
@zenhack zenhack reopened this Oct 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Choose recommended hash function
3 participants