-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specify buzhash #24
Specify buzhash #24
Conversation
I think these changes (not mutually exclusive) would improve the presentation of the sequence
It would also be good to put the entries in a text file hosted in the same place as the spec, for easy downloading and copy/paste. That can be a follow-up PR, though. My only substantive concern is that if we're fixing a particular |
I generally agree with your suggestions. Note though that it's easy enough to copy & paste from the HTML version of the spec -- but if we reformat as a table then depending on the markup that may cease to be the case. Hopefully we can do it without breaking that; I think monospace + hex should do a good enough job. |
Incorporated most of your suggestions. The HTML version still copies & pastes well. I don't expect bad behavior on a randomly generated G is likely, but once we have an implementation we can test. |
Interesting observation about this hash function, discovered when testing the reference implementation I just wrote: if all of the bytes in the window are the same, the hash is always zero. This is due to the fact that the window size is twice the bit length of the hash, so you get two copies of each (after translation through g) xor'd together. |
I'm not sure how big a problem this is, if at all. Long stretches of any single byte will produce long sequences of identical chunks, of course, and a lopsided hashsplit tree. The fact that this hash has the same value for all long monosequences, and that that value is zero, doesn't seem like it should matter much in practice. How many different long monosequences is one file likely to contain? |
$\operatorname{CP32}(X) = \bigoplus_{i = 0}^{|X| - 1} | ||
\operatorname{ROT}_L(g(X_i), |X| - i + 1)$ | ||
|
||
Where $g(n) = G_n$ and the sequence $G \in V_{32}$ is defined in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why define g(n) at all? Why not simply use G_n wherever you're using g(n)?
(Is it because you're trying to avoid a subscript of a subscript? If so, that has implications for my PR.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it was just to avoid the double subscript, though I don't feel incredibly strongly.
related functions is sometimes also called "buzhash." `cp32` is the | ||
recommended hash function for use with hashsplit; use it unless you have | ||
clear reasons for doing otherwise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to add some rationale here. (Perhaps in a subsequent PR.) Why is cp32 recommended? Why not rrs, which from the description below sounds much more common and therefore a likelier standard?
(These are rhetorical questions - I know the answer.)
Thanks to @bobg for spotting this. Co-authored-by: Bob Glickstein <[email protected]>
Going to go ahead and merge this. |
This specifies a concrete version of buzhash32 for us to use as a recommended hash function. I decided to call the concrete instance
cp32
to distinguish it from the general concept of cyclic polynomials/buzhash.closes #3
@bobg, @cole-miller, want to give this a look?