Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify buzhash #24

Merged
merged 7 commits into from
Oct 29, 2020
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
116 changes: 113 additions & 3 deletions spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,12 +35,13 @@ This section discusses notation used in this specification.

We define the following sets:

- $U_{32}$, The set of integers in the range $[0, 2^{32})$
- $U_{32}$, The set of integers in the range $[0, 2^{32})$.
- $U_8$, The set of integers in the range $[0, 2^8)$, aka bytes.
- $V_8$, The set of *sequences* of bytes, i.e. sequences of
$U_8$.
- $V_v$, The set of *sequences* of *sequences* of bytes, i.e.
sequences of elements of $V_8$.
- $V_{32}$, The set of sequences of elements of $U_{32}$.

All arithmetic operations in this document are implicitly performed
modulo $2^{32}$. We use standard mathematical notation for addition,
Expand All @@ -56,16 +57,33 @@ elements it contains.
We also use the following operators and functions:

- $x \wedge y$ denotes the bitwise AND of $x$ and $y$
- $x \vee y$ denotes the bitwise OR of $x$ and $y$
- $x \vee y$ denotes the bitwise *inclusive* OR of $x$ and $y$
- $x \oplus y$ denotes the bitwise *exclusive* OR of $x$ and $y$
- $x \ll n$ denotes shifting $x$ to the left $n$ bits, i.e.
$x \ll n = x2^{n}$
- $x \gg n$ denotes a *logical* right shift -- it shifts $x$ to the
right by $n$ bits, i.e. $x \gg n = x / 2^n$
- $X \mathbin{\|} Y$ denotes the concatenation of two sequences $X$ and $Y$,
- $X \mathbin{\|} Y$ denotes the concatenation of two sequences $X$ and
$Y$,
i.e. if $X = \langle X_0, \dots, X_N \rangle$ and $Y = \langle Y_0,
\dots, Y_M \rangle$ then $X \mathbin{\|} Y = \langle X_0, \dots, X_N, Y_0, \dots, Y_M
\rangle$
- $\operatorname{min}(x, y)$ denotes the minimum of $x$ and $y$.
- $\operatorname{ROT}_L(x, n)$ denotes the rotation of $x$ to the left
by $n$ bits, i.e. $\operatorname{ROT}_L(x, n) = (x \ll n) \vee (x \gg
(32 - n))$

We use standard mathematical notation for summation. For example:

$\sum_{i = 0}^{n} i$

denotes the sum of integers in the range $[0, n]$.

We define a similar notation for exclusive or:

$\bigoplus_{i = 0}^{n} i$

denotes the bitwise exclusive or of theintegers in $[0, n]$.
zenhack marked this conversation as resolved.
Show resolved Hide resolved

# Splitting

Expand Down Expand Up @@ -193,6 +211,98 @@ To “close” a node $N_i$:

# Rolling Hash Functions

## CP32

The `cp32` hash function is based on cyclic polynomials. The family of
related functions is sometimes also called "buzhash." `cp32` is the
recommended hash function for use with hashsplit; use it unless you have
clear reasons for doing otherwise.
Comment on lines +306 to +308
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to add some rationale here. (Perhaps in a subsequent PR.) Why is cp32 recommended? Why not rrs, which from the description below sounds much more common and therefore a likelier standard?

(These are rhetorical questions - I know the answer.)


### Definition

We define the function $\operatorname{CP32} \in V_8 \rightarrow U_{32}$
as:

$\operatorname{CP32}(X) = \bigoplus_{i = 0}^{|X| - 1}
\operatorname{ROT}_L(g(X_i), |X| - i + 1)$

Where $g(n) = G_n$ and $G \in V_{32}$ is the sequence:

$\langle$
1798466244, 335077821, 492897903, 2272477127, 924078993,
470900288, 3538556835, 2253448369, 3901930249, 1082788922, 1010874156,
714641328, 4136751352, 2988471001, 1873327877, 1925014406, 3717225155,
2144051386, 1334093084, 500713342, 1872178988, 2899116430, 3801501459,
2055021448, 3958647975, 2833424546, 2394367699, 3436276150, 1413805598,
3782782330, 3579224461, 3133673204, 885754503, 2000724393, 1650724833,
2984658034, 3739408072, 662999316, 3056811851, 3138083869, 1474639442,
4259649945, 4017566483, 465238337, 781885572, 928545464, 2787015764,
3078209121, 3631832061, 386384374, 1358863444, 855586437, 2499107874,
707972634, 194016939, 339095673, 2929281836, 1250797697, 1198569924,
4107355101, 1890126859, 2694076458, 1260735760, 609497694, 388343177,
2587066586, 2492394206, 2046329380, 2072888184, 3255373238, 1106749356,
3571012236, 2131471591, 3541399572, 3614800836, 3022576390, 774577410,
77184245, 823105086, 3857914499, 3771555855, 2336796436, 1452192314,
30479627, 871710755, 2699403231, 2367144669, 4219196231, 1301074994,
1716369630, 3566152300, 381894957, 2268738300, 276392481, 3980456184,
1554573746, 259052121, 234173122, 23950250, 1165367973, 412829095,
1254938419, 1679790307, 1496242670, 1260221101, 1124019412, 106214921,
2039485120, 1499412132, 137092054, 58056147, 3245088693, 1464413688,
2731895448, 3753028136, 925430623, 3695831665, 2599322487, 3593331371,
742039893, 1081974509, 3160094770, 1117133092, 345511722, 3339872022,
2504608598, 3557049083, 2989113041, 181657774, 2007372650, 4212900848,
443636792, 3434861085, 102756407, 2245460171, 2324673430, 2506866248,
4065685208, 857327755, 3772175337, 1199813398, 61289795, 624682477,
1093806826, 3753905274, 2536215571, 3435807477, 169664309, 2732339640,
4102264811, 3191810878, 3532983199, 3436341711, 1369853097, 3930511726,
3404499246, 1068382818, 4046345179, 928204364, 558874001, 1620894455,
1633203276, 2806743079, 2045995282, 382386530, 1584848719, 3158410491,
2435624061, 4185806596, 1716376159, 2105767941, 1401343968, 1314413943,
2700759678, 3948708865, 2709965905, 99565343, 3253716852, 947063852,
2912703801, 667765048, 1930933490, 2234259567, 2764681460, 4120047632,
4143875827, 2974525548, 118528551, 218125965, 3643729055, 3697557171,
2374091571, 3441261501, 2281675994, 2163884677, 623218011, 3655599706,
2023054360, 2223459370, 4256043883, 1223603155, 3261217112, 1517615469,
197872117, 809619196, 2779816360, 757709542, 2439696019, 242243149,
2722665646, 2560033869, 3416882218, 1386419121, 831440001, 3295846081,
3641366410, 1441505168, 1326817242, 2836380996, 3502924296, 1549365865,
3764012830, 648405540, 2534689092, 3974284422, 4133264030, 2191784891,
2455575138, 3603219099, 1639524653, 370485705, 930165312, 1760334651,
3984631013, 45275969, 1204803190, 3219286849, 1585735867, 1846351268,
1077427956, 2179343099, 424359820, 206584565, 2382377895, 3844702,
2642110243, 117631095, 884429158, 3518666968, 2610894740, 3902151812,
2641836823, 4185407209, 3757119320, 1237230287, 2699069567
$\rangle$

(The sequence $G$ was chosen at random). Note that $|G| = 256$, so
$g(n)$ is always defined.

### Implementation

## Rolling

$\operatorname{CP32}$ can be computed in a rolling fashion; for
sequences

$X = \langle X_0, \dots, X_N \rangle$

and

$Y = \langle X_1, \dots, X_N, y \rangle$

Given $\operatorname{CP32}(X)$, $X_0$ and $y$, we can compute
$\operatorname{CP32}(Y)$ as:

$\operatorname{CP32}(Y) = \operatorname{ROT}_L(\operatorname{CP32}(X),
zenhack marked this conversation as resolved.
Show resolved Hide resolved
1) \oplus \operatorname{ROT}_L(g(X_0), |X|) \oplus g(y)$.

Note that the splitting algorithm only computes hashes on sequences of
size $W = 64$, and since 64 is a multiple of 32 this means that for the
purposes of splitting, the above can be simplified to:

$\operatorname{CP32}(Y) = \operatorname{ROT}_L(\operatorname{CP32}(X),
1) \oplus g(X_0) \oplus g(y)$.

## The RRS Rolling Checksums

The `rrs` family of checksums is based on an algorithm first used
Expand Down