Skip to content

Commit 9f1390f

Browse files
committed
Merge remote-tracking branch 'origin/master' into cp32
2 parents 0bb9d97 + a9d62bd commit 9f1390f

File tree

1 file changed

+139
-55
lines changed

1 file changed

+139
-55
lines changed

spec.md

Lines changed: 139 additions & 55 deletions
Original file line numberDiff line numberDiff line change
@@ -71,10 +71,12 @@ We also use the following operators and functions:
7171
i.e. if $X = \langle X_0, \dots, X_N \rangle$ and $Y = \langle Y_0,
7272
\dots, Y_M \rangle$ then $X \mathbin{\|} Y = \langle X_0, \dots, X_N, Y_0, \dots, Y_M
7373
\rangle$
74-
- $\operatorname{min}(x, y)$ denotes the minimum of $x$ and $y$.
74+
- $\min(x, y)$ denotes the minimum of $x$ and $y$ and $\max(x, y)$
75+
denotes the maximum
7576
- $\operatorname{ROT}_L(x, n)$ denotes the rotation of $x$ to the left
7677
by $n$ bits, i.e. $\operatorname{ROT}_L(x, n) = (x \ll n) \vee (x \gg
7778
(32 - n))$
79+
- $\operatorname{Type}(x)$ denotes the type of $x$.
7880

7981
We use standard mathematical notation for summation. For example:
8082

@@ -88,14 +90,43 @@ $\bigoplus_{i = 0}^{n} i$
8890

8991
denotes the bitwise exclusive or of the integers in $[0, n]$.
9092

93+
Finally, we define the “prefix” $\mathbb{P}_q(X)$
94+
of a non-empty sequence $X$
95+
with respect to a given predicate $q$
96+
to be the initial subsequence $X^\prime$ of $X$
97+
up to and including the first member that makes $q(X^\prime)$ true.
98+
And we define the “remainder” $\mathbb{R}_q(X)$
99+
to be everything left after removing the prefix.
100+
101+
Formally,
102+
given a sequence $X = \langle X_0, \dots, X_{|X|-1} \rangle$
103+
and a predicate $q \in \operatorname{Type}(X) \rightarrow \{\text{true},\text{false}\}$,
104+
105+
$\mathbb{P}_q(X) = \langle X_0, \dots, X_e \rangle$
106+
107+
for the smallest integer $e$ such that:
108+
109+
- $0 \le e < |X|$ and
110+
- $q(\langle X_0, \dots, X_e \rangle) = \text{true}$
111+
112+
or $|X|-1$ if no such integer exists.
113+
(I.e., if nothing satisfies $q$, the prefix is the whole sequence.)
114+
And:
115+
116+
$\mathbb{R}_q(X) = \langle X_b, \dots, X_{|X|-1} \rangle$
117+
118+
where $b = |\mathbb{P}_q(\langle X_0, \dots, X_{|X|-1} \rangle)|$.
119+
120+
Note that when $\mathbb{P}_q(X) = X$, $\mathbb{R}_q(X) = \langle \rangle$.
121+
91122
# Splitting
92123

93124
The primary result of this specification is to define a family of
94125
functions:
95126

96127
$\operatorname{SPLIT}_C \in V_8 \rightarrow V_v$
97128

98-
...which is parameterized by a configuration $C$, consisting of:
129+
...which is parameterized by a _configuration_ $C$, consisting of:
99130

100131
- $S_{\text{min}} \in U_{32}$, the minimum split size
101132
- $S_{\text{max}} \in U_{32}$, the maximum split size
@@ -108,24 +139,21 @@ The configuration must satisfy $S_{\text{max}} \ge S_{\text{min}} > 0$.
108139

109140
We define the constant $W$, which we call the "window size," to be 64.
110141

111-
The "split index" $I(X)$ of a sequence $X$ is either the smallest
112-
non-negative integer $i$ satisfying:
142+
We define the predicate $q_C(X)$
143+
on a non-empty byte sequence $X$
144+
with respect to a configuration $C$
145+
to be:
113146

114-
- $i \le |X|$ and
115-
- $S_{\text{max}} \ge i \ge S_{\text{min}}$ and
116-
- $H(\langle X_{i-W}, \dots, X_{i-1} \rangle) \mod 2^T = 0$
117-
118-
...or $\operatorname{min}(|X|, S_{\text{max}})$, if no such $i$ exists. For the
119-
purposes of this definition we set $X_i = 0$ for $i < 0$.
120-
121-
The “prefix” $P(X)$ of a non-empty sequence $X$ is $\langle X_0, \dots, X_{I(X)-1} \rangle$.
122-
123-
The “remainder” $R(X)$ of a non-empty sequence $X$ is $\langle X_{I(X)}, \dots, X_{|X|-1} \rangle$.
147+
- $\text{true}$ if $|X| = S_{\text{max}}$; otherwise
148+
- $\text{true}$ if $|X| \ge S_{\text{min}}$ and $H(\langle X_{\max(0,|X|-W)}, \dots, X_{|X|-1} \rangle) \mod 2^T = 0$
149+
(i.e., the last $W$ bytes of $X$ hash to a value with at least $T$ trailing zeroes);
150+
otherwise
151+
- $\text{false}$.
124152

125153
We define $\operatorname{SPLIT}_C(X)$ recursively, as follows:
126154

127155
- If $|X| = 0$, $\operatorname{SPLIT}_C(X) = \langle \rangle$
128-
- Otherwise, $\operatorname{SPLIT}_C(X) = \langle P(X) \rangle \mathbin{\|} \operatorname{SPLIT}_C(R(X))$
156+
- Otherwise, $\operatorname{SPLIT}_C(X) = \langle \mathbb{P}_{q_C}(X) \rangle \mathbin{\|} \operatorname{SPLIT}_C(\mathbb{R}_{q_C}(X))$
129157

130158
# Tree Construction
131159

@@ -154,63 +182,119 @@ will differ only in the subtrees in the vicinity of the differences.
154182

155183
## Definitions
156184

157-
The “hashval” $V(X)$ of a sequence $X$ is:
185+
A “chunk” is a member of the sequence produced by $\operatorname{SPLIT}_C$.
186+
187+
The “hashval” $V_C(X)$ of a byte sequence $X$ is:
158188

159-
$H(\langle X_{\operatorname{max}(0, |X|-W)}, \dots, X_{|X|-1} \rangle)$
189+
$H(\langle X_{\max(0, |X|-W)}, \dots, X_{|X|-1} \rangle)$
160190

161191
(i.e., the hash of the last $W$ bytes of $X$).
162192

163-
The “level” $L(X)$ of a sequence $X$ is $Q - T$,
193+
A “node” $N_{h,i}$ in a hashsplit tree
194+
at non-negative “height” $h$
195+
is a sequence of children.
196+
The children of a node at height 0 are chunks.
197+
The children of a node at height $h+1$ are nodes at height $h$.
198+
199+
A “tier” of a hashsplit tree is a sequence of nodes
200+
$N_h = \langle N_{h,0}, \dots, N_{h,k} \rangle$
201+
at a given height $h$.
202+
203+
The function $\operatorname{Rightmost}(N_{h,i})$
204+
on a node $N_{h,i} = \langle S_0, \dots, S_e \rangle$
205+
produces the “rightmost leaf chunk”
206+
defined recursively as follows:
207+
208+
- If $h = 0$, $\operatorname{Rightmost}(N_{h,i}) = S_e$
209+
- If $h > 0$, $\operatorname{Rightmost}(N_{h,i}) = \operatorname{Rightmost}(S_e)$
210+
211+
The “level” $L_C(X)$ of a given chunk $X$
212+
is $\max(0, Q - T)$,
164213
where $Q$ is the largest integer such that
165214

166215
- $Q \le 32$ and
167-
- $V(P(X)) \mod 2^Q = 0$
168-
169-
(i.e., the level is the number of trailing zeroes in the rolling checksum in excess of the threshold needed to produce the prefix chunk $P(X)$).
170-
171-
(Note:
172-
When $|R(X)| > 0$,
173-
$L(X)$ is non-negative,
174-
because $P(X)$ is defined in terms of a hash with $T$ trailing zeroes.
175-
But when $|R(X)| = 0$,
176-
that hash may have fewer than $T$ trailing zeroes,
177-
and so $L(X)$ may be negative.
178-
This makes no difference to the algorithm below, however.)
179-
180-
A “node” in a hashsplit tree
181-
is a pair $(D, C)$
182-
where $D$ is the node’s “depth”
183-
and $C$ is a sequence of children.
184-
The children of a node at depth 0 are chunks
185-
(i.e., subsequences of the input).
186-
The children of a node at depth $D > 0$ are nodes at depth $D - 1$.
187-
188-
The function $\operatorname{Children}(N)$ on a node $N = (D, C)$ produces $C$
189-
(the sequence of children).
216+
- $V_C(\mathbb{P}_{q_C}(X)) \mod 2^Q = 0$
217+
218+
(i.e., the level is the number of trailing zeroes in the hashval
219+
in excess of the threshold needed
220+
to produce the prefix chunk $\mathbb{P}_{q_C}(X)$).
221+
222+
The level $L_C(N)$ of a given _node_ $N$
223+
is the level of its rightmost leaf chunk:
224+
$L_C(N) = L_C(\operatorname{Rightmost}(N))$
225+
226+
The predicate $z_{C,h}(K)$
227+
on a sequence $K = \langle K_0, \dots, K_e \rangle$
228+
of chunks or of nodes
229+
with respect to a height $h$
230+
is defined as:
231+
232+
- $\text{true}$ if $L_C(K_e) > h$; otherwise
233+
- $\text{false}$.
234+
235+
For conciseness, define
236+
237+
- $P_C(X) = \mathbb{P}_{z_{C,0}}(\operatorname{SPLIT}_C(X))$ and
238+
- $R_C(X) = \mathbb{R}_{z_{C,0}}(\operatorname{SPLIT}_C(X))$
190239

191240
## Algorithm
192241

193-
To compute a hashsplit tree from sequence $X$,
242+
This section contains two descriptions of hashsplit trees:
243+
an algebraic description for formal reasoning,
244+
and a procedural description for practical construction.
245+
246+
### Algebraic description
247+
248+
The tier $N_0$
249+
of hashsplit tree nodes
250+
for a given byte sequence $X$
251+
is equal to
252+
253+
$\langle P_C(X) \rangle \mathbb{\|} R_C(X)$
254+
255+
The tier $N_{h+1}$
256+
of hashsplit tree nodes
257+
for a given byte sequence $X$
258+
is equal to
259+
260+
$\langle \mathbb{P}_{z_{C,h+1}}(N_h) \rangle \mathbb{\|} \mathbb{R}_{z_{C,h+1}}(N_h)$
261+
262+
(I.e., each node in the tree has as its children
263+
a sequence of chunks or lower-tier nodes,
264+
as appropriate,
265+
up to and including the first one
266+
whose “level” is greater than the node’s height.)
267+
268+
The root of the hashsplit tree is $N_{h^\prime,0}$
269+
for the smallest value of $h^\prime$
270+
such that $|N_{h^\prime}| = 1$
271+
272+
### Procedural description
273+
274+
For this description we use $N_h$ to denote a single node at height $h$.
275+
The algorithm must keep track of the “rightmost” such node for each tier in the tree.
276+
277+
To compute a hashsplit tree from a byte sequence $X$,
194278
compute its “root node” as follows.
195279

196-
1. Let $N_0$ be $(0, \langle\rangle)$ (i.e., a node at depth 0 with no children).
280+
1. Let $N_0$ be $\langle\rangle$ (i.e., a node at height 0 with no children).
197281
2. If $|X| = 0$, then:
198-
a. Let $d$ be the largest depth such that $N_d$ exists.
199-
b. If $|\operatorname{Children}(N_0)| > 0$, then:
200-
i. For each integer $i$ in $[0 .. d]$, “close” $N_i$.
201-
ii. Set $d \leftarrow d+1$.
202-
c. [pruning] While $d > 0$ and $|\operatorname{Children}(N_d)| = 1$, set $d \leftarrow d-1$ (i.e., traverse from the prospective tree root downward until there is a node with more than one child).
203-
d. **Terminate** with $N_d$ as the root node.
204-
3. Otherwise, set $N_0 \leftarrow (0, \operatorname{Children}(N_0) \mathbin{\|} \langle P(X) \rangle)$ (i.e., add $P(X)$ to the list of children in $N_0$).
205-
4. For each integer $i$ in $[0 .. L(X))$, “close” the node $N_i$ (see below).
206-
5. Set $X \leftarrow R(X)$.
282+
a. Let $h$ be the largest height such that $N_h$ exists.
283+
b. If $|N_0| > 0$, then:
284+
i. For each integer $i$ in $[0 .. h]$, “close” $N_i$ (see below).
285+
ii. Set $h \leftarrow h+1$.
286+
c. [pruning] While $h > 0$ and $|N_h| = 1$, set $h \leftarrow h-1$ (i.e., traverse from the prospective tree root downward until there is a node with more than one child).
287+
d. **Terminate** with $N_h$ as the root node.
288+
3. Otherwise, set $N_0 \leftarrow N_0 \mathbin{\|} \langle P_C(X) \rangle$ (i.e., add $P_C(X)$ to the list of children in $N_0$).
289+
4. For each integer $i$ in $[0 .. L_C(X))$, “close” the node $N_i$ (see below).
290+
5. Set $X \leftarrow R_C(X)$.
207291
6. Go to step 2.
208292

209293
To “close” a node $N_i$:
210294

211-
1. If no $N_{i+1}$ exists yet, let $N_{i+1}$ be $(i+1, \langle\rangle)$ (i.e., a node at depth ${i + 1}$ with no children).
212-
2. Set $N_{i+1} \leftarrow (i+1, \operatorname{Children}(N_{i+1}) \mathbin{\|} \langle N_i \rangle)$ (i.e., add $N_i$ as a child to $N_{i+1}$).
213-
3. Let $N_i$ be $(i, \langle\rangle)$ (i.e., new node at depth $i$ with no children).
295+
1. If no $N_{i+1}$ exists yet, let $N_{i+1}$ be $\langle\rangle$ (i.e., a node at height ${i + 1}$ with no children).
296+
2. Set $N_{i+1} \leftarrow N_{i+1} \mathbin{\|} \langle N_i \rangle$ (i.e., add $N_i$ as a child to $N_{i+1}$).
297+
3. Let $N_i$ be $\langle\rangle$ (i.e., new node at height $i$ with no children).
214298

215299
# Rolling Hash Functions
216300

0 commit comments

Comments
 (0)