Skip to content

Commit a9d62bd

Browse files
authored
Merge pull request #23 from bobg/algebraic-tree
Algebraic tree description
2 parents c3c03d6 + 40da42b commit a9d62bd

File tree

1 file changed

+138
-55
lines changed

1 file changed

+138
-55
lines changed

spec.md

+138-55
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,37 @@ We also use the following operators and functions:
6565
i.e. if $X = \langle X_0, \dots, X_N \rangle$ and $Y = \langle Y_0,
6666
\dots, Y_M \rangle$ then $X \mathbin{\|} Y = \langle X_0, \dots, X_N, Y_0, \dots, Y_M
6767
\rangle$
68-
- $\operatorname{min}(x, y)$ denotes the minimum of $x$ and $y$.
68+
- $\min(x, y)$ denotes the minimum of $x$ and $y$ and $\max(x, y)$ denotes the maximum
69+
- $\operatorname{Type}(x)$ denotes the type of $x$.
70+
71+
Finally, we define the “prefix” $\mathbb{P}_q(X)$
72+
of a non-empty sequence $X$
73+
with respect to a given predicate $q$
74+
to be the initial subsequence $X^\prime$ of $X$
75+
up to and including the first member that makes $q(X^\prime)$ true.
76+
And we define the “remainder” $\mathbb{R}_q(X)$
77+
to be everything left after removing the prefix.
78+
79+
Formally,
80+
given a sequence $X = \langle X_0, \dots, X_{|X|-1} \rangle$
81+
and a predicate $q \in \operatorname{Type}(X) \rightarrow \{\text{true},\text{false}\}$,
82+
83+
$\mathbb{P}_q(X) = \langle X_0, \dots, X_e \rangle$
84+
85+
for the smallest integer $e$ such that:
86+
87+
- $0 \le e < |X|$ and
88+
- $q(\langle X_0, \dots, X_e \rangle) = \text{true}$
89+
90+
or $|X|-1$ if no such integer exists.
91+
(I.e., if nothing satisfies $q$, the prefix is the whole sequence.)
92+
And:
93+
94+
$\mathbb{R}_q(X) = \langle X_b, \dots, X_{|X|-1} \rangle$
95+
96+
where $b = |\mathbb{P}_q(\langle X_0, \dots, X_{|X|-1} \rangle)|$.
97+
98+
Note that when $\mathbb{P}_q(X) = X$, $\mathbb{R}_q(X) = \langle \rangle$.
6999

70100
# Splitting
71101

@@ -74,7 +104,7 @@ functions:
74104

75105
$\operatorname{SPLIT}_C \in V_8 \rightarrow V_v$
76106

77-
...which is parameterized by a configuration $C$, consisting of:
107+
...which is parameterized by a _configuration_ $C$, consisting of:
78108

79109
- $S_{\text{min}} \in U_{32}$, the minimum split size
80110
- $S_{\text{max}} \in U_{32}$, the maximum split size
@@ -87,24 +117,21 @@ The configuration must satisfy $S_{\text{max}} \ge S_{\text{min}} > 0$.
87117

88118
We define the constant $W$, which we call the "window size," to be 64.
89119

90-
The "split index" $I(X)$ of a sequence $X$ is either the smallest
91-
non-negative integer $i$ satisfying:
120+
We define the predicate $q_C(X)$
121+
on a non-empty byte sequence $X$
122+
with respect to a configuration $C$
123+
to be:
92124

93-
- $i \le |X|$ and
94-
- $S_{\text{max}} \ge i \ge S_{\text{min}}$ and
95-
- $H(\langle X_{i-W}, \dots, X_{i-1} \rangle) \mod 2^T = 0$
96-
97-
...or $\operatorname{min}(|X|, S_{\text{max}})$, if no such $i$ exists. For the
98-
purposes of this definition we set $X_i = 0$ for $i < 0$.
99-
100-
The “prefix” $P(X)$ of a non-empty sequence $X$ is $\langle X_0, \dots, X_{I(X)-1} \rangle$.
101-
102-
The “remainder” $R(X)$ of a non-empty sequence $X$ is $\langle X_{I(X)}, \dots, X_{|X|-1} \rangle$.
125+
- $\text{true}$ if $|X| = S_{\text{max}}$; otherwise
126+
- $\text{true}$ if $|X| \ge S_{\text{min}}$ and $H(\langle X_{\max(0,|X|-W)}, \dots, X_{|X|-1} \rangle) \mod 2^T = 0$
127+
(i.e., the last $W$ bytes of $X$ hash to a value with at least $T$ trailing zeroes);
128+
otherwise
129+
- $\text{false}$.
103130

104131
We define $\operatorname{SPLIT}_C(X)$ recursively, as follows:
105132

106133
- If $|X| = 0$, $\operatorname{SPLIT}_C(X) = \langle \rangle$
107-
- Otherwise, $\operatorname{SPLIT}_C(X) = \langle P(X) \rangle \mathbin{\|} \operatorname{SPLIT}_C(R(X))$
134+
- Otherwise, $\operatorname{SPLIT}_C(X) = \langle \mathbb{P}_{q_C}(X) \rangle \mathbin{\|} \operatorname{SPLIT}_C(\mathbb{R}_{q_C}(X))$
108135

109136
# Tree Construction
110137

@@ -133,63 +160,119 @@ will differ only in the subtrees in the vicinity of the differences.
133160

134161
## Definitions
135162

136-
The “hashval” $V(X)$ of a sequence $X$ is:
163+
A “chunk” is a member of the sequence produced by $\operatorname{SPLIT}_C$.
164+
165+
The “hashval” $V_C(X)$ of a byte sequence $X$ is:
137166

138-
$H(\langle X_{\operatorname{max}(0, |X|-W)}, \dots, X_{|X|-1} \rangle)$
167+
$H(\langle X_{\max(0, |X|-W)}, \dots, X_{|X|-1} \rangle)$
139168

140169
(i.e., the hash of the last $W$ bytes of $X$).
141170

142-
The “level” $L(X)$ of a sequence $X$ is $Q - T$,
171+
A “node” $N_{h,i}$ in a hashsplit tree
172+
at non-negative “height” $h$
173+
is a sequence of children.
174+
The children of a node at height 0 are chunks.
175+
The children of a node at height $h+1$ are nodes at height $h$.
176+
177+
A “tier” of a hashsplit tree is a sequence of nodes
178+
$N_h = \langle N_{h,0}, \dots, N_{h,k} \rangle$
179+
at a given height $h$.
180+
181+
The function $\operatorname{Rightmost}(N_{h,i})$
182+
on a node $N_{h,i} = \langle S_0, \dots, S_e \rangle$
183+
produces the “rightmost leaf chunk”
184+
defined recursively as follows:
185+
186+
- If $h = 0$, $\operatorname{Rightmost}(N_{h,i}) = S_e$
187+
- If $h > 0$, $\operatorname{Rightmost}(N_{h,i}) = \operatorname{Rightmost}(S_e)$
188+
189+
The “level” $L_C(X)$ of a given chunk $X$
190+
is $\max(0, Q - T)$,
143191
where $Q$ is the largest integer such that
144192

145193
- $Q \le 32$ and
146-
- $V(P(X)) \mod 2^Q = 0$
147-
148-
(i.e., the level is the number of trailing zeroes in the rolling checksum in excess of the threshold needed to produce the prefix chunk $P(X)$).
149-
150-
(Note:
151-
When $|R(X)| > 0$,
152-
$L(X)$ is non-negative,
153-
because $P(X)$ is defined in terms of a hash with $T$ trailing zeroes.
154-
But when $|R(X)| = 0$,
155-
that hash may have fewer than $T$ trailing zeroes,
156-
and so $L(X)$ may be negative.
157-
This makes no difference to the algorithm below, however.)
158-
159-
A “node” in a hashsplit tree
160-
is a pair $(D, C)$
161-
where $D$ is the node’s “depth”
162-
and $C$ is a sequence of children.
163-
The children of a node at depth 0 are chunks
164-
(i.e., subsequences of the input).
165-
The children of a node at depth $D > 0$ are nodes at depth $D - 1$.
166-
167-
The function $\operatorname{Children}(N)$ on a node $N = (D, C)$ produces $C$
168-
(the sequence of children).
194+
- $V_C(\mathbb{P}_{q_C}(X)) \mod 2^Q = 0$
195+
196+
(i.e., the level is the number of trailing zeroes in the hashval
197+
in excess of the threshold needed
198+
to produce the prefix chunk $\mathbb{P}_{q_C}(X)$).
199+
200+
The level $L_C(N)$ of a given _node_ $N$
201+
is the level of its rightmost leaf chunk:
202+
$L_C(N) = L_C(\operatorname{Rightmost}(N))$
203+
204+
The predicate $z_{C,h}(K)$
205+
on a sequence $K = \langle K_0, \dots, K_e \rangle$
206+
of chunks or of nodes
207+
with respect to a height $h$
208+
is defined as:
209+
210+
- $\text{true}$ if $L_C(K_e) > h$; otherwise
211+
- $\text{false}$.
212+
213+
For conciseness, define
214+
215+
- $P_C(X) = \mathbb{P}_{z_{C,0}}(\operatorname{SPLIT}_C(X))$ and
216+
- $R_C(X) = \mathbb{R}_{z_{C,0}}(\operatorname{SPLIT}_C(X))$
169217

170218
## Algorithm
171219

172-
To compute a hashsplit tree from sequence $X$,
220+
This section contains two descriptions of hashsplit trees:
221+
an algebraic description for formal reasoning,
222+
and a procedural description for practical construction.
223+
224+
### Algebraic description
225+
226+
The tier $N_0$
227+
of hashsplit tree nodes
228+
for a given byte sequence $X$
229+
is equal to
230+
231+
$\langle P_C(X) \rangle \mathbb{\|} R_C(X)$
232+
233+
The tier $N_{h+1}$
234+
of hashsplit tree nodes
235+
for a given byte sequence $X$
236+
is equal to
237+
238+
$\langle \mathbb{P}_{z_{C,h+1}}(N_h) \rangle \mathbb{\|} \mathbb{R}_{z_{C,h+1}}(N_h)$
239+
240+
(I.e., each node in the tree has as its children
241+
a sequence of chunks or lower-tier nodes,
242+
as appropriate,
243+
up to and including the first one
244+
whose “level” is greater than the node’s height.)
245+
246+
The root of the hashsplit tree is $N_{h^\prime,0}$
247+
for the smallest value of $h^\prime$
248+
such that $|N_{h^\prime}| = 1$
249+
250+
### Procedural description
251+
252+
For this description we use $N_h$ to denote a single node at height $h$.
253+
The algorithm must keep track of the “rightmost” such node for each tier in the tree.
254+
255+
To compute a hashsplit tree from a byte sequence $X$,
173256
compute its “root node” as follows.
174257

175-
1. Let $N_0$ be $(0, \langle\rangle)$ (i.e., a node at depth 0 with no children).
258+
1. Let $N_0$ be $\langle\rangle$ (i.e., a node at height 0 with no children).
176259
2. If $|X| = 0$, then:
177-
a. Let $d$ be the largest depth such that $N_d$ exists.
178-
b. If $|\operatorname{Children}(N_0)| > 0$, then:
179-
i. For each integer $i$ in $[0 .. d]$, “close” $N_i$.
180-
ii. Set $d \leftarrow d+1$.
181-
c. [pruning] While $d > 0$ and $|\operatorname{Children}(N_d)| = 1$, set $d \leftarrow d-1$ (i.e., traverse from the prospective tree root downward until there is a node with more than one child).
182-
d. **Terminate** with $N_d$ as the root node.
183-
3. Otherwise, set $N_0 \leftarrow (0, \operatorname{Children}(N_0) \mathbin{\|} \langle P(X) \rangle)$ (i.e., add $P(X)$ to the list of children in $N_0$).
184-
4. For each integer $i$ in $[0 .. L(X))$, “close” the node $N_i$ (see below).
185-
5. Set $X \leftarrow R(X)$.
260+
a. Let $h$ be the largest height such that $N_h$ exists.
261+
b. If $|N_0| > 0$, then:
262+
i. For each integer $i$ in $[0 .. h]$, “close” $N_i$ (see below).
263+
ii. Set $h \leftarrow h+1$.
264+
c. [pruning] While $h > 0$ and $|N_h| = 1$, set $h \leftarrow h-1$ (i.e., traverse from the prospective tree root downward until there is a node with more than one child).
265+
d. **Terminate** with $N_h$ as the root node.
266+
3. Otherwise, set $N_0 \leftarrow N_0 \mathbin{\|} \langle P_C(X) \rangle$ (i.e., add $P_C(X)$ to the list of children in $N_0$).
267+
4. For each integer $i$ in $[0 .. L_C(X))$, “close” the node $N_i$ (see below).
268+
5. Set $X \leftarrow R_C(X)$.
186269
6. Go to step 2.
187270

188271
To “close” a node $N_i$:
189272

190-
1. If no $N_{i+1}$ exists yet, let $N_{i+1}$ be $(i+1, \langle\rangle)$ (i.e., a node at depth ${i + 1}$ with no children).
191-
2. Set $N_{i+1} \leftarrow (i+1, \operatorname{Children}(N_{i+1}) \mathbin{\|} \langle N_i \rangle)$ (i.e., add $N_i$ as a child to $N_{i+1}$).
192-
3. Let $N_i$ be $(i, \langle\rangle)$ (i.e., new node at depth $i$ with no children).
273+
1. If no $N_{i+1}$ exists yet, let $N_{i+1}$ be $\langle\rangle$ (i.e., a node at height ${i + 1}$ with no children).
274+
2. Set $N_{i+1} \leftarrow N_{i+1} \mathbin{\|} \langle N_i \rangle$ (i.e., add $N_i$ as a child to $N_{i+1}$).
275+
3. Let $N_i$ be $\langle\rangle$ (i.e., new node at height $i$ with no children).
193276

194277
# Rolling Hash Functions
195278

0 commit comments

Comments
 (0)