From 8feac1b10b6fb6d61c66dbc0b030f87b6467699c Mon Sep 17 00:00:00 2001 From: Bob Glickstein Date: Fri, 2 Oct 2020 08:25:54 -0700 Subject: [PATCH 1/9] WIP: algebraic description of hashsplit-tree construction. --- spec.md | 32 +++++++++++++++++++++++++++++--- 1 file changed, 29 insertions(+), 3 deletions(-) diff --git a/spec.md b/spec.md index 7416cb6..639e071 100644 --- a/spec.md +++ b/spec.md @@ -65,7 +65,7 @@ We also use the following operators and functions: i.e. if $X = \langle X_0, \dots, X_N \rangle$ and $Y = \langle Y_0, \dots, Y_M \rangle$ then $X \mathbin{\|} Y = \langle X_0, \dots, X_N, Y_0, \dots, Y_M \rangle$ -- $\operatorname{min}(x, y)$ denotes the minimum of $x$ and $y$. +- $\min(x, y)$ denotes the minimum of $x$ and $y$. # Splitting @@ -92,7 +92,7 @@ The "split index" $I(X)$ of a sequence $X$ is either the smallest integer $i$ sa - $S_{\text{max}} \ge i \ge S_{\text{min}}$ and - $H(\langle X_{i-W}, \dots, X_{i-1} \rangle) \mod 2^T = 0$ -...or $\operatorname{min}(|X|, S_{\text{max}})$, if no such $i$ exists. +...or $\min(|X|, S_{\text{max}})$, if no such $i$ exists. The “prefix” $P(X)$ of a non-empty sequence $X$ is $\langle X_0, \dots, X_{I(X)-1} \rangle$. @@ -130,9 +130,12 @@ will differ only in the subtrees in the vicinity of the differences. ## Definitions +A “chunk” $K_{C,i}(X)$ is a member of the sequence produced by $\operatorname{SPLIT}_C(X)$. +The index $i$ lies in the range $[0 .. |\operatorname{SPLIT}_C(X)|)$. + The “hashval” $V(X)$ of a sequence $X$ is: -$H(\langle X_{\operatorname{max}(0, |X|-W)}, \dots, X_{|X|-1} \rangle)$ +$H(\langle X_{\max(0, |X|-W)}, \dots, X_{|X|-1} \rangle)$ (i.e., the hash of the last $W$ bytes of $X$). @@ -166,6 +169,29 @@ The function $\operatorname{Children}(N)$ on a node $N = (D, C)$ produces $C$ ## Algorithm +### Algebraic description + +Let function $F_C(X, D, i)$ on a sequence $X$, a depth $D$, and an index $i$ produce the $i^\text{th}$ node at level $D$. + +$F_C(X, 0, 0) = (0, \langle K_{C,0}(X), \dots, K_{C,b}(X) \rangle)$ +where $b$ is the smallest integer such that $L(K_{C,b}(X)) > 0$, +or $|SPLIT_C(X)|-1$ if no such integer exists. + +$F_C(X, 0, n+1) = (0, \langle K_{C,\sigma}, \dots, K_{C,b}(X) \rangle)$ +where: + +- $\sigma = \sum_{i=0}^{n}|Children(F_C(X, 0, i))|$ and +- $b$ is the smallest integer such that $L(K_{C,b}(X)) > 0$, or $|SPLIT_C(X)|-1$ if no such integer exists (as before). + +$F_C(X, D+1, 0) = (D, \langle F_C(X, D, 0), \dots, F_C(X, D, b) \rangle)$ +where $b$ is the smallest integer such that $L(xxxrightmostleaf(F_C(X, D, b)))$ > D$, +or xxx if no such integer exists. + +The root of the tree is $F_C(X, D, 0)$ where $D$ is the largest integer such that $|Children(F_C(X, D, 0))| > 1$, +or 0 if no such integer exists. + +### Procedural description + To compute a hashsplit tree from sequence $X$, compute its “root node” as follows. From 3f14f6ae82bb02940807e0c2a23a551f12b7c563 Mon Sep 17 00:00:00 2001 From: Bob Glickstein Date: Thu, 8 Oct 2020 08:28:38 -0700 Subject: [PATCH 2/9] Flesh out the rest of the algebraic tree description. Also don't overload C; use S for a node's children. --- spec.md | 78 +++++++++++++++++++++++++++++++++++++++++++++------------ 1 file changed, 62 insertions(+), 16 deletions(-) diff --git a/spec.md b/spec.md index 639e071..d6b3815 100644 --- a/spec.md +++ b/spec.md @@ -157,38 +157,84 @@ and so $L(X)$ may be negative. This makes no difference to the algorithm below, however.) A “node” in a hashsplit tree -is a pair $(D, C)$ +is a pair $(D, S)$ where $D$ is the node’s “depth” -and $C$ is a sequence of children. +and $S$ is a sequence of children. The children of a node at depth 0 are chunks (i.e., subsequences of the input). The children of a node at depth $D > 0$ are nodes at depth $D - 1$. -The function $\operatorname{Children}(N)$ on a node $N = (D, C)$ produces $C$ +The function $\operatorname{Children}(N)$ +on a node $N = (D, S)$ +produces $S$ (the sequence of children). +The function $\operatorname{Rightmost}(N)$ +on a node $N = (D, \langle S_0, \dots, S_b \rangle)$ +produces the “rightmost leaf chunk” +defined recursively as follows: + +- If $D = 0$, + $\operatorname{Rightmost}(N) = K_{C,b}$ +- Otherwise, $\operatorname{Rightmost}(N) = \operatorname{Rightmost}(S_b)$ + + ## Algorithm ### Algebraic description Let function $F_C(X, D, i)$ on a sequence $X$, a depth $D$, and an index $i$ produce the $i^\text{th}$ node at level $D$. -$F_C(X, 0, 0) = (0, \langle K_{C,0}(X), \dots, K_{C,b}(X) \rangle)$ -where $b$ is the smallest integer such that $L(K_{C,b}(X)) > 0$, -or $|SPLIT_C(X)|-1$ if no such integer exists. +Then the root of the tree is $F_C(X, D, 0)$ +where $D$ is the largest integer such that $|Children(F_C(X, D, 0))| > 1$, +or 0 if no such integer exists. -$F_C(X, 0, n+1) = (0, \langle K_{C,\sigma}, \dots, K_{C,b}(X) \rangle)$ -where: +$F_C$ is defined recursively as follows: -- $\sigma = \sum_{i=0}^{n}|Children(F_C(X, 0, i))|$ and -- $b$ is the smallest integer such that $L(K_{C,b}(X)) > 0$, or $|SPLIT_C(X)|-1$ if no such integer exists (as before). +- $F_C(X, 0, 0) = (0, \langle K_{C,0}(X), \dots, K_{C,b}(X) \rangle)$ + where $b$ is: -$F_C(X, D+1, 0) = (D, \langle F_C(X, D, 0), \dots, F_C(X, D, b) \rangle)$ -where $b$ is the smallest integer such that $L(xxxrightmostleaf(F_C(X, D, b)))$ > D$, -or xxx if no such integer exists. + - the smallest integer such that + $L(K_{C,b}(X)) > 0$, if one exists; + otherwise + - $|\operatorname{SPLIT}_C(X)|-1$ -The root of the tree is $F_C(X, D, 0)$ where $D$ is the largest integer such that $|Children(F_C(X, D, 0))| > 1$, -or 0 if no such integer exists. +- $F_C(X, 0, n+1)$ is: + + - **undefined** when $\sigma \ge |\operatorname{SPLIT}_C(X)|-1$; otherwise + - $(0, \langle K_{C,\sigma}, \dots, K_{C,b}(X) \rangle)$ + + where: + + - $\sigma = \sum_{i=0}^{n}|Children(F_C(X, 0, i))|$ and + - $b$ is: + - the smallest integer such that + $L(K_{C,b}(X)) > 0$, if one exists; + otherwise + - $|\operatorname{SPLIT}_C(X)|-1$. + +- $F_C(X, D+1, 0) = (D+1, \langle F_C(X, D, 0), \dots, F_C(X, D, b) \rangle)$ + where $b$ is: + + - the smallest integer such that + $L(\operatorname{Rightmost}(F_C(X, D, b))) > D$, if one exists; + otherwise + - the integer such that + $\operatorname{Rightmost}(F_C(X, D, b)) = K_{C,|\operatorname{SPLIT}_C(X)|-1}$ + +- $F_C(X, D+1, n+1)$ is: + + - **undefined** when $F_C(X, D, \sigma)$ is undefined; otherwise + - $(D+1, \langle F_C(X, D, \sigma), \dots, F_C(X, D, b) \rangle)$ + + where: + + - $\sigma = \sum_{i=0}^{n}|Children(F_C(X, D+1, i))|$ and + - $b$ is the smallest integer such that $b \ge \sigma$ and either: + - $L(\operatorname{Rightmost}(F_C(X, D, b))) > D$, if one exists; + otherwise + - the integer such that + $\operatorname{Rightmost}(F_C(X, D, b)) = K_{C,|\operatorname{SPLIT}_C(X)|-1}$ ### Procedural description @@ -199,7 +245,7 @@ compute its “root node” as follows. 2. If $|X| = 0$, then: a. Let $d$ be the largest depth such that $N_d$ exists. b. If $|\operatorname{Children}(N_0)| > 0$, then: - i. For each integer $i$ in $[0 .. d]$, “close” $N_i$. + i. For each integer $i$ in $[0 .. d]$, “close” $N_i$ (see below). ii. Set $d \leftarrow d+1$. c. [pruning] While $d > 0$ and $|\operatorname{Children}(N_d)| = 1$, set $d \leftarrow d-1$ (i.e., traverse from the prospective tree root downward until there is a node with more than one child). d. **Terminate** with $N_d$ as the root node. From 8b82dc22ce596788a65dede772ec7eba1669ac57 Mon Sep 17 00:00:00 2001 From: Bob Glickstein Date: Thu, 8 Oct 2020 17:14:10 -0700 Subject: [PATCH 3/9] Various fixes (including an important correctness fix) plus some explanatory text. --- spec.md | 32 +++++++++++++++++++++++++------- 1 file changed, 25 insertions(+), 7 deletions(-) diff --git a/spec.md b/spec.md index b4d1972..e67d796 100644 --- a/spec.md +++ b/spec.md @@ -153,8 +153,10 @@ where $Q$ is the largest integer such that (Note: When $|R(X)| > 0$, $L(X)$ is non-negative, -because $P(X)$ is defined in terms of a hash with $T$ trailing zeroes. -But when $|R(X)| = 0$, +because $P(X)$ is defined in terms of a hash with $T$ trailing zeroes +(except when the split is triggered by hitting $S_\text{max}$). +But when $|R(X)| = 0$ +(or in the $S_\text{max}$ case), that hash may have fewer than $T$ trailing zeroes, and so $L(X)$ may be negative. This makes no difference to the algorithm below, however.) @@ -184,12 +186,17 @@ defined recursively as follows: ## Algorithm +This section contains two descriptions of hashsplit trees: +an algebraic description for formal reasoning, +and a procedural description for practical construction. + ### Algebraic description Let function $F_C(X, D, i)$ on a sequence $X$, a depth $D$, and an index $i$ produce the $i^\text{th}$ node at level $D$. Then the root of the tree is $F_C(X, D, 0)$ -where $D$ is the largest integer such that $|Children(F_C(X, D, 0))| > 1$, +where $D$ is the largest integer such that $|\operatorname{Children}(F_C(X, D, 0))| > 1$ +(i.e., the highest node that has multiple children), or 0 if no such integer exists. $F_C$ is defined recursively as follows: @@ -202,6 +209,8 @@ $F_C$ is defined recursively as follows: otherwise - $|\operatorname{SPLIT}_C(X)|-1$ + (I.e., its children are the chunks from 0 up to and including the first one with a “level” higher than 0.) + - $F_C(X, 0, n+1)$ is: - **undefined** when $\sigma \ge |\operatorname{SPLIT}_C(X)|-1$; otherwise @@ -209,22 +218,28 @@ $F_C$ is defined recursively as follows: where: - - $\sigma = \sum_{i=0}^{n}|Children(F_C(X, 0, i))|$ and + - $\sigma = \sum_{i=0}^{n}|\operatorname{Children}(F_C(X, 0, i))|$ and - $b$ is: - the smallest integer such that $L(K_{C,b}(X)) > 0$, if one exists; otherwise - $|\operatorname{SPLIT}_C(X)|-1$. + (I.e., its children are the next chunks after the ones in $F_C(X, 0, 0)$ through $F_C(X, 0, n)$, + up until the next one whose level is higher than 0.) + - $F_C(X, D+1, 0) = (D+1, \langle F_C(X, D, 0), \dots, F_C(X, D, b) \rangle)$ where $b$ is: - the smallest integer such that - $L(\operatorname{Rightmost}(F_C(X, D, b))) > D$, if one exists; + $L(\operatorname{Rightmost}(F_C(X, D, b))) > D+1$, if one exists; otherwise - the integer such that $\operatorname{Rightmost}(F_C(X, D, b)) = K_{C,|\operatorname{SPLIT}_C(X)|-1}$ + (I.e., its children are the nodes from $F_C(X, D, 0)$ up to and including the first one + whose “rightmost leaf chunk” has a level higher than $D+1$.) + - $F_C(X, D+1, n+1)$ is: - **undefined** when $F_C(X, D, \sigma)$ is undefined; otherwise @@ -232,13 +247,16 @@ $F_C$ is defined recursively as follows: where: - - $\sigma = \sum_{i=0}^{n}|Children(F_C(X, D+1, i))|$ and + - $\sigma = \sum_{i=0}^{n}|\operatorname{Children}(F_C(X, D+1, i))|$ and - $b$ is the smallest integer such that $b \ge \sigma$ and either: - - $L(\operatorname{Rightmost}(F_C(X, D, b))) > D$, if one exists; + - $L(\operatorname{Rightmost}(F_C(X, D, b))) > D+1$, if one exists; otherwise - the integer such that $\operatorname{Rightmost}(F_C(X, D, b)) = K_{C,|\operatorname{SPLIT}_C(X)|-1}$ + (I.e., its children are the next nodes at level D after the ones in $F_C(X, D+1, 0)$ through $F_C(X, D+1, n)$, + up until the next one whose rightmost leaf chunk has a level higher than $D+1$.) + ### Procedural description To compute a hashsplit tree from sequence $X$, From 94de072c4bcae3858ed8f7f8b17584e39609ae20 Mon Sep 17 00:00:00 2001 From: Bob Glickstein Date: Sat, 10 Oct 2020 09:27:37 -0700 Subject: [PATCH 4/9] Use "height" and "h" instead of "depth" and "D." --- spec.md | 62 ++++++++++++++++++++++++++++----------------------------- 1 file changed, 31 insertions(+), 31 deletions(-) diff --git a/spec.md b/spec.md index e67d796..cf137b2 100644 --- a/spec.md +++ b/spec.md @@ -162,24 +162,24 @@ and so $L(X)$ may be negative. This makes no difference to the algorithm below, however.) A “node” in a hashsplit tree -is a pair $(D, S)$ -where $D$ is the node’s “depth” +is a pair $(h, S)$ +where $h$ is the node’s “height” and $S$ is a sequence of children. -The children of a node at depth 0 are chunks +The children of a node at height 0 are chunks (i.e., subsequences of the input). -The children of a node at depth $D > 0$ are nodes at depth $D - 1$. +The children of a node at height $h > 0$ are nodes at height $h - 1$. The function $\operatorname{Children}(N)$ -on a node $N = (D, S)$ +on a node $N = (h, S)$ produces $S$ (the sequence of children). The function $\operatorname{Rightmost}(N)$ -on a node $N = (D, \langle S_0, \dots, S_b \rangle)$ +on a node $N = (h, \langle S_0, \dots, S_b \rangle)$ produces the “rightmost leaf chunk” defined recursively as follows: -- If $D = 0$, +- If $h = 0$, $\operatorname{Rightmost}(N) = K_{C,b}$ - Otherwise, $\operatorname{Rightmost}(N) = \operatorname{Rightmost}(S_b)$ @@ -192,10 +192,10 @@ and a procedural description for practical construction. ### Algebraic description -Let function $F_C(X, D, i)$ on a sequence $X$, a depth $D$, and an index $i$ produce the $i^\text{th}$ node at level $D$. +Let function $F_C(X, h, i)$ on a sequence $X$, a height $h$, and an index $i$ produce the $i^\text{th}$ node at level $h$. -Then the root of the tree is $F_C(X, D, 0)$ -where $D$ is the largest integer such that $|\operatorname{Children}(F_C(X, D, 0))| > 1$ +Then the root of the tree is $F_C(X, h, 0)$ +where $h$ is the largest integer such that $|\operatorname{Children}(F_C(X, h, 0))| > 1$ (i.e., the highest node that has multiple children), or 0 if no such integer exists. @@ -228,48 +228,48 @@ $F_C$ is defined recursively as follows: (I.e., its children are the next chunks after the ones in $F_C(X, 0, 0)$ through $F_C(X, 0, n)$, up until the next one whose level is higher than 0.) -- $F_C(X, D+1, 0) = (D+1, \langle F_C(X, D, 0), \dots, F_C(X, D, b) \rangle)$ +- $F_C(X, h+1, 0) = (h+1, \langle F_C(X, h, 0), \dots, F_C(X, h, b) \rangle)$ where $b$ is: - the smallest integer such that - $L(\operatorname{Rightmost}(F_C(X, D, b))) > D+1$, if one exists; + $L(\operatorname{Rightmost}(F_C(X, h, b))) > h+1$, if one exists; otherwise - the integer such that - $\operatorname{Rightmost}(F_C(X, D, b)) = K_{C,|\operatorname{SPLIT}_C(X)|-1}$ + $\operatorname{Rightmost}(F_C(X, h, b)) = K_{C,|\operatorname{SPLIT}_C(X)|-1}$ - (I.e., its children are the nodes from $F_C(X, D, 0)$ up to and including the first one - whose “rightmost leaf chunk” has a level higher than $D+1$.) + (I.e., its children are the nodes from $F_C(X, h, 0)$ up to and including the first one + whose “rightmost leaf chunk” has a level higher than $h+1$.) -- $F_C(X, D+1, n+1)$ is: +- $F_C(X, h+1, n+1)$ is: - - **undefined** when $F_C(X, D, \sigma)$ is undefined; otherwise - - $(D+1, \langle F_C(X, D, \sigma), \dots, F_C(X, D, b) \rangle)$ + - **undefined** when $F_C(X, h, \sigma)$ is undefined; otherwise + - $(h+1, \langle F_C(X, h, \sigma), \dots, F_C(X, h, b) \rangle)$ where: - - $\sigma = \sum_{i=0}^{n}|\operatorname{Children}(F_C(X, D+1, i))|$ and + - $\sigma = \sum_{i=0}^{n}|\operatorname{Children}(F_C(X, h+1, i))|$ and - $b$ is the smallest integer such that $b \ge \sigma$ and either: - - $L(\operatorname{Rightmost}(F_C(X, D, b))) > D+1$, if one exists; + - $L(\operatorname{Rightmost}(F_C(X, h, b))) > h+1$, if one exists; otherwise - the integer such that - $\operatorname{Rightmost}(F_C(X, D, b)) = K_{C,|\operatorname{SPLIT}_C(X)|-1}$ + $\operatorname{Rightmost}(F_C(X, h, b)) = K_{C,|\operatorname{SPLIT}_C(X)|-1}$ - (I.e., its children are the next nodes at level D after the ones in $F_C(X, D+1, 0)$ through $F_C(X, D+1, n)$, - up until the next one whose rightmost leaf chunk has a level higher than $D+1$.) + (I.e., its children are the next nodes at level h after the ones in $F_C(X, h+1, 0)$ through $F_C(X, h+1, n)$, + up until the next one whose rightmost leaf chunk has a level higher than $h+1$.) ### Procedural description To compute a hashsplit tree from sequence $X$, compute its “root node” as follows. -1. Let $N_0$ be $(0, \langle\rangle)$ (i.e., a node at depth 0 with no children). +1. Let $N_0$ be $(0, \langle\rangle)$ (i.e., a node at height 0 with no children). 2. If $|X| = 0$, then: - a. Let $d$ be the largest depth such that $N_d$ exists. + a. Let $h$ be the largest height such that $N_h$ exists. b. If $|\operatorname{Children}(N_0)| > 0$, then: - i. For each integer $i$ in $[0 .. d]$, “close” $N_i$ (see below). - ii. Set $d \leftarrow d+1$. - c. [pruning] While $d > 0$ and $|\operatorname{Children}(N_d)| = 1$, set $d \leftarrow d-1$ (i.e., traverse from the prospective tree root downward until there is a node with more than one child). - d. **Terminate** with $N_d$ as the root node. + i. For each integer $i$ in $[0 .. h]$, “close” $N_i$ (see below). + ii. Set $h \leftarrow h+1$. + c. [pruning] While $h > 0$ and $|\operatorname{Children}(N_h)| = 1$, set $h \leftarrow h-1$ (i.e., traverse from the prospective tree root downward until there is a node with more than one child). + d. **Terminate** with $N_h$ as the root node. 3. Otherwise, set $N_0 \leftarrow (0, \operatorname{Children}(N_0) \mathbin{\|} \langle P(X) \rangle)$ (i.e., add $P(X)$ to the list of children in $N_0$). 4. For each integer $i$ in $[0 .. L(X))$, “close” the node $N_i$ (see below). 5. Set $X \leftarrow R(X)$. @@ -277,9 +277,9 @@ compute its “root node” as follows. To “close” a node $N_i$: -1. If no $N_{i+1}$ exists yet, let $N_{i+1}$ be $(i+1, \langle\rangle)$ (i.e., a node at depth ${i + 1}$ with no children). +1. If no $N_{i+1}$ exists yet, let $N_{i+1}$ be $(i+1, \langle\rangle)$ (i.e., a node at height ${i + 1}$ with no children). 2. Set $N_{i+1} \leftarrow (i+1, \operatorname{Children}(N_{i+1}) \mathbin{\|} \langle N_i \rangle)$ (i.e., add $N_i$ as a child to $N_{i+1}$). -3. Let $N_i$ be $(i, \langle\rangle)$ (i.e., new node at depth $i$ with no children). +3. Let $N_i$ be $(i, \langle\rangle)$ (i.e., new node at height $i$ with no children). # Rolling Hash Functions From d9a8fe5790e5655b77e7458ee605772995e83278 Mon Sep 17 00:00:00 2001 From: Bob Glickstein Date: Sat, 10 Oct 2020 09:37:00 -0700 Subject: [PATCH 5/9] Fix "level" to always be non-negative. --- spec.md | 14 +------------- 1 file changed, 1 insertion(+), 13 deletions(-) diff --git a/spec.md b/spec.md index cf137b2..3cd6419 100644 --- a/spec.md +++ b/spec.md @@ -142,7 +142,7 @@ $H(\langle X_{\max(0, |X|-W)}, \dots, X_{|X|-1} \rangle)$ (i.e., the hash of the last $W$ bytes of $X$). -The “level” $L(X)$ of a sequence $X$ is $Q - T$, +The “level” $L(X)$ of a sequence $X$ is $\max(0, Q - T)$, where $Q$ is the largest integer such that - $Q \le 32$ and @@ -150,17 +150,6 @@ where $Q$ is the largest integer such that (i.e., the level is the number of trailing zeroes in the rolling checksum in excess of the threshold needed to produce the prefix chunk $P(X)$). -(Note: -When $|R(X)| > 0$, -$L(X)$ is non-negative, -because $P(X)$ is defined in terms of a hash with $T$ trailing zeroes -(except when the split is triggered by hitting $S_\text{max}$). -But when $|R(X)| = 0$ -(or in the $S_\text{max}$ case), -that hash may have fewer than $T$ trailing zeroes, -and so $L(X)$ may be negative. -This makes no difference to the algorithm below, however.) - A “node” in a hashsplit tree is a pair $(h, S)$ where $h$ is the node’s “height” @@ -183,7 +172,6 @@ defined recursively as follows: $\operatorname{Rightmost}(N) = K_{C,b}$ - Otherwise, $\operatorname{Rightmost}(N) = \operatorname{Rightmost}(S_b)$ - ## Algorithm This section contains two descriptions of hashsplit trees: From e8a543a97e464f4c67abcb19926c83d740d2a652 Mon Sep 17 00:00:00 2001 From: Bob Glickstein Date: Sat, 10 Oct 2020 10:28:24 -0700 Subject: [PATCH 6/9] Very rough draft of prefix/remainder rewrite of algebraic tree description. --- spec.md | 80 +++++++++++++++++++++------------------------------------ 1 file changed, 29 insertions(+), 51 deletions(-) diff --git a/spec.md b/spec.md index 3cd6419..6d1422f 100644 --- a/spec.md +++ b/spec.md @@ -180,70 +180,48 @@ and a procedural description for practical construction. ### Algebraic description -Let function $F_C(X, h, i)$ on a sequence $X$, a height $h$, and an index $i$ produce the $i^\text{th}$ node at level $h$. +Let $\mathbb{K}_C$ be a sequence of $n$ chunks $\langle K_{C,0}, \dots, K_{C,n-1} \rangle$. -Then the root of the tree is $F_C(X, h, 0)$ -where $h$ is the largest integer such that $|\operatorname{Children}(F_C(X, h, 0))| > 1$ -(i.e., the highest node that has multiple children), -or 0 if no such integer exists. +Let the “prefix” $\mathbb{P}_{C,0}(\mathbb{K}_C)$ +of a sequence of chunks be defined as: -$F_C$ is defined recursively as follows: +- $\langle \rangle$ if $|\mathbb{K}_C| = 0$; otherwise +- $\langle K_{C,0}, \dots, K_{C,b} \rangle$ + where $b$ is the smallest integer such that $L(K_{C,b}) > 0$, + or $n-1$ if no such integer exists. -- $F_C(X, 0, 0) = (0, \langle K_{C,0}(X), \dots, K_{C,b}(X) \rangle)$ - where $b$ is: +Let the “prefix” $\mathbb{P}_{C,h+1}(\mathbb{N}_{C,h})$ +of a sequence of $m$ nodes $\langle N_{C,h,0}, \dots, N_{C,h,m-1} \rangle$ +at height $h$ be defined as: - - the smallest integer such that - $L(K_{C,b}(X)) > 0$, if one exists; - otherwise - - $|\operatorname{SPLIT}_C(X)|-1$ +- $\langle \rangle$ if $|\mathbb{N}_{C,h}| = 0$; otherwise +- $\langle N_{C,h,0}, \dots, N_{C,h,b} \rangle$ + where $b$ is the smallest integer such that $L(Rightmost(N_{C,h,b})) > h+1$, + or $m-1$ if no such integer exists. - (I.e., its children are the chunks from 0 up to and including the first one with a “level” higher than 0.) +Let the “remainder” $\mathbb{R}_{C,0}(\mathbb{K}_C)$ +of a sequence of $n$ chunks be defined as: -- $F_C(X, 0, n+1)$ is: +$\langle K_{C,|\mathbb{P}_{C,0}(\mathbb{K}_C)|}, \dots, K_{C,n-1} \rangle$ - - **undefined** when $\sigma \ge |\operatorname{SPLIT}_C(X)|-1$; otherwise - - $(0, \langle K_{C,\sigma}, \dots, K_{C,b}(X) \rangle)$ +(i.e., the chunks remaining after removing the “prefix.”) - where: +Let the “remainder” $\mathbb{R}_{C,h+1}(\mathbb{N}_{C,h})$ +of a sequence of $m$ nodes at height $h$ be defined as: - - $\sigma = \sum_{i=0}^{n}|\operatorname{Children}(F_C(X, 0, i))|$ and - - $b$ is: - - the smallest integer such that - $L(K_{C,b}(X)) > 0$, if one exists; - otherwise - - $|\operatorname{SPLIT}_C(X)|-1$. +$\langle N_{C,h,|\mathbb{P}_{C,h+1}(\mathbb{N}_{C,h})|}, \dots, N_{C,h,m-1} \rangle$ - (I.e., its children are the next chunks after the ones in $F_C(X, 0, 0)$ through $F_C(X, 0, n)$, - up until the next one whose level is higher than 0.) +(i.e., the nodes remaining after removing the “prefix.”) -- $F_C(X, h+1, 0) = (h+1, \langle F_C(X, h, 0), \dots, F_C(X, h, b) \rangle)$ - where $b$ is: +Then: - - the smallest integer such that - $L(\operatorname{Rightmost}(F_C(X, h, b))) > h+1$, if one exists; - otherwise - - the integer such that - $\operatorname{Rightmost}(F_C(X, h, b)) = K_{C,|\operatorname{SPLIT}_C(X)|-1}$ +- $N_{C,0,0} = (0, \mathbb{P}_{C,0}(\operatorname{SPLIT}_C(X)))$ +- $\mathbb{N}_{C,0} = \langle N_{C,0,0} \rangle \mathbin{\|} \mathbb{R}_{C,0}(\operatorname{SPLIT}_C(X))$ +- $N_{C,h+1,0} = (h+1, \mathbb{P}_{C,h+1}(\mathbb{N}_{C,h}))$ +- $\mathbb{N}_{C,h+1} = \langle N_{C,h+1,0} \rangle \mathbin{\|} \mathbb{R}_{C,h+1}(\mathbb{N}_{C,h})$ - (I.e., its children are the nodes from $F_C(X, h, 0)$ up to and including the first one - whose “rightmost leaf chunk” has a level higher than $h+1$.) - -- $F_C(X, h+1, n+1)$ is: - - - **undefined** when $F_C(X, h, \sigma)$ is undefined; otherwise - - $(h+1, \langle F_C(X, h, \sigma), \dots, F_C(X, h, b) \rangle)$ - - where: - - - $\sigma = \sum_{i=0}^{n}|\operatorname{Children}(F_C(X, h+1, i))|$ and - - $b$ is the smallest integer such that $b \ge \sigma$ and either: - - $L(\operatorname{Rightmost}(F_C(X, h, b))) > h+1$, if one exists; - otherwise - - the integer such that - $\operatorname{Rightmost}(F_C(X, h, b)) = K_{C,|\operatorname{SPLIT}_C(X)|-1}$ - - (I.e., its children are the next nodes at level h after the ones in $F_C(X, h+1, 0)$ through $F_C(X, h+1, n)$, - up until the next one whose rightmost leaf chunk has a level higher than $h+1$.) +and the root of the tree is $N_{C,h,0}$ +for the lowest value of $h$ where $|\mathbb{N}_{C,h}| = 1$. ### Procedural description From 31786ccd9e5bd4bc6befa616a5b822c3e4054bc9 Mon Sep 17 00:00:00 2001 From: Bob Glickstein Date: Sun, 18 Oct 2020 10:11:03 -0700 Subject: [PATCH 7/9] Factor out a parameterized prefix/remainder notation and use it to make both SPLIT_C and the tree algorithm more concise. --- spec.md | 232 ++++++++++++++++++++++++++++++++++++-------------------- 1 file changed, 149 insertions(+), 83 deletions(-) diff --git a/spec.md b/spec.md index 6d1422f..3069a56 100644 --- a/spec.md +++ b/spec.md @@ -65,7 +65,37 @@ We also use the following operators and functions: i.e. if $X = \langle X_0, \dots, X_N \rangle$ and $Y = \langle Y_0, \dots, Y_M \rangle$ then $X \mathbin{\|} Y = \langle X_0, \dots, X_N, Y_0, \dots, Y_M \rangle$ -- $\min(x, y)$ denotes the minimum of $x$ and $y$. +- $\min(x, y)$ denotes the minimum of $x$ and $y$ and $\max(x, y)$ denotes the maximum +- $\operatorname{Type}(x)$ denotes the type of $x$. + +Finally, we define the “prefix” $\mathbb{P}_q(X)$ +of a non-empty sequence $X$ +with respect to a given predicate $q$ +to be the initial subsequence $X^\prime$ of $X$ +up to and including the first member that makes $q(X^\prime)$ true. +And we define the “remainder” $\mathbb{R}_q(X)$ +to be everything left after removing the prefix. + +Formally, +given a sequence $X = \langle X_0, \dots, X_{|X|-1} \rangle$ +and a predicate $q \in \operatorname{Type}(X) \rightarrow \{\text{true},\text{false}\}$, + +$\mathbb{P}_q(X) = \langle X_0, \dots, X_e \rangle$ + +for the smallest integer $e$ such that: + +- $0 \le e < |X|$ and +- $q(\langle X_0, \dots, X_e \rangle) = \text{true}$ + +or $|X|-1$ if no such integer exists. +(I.e., if nothing satisfies $q$, the prefix is the whole sequence.) +And: + +$\mathbb{R}_q(X) = \langle X_b, \dots, X_{|X|-1} \rangle$ + +where $b = |\mathbb{P}_q(\langle X_0, \dots, X_{|X|-1} \rangle)|$. + +Note that when $\mathbb{P}_q(X) = X$, $\mathbb{R}_q(X) = \langle \rangle$. # Splitting @@ -74,7 +104,7 @@ functions: $\operatorname{SPLIT}_C \in V_8 \rightarrow V_v$ -...which is parameterized by a configuration $C$, consisting of: +...which is parameterized by a _configuration_ $C$, consisting of: - $S_{\text{min}} \in U_{32}$, the minimum split size - $S_{\text{max}} \in U_{32}$, the maximum split size @@ -83,28 +113,55 @@ $\operatorname{SPLIT}_C \in V_8 \rightarrow V_v$ The configuration must satisfy $S_{\text{max}} \ge S_{\text{min}} > 0$. + + ## Definitions We define the constant $W$, which we call the "window size," to be 64. -The "split index" $I(X)$ of a sequence $X$ is either the smallest -non-negative integer $i$ satisfying: + -- $i \le |X|$ and -- $S_{\text{max}} \ge i \ge S_{\text{min}}$ and -- $H(\langle X_{i-W}, \dots, X_{i-1} \rangle) \mod 2^T = 0$ +We define the predicate $q_C(X)$ +on a non-empty byte sequence $X$ +with respect to a configuration $C$ +to be: -...or $\min(|X|, S_{\text{max}})$, if no such $i$ exists. For the -purposes of this definition we set $X_i = 0$ for $i < 0$. +- $\text{true}$ if $|X| = S_{\text{max}}$; otherwise +- $\text{true}$ if $|X| \ge S_{\text{min}}$ and $H(\langle X_{\max(0,|X|-W)}, \dots, X_{|X|-1} \rangle) \mod 2^T = 0$ + (i.e., the last $W$ bytes of $X$ hash to a value with at least $T$ trailing zeroes); + otherwise +- $\text{false}$. -The “prefix” $P(X)$ of a non-empty sequence $X$ is $\langle X_0, \dots, X_{I(X)-1} \rangle$. + We define $\operatorname{SPLIT}_C(X)$ recursively, as follows: - If $|X| = 0$, $\operatorname{SPLIT}_C(X) = \langle \rangle$ -- Otherwise, $\operatorname{SPLIT}_C(X) = \langle P(X) \rangle \mathbin{\|} \operatorname{SPLIT}_C(R(X))$ +- Otherwise, $\operatorname{SPLIT}_C(X) = \langle \mathbb{P}_{q_C}(X) \rangle \mathbin{\|} \operatorname{SPLIT}_C(\mathbb{R}_{q_C}(X))$ # Tree Construction @@ -133,119 +190,128 @@ will differ only in the subtrees in the vicinity of the differences. ## Definitions -A “chunk” $K_{C,i}(X)$ is a member of the sequence produced by $\operatorname{SPLIT}_C(X)$. -The index $i$ lies in the range $[0 .. |\operatorname{SPLIT}_C(X)|)$. +A “chunk” is a member of the sequence produced by $\operatorname{SPLIT}_C$. -The “hashval” $V(X)$ of a sequence $X$ is: +The “hashval” $V_C(X)$ of a byte sequence $X$ is: $H(\langle X_{\max(0, |X|-W)}, \dots, X_{|X|-1} \rangle)$ (i.e., the hash of the last $W$ bytes of $X$). -The “level” $L(X)$ of a sequence $X$ is $\max(0, Q - T)$, -where $Q$ is the largest integer such that - -- $Q \le 32$ and -- $V(P(X)) \mod 2^Q = 0$ - -(i.e., the level is the number of trailing zeroes in the rolling checksum in excess of the threshold needed to produce the prefix chunk $P(X)$). +A “node” $N_{h,i}$ in a hashsplit tree +at non-negative “height” $h$ +is a sequence of children. +The children of a node at height 0 are chunks. +The children of a node at height $h+1$ are nodes at height $h$. -A “node” in a hashsplit tree -is a pair $(h, S)$ -where $h$ is the node’s “height” -and $S$ is a sequence of children. -The children of a node at height 0 are chunks -(i.e., subsequences of the input). -The children of a node at height $h > 0$ are nodes at height $h - 1$. +A “tier” of a hashsplit tree is a sequence of nodes +$N_h = \langle N_{h,0}, \dots, N_{h,k} \rangle$ +at a given height $h$. -The function $\operatorname{Children}(N)$ -on a node $N = (h, S)$ -produces $S$ -(the sequence of children). - -The function $\operatorname{Rightmost}(N)$ -on a node $N = (h, \langle S_0, \dots, S_b \rangle)$ +The function $\operatorname{Rightmost}(N_{h,i})$ +on a node $N_{h,i} = \langle S_0, \dots, S_e \rangle$ produces the “rightmost leaf chunk” defined recursively as follows: -- If $h = 0$, - $\operatorname{Rightmost}(N) = K_{C,b}$ -- Otherwise, $\operatorname{Rightmost}(N) = \operatorname{Rightmost}(S_b)$ +- If $h = 0$, $\operatorname{Rightmost}(N_{h,i}) = S_e$ +- If $h > 0$, $\operatorname{Rightmost}(N_{h,i}) = \operatorname{Rightmost}(S_e)$ -## Algorithm +The “level” $L_C(X)$ of a given chunk $X$ +is $\max(0, Q - T)$, +where $Q$ is the largest integer such that -This section contains two descriptions of hashsplit trees: -an algebraic description for formal reasoning, -and a procedural description for practical construction. +- $Q \le 32$ and +- $V_C(\mathbb{P}_{q_C}(X)) \mod 2^Q = 0$ -### Algebraic description +(i.e., the level is the number of trailing zeroes in the hashval +in excess of the threshold needed +to produce the prefix chunk $\mathbb{P}_{q_C}(X)$). -Let $\mathbb{K}_C$ be a sequence of $n$ chunks $\langle K_{C,0}, \dots, K_{C,n-1} \rangle$. +The level $L_C(N)$ of a given _node_ $N$ +is the level of its rightmost leaf chunk: +$L_C(N) = L_C(\operatorname{Rightmost}(N))$ -Let the “prefix” $\mathbb{P}_{C,0}(\mathbb{K}_C)$ -of a sequence of chunks be defined as: +The predicate $z_{C,h}(K)$ +on a sequence $K = \langle K_0, \dots, K_e \rangle$ +of chunks or of nodes +with respect to a height $h$ +is defined as: -- $\langle \rangle$ if $|\mathbb{K}_C| = 0$; otherwise -- $\langle K_{C,0}, \dots, K_{C,b} \rangle$ - where $b$ is the smallest integer such that $L(K_{C,b}) > 0$, - or $n-1$ if no such integer exists. +- $\text{true}$ if $L(K_e) > h$; otherwise +- $\text{false}$. -Let the “prefix” $\mathbb{P}_{C,h+1}(\mathbb{N}_{C,h})$ -of a sequence of $m$ nodes $\langle N_{C,h,0}, \dots, N_{C,h,m-1} \rangle$ -at height $h$ be defined as: + -Let the “remainder” $\mathbb{R}_{C,0}(\mathbb{K}_C)$ -of a sequence of $n$ chunks be defined as: +For conciseness, define -$\langle K_{C,|\mathbb{P}_{C,0}(\mathbb{K}_C)|}, \dots, K_{C,n-1} \rangle$ +- $P_C(X) = \mathbb{P}_{z_{C,0}}(\operatorname{SPLIT}_C(X))$ and +- $R_C(X) = \mathbb{R}_{z_{C,0}}(\operatorname{SPLIT}_C(X))$ -(i.e., the chunks remaining after removing the “prefix.”) +## Algorithm + +This section contains two descriptions of hashsplit trees: +an algebraic description for formal reasoning, +and a procedural description for practical construction. + +### Algebraic description -Let the “remainder” $\mathbb{R}_{C,h+1}(\mathbb{N}_{C,h})$ -of a sequence of $m$ nodes at height $h$ be defined as: +The tier $N_0$ +of hashsplit tree nodes +for a given byte sequence $X$ +is equal to -$\langle N_{C,h,|\mathbb{P}_{C,h+1}(\mathbb{N}_{C,h})|}, \dots, N_{C,h,m-1} \rangle$ +$\langle P_C(X) \rangle \mathbb{\|} R_C(X)$ -(i.e., the nodes remaining after removing the “prefix.”) +The tier $N_{h+1}$ +of hashsplit tree nodes +for a given byte sequence $X$ +is equal to -Then: +$\langle \mathbb{P}_{z_{C,h+1}}(N_h) \rangle \mathbb{\|} \mathbb{R}_{z_{C,h+1}}(N_h)$ -- $N_{C,0,0} = (0, \mathbb{P}_{C,0}(\operatorname{SPLIT}_C(X)))$ -- $\mathbb{N}_{C,0} = \langle N_{C,0,0} \rangle \mathbin{\|} \mathbb{R}_{C,0}(\operatorname{SPLIT}_C(X))$ -- $N_{C,h+1,0} = (h+1, \mathbb{P}_{C,h+1}(\mathbb{N}_{C,h}))$ -- $\mathbb{N}_{C,h+1} = \langle N_{C,h+1,0} \rangle \mathbin{\|} \mathbb{R}_{C,h+1}(\mathbb{N}_{C,h})$ +(I.e., each node in the tree has as its children +a sequence of chunks or lower-tier nodes, +as appropriate, +up to and including the first one +whose “level” is greater than the node’s height.) -and the root of the tree is $N_{C,h,0}$ -for the lowest value of $h$ where $|\mathbb{N}_{C,h}| = 1$. +The root of the hashsplit tree is $N_{h^\prime,0}$ +for the smallest value of $h^\prime$ +such that $|N_{h^\prime}| = 1$ ### Procedural description -To compute a hashsplit tree from sequence $X$, +For this description we use $N_h$ to denote a single node at height $h$. +The algorithm must keep track of the “rightmost” such node for each tier in the tree. + +To compute a hashsplit tree from a byte sequence $X$, compute its “root node” as follows. -1. Let $N_0$ be $(0, \langle\rangle)$ (i.e., a node at height 0 with no children). +1. Let $N_0$ be $\langle\rangle$ (i.e., a node at height 0 with no children). 2. If $|X| = 0$, then: a. Let $h$ be the largest height such that $N_h$ exists. - b. If $|\operatorname{Children}(N_0)| > 0$, then: + b. If $|N_0| > 0$, then: i. For each integer $i$ in $[0 .. h]$, “close” $N_i$ (see below). ii. Set $h \leftarrow h+1$. - c. [pruning] While $h > 0$ and $|\operatorname{Children}(N_h)| = 1$, set $h \leftarrow h-1$ (i.e., traverse from the prospective tree root downward until there is a node with more than one child). + c. [pruning] While $h > 0$ and $|N_h| = 1$, set $h \leftarrow h-1$ (i.e., traverse from the prospective tree root downward until there is a node with more than one child). d. **Terminate** with $N_h$ as the root node. -3. Otherwise, set $N_0 \leftarrow (0, \operatorname{Children}(N_0) \mathbin{\|} \langle P(X) \rangle)$ (i.e., add $P(X)$ to the list of children in $N_0$). -4. For each integer $i$ in $[0 .. L(X))$, “close” the node $N_i$ (see below). -5. Set $X \leftarrow R(X)$. +3. Otherwise, set $N_0 \leftarrow N_0 \mathbin{\|} \langle P_C(X) \rangle$ (i.e., add $P_C(X)$ to the list of children in $N_0$). +4. For each integer $i$ in $[0 .. L_C(X))$, “close” the node $N_i$ (see below). +5. Set $X \leftarrow R_C(X)$. 6. Go to step 2. To “close” a node $N_i$: -1. If no $N_{i+1}$ exists yet, let $N_{i+1}$ be $(i+1, \langle\rangle)$ (i.e., a node at height ${i + 1}$ with no children). -2. Set $N_{i+1} \leftarrow (i+1, \operatorname{Children}(N_{i+1}) \mathbin{\|} \langle N_i \rangle)$ (i.e., add $N_i$ as a child to $N_{i+1}$). -3. Let $N_i$ be $(i, \langle\rangle)$ (i.e., new node at height $i$ with no children). +1. If no $N_{i+1}$ exists yet, let $N_{i+1}$ be $\langle\rangle$ (i.e., a node at height ${i + 1}$ with no children). +2. Set $N_{i+1} \leftarrow N_{i+1} \mathbin{\|} \langle N_i \rangle$ (i.e., add $N_i$ as a child to $N_{i+1}$). +3. Let $N_i$ be $\langle\rangle$ (i.e., new node at height $i$ with no children). # Rolling Hash Functions From eb1b5b667c147bdb6a181f6116b850c8c381c1af Mon Sep 17 00:00:00 2001 From: Bob Glickstein Date: Mon, 26 Oct 2020 17:51:15 -0700 Subject: [PATCH 8/9] Add a missing C subscript. --- spec.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/spec.md b/spec.md index 3069a56..c9d7719 100644 --- a/spec.md +++ b/spec.md @@ -237,7 +237,7 @@ of chunks or of nodes with respect to a height $h$ is defined as: -- $\text{true}$ if $L(K_e) > h$; otherwise +- $\text{true}$ if $L_C(K_e) > h$; otherwise - $\text{false}$. - ## Definitions We define the constant $W$, which we call the "window size," to be 64. - - We define the predicate $q_C(X)$ on a non-empty byte sequence $X$ with respect to a configuration $C$ @@ -147,17 +128,6 @@ to be: otherwise - $\text{false}$. - - We define $\operatorname{SPLIT}_C(X)$ recursively, as follows: - If $|X| = 0$, $\operatorname{SPLIT}_C(X) = \langle \rangle$ @@ -240,15 +210,6 @@ is defined as: - $\text{true}$ if $L_C(K_e) > h$; otherwise - $\text{false}$. - - For conciseness, define - $P_C(X) = \mathbb{P}_{z_{C,0}}(\operatorname{SPLIT}_C(X))$ and