-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Algebraic tree description #23
Changes from 8 commits
8feac1b
3f14f6a
d4f1cfd
8b82dc2
94de072
d9a8fe5
e8a543a
31786cc
eb1b5b6
40da42b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -65,7 +65,37 @@ We also use the following operators and functions: | |
i.e. if $X = \langle X_0, \dots, X_N \rangle$ and $Y = \langle Y_0, | ||
\dots, Y_M \rangle$ then $X \mathbin{\|} Y = \langle X_0, \dots, X_N, Y_0, \dots, Y_M | ||
\rangle$ | ||
- $\operatorname{min}(x, y)$ denotes the minimum of $x$ and $y$. | ||
- $\min(x, y)$ denotes the minimum of $x$ and $y$ and $\max(x, y)$ denotes the maximum | ||
- $\operatorname{Type}(x)$ denotes the type of $x$. | ||
|
||
Finally, we define the “prefix” $\mathbb{P}_q(X)$ | ||
of a non-empty sequence $X$ | ||
with respect to a given predicate $q$ | ||
to be the initial subsequence $X^\prime$ of $X$ | ||
up to and including the first member that makes $q(X^\prime)$ true. | ||
And we define the “remainder” $\mathbb{R}_q(X)$ | ||
to be everything left after removing the prefix. | ||
|
||
Formally, | ||
given a sequence $X = \langle X_0, \dots, X_{|X|-1} \rangle$ | ||
and a predicate $q \in \operatorname{Type}(X) \rightarrow \{\text{true},\text{false}\}$, | ||
|
||
$\mathbb{P}_q(X) = \langle X_0, \dots, X_e \rangle$ | ||
|
||
for the smallest integer $e$ such that: | ||
|
||
- $0 \le e < |X|$ and | ||
- $q(\langle X_0, \dots, X_e \rangle) = \text{true}$ | ||
|
||
or $|X|-1$ if no such integer exists. | ||
(I.e., if nothing satisfies $q$, the prefix is the whole sequence.) | ||
And: | ||
|
||
$\mathbb{R}_q(X) = \langle X_b, \dots, X_{|X|-1} \rangle$ | ||
|
||
where $b = |\mathbb{P}_q(\langle X_0, \dots, X_{|X|-1} \rangle)|$. | ||
|
||
Note that when $\mathbb{P}_q(X) = X$, $\mathbb{R}_q(X) = \langle \rangle$. | ||
|
||
# Splitting | ||
|
||
|
@@ -74,7 +104,7 @@ functions: | |
|
||
$\operatorname{SPLIT}_C \in V_8 \rightarrow V_v$ | ||
|
||
...which is parameterized by a configuration $C$, consisting of: | ||
...which is parameterized by a _configuration_ $C$, consisting of: | ||
|
||
- $S_{\text{min}} \in U_{32}$, the minimum split size | ||
- $S_{\text{max}} \in U_{32}$, the maximum split size | ||
|
@@ -83,28 +113,55 @@ $\operatorname{SPLIT}_C \in V_8 \rightarrow V_v$ | |
|
||
The configuration must satisfy $S_{\text{max}} \ge S_{\text{min}} > 0$. | ||
|
||
<!--- | ||
NOTE | ||
|
||
It might help the clarity of what follows | ||
if we make parameterization by C implicit | ||
rather than repeating C in subscripts everywhere. | ||
--> | ||
|
||
## Definitions | ||
|
||
We define the constant $W$, which we call the "window size," to be 64. | ||
|
||
The "split index" $I(X)$ of a sequence $X$ is either the smallest | ||
non-negative integer $i$ satisfying: | ||
<!--- | ||
NOTE | ||
|
||
Fixing W at 64 | ||
arbitrarily rules out using this section | ||
to describe and reason about hashsplit algorithms | ||
that it otherwise could. | ||
|
||
- $i \le |X|$ and | ||
- $S_{\text{max}} \ge i \ge S_{\text{min}}$ and | ||
- $H(\langle X_{i-W}, \dots, X_{i-1} \rangle) \mod 2^T = 0$ | ||
I think fixing it at 64 is more properly a part of the recommendation section. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I really do want to be prescriptive about this, because fixing it as a constant allows implementations that aren't possible if this is configurable. In particular, this allows using a fixed size array for the ring buffer rather than a dynamically allocated slice (or your language's equivalent). For e.g. @cole-miller's Rust implementation this is a big deal, because it means rolling hashes can be computed in an environment that doesn't supply a heap allocator, and generally it will be more efficient. I guess in my mind if something is "recommended" then general purpose libraries should still allow a user to configure it, as they may need to to interoperate with other systems. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just to clarify -- I'm already using Rust's nascent support for const generics in my implementation, and it would be pretty simple to make A compromise that seems reasonable to me is to explicitly acknowledge the possibility of using other window sizes, while making clear that this is a "bonus feature" that spec-compliant implementations are not required to support. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hi! Thanks for the comments and sorry for the delay. In the end I will defer to your preference on this. But before I do I want to push back a bit more. What is the purpose of the formalism in this part of the document? I thought the whole point was to describe essentially the whole space of possible hashsplit algorithms (or at least, all those belonging to a broad family) simply by the selection of appropriate configuration parameters. This allows formal reasoning about a wide assortment of different hashsplit algorithms, and to that end I see no use in artificially limiting the size of that assortment by removing a parameter. Keeping On the other hand, if the only purpose of this document is to prescribe a specific hashsplit algorithm (and perhaps to describe one or two others, like rrs), why bother with the formalism at all? Why not describe exactly and only those algorithms and no others? (These same argument applies to the input-shorter-than-W business below, unless we're re-imposing the S_min>=W requirement, in which case it's moot.) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't have a problem with making W an opaque parameter from the perspective of the formal description. I suppose part of the trouble is we don't really have a section on recommendations for implementors, and so I'd been thinking of the configuration as much as a set of parameters that would be exposed by a library as part of the formalism. I think it makes sense to expose S_min, S_max and T to library users, I'm waffling on H, and I think W does not, so maybe the right answer is to actually separate these two ideas, and add a separate section that is prescriptive about what conforming library implementations should or should not make configurable. I'll open a new issue about this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I've been assuming all along that that was the plan. Sorry for the miscommunication, but I do think that's the right way to go. |
||
--> | ||
|
||
...or $\operatorname{min}(|X|, S_{\text{max}})$, if no such $i$ exists. For the | ||
purposes of this definition we set $X_i = 0$ for $i < 0$. | ||
We define the predicate $q_C(X)$ | ||
on a non-empty byte sequence $X$ | ||
with respect to a configuration $C$ | ||
to be: | ||
|
||
The “prefix” $P(X)$ of a non-empty sequence $X$ is $\langle X_0, \dots, X_{I(X)-1} \rangle$. | ||
- $\text{true}$ if $|X| = S_{\text{max}}$; otherwise | ||
- $\text{true}$ if $|X| \ge S_{\text{min}}$ and $H(\langle X_{\max(0,|X|-W)}, \dots, X_{|X|-1} \rangle) \mod 2^T = 0$ | ||
(i.e., the last $W$ bytes of $X$ hash to a value with at least $T$ trailing zeroes); | ||
otherwise | ||
- $\text{false}$. | ||
|
||
The “remainder” $R(X)$ of a non-empty sequence $X$ is $\langle X_{I(X)}, \dots, X_{|X|-1} \rangle$. | ||
<!--- | ||
NOTE | ||
|
||
This previously used H(<X_(|X|-W) ... X_(|X|-1)>) and defined X_i = 0 for i<0. | ||
However, this unnecessarily constrains the choice of hash function. | ||
If the hash function wants to treat input shorter than W as being prefixed by zeroes, | ||
it can specify that; | ||
but if it wants to handle input shorter than W differently, | ||
it should be allowed to do that too. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This gets a little weird because the way the hash functions are currently specified is over a variable length buffer; it's actually well defined what RRS1 and CP32 are when I'm still somewhat inclined to just insist that @cole-miller, how strongly do you feel about this? I'm still not sure I grok your use case... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, I think I understand better now how |
||
--> | ||
|
||
We define $\operatorname{SPLIT}_C(X)$ recursively, as follows: | ||
|
||
- If $|X| = 0$, $\operatorname{SPLIT}_C(X) = \langle \rangle$ | ||
- Otherwise, $\operatorname{SPLIT}_C(X) = \langle P(X) \rangle \mathbin{\|} \operatorname{SPLIT}_C(R(X))$ | ||
- Otherwise, $\operatorname{SPLIT}_C(X) = \langle \mathbb{P}_{q_C}(X) \rangle \mathbin{\|} \operatorname{SPLIT}_C(\mathbb{R}_{q_C}(X))$ | ||
|
||
# Tree Construction | ||
|
||
|
@@ -133,63 +190,128 @@ will differ only in the subtrees in the vicinity of the differences. | |
|
||
## Definitions | ||
|
||
The “hashval” $V(X)$ of a sequence $X$ is: | ||
A “chunk” is a member of the sequence produced by $\operatorname{SPLIT}_C$. | ||
|
||
The “hashval” $V_C(X)$ of a byte sequence $X$ is: | ||
|
||
$H(\langle X_{\operatorname{max}(0, |X|-W)}, \dots, X_{|X|-1} \rangle)$ | ||
$H(\langle X_{\max(0, |X|-W)}, \dots, X_{|X|-1} \rangle)$ | ||
|
||
(i.e., the hash of the last $W$ bytes of $X$). | ||
|
||
The “level” $L(X)$ of a sequence $X$ is $Q - T$, | ||
A “node” $N_{h,i}$ in a hashsplit tree | ||
at non-negative “height” $h$ | ||
is a sequence of children. | ||
The children of a node at height 0 are chunks. | ||
The children of a node at height $h+1$ are nodes at height $h$. | ||
|
||
A “tier” of a hashsplit tree is a sequence of nodes | ||
$N_h = \langle N_{h,0}, \dots, N_{h,k} \rangle$ | ||
at a given height $h$. | ||
|
||
The function $\operatorname{Rightmost}(N_{h,i})$ | ||
on a node $N_{h,i} = \langle S_0, \dots, S_e \rangle$ | ||
produces the “rightmost leaf chunk” | ||
defined recursively as follows: | ||
|
||
- If $h = 0$, $\operatorname{Rightmost}(N_{h,i}) = S_e$ | ||
- If $h > 0$, $\operatorname{Rightmost}(N_{h,i}) = \operatorname{Rightmost}(S_e)$ | ||
|
||
The “level” $L_C(X)$ of a given chunk $X$ | ||
is $\max(0, Q - T)$, | ||
where $Q$ is the largest integer such that | ||
|
||
- $Q \le 32$ and | ||
- $V(P(X)) \mod 2^Q = 0$ | ||
|
||
(i.e., the level is the number of trailing zeroes in the rolling checksum in excess of the threshold needed to produce the prefix chunk $P(X)$). | ||
|
||
(Note: | ||
When $|R(X)| > 0$, | ||
$L(X)$ is non-negative, | ||
because $P(X)$ is defined in terms of a hash with $T$ trailing zeroes. | ||
But when $|R(X)| = 0$, | ||
that hash may have fewer than $T$ trailing zeroes, | ||
and so $L(X)$ may be negative. | ||
This makes no difference to the algorithm below, however.) | ||
|
||
A “node” in a hashsplit tree | ||
is a pair $(D, C)$ | ||
where $D$ is the node’s “depth” | ||
and $C$ is a sequence of children. | ||
The children of a node at depth 0 are chunks | ||
(i.e., subsequences of the input). | ||
The children of a node at depth $D > 0$ are nodes at depth $D - 1$. | ||
|
||
The function $\operatorname{Children}(N)$ on a node $N = (D, C)$ produces $C$ | ||
(the sequence of children). | ||
- $V_C(\mathbb{P}_{q_C}(X)) \mod 2^Q = 0$ | ||
|
||
(i.e., the level is the number of trailing zeroes in the hashval | ||
in excess of the threshold needed | ||
to produce the prefix chunk $\mathbb{P}_{q_C}(X)$). | ||
|
||
The level $L_C(N)$ of a given _node_ $N$ | ||
is the level of its rightmost leaf chunk: | ||
$L_C(N) = L_C(\operatorname{Rightmost}(N))$ | ||
|
||
The predicate $z_{C,h}(K)$ | ||
on a sequence $K = \langle K_0, \dots, K_e \rangle$ | ||
of chunks or of nodes | ||
with respect to a height $h$ | ||
is defined as: | ||
|
||
- $\text{true}$ if $L(K_e) > h$; otherwise | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You're missing a subscript |
||
- $\text{false}$. | ||
|
||
<!--- | ||
NOTE | ||
|
||
Still needed: | ||
a way to specify | ||
the minimum and maximum branching factor | ||
(akin to S_min and S_max for SPLIT_C). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we can deal with this as a separate issue. |
||
--> | ||
|
||
For conciseness, define | ||
|
||
- $P_C(X) = \mathbb{P}_{z_{C,0}}(\operatorname{SPLIT}_C(X))$ and | ||
- $R_C(X) = \mathbb{R}_{z_{C,0}}(\operatorname{SPLIT}_C(X))$ | ||
|
||
## Algorithm | ||
|
||
To compute a hashsplit tree from sequence $X$, | ||
This section contains two descriptions of hashsplit trees: | ||
an algebraic description for formal reasoning, | ||
and a procedural description for practical construction. | ||
|
||
### Algebraic description | ||
|
||
The tier $N_0$ | ||
of hashsplit tree nodes | ||
for a given byte sequence $X$ | ||
is equal to | ||
|
||
$\langle P_C(X) \rangle \mathbb{\|} R_C(X)$ | ||
|
||
The tier $N_{h+1}$ | ||
of hashsplit tree nodes | ||
for a given byte sequence $X$ | ||
is equal to | ||
|
||
$\langle \mathbb{P}_{z_{C,h+1}}(N_h) \rangle \mathbb{\|} \mathbb{R}_{z_{C,h+1}}(N_h)$ | ||
|
||
(I.e., each node in the tree has as its children | ||
a sequence of chunks or lower-tier nodes, | ||
as appropriate, | ||
up to and including the first one | ||
whose “level” is greater than the node’s height.) | ||
|
||
The root of the hashsplit tree is $N_{h^\prime,0}$ | ||
for the smallest value of $h^\prime$ | ||
such that $|N_{h^\prime}| = 1$ | ||
|
||
### Procedural description | ||
|
||
For this description we use $N_h$ to denote a single node at height $h$. | ||
The algorithm must keep track of the “rightmost” such node for each tier in the tree. | ||
|
||
To compute a hashsplit tree from a byte sequence $X$, | ||
compute its “root node” as follows. | ||
|
||
1. Let $N_0$ be $(0, \langle\rangle)$ (i.e., a node at depth 0 with no children). | ||
1. Let $N_0$ be $\langle\rangle$ (i.e., a node at height 0 with no children). | ||
2. If $|X| = 0$, then: | ||
a. Let $d$ be the largest depth such that $N_d$ exists. | ||
b. If $|\operatorname{Children}(N_0)| > 0$, then: | ||
i. For each integer $i$ in $[0 .. d]$, “close” $N_i$. | ||
ii. Set $d \leftarrow d+1$. | ||
c. [pruning] While $d > 0$ and $|\operatorname{Children}(N_d)| = 1$, set $d \leftarrow d-1$ (i.e., traverse from the prospective tree root downward until there is a node with more than one child). | ||
d. **Terminate** with $N_d$ as the root node. | ||
3. Otherwise, set $N_0 \leftarrow (0, \operatorname{Children}(N_0) \mathbin{\|} \langle P(X) \rangle)$ (i.e., add $P(X)$ to the list of children in $N_0$). | ||
4. For each integer $i$ in $[0 .. L(X))$, “close” the node $N_i$ (see below). | ||
5. Set $X \leftarrow R(X)$. | ||
a. Let $h$ be the largest height such that $N_h$ exists. | ||
b. If $|N_0| > 0$, then: | ||
i. For each integer $i$ in $[0 .. h]$, “close” $N_i$ (see below). | ||
ii. Set $h \leftarrow h+1$. | ||
c. [pruning] While $h > 0$ and $|N_h| = 1$, set $h \leftarrow h-1$ (i.e., traverse from the prospective tree root downward until there is a node with more than one child). | ||
d. **Terminate** with $N_h$ as the root node. | ||
3. Otherwise, set $N_0 \leftarrow N_0 \mathbin{\|} \langle P_C(X) \rangle$ (i.e., add $P_C(X)$ to the list of children in $N_0$). | ||
4. For each integer $i$ in $[0 .. L_C(X))$, “close” the node $N_i$ (see below). | ||
5. Set $X \leftarrow R_C(X)$. | ||
6. Go to step 2. | ||
|
||
To “close” a node $N_i$: | ||
|
||
1. If no $N_{i+1}$ exists yet, let $N_{i+1}$ be $(i+1, \langle\rangle)$ (i.e., a node at depth ${i + 1}$ with no children). | ||
2. Set $N_{i+1} \leftarrow (i+1, \operatorname{Children}(N_{i+1}) \mathbin{\|} \langle N_i \rangle)$ (i.e., add $N_i$ as a child to $N_{i+1}$). | ||
3. Let $N_i$ be $(i, \langle\rangle)$ (i.e., new node at depth $i$ with no children). | ||
1. If no $N_{i+1}$ exists yet, let $N_{i+1}$ be $\langle\rangle$ (i.e., a node at height ${i + 1}$ with no children). | ||
2. Set $N_{i+1} \leftarrow N_{i+1} \mathbin{\|} \langle N_i \rangle$ (i.e., add $N_i$ as a child to $N_{i+1}$). | ||
3. Let $N_i$ be $\langle\rangle$ (i.e., new node at height $i$ with no children). | ||
|
||
# Rolling Hash Functions | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree; feel free to add some verbiage to this affect and change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to take a whack at this, or should I go ahead and merge and we'll treat this as a follow-on task?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A follow-up task sounds good. I made one pass at it and didn't like how it looked.
For the record, this should involve more than just saying that
C
is implicit. We should also be clearer about the components ofC
, likeH
andS_min
. Those are used throughout the document without explicitly connecting them back toC
. Perhaps we need something likeH(C)
,S_min(C)
,S_max(C)
, andT(C)
(and maybe alsoW(C)
as discussed) to select the members ofC
(like I did withChildren(N)
in an earlier draft of this PR), but that sure seems cumbersome, and is why I'm happy to punt for now.