Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Algebraic tree description #23

Merged
merged 10 commits into from
Oct 27, 2020
226 changes: 174 additions & 52 deletions spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,37 @@ We also use the following operators and functions:
i.e. if $X = \langle X_0, \dots, X_N \rangle$ and $Y = \langle Y_0,
\dots, Y_M \rangle$ then $X \mathbin{\|} Y = \langle X_0, \dots, X_N, Y_0, \dots, Y_M
\rangle$
- $\operatorname{min}(x, y)$ denotes the minimum of $x$ and $y$.
- $\min(x, y)$ denotes the minimum of $x$ and $y$ and $\max(x, y)$ denotes the maximum
- $\operatorname{Type}(x)$ denotes the type of $x$.

Finally, we define the “prefix” $\mathbb{P}_q(X)$
of a non-empty sequence $X$
with respect to a given predicate $q$
to be the initial subsequence $X^\prime$ of $X$
up to and including the first member that makes $q(X^\prime)$ true.
And we define the “remainder” $\mathbb{R}_q(X)$
to be everything left after removing the prefix.

Formally,
given a sequence $X = \langle X_0, \dots, X_{|X|-1} \rangle$
and a predicate $q \in \operatorname{Type}(X) \rightarrow \{\text{true},\text{false}\}$,

$\mathbb{P}_q(X) = \langle X_0, \dots, X_e \rangle$

for the smallest integer $e$ such that:

- $0 \le e < |X|$ and
- $q(\langle X_0, \dots, X_e \rangle) = \text{true}$

or $|X|-1$ if no such integer exists.
(I.e., if nothing satisfies $q$, the prefix is the whole sequence.)
And:

$\mathbb{R}_q(X) = \langle X_b, \dots, X_{|X|-1} \rangle$

where $b = |\mathbb{P}_q(\langle X_0, \dots, X_{|X|-1} \rangle)|$.

Note that when $\mathbb{P}_q(X) = X$, $\mathbb{R}_q(X) = \langle \rangle$.

# Splitting

Expand All @@ -74,7 +104,7 @@ functions:

$\operatorname{SPLIT}_C \in V_8 \rightarrow V_v$

...which is parameterized by a configuration $C$, consisting of:
...which is parameterized by a _configuration_ $C$, consisting of:

- $S_{\text{min}} \in U_{32}$, the minimum split size
- $S_{\text{max}} \in U_{32}$, the maximum split size
Expand All @@ -83,28 +113,55 @@ $\operatorname{SPLIT}_C \in V_8 \rightarrow V_v$

The configuration must satisfy $S_{\text{max}} \ge S_{\text{min}} > 0$.

<!---
NOTE

It might help the clarity of what follows
if we make parameterization by C implicit
rather than repeating C in subscripts everywhere.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree; feel free to add some verbiage to this affect and change it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to take a whack at this, or should I go ahead and merge and we'll treat this as a follow-on task?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A follow-up task sounds good. I made one pass at it and didn't like how it looked.

For the record, this should involve more than just saying that C is implicit. We should also be clearer about the components of C, like H and S_min. Those are used throughout the document without explicitly connecting them back to C. Perhaps we need something like H(C), S_min(C), S_max(C), and T(C) (and maybe also W(C) as discussed) to select the members of C (like I did with Children(N) in an earlier draft of this PR), but that sure seems cumbersome, and is why I'm happy to punt for now.

-->

## Definitions

We define the constant $W$, which we call the "window size," to be 64.

The "split index" $I(X)$ of a sequence $X$ is either the smallest
non-negative integer $i$ satisfying:
<!---
NOTE

Fixing W at 64
arbitrarily rules out using this section
to describe and reason about hashsplit algorithms
that it otherwise could.

- $i \le |X|$ and
- $S_{\text{max}} \ge i \ge S_{\text{min}}$ and
- $H(\langle X_{i-W}, \dots, X_{i-1} \rangle) \mod 2^T = 0$
I think fixing it at 64 is more properly a part of the recommendation section.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really do want to be prescriptive about this, because fixing it as a constant allows implementations that aren't possible if this is configurable. In particular, this allows using a fixed size array for the ring buffer rather than a dynamically allocated slice (or your language's equivalent). For e.g. @cole-miller's Rust implementation this is a big deal, because it means rolling hashes can be computed in an environment that doesn't supply a heap allocator, and generally it will be more efficient.

I guess in my mind if something is "recommended" then general purpose libraries should still allow a user to configure it, as they may need to to interoperate with other systems.

Copy link
Contributor

@cole-miller cole-miller Oct 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify -- I'm already using Rust's nascent support for const generics in my implementation, and it would be pretty simple to make WINDOW_SIZE a generic parameter in the appropriate place. That would allow users to choose a custom window size, as long as it's specified at compile time (which is a pretty mild restriction), without sacrificing no-alloc support. The general point about a fixed window size making things easier for implementors is well-taken, though.

A compromise that seems reasonable to me is to explicitly acknowledge the possibility of using other window sizes, while making clear that this is a "bonus feature" that spec-compliant implementations are not required to support.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! Thanks for the comments and sorry for the delay.

In the end I will defer to your preference on this. But before I do I want to push back a bit more. What is the purpose of the formalism in this part of the document? I thought the whole point was to describe essentially the whole space of possible hashsplit algorithms (or at least, all those belonging to a broad family) simply by the selection of appropriate configuration parameters.

This allows formal reasoning about a wide assortment of different hashsplit algorithms, and to that end I see no use in artificially limiting the size of that assortment by removing a parameter. Keeping W as a configurable value neither complicates the description nor limits our ability to prescribe setting it to 64.

On the other hand, if the only purpose of this document is to prescribe a specific hashsplit algorithm (and perhaps to describe one or two others, like rrs), why bother with the formalism at all? Why not describe exactly and only those algorithms and no others?

(These same argument applies to the input-shorter-than-W business below, unless we're re-imposing the S_min>=W requirement, in which case it's moot.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a problem with making W an opaque parameter from the perspective of the formal description. I suppose part of the trouble is we don't really have a section on recommendations for implementors, and so I'd been thinking of the configuration as much as a set of parameters that would be exposed by a library as part of the formalism. I think it makes sense to expose S_min, S_max and T to library users, I'm waffling on H, and I think W does not, so maybe the right answer is to actually separate these two ideas, and add a separate section that is prescriptive about what conforming library implementations should or should not make configurable. I'll open a new issue about this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe the right answer is to actually separate these two ideas, and add a separate section that is prescriptive about what conforming library implementations should or should not make configurable

I've been assuming all along that that was the plan. Sorry for the miscommunication, but I do think that's the right way to go.

-->

...or $\operatorname{min}(|X|, S_{\text{max}})$, if no such $i$ exists. For the
purposes of this definition we set $X_i = 0$ for $i < 0$.
We define the predicate $q_C(X)$
on a non-empty byte sequence $X$
with respect to a configuration $C$
to be:

The “prefix” $P(X)$ of a non-empty sequence $X$ is $\langle X_0, \dots, X_{I(X)-1} \rangle$.
- $\text{true}$ if $|X| = S_{\text{max}}$; otherwise
- $\text{true}$ if $|X| \ge S_{\text{min}}$ and $H(\langle X_{\max(0,|X|-W)}, \dots, X_{|X|-1} \rangle) \mod 2^T = 0$
(i.e., the last $W$ bytes of $X$ hash to a value with at least $T$ trailing zeroes);
otherwise
- $\text{false}$.

The “remainder” $R(X)$ of a non-empty sequence $X$ is $\langle X_{I(X)}, \dots, X_{|X|-1} \rangle$.
<!---
NOTE

This previously used H(<X_(|X|-W) ... X_(|X|-1)>) and defined X_i = 0 for i<0.
However, this unnecessarily constrains the choice of hash function.
If the hash function wants to treat input shorter than W as being prefixed by zeroes,
it can specify that;
but if it wants to handle input shorter than W differently,
it should be allowed to do that too.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gets a little weird because the way the hash functions are currently specified is over a variable length buffer; it's actually well defined what RRS1 and CP32 are when |X| != W, but a rolling implementation won't be amenable to computing that...

I'm still somewhat inclined to just insist that W <= S_min, as it would obviate this entirely, and it doesn't really affect the obvious implementation, but @cole-miller wanted this defined so he could do something with iterators of hashes...

@cole-miller, how strongly do you feel about this? I'm still not sure I grok your use case...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think I understand better now how S_min >= W solves this kind of problem. And since it really applies specifically to the splitting procedure, not to the hash computation step, the implementation strategy I had in mind isn't affected. So I'm fine with re-imposing the restriction.

-->

We define $\operatorname{SPLIT}_C(X)$ recursively, as follows:

- If $|X| = 0$, $\operatorname{SPLIT}_C(X) = \langle \rangle$
- Otherwise, $\operatorname{SPLIT}_C(X) = \langle P(X) \rangle \mathbin{\|} \operatorname{SPLIT}_C(R(X))$
- Otherwise, $\operatorname{SPLIT}_C(X) = \langle \mathbb{P}_{q_C}(X) \rangle \mathbin{\|} \operatorname{SPLIT}_C(\mathbb{R}_{q_C}(X))$

# Tree Construction

Expand Down Expand Up @@ -133,63 +190,128 @@ will differ only in the subtrees in the vicinity of the differences.

## Definitions

The “hashval” $V(X)$ of a sequence $X$ is:
A “chunk” is a member of the sequence produced by $\operatorname{SPLIT}_C$.

The “hashval” $V_C(X)$ of a byte sequence $X$ is:

$H(\langle X_{\operatorname{max}(0, |X|-W)}, \dots, X_{|X|-1} \rangle)$
$H(\langle X_{\max(0, |X|-W)}, \dots, X_{|X|-1} \rangle)$

(i.e., the hash of the last $W$ bytes of $X$).

The “level” $L(X)$ of a sequence $X$ is $Q - T$,
A “node” $N_{h,i}$ in a hashsplit tree
at non-negative “height” $h$
is a sequence of children.
The children of a node at height 0 are chunks.
The children of a node at height $h+1$ are nodes at height $h$.

A “tier” of a hashsplit tree is a sequence of nodes
$N_h = \langle N_{h,0}, \dots, N_{h,k} \rangle$
at a given height $h$.

The function $\operatorname{Rightmost}(N_{h,i})$
on a node $N_{h,i} = \langle S_0, \dots, S_e \rangle$
produces the “rightmost leaf chunk”
defined recursively as follows:

- If $h = 0$, $\operatorname{Rightmost}(N_{h,i}) = S_e$
- If $h > 0$, $\operatorname{Rightmost}(N_{h,i}) = \operatorname{Rightmost}(S_e)$

The “level” $L_C(X)$ of a given chunk $X$
is $\max(0, Q - T)$,
where $Q$ is the largest integer such that

- $Q \le 32$ and
- $V(P(X)) \mod 2^Q = 0$

(i.e., the level is the number of trailing zeroes in the rolling checksum in excess of the threshold needed to produce the prefix chunk $P(X)$).

(Note:
When $|R(X)| > 0$,
$L(X)$ is non-negative,
because $P(X)$ is defined in terms of a hash with $T$ trailing zeroes.
But when $|R(X)| = 0$,
that hash may have fewer than $T$ trailing zeroes,
and so $L(X)$ may be negative.
This makes no difference to the algorithm below, however.)

A “node” in a hashsplit tree
is a pair $(D, C)$
where $D$ is the node’s “depth”
and $C$ is a sequence of children.
The children of a node at depth 0 are chunks
(i.e., subsequences of the input).
The children of a node at depth $D > 0$ are nodes at depth $D - 1$.

The function $\operatorname{Children}(N)$ on a node $N = (D, C)$ produces $C$
(the sequence of children).
- $V_C(\mathbb{P}_{q_C}(X)) \mod 2^Q = 0$

(i.e., the level is the number of trailing zeroes in the hashval
in excess of the threshold needed
to produce the prefix chunk $\mathbb{P}_{q_C}(X)$).

The level $L_C(N)$ of a given _node_ $N$
is the level of its rightmost leaf chunk:
$L_C(N) = L_C(\operatorname{Rightmost}(N))$

The predicate $z_{C,h}(K)$
on a sequence $K = \langle K_0, \dots, K_e \rangle$
of chunks or of nodes
with respect to a height $h$
is defined as:

- $\text{true}$ if $L(K_e) > h$; otherwise
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're missing a subscript C here.

- $\text{false}$.

<!---
NOTE

Still needed:
a way to specify
the minimum and maximum branching factor
(akin to S_min and S_max for SPLIT_C).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can deal with this as a separate issue.

-->

For conciseness, define

- $P_C(X) = \mathbb{P}_{z_{C,0}}(\operatorname{SPLIT}_C(X))$ and
- $R_C(X) = \mathbb{R}_{z_{C,0}}(\operatorname{SPLIT}_C(X))$

## Algorithm

To compute a hashsplit tree from sequence $X$,
This section contains two descriptions of hashsplit trees:
an algebraic description for formal reasoning,
and a procedural description for practical construction.

### Algebraic description

The tier $N_0$
of hashsplit tree nodes
for a given byte sequence $X$
is equal to

$\langle P_C(X) \rangle \mathbb{\|} R_C(X)$

The tier $N_{h+1}$
of hashsplit tree nodes
for a given byte sequence $X$
is equal to

$\langle \mathbb{P}_{z_{C,h+1}}(N_h) \rangle \mathbb{\|} \mathbb{R}_{z_{C,h+1}}(N_h)$

(I.e., each node in the tree has as its children
a sequence of chunks or lower-tier nodes,
as appropriate,
up to and including the first one
whose “level” is greater than the node’s height.)

The root of the hashsplit tree is $N_{h^\prime,0}$
for the smallest value of $h^\prime$
such that $|N_{h^\prime}| = 1$

### Procedural description

For this description we use $N_h$ to denote a single node at height $h$.
The algorithm must keep track of the “rightmost” such node for each tier in the tree.

To compute a hashsplit tree from a byte sequence $X$,
compute its “root node” as follows.

1. Let $N_0$ be $(0, \langle\rangle)$ (i.e., a node at depth 0 with no children).
1. Let $N_0$ be $\langle\rangle$ (i.e., a node at height 0 with no children).
2. If $|X| = 0$, then:
a. Let $d$ be the largest depth such that $N_d$ exists.
b. If $|\operatorname{Children}(N_0)| > 0$, then:
i. For each integer $i$ in $[0 .. d]$, “close” $N_i$.
ii. Set $d \leftarrow d+1$.
c. [pruning] While $d > 0$ and $|\operatorname{Children}(N_d)| = 1$, set $d \leftarrow d-1$ (i.e., traverse from the prospective tree root downward until there is a node with more than one child).
d. **Terminate** with $N_d$ as the root node.
3. Otherwise, set $N_0 \leftarrow (0, \operatorname{Children}(N_0) \mathbin{\|} \langle P(X) \rangle)$ (i.e., add $P(X)$ to the list of children in $N_0$).
4. For each integer $i$ in $[0 .. L(X))$, “close” the node $N_i$ (see below).
5. Set $X \leftarrow R(X)$.
a. Let $h$ be the largest height such that $N_h$ exists.
b. If $|N_0| > 0$, then:
i. For each integer $i$ in $[0 .. h]$, “close” $N_i$ (see below).
ii. Set $h \leftarrow h+1$.
c. [pruning] While $h > 0$ and $|N_h| = 1$, set $h \leftarrow h-1$ (i.e., traverse from the prospective tree root downward until there is a node with more than one child).
d. **Terminate** with $N_h$ as the root node.
3. Otherwise, set $N_0 \leftarrow N_0 \mathbin{\|} \langle P_C(X) \rangle$ (i.e., add $P_C(X)$ to the list of children in $N_0$).
4. For each integer $i$ in $[0 .. L_C(X))$, “close” the node $N_i$ (see below).
5. Set $X \leftarrow R_C(X)$.
6. Go to step 2.

To “close” a node $N_i$:

1. If no $N_{i+1}$ exists yet, let $N_{i+1}$ be $(i+1, \langle\rangle)$ (i.e., a node at depth ${i + 1}$ with no children).
2. Set $N_{i+1} \leftarrow (i+1, \operatorname{Children}(N_{i+1}) \mathbin{\|} \langle N_i \rangle)$ (i.e., add $N_i$ as a child to $N_{i+1}$).
3. Let $N_i$ be $(i, \langle\rangle)$ (i.e., new node at depth $i$ with no children).
1. If no $N_{i+1}$ exists yet, let $N_{i+1}$ be $\langle\rangle$ (i.e., a node at height ${i + 1}$ with no children).
2. Set $N_{i+1} \leftarrow N_{i+1} \mathbin{\|} \langle N_i \rangle$ (i.e., add $N_i$ as a child to $N_{i+1}$).
3. Let $N_i$ be $\langle\rangle$ (i.e., new node at height $i$ with no children).

# Rolling Hash Functions

Expand Down