-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Algebraic tree description #23
Conversation
Not sure how coherent this is, feedback is most welcome. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mostly looks reasonable to me (aside from the clerical issue @cole-miller points out). I'm going to try to sit down tomorrow and add this to the reference implementation to make sure I understand it correctly.
Another observation as I work on adding this to the reference implementation: |
Ok, staring at how to implement this I keep coming back to the idea that we can make the symmetry between this and Maybe if we define a I came to all this after having written this for computing the first two cases of fc0 :: Config -> [[Word8]] -> [[[Word8]]]
fc0 cfg cs =
let p = takeWhile (\c -> levelChunk cfg c <= 0) cs
r = dropWhile (\c -> levelChunk cfg c <= 0) cs
cs' = case r of
(x:xs) -> p ++ [x]
[] -> []
in
[p] ++ fc0 cfg cs' Which, if you squint at it, looks similar to the implementation of split :: Config -> [Word8] -> [[Word8]]
split _cfg [] = []
split cfg bytes =
let
i = splitIndex cfg bytes
prefix = take i bytes
remainder = drop i bytes
in
[prefix] ++ split cfg remainder Does that make sense? Do you want to take a whack at a prefix/remainder formulation, or should I? |
I wanted to too, but we were already using the letter H to mean "hash." Maybe we can use a lower-case h and it won't be confusing. I can give that a try.
That does sound better. I'll take a stab at that this weekend. Thanks! |
e8a543a is a rough draft of the prefix/remainder rewrite of the algebraic tree description. It needs a little more rigor - e.g., some of it simply assumes that |
I still feel like there's a lot of duplicated logic, both between this presentation and the one for SPLIT, and just internally in this presentation. e.g. the definitions for the prefix of a sequence of chunks and of a sequence of nodes are almost identical. I think we could make SPLIT itself generic over the element type if we parametrize it over an arbitrary predicate, rather than hard-coding the hash in the description of I(X). Then we could almost just re-use split at every level, applying it iteratively until it gives you only one chunk. The only hitch being that split expects a min/max size, but it might be a good idea to impose min/max bounds for the internal branching factor anyway, to enforce balance (we'd be defining it to be a B+ tree then). Thoughts? |
I see what you mean (finally!). But it'll be another week probably before I can take another whack at it. In the meantime please feel free to iterate on this if you are so inclined. I'll be happy to review. |
…ke both SPLIT_C and the tree algorithm more concise.
In 31786cc I've rewritten SPLIT and the tree description in terms of parameterized prefix/remainder functions. This is still a draft and not ready for merging, but please let me know what you think. Also, please let me know your thoughts on a few NOTEs that I added to the source (in HTML comments to keep them from rendering). |
Cool, I'll do another pass either later tonight or tomorrow. |
I'll do a detailed review tomorrow, but I like the basic shape of this; this is much better. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments in-line, mostly responses to your notes.
I think once all this settles I'm going to want to go over the whole spec to make the presentation a bit more consistent, but modulo the comments I left I think this is good enough to merge for now.
spec.md
Outdated
with respect to a height $h$ | ||
is defined as: | ||
|
||
- $\text{true}$ if $L(K_e) > h$; otherwise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're missing a subscript C
here.
spec.md
Outdated
|
||
It might help the clarity of what follows | ||
if we make parameterization by C implicit | ||
rather than repeating C in subscripts everywhere. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree; feel free to add some verbiage to this affect and change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to take a whack at this, or should I go ahead and merge and we'll treat this as a follow-on task?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A follow-up task sounds good. I made one pass at it and didn't like how it looked.
For the record, this should involve more than just saying that C
is implicit. We should also be clearer about the components of C
, like H
and S_min
. Those are used throughout the document without explicitly connecting them back to C
. Perhaps we need something like H(C)
, S_min(C)
, S_max(C)
, and T(C)
(and maybe also W(C)
as discussed) to select the members of C
(like I did with Children(N)
in an earlier draft of this PR), but that sure seems cumbersome, and is why I'm happy to punt for now.
spec.md
Outdated
- $i \le |X|$ and | ||
- $S_{\text{max}} \ge i \ge S_{\text{min}}$ and | ||
- $H(\langle X_{i-W}, \dots, X_{i-1} \rangle) \mod 2^T = 0$ | ||
I think fixing it at 64 is more properly a part of the recommendation section. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really do want to be prescriptive about this, because fixing it as a constant allows implementations that aren't possible if this is configurable. In particular, this allows using a fixed size array for the ring buffer rather than a dynamically allocated slice (or your language's equivalent). For e.g. @cole-miller's Rust implementation this is a big deal, because it means rolling hashes can be computed in an environment that doesn't supply a heap allocator, and generally it will be more efficient.
I guess in my mind if something is "recommended" then general purpose libraries should still allow a user to configure it, as they may need to to interoperate with other systems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify -- I'm already using Rust's nascent support for const generics in my implementation, and it would be pretty simple to make WINDOW_SIZE
a generic parameter in the appropriate place. That would allow users to choose a custom window size, as long as it's specified at compile time (which is a pretty mild restriction), without sacrificing no-alloc
support. The general point about a fixed window size making things easier for implementors is well-taken, though.
A compromise that seems reasonable to me is to explicitly acknowledge the possibility of using other window sizes, while making clear that this is a "bonus feature" that spec-compliant implementations are not required to support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi! Thanks for the comments and sorry for the delay.
In the end I will defer to your preference on this. But before I do I want to push back a bit more. What is the purpose of the formalism in this part of the document? I thought the whole point was to describe essentially the whole space of possible hashsplit algorithms (or at least, all those belonging to a broad family) simply by the selection of appropriate configuration parameters.
This allows formal reasoning about a wide assortment of different hashsplit algorithms, and to that end I see no use in artificially limiting the size of that assortment by removing a parameter. Keeping W
as a configurable value neither complicates the description nor limits our ability to prescribe setting it to 64.
On the other hand, if the only purpose of this document is to prescribe a specific hashsplit algorithm (and perhaps to describe one or two others, like rrs), why bother with the formalism at all? Why not describe exactly and only those algorithms and no others?
(These same argument applies to the input-shorter-than-W business below, unless we're re-imposing the S_min>=W requirement, in which case it's moot.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a problem with making W an opaque parameter from the perspective of the formal description. I suppose part of the trouble is we don't really have a section on recommendations for implementors, and so I'd been thinking of the configuration as much as a set of parameters that would be exposed by a library as part of the formalism. I think it makes sense to expose S_min, S_max and T to library users, I'm waffling on H, and I think W does not, so maybe the right answer is to actually separate these two ideas, and add a separate section that is prescriptive about what conforming library implementations should or should not make configurable. I'll open a new issue about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe the right answer is to actually separate these two ideas, and add a separate section that is prescriptive about what conforming library implementations should or should not make configurable
I've been assuming all along that that was the plan. Sorry for the miscommunication, but I do think that's the right way to go.
spec.md
Outdated
If the hash function wants to treat input shorter than W as being prefixed by zeroes, | ||
it can specify that; | ||
but if it wants to handle input shorter than W differently, | ||
it should be allowed to do that too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This gets a little weird because the way the hash functions are currently specified is over a variable length buffer; it's actually well defined what RRS1 and CP32 are when |X| != W
, but a rolling implementation won't be amenable to computing that...
I'm still somewhat inclined to just insist that W <= S_min
, as it would obviate this entirely, and it doesn't really affect the obvious implementation, but @cole-miller wanted this defined so he could do something with iterators of hashes...
@cole-miller, how strongly do you feel about this? I'm still not sure I grok your use case...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think I understand better now how S_min >= W
solves this kind of problem. And since it really applies specifically to the splitting procedure, not to the hash computation step, the implementation strategy I had in mind isn't affected. So I'm fine with re-imposing the restriction.
spec.md
Outdated
Still needed: | ||
a way to specify | ||
the minimum and maximum branching factor | ||
(akin to S_min and S_max for SPLIT_C). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can deal with this as a separate issue.
Merged, thanks! |
An algebraic description of a hashsplit tree, to supplement the procedural description for building one.