Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Algebraic tree description #23

Merged
merged 10 commits into from
Oct 27, 2020
Merged

Algebraic tree description #23

merged 10 commits into from
Oct 27, 2020

Conversation

bobg
Copy link
Contributor

@bobg bobg commented Oct 8, 2020

An algebraic description of a hashsplit tree, to supplement the procedural description for building one.

@bobg bobg changed the title [WIP] Algebraic tree description Algebraic tree description Oct 8, 2020
@bobg
Copy link
Contributor Author

bobg commented Oct 8, 2020

Not sure how coherent this is, feedback is most welcome.

spec.md Outdated Show resolved Hide resolved
Copy link
Collaborator

@zenhack zenhack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mostly looks reasonable to me (aside from the clerical issue @cole-miller points out). I'm going to try to sit down tomorrow and add this to the reference implementation to make sure I understand it correctly.

spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
@zenhack
Copy link
Collaborator

zenhack commented Oct 9, 2020

Another observation as I work on adding this to the reference implementation: L(X) is not defined if there is no integer satisfying the conditions (i.e. the lowest bit is 1). Probably we should just say it's zero in that case. (and we could then throw the in the word "non-negative" into the constraints and be able to skip the verbiage about why negative values don't matter).

@zenhack
Copy link
Collaborator

zenhack commented Oct 9, 2020

Ok, staring at how to implement this I keep coming back to the idea that we can make the symmetry between this and SPLIT_C more obvious, by taking a prefix/remainder approach, rather than providing an index to F_C.

Maybe if we define a GROUP_C(X) which groups a series of nodes at depth D into a series of nodes at depth D+1 (as an aside: maybe we should call it height rather than depth?), and the root is the first depth which results in a 1 element sequence.

I came to all this after having written this for computing the first two cases of F_C as:

fc0 :: Config -> [[Word8]] -> [[[Word8]]]
fc0 cfg cs =
    let p = takeWhile (\c -> levelChunk cfg c <= 0) cs
        r = dropWhile (\c -> levelChunk cfg c <= 0) cs
        cs' = case r of
                (x:xs) -> p ++ [x]
                []     -> []
    in
    [p] ++ fc0 cfg cs'

Which, if you squint at it, looks similar to the implementation of SPLIT_C:

split :: Config -> [Word8] -> [[Word8]]
split _cfg [] = []
split cfg bytes =
    let
        i = splitIndex cfg bytes
        prefix = take i bytes
        remainder = drop i bytes
    in
    [prefix] ++ split cfg remainder

Does that make sense? Do you want to take a whack at a prefix/remainder formulation, or should I?

@bobg
Copy link
Contributor Author

bobg commented Oct 9, 2020

maybe we should call it height rather than depth?

I wanted to too, but we were already using the letter H to mean "hash." Maybe we can use a lower-case h and it won't be confusing. I can give that a try.

we can make the symmetry between this and SPLIT_C more obvious, by taking a prefix/remainder approach

That does sound better. I'll take a stab at that this weekend. Thanks!

@bobg
Copy link
Contributor Author

bobg commented Oct 10, 2020

e8a543a is a rough draft of the prefix/remainder rewrite of the algebraic tree description. It needs a little more rigor - e.g., some of it simply assumes that X is what's being split without explicitly saying so, and there are some notational inconsistencies - but I wanted to get this draft pushed up to get some 👀 on it sooner rather than later. Does this look like it's on the right track?

@zenhack
Copy link
Collaborator

zenhack commented Oct 10, 2020

I still feel like there's a lot of duplicated logic, both between this presentation and the one for SPLIT, and just internally in this presentation. e.g. the definitions for the prefix of a sequence of chunks and of a sequence of nodes are almost identical.

I think we could make SPLIT itself generic over the element type if we parametrize it over an arbitrary predicate, rather than hard-coding the hash in the description of I(X). Then we could almost just re-use split at every level, applying it iteratively until it gives you only one chunk. The only hitch being that split expects a min/max size, but it might be a good idea to impose min/max bounds for the internal branching factor anyway, to enforce balance (we'd be defining it to be a B+ tree then). Thoughts?

@bobg
Copy link
Contributor Author

bobg commented Oct 12, 2020

I see what you mean (finally!). But it'll be another week probably before I can take another whack at it. In the meantime please feel free to iterate on this if you are so inclined. I'll be happy to review.

…ke both SPLIT_C and the tree algorithm more concise.
@bobg
Copy link
Contributor Author

bobg commented Oct 18, 2020

In 31786cc I've rewritten SPLIT and the tree description in terms of parameterized prefix/remainder functions.

This is still a draft and not ready for merging, but please let me know what you think.

Also, please let me know your thoughts on a few NOTEs that I added to the source (in HTML comments to keep them from rendering).

@bobg bobg mentioned this pull request Oct 18, 2020
@zenhack
Copy link
Collaborator

zenhack commented Oct 18, 2020

Cool, I'll do another pass either later tonight or tomorrow.

@zenhack
Copy link
Collaborator

zenhack commented Oct 20, 2020

I'll do a detailed review tomorrow, but I like the basic shape of this; this is much better.

Copy link
Collaborator

@zenhack zenhack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments in-line, mostly responses to your notes.

I think once all this settles I'm going to want to go over the whole spec to make the presentation a bit more consistent, but modulo the comments I left I think this is good enough to merge for now.

spec.md Outdated
with respect to a height $h$
is defined as:

- $\text{true}$ if $L(K_e) > h$; otherwise
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're missing a subscript C here.

spec.md Outdated

It might help the clarity of what follows
if we make parameterization by C implicit
rather than repeating C in subscripts everywhere.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree; feel free to add some verbiage to this affect and change it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to take a whack at this, or should I go ahead and merge and we'll treat this as a follow-on task?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A follow-up task sounds good. I made one pass at it and didn't like how it looked.

For the record, this should involve more than just saying that C is implicit. We should also be clearer about the components of C, like H and S_min. Those are used throughout the document without explicitly connecting them back to C. Perhaps we need something like H(C), S_min(C), S_max(C), and T(C) (and maybe also W(C) as discussed) to select the members of C (like I did with Children(N) in an earlier draft of this PR), but that sure seems cumbersome, and is why I'm happy to punt for now.

spec.md Outdated
- $i \le |X|$ and
- $S_{\text{max}} \ge i \ge S_{\text{min}}$ and
- $H(\langle X_{i-W}, \dots, X_{i-1} \rangle) \mod 2^T = 0$
I think fixing it at 64 is more properly a part of the recommendation section.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really do want to be prescriptive about this, because fixing it as a constant allows implementations that aren't possible if this is configurable. In particular, this allows using a fixed size array for the ring buffer rather than a dynamically allocated slice (or your language's equivalent). For e.g. @cole-miller's Rust implementation this is a big deal, because it means rolling hashes can be computed in an environment that doesn't supply a heap allocator, and generally it will be more efficient.

I guess in my mind if something is "recommended" then general purpose libraries should still allow a user to configure it, as they may need to to interoperate with other systems.

Copy link
Contributor

@cole-miller cole-miller Oct 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify -- I'm already using Rust's nascent support for const generics in my implementation, and it would be pretty simple to make WINDOW_SIZE a generic parameter in the appropriate place. That would allow users to choose a custom window size, as long as it's specified at compile time (which is a pretty mild restriction), without sacrificing no-alloc support. The general point about a fixed window size making things easier for implementors is well-taken, though.

A compromise that seems reasonable to me is to explicitly acknowledge the possibility of using other window sizes, while making clear that this is a "bonus feature" that spec-compliant implementations are not required to support.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! Thanks for the comments and sorry for the delay.

In the end I will defer to your preference on this. But before I do I want to push back a bit more. What is the purpose of the formalism in this part of the document? I thought the whole point was to describe essentially the whole space of possible hashsplit algorithms (or at least, all those belonging to a broad family) simply by the selection of appropriate configuration parameters.

This allows formal reasoning about a wide assortment of different hashsplit algorithms, and to that end I see no use in artificially limiting the size of that assortment by removing a parameter. Keeping W as a configurable value neither complicates the description nor limits our ability to prescribe setting it to 64.

On the other hand, if the only purpose of this document is to prescribe a specific hashsplit algorithm (and perhaps to describe one or two others, like rrs), why bother with the formalism at all? Why not describe exactly and only those algorithms and no others?

(These same argument applies to the input-shorter-than-W business below, unless we're re-imposing the S_min>=W requirement, in which case it's moot.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a problem with making W an opaque parameter from the perspective of the formal description. I suppose part of the trouble is we don't really have a section on recommendations for implementors, and so I'd been thinking of the configuration as much as a set of parameters that would be exposed by a library as part of the formalism. I think it makes sense to expose S_min, S_max and T to library users, I'm waffling on H, and I think W does not, so maybe the right answer is to actually separate these two ideas, and add a separate section that is prescriptive about what conforming library implementations should or should not make configurable. I'll open a new issue about this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe the right answer is to actually separate these two ideas, and add a separate section that is prescriptive about what conforming library implementations should or should not make configurable

I've been assuming all along that that was the plan. Sorry for the miscommunication, but I do think that's the right way to go.

spec.md Outdated
If the hash function wants to treat input shorter than W as being prefixed by zeroes,
it can specify that;
but if it wants to handle input shorter than W differently,
it should be allowed to do that too.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gets a little weird because the way the hash functions are currently specified is over a variable length buffer; it's actually well defined what RRS1 and CP32 are when |X| != W, but a rolling implementation won't be amenable to computing that...

I'm still somewhat inclined to just insist that W <= S_min, as it would obviate this entirely, and it doesn't really affect the obvious implementation, but @cole-miller wanted this defined so he could do something with iterators of hashes...

@cole-miller, how strongly do you feel about this? I'm still not sure I grok your use case...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think I understand better now how S_min >= W solves this kind of problem. And since it really applies specifically to the splitting procedure, not to the hash computation step, the implementation strategy I had in mind isn't affected. So I'm fine with re-imposing the restriction.

spec.md Outdated
Still needed:
a way to specify
the minimum and maximum branching factor
(akin to S_min and S_max for SPLIT_C).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can deal with this as a separate issue.

@zenhack zenhack merged commit a9d62bd into hashsplit:master Oct 27, 2020
@zenhack
Copy link
Collaborator

zenhack commented Oct 27, 2020

Merged, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants