Algebraic tree description #23

bobg · 2020-10-08T15:30:50Z

An algebraic description of a hashsplit tree, to supplement the procedural description for building one.

…load C; use S for a node's children.

bobg · 2020-10-08T15:34:54Z

Not sure how coherent this is, feedback is most welcome.

spec.md

zenhack

This mostly looks reasonable to me (aside from the clerical issue @cole-miller points out). I'm going to try to sit down tomorrow and add this to the reference implementation to make sure I understand it correctly.

spec.md

…anatory text.

spec.md

zenhack · 2020-10-09T21:03:51Z

Another observation as I work on adding this to the reference implementation: L(X) is not defined if there is no integer satisfying the conditions (i.e. the lowest bit is 1). Probably we should just say it's zero in that case. (and we could then throw the in the word "non-negative" into the constraints and be able to skip the verbiage about why negative values don't matter).

zenhack · 2020-10-09T21:36:58Z

Ok, staring at how to implement this I keep coming back to the idea that we can make the symmetry between this and SPLIT_C more obvious, by taking a prefix/remainder approach, rather than providing an index to F_C.

Maybe if we define a GROUP_C(X) which groups a series of nodes at depth D into a series of nodes at depth D+1 (as an aside: maybe we should call it height rather than depth?), and the root is the first depth which results in a 1 element sequence.

I came to all this after having written this for computing the first two cases of F_C as:

fc0 :: Config -> [[Word8]] -> [[[Word8]]]
fc0 cfg cs =
    let p = takeWhile (\c -> levelChunk cfg c <= 0) cs
        r = dropWhile (\c -> levelChunk cfg c <= 0) cs
        cs' = case r of
                (x:xs) -> p ++ [x]
                []     -> []
    in
    [p] ++ fc0 cfg cs'

Which, if you squint at it, looks similar to the implementation of SPLIT_C:

split :: Config -> [Word8] -> [[Word8]]
split _cfg [] = []
split cfg bytes =
    let
        i = splitIndex cfg bytes
        prefix = take i bytes
        remainder = drop i bytes
    in
    [prefix] ++ split cfg remainder

Does that make sense? Do you want to take a whack at a prefix/remainder formulation, or should I?

bobg · 2020-10-09T23:10:17Z

maybe we should call it height rather than depth?

I wanted to too, but we were already using the letter H to mean "hash." Maybe we can use a lower-case h and it won't be confusing. I can give that a try.

we can make the symmetry between this and SPLIT_C more obvious, by taking a prefix/remainder approach

That does sound better. I'll take a stab at that this weekend. Thanks!

…ption.

bobg · 2020-10-10T17:31:32Z

e8a543a is a rough draft of the prefix/remainder rewrite of the algebraic tree description. It needs a little more rigor - e.g., some of it simply assumes that X is what's being split without explicitly saying so, and there are some notational inconsistencies - but I wanted to get this draft pushed up to get some 👀 on it sooner rather than later. Does this look like it's on the right track?

zenhack · 2020-10-10T22:52:37Z

I still feel like there's a lot of duplicated logic, both between this presentation and the one for SPLIT, and just internally in this presentation. e.g. the definitions for the prefix of a sequence of chunks and of a sequence of nodes are almost identical.

I think we could make SPLIT itself generic over the element type if we parametrize it over an arbitrary predicate, rather than hard-coding the hash in the description of I(X). Then we could almost just re-use split at every level, applying it iteratively until it gives you only one chunk. The only hitch being that split expects a min/max size, but it might be a good idea to impose min/max bounds for the internal branching factor anyway, to enforce balance (we'd be defining it to be a B+ tree then). Thoughts?

bobg · 2020-10-12T00:15:37Z

I see what you mean (finally!). But it'll be another week probably before I can take another whack at it. In the meantime please feel free to iterate on this if you are so inclined. I'll be happy to review.

…ke both SPLIT_C and the tree algorithm more concise.

bobg · 2020-10-18T17:14:23Z

In 31786cc I've rewritten SPLIT and the tree description in terms of parameterized prefix/remainder functions.

This is still a draft and not ready for merging, but please let me know what you think.

Also, please let me know your thoughts on a few NOTEs that I added to the source (in HTML comments to keep them from rendering).

zenhack · 2020-10-18T19:22:13Z

Cool, I'll do another pass either later tonight or tomorrow.

zenhack · 2020-10-20T05:41:27Z

I'll do a detailed review tomorrow, but I like the basic shape of this; this is much better.

zenhack

Some comments in-line, mostly responses to your notes.

I think once all this settles I'm going to want to go over the whole spec to make the presentation a bit more consistent, but modulo the comments I left I think this is good enough to merge for now.

zenhack · 2020-10-21T21:49:32Z

spec.md

+with respect to a height $h$
+is defined as:
+
+- $\text{true}$ if $L(K_e) > h$; otherwise


You're missing a subscript C here.

zenhack · 2020-10-21T21:57:54Z

spec.md

+
+It might help the clarity of what follows
+if we make parameterization by C implicit
+rather than repeating C in subscripts everywhere.


I agree; feel free to add some verbiage to this affect and change it.

Do you want to take a whack at this, or should I go ahead and merge and we'll treat this as a follow-on task?

A follow-up task sounds good. I made one pass at it and didn't like how it looked.

For the record, this should involve more than just saying that C is implicit. We should also be clearer about the components of C, like H and S_min. Those are used throughout the document without explicitly connecting them back to C. Perhaps we need something like H(C), S_min(C), S_max(C), and T(C) (and maybe also W(C) as discussed) to select the members of C (like I did with Children(N) in an earlier draft of this PR), but that sure seems cumbersome, and is why I'm happy to punt for now.

zenhack · 2020-10-21T22:02:44Z

spec.md

- $i \le |X|$ and
- $S_{\text{max}} \ge i \ge S_{\text{min}}$ and
- $H(\langle X_{i-W}, \dots, X_{i-1} \rangle) \mod 2^T = 0$
+I think fixing it at 64 is more properly a part of the recommendation section.


I really do want to be prescriptive about this, because fixing it as a constant allows implementations that aren't possible if this is configurable. In particular, this allows using a fixed size array for the ring buffer rather than a dynamically allocated slice (or your language's equivalent). For e.g. @cole-miller's Rust implementation this is a big deal, because it means rolling hashes can be computed in an environment that doesn't supply a heap allocator, and generally it will be more efficient.

I guess in my mind if something is "recommended" then general purpose libraries should still allow a user to configure it, as they may need to to interoperate with other systems.

Just to clarify -- I'm already using Rust's nascent support for const generics in my implementation, and it would be pretty simple to make WINDOW_SIZE a generic parameter in the appropriate place. That would allow users to choose a custom window size, as long as it's specified at compile time (which is a pretty mild restriction), without sacrificing no-alloc support. The general point about a fixed window size making things easier for implementors is well-taken, though.

A compromise that seems reasonable to me is to explicitly acknowledge the possibility of using other window sizes, while making clear that this is a "bonus feature" that spec-compliant implementations are not required to support.

Hi! Thanks for the comments and sorry for the delay.

In the end I will defer to your preference on this. But before I do I want to push back a bit more. What is the purpose of the formalism in this part of the document? I thought the whole point was to describe essentially the whole space of possible hashsplit algorithms (or at least, all those belonging to a broad family) simply by the selection of appropriate configuration parameters.

This allows formal reasoning about a wide assortment of different hashsplit algorithms, and to that end I see no use in artificially limiting the size of that assortment by removing a parameter. Keeping W as a configurable value neither complicates the description nor limits our ability to prescribe setting it to 64.

On the other hand, if the only purpose of this document is to prescribe a specific hashsplit algorithm (and perhaps to describe one or two others, like rrs), why bother with the formalism at all? Why not describe exactly and only those algorithms and no others?

(These same argument applies to the input-shorter-than-W business below, unless we're re-imposing the S_min>=W requirement, in which case it's moot.)

I don't have a problem with making W an opaque parameter from the perspective of the formal description. I suppose part of the trouble is we don't really have a section on recommendations for implementors, and so I'd been thinking of the configuration as much as a set of parameters that would be exposed by a library as part of the formalism. I think it makes sense to expose S_min, S_max and T to library users, I'm waffling on H, and I think W does not, so maybe the right answer is to actually separate these two ideas, and add a separate section that is prescriptive about what conforming library implementations should or should not make configurable. I'll open a new issue about this.

maybe the right answer is to actually separate these two ideas, and add a separate section that is prescriptive about what conforming library implementations should or should not make configurable

I've been assuming all along that that was the plan. Sorry for the miscommunication, but I do think that's the right way to go.

zenhack · 2020-10-21T22:07:28Z

spec.md

+If the hash function wants to treat input shorter than W as being prefixed by zeroes,
+it can specify that;
+but if it wants to handle input shorter than W differently,
+it should be allowed to do that too.


This gets a little weird because the way the hash functions are currently specified is over a variable length buffer; it's actually well defined what RRS1 and CP32 are when |X| != W, but a rolling implementation won't be amenable to computing that...

I'm still somewhat inclined to just insist that W <= S_min, as it would obviate this entirely, and it doesn't really affect the obvious implementation, but @cole-miller wanted this defined so he could do something with iterators of hashes...

@cole-miller, how strongly do you feel about this? I'm still not sure I grok your use case...

Yeah, I think I understand better now how S_min >= W solves this kind of problem. And since it really applies specifically to the splitting procedure, not to the hash computation step, the implementation strategy I had in mind isn't affected. So I'm fine with re-imposing the restriction.

zenhack · 2020-10-21T22:08:03Z

spec.md

+Still needed:
+a way to specify
+the minimum and maximum branching factor
+(akin to S_min and S_max for SPLIT_C).


I think we can deal with this as a separate issue.

zenhack · 2020-10-27T17:40:10Z

Merged, thanks!

bobg added 3 commits October 2, 2020 08:25

WIP: algebraic description of hashsplit-tree construction.

8feac1b

Flesh out the rest of the algebraic tree description. Also don't over…

3f14f6a

…load C; use S for a node's children.

Merge branch 'master' into algebraic-tree

d4f1cfd

bobg changed the title ~~[WIP] Algebraic tree description~~ Algebraic tree description Oct 8, 2020

cole-miller reviewed Oct 8, 2020

View reviewed changes

spec.md Outdated Show resolved Hide resolved

zenhack reviewed Oct 8, 2020

View reviewed changes

spec.md Outdated Show resolved Hide resolved

Various fixes (including an important correctness fix) plus some expl…

8b82dc2

…anatory text.

bobg commented Oct 9, 2020

View reviewed changes

spec.md Outdated Show resolved Hide resolved

bobg added 3 commits October 10, 2020 09:27

Use "height" and "h" instead of "depth" and "D."

94de072

Fix "level" to always be non-negative.

d9a8fe5

Very rough draft of prefix/remainder rewrite of algebraic tree descri…

e8a543a

…ption.

Factor out a parameterized prefix/remainder notation and use it to ma…

31786cc

…ke both SPLIT_C and the tree algorithm more concise.

bobg mentioned this pull request Oct 18, 2020

Specify buzhash #24

Merged

zenhack reviewed Oct 21, 2020

View reviewed changes

This was referenced Oct 24, 2020

Add rationale re: requiring W <= Smin #11

Open

Tests for S_min < W hashsplit/test-suite#3

Closed

Add a missing C subscript.

eb1b5b6

zenhack mentioned this pull request Oct 27, 2020

Clearer distinction between recommendations to library authors vs. parameters for reasoning purposes in the formalism. #25

Open

Remove inline notes to reviewers.

40da42b

zenhack merged commit a9d62bd into hashsplit:master Oct 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Algebraic tree description #23

Algebraic tree description #23

bobg commented Oct 8, 2020

bobg commented Oct 8, 2020

zenhack left a comment

zenhack commented Oct 9, 2020

zenhack commented Oct 9, 2020

bobg commented Oct 9, 2020

bobg commented Oct 10, 2020 •

edited

Loading

zenhack commented Oct 10, 2020

bobg commented Oct 12, 2020

bobg commented Oct 18, 2020

zenhack commented Oct 18, 2020

zenhack commented Oct 20, 2020

zenhack left a comment

zenhack Oct 21, 2020

zenhack Oct 21, 2020

zenhack Oct 27, 2020

bobg Oct 27, 2020

zenhack Oct 21, 2020

cole-miller Oct 21, 2020 •

edited

Loading

bobg Oct 27, 2020

zenhack Oct 27, 2020

bobg Oct 27, 2020

zenhack Oct 21, 2020

cole-miller Oct 22, 2020

zenhack Oct 21, 2020

zenhack commented Oct 27, 2020

Algebraic tree description #23

Algebraic tree description #23

Conversation

bobg commented Oct 8, 2020

bobg commented Oct 8, 2020

zenhack left a comment

Choose a reason for hiding this comment

zenhack commented Oct 9, 2020

zenhack commented Oct 9, 2020

bobg commented Oct 9, 2020

bobg commented Oct 10, 2020 • edited Loading

zenhack commented Oct 10, 2020

bobg commented Oct 12, 2020

bobg commented Oct 18, 2020

zenhack commented Oct 18, 2020

zenhack commented Oct 20, 2020

zenhack left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cole-miller Oct 21, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zenhack commented Oct 27, 2020

bobg commented Oct 10, 2020 •

edited

Loading

cole-miller Oct 21, 2020 •

edited

Loading