Description
Since we intend to favor streaming parsing, we need to consider a format suited for streaming.
Strings + Lazy parsing
One of the problems we are going to encounter is the combination of strings and lazy parsing:
- consider two independent lazy functions
foo
andbar
, wherebar
is somewhere further down the stream fromfoo
; - assume that
foo
defines a literal strings
that does not show up in our AOT dictionary; - how should
bar
refer tos
in such a way that we do not first need to parsefoo
?
One way to do this is the following:
- divide the stream in packets;
- each packet starts with a table of strings, which may now used by every packet further down the line.
If we do so, the packet containing foo
will define literal string s
. The packet containing bar
will either be the same packet or a packet further down the line, and will be able to access s
.
As a bonus, this will let us compress these strings table using a well-known algorithm, such as brotli.
Model State + Lazy Parsing
We will need to adapt our models to restart from a well-specified state whenever parsing a lazy function.
(TBD)
Offsets + Entropy + Streaming
We need the ability to tell the decoder where to fetch a lazy function. In non-entropy-coding versions, we could reference the actual offset at which a lazy function was encoded. With entropy coding, offsets make no sense.
A partial solution would be the following:
- each packet may contain a number of (aligned) lazy declarations;
- each packet's header declares the lazy declarations included in this packet (as keys, actual value of the key is an arbitrary string), with their starting-offset-in-packet;
- when encoding a
[lazy]
field, we specify the key at which to find the content of the field; - note that a lazy declaration could span over several packets.