Skip to content

wrsturgeon/inator

Repository files navigation

inator: An evil parsing library.

You supply the evil plan; we build the inator.

Portrait of the eminent Dr. Heinz Doofenshmirtz

🚧 Development still ongoing! 🚧

TL;DR

We ask for a specification and turn it into a graph that knows exactly what can happen at each step. We then ruthlessly cut things that won't happen, combine identical ones, and output the result as a Rust file. Compile this with the rest of your code and, voila!, you've got a hand-rolled zero-copy parser.

The long explanation

Given a language like "AAAAAAAAB or AAAAAAAAC", we'd like to write a parser that takes the common-sense route: blow through all the As as a group, then match on the B or C once. This has several advantages:

  • We never rewind the tape, so we don't need to store any previous input.
  • In the limit, it's twice as fast as trying one then the other. With n alternatives for the last character, it's n times faster. This general idea goes by the name zero-copy streaming parsers, and the pleasure of constructing them usually ranks similarly to being repeatedly stabbed.

Yet, as far as I know, no existing parsing library tries to optimize these cases at compile time, and for good reason: it's hard.

The problem is that, the more we can do with a parser, the less we can say about them before they run. I've tried to strike a balance between the two with a model of computation similar to pushdown automata that maps nicely onto stack-based bare-metal code. Here's the definition, with overloaded terms italicized:

  • A parser is a graph.
  • A graph is a set of states with an initial index.
  • An index is implementation-defined (literally, by a Rust trait), but it's usually either an natural number or a set thereof, corresponding to deterministic or nondeterministic evaluation, respectively.
  • A state is a set of curried transitions and a (possibly empty) set of error messages; if the input stream ends on a state with no error messages, we accept the input.
  • A curried transition can take two forms:
    • Accept any input token and trigger one transition for them all, or
    • Map a set of disjoint input ranges to potentially different transitions for each, and have an optional fallback transition if no ranges match. Note that we cannot inspect the stack before choosing a transition.
  • A transition is one of three things:
    • Lateral: Don't touch the stack; just move to a new state index.
    • Return: Pop from the stack (reject if the stack is empty) and move to the state index we just popped. Note that this is exactly how Rust's actual call stack works in assembly. Also note that we can't move to a specific state with a return/pop statement; this would entail returning a function pointer at runtime, which is pointlessly (get it?) slow.
    • Call: Push a specified destination index onto the stack and move to a specified detour index. Note that the detour index needs to have a return/pop statement, at which it will move to the specified destination index, but (as of yet) we don't check that a return statement is reachable from any given detour index. You may have noticed that the structure of states and transitions maps remarkably well onto functions and calls:
  • States are functions;
  • Lateral transitions are tail calls;
  • Return transitions are actual return statements; and
  • Calls are function calls that are not at the end of a block of execution (if they were, they would be lateral transitions). And this is exactly how we transcompile it.

Lastly, on top of this graph, parsers output data. We can't prove anything about the values you compute along the way—it's essentially having one model of computation (Rust) riding another (automata)—but, in practice, being able to have your parsers output something (e.g., an abstract syntax tree) without having to run a lexer first is invaluable.

The automata directory in this repository contains both an interpreter and a compiler for this language, and I remain extremely confident that their operation is always equivalent, but property-testing involving compiling Rust code is extremely difficult. In the future, I would like to prove their equivalence in Coq and extract the proof into Haskell and OCaml versions of this library. Way in the future! 🔮

What does this whole process look like?

Surprisingly, it looks a lot like just writing down what you want. The key idea is that parsers are data, and you can pass them around, modify them, and combine them just like anything else.

Here's how we parse either "abc" or "azc":

use inator::toss; // Combinator that filters for a certain character, then forgets it.
let a = toss('a');
let b = toss('b');
let c = toss('c');
let z = toss('z');
let abc_azc = a >> (b | z) >> c;

The above gets compiled down a function that takes an iterator,

  • checks that the first item is a;
  • runs a match statement on the second item without rewinding, sending both b and z to the same third state;
  • checks that the third item is c; then
  • checks that the next item is None. It's not remarkable when spelled out, but most simple parsers would allocate a buffer, read input into the buffer, rewind to the top, try abc, rewind, try abz, then reject. By the time a similar parser has even allocated its buffer, let alone read the input, our parser is almost surely already done. Plus, this approach to parsing requires zero allocations, so even microcontrollers running bare-metal code couldn't get any faster if you tried.

Then we can take that whole above parser and pass it around, e.g. to put it in parentheses:

// Copied from above:
use inator::toss;
let a = toss('a');
let b = toss('b');
let c = toss('c');
let z = toss('z');
let abc_azc = a >> (b | z) >> c;

// Function from a parser to another parser!
let parenthesized = |p| toss('(') >> p >> toss(')');

let paren_abc_azc = parenthesized(
    abc_azc, // <-- Parser we just made above, passed around as data
);

Above, if p is a parser that accepts ABC, then parenthesized(p) will accept (ABC), and so on for any language other than ABC. Simple as that.

If you need to nest parentheses (or any other delimiters) and verify that everything matches up, there's a built-in function for that. region takes five arguments:

  • name, a &'static string describing the region (e.g. in error messages);
  • open, a parser that opens the region (here, it would be toss('('));
  • contents, the parser that runs inside the region;
  • close, a parser that closes the region (here, it would be toss(')')); and
  • combine, which is a bit more complicated.
    • Every parser returns a value, but after a call, we have two: what we had before, and the return value from the call. You can combine these two values in any way you'd like, including by throwing one or the other out.

Anything else cool you can do?

Yes! Since we're really just riding on top of a decision-problem automaton, I'm working on (but confident about) taking a specification and inverting it to fuzz with an infinite stream of strings that are all guaranteed to be parsed correctly. If you're writing a language, this means automatically generating all possible valid source files, which would be huge. After the redesign of automata above, this is back in the to-do pile.

Why not other parsing libraries?

Please try other parsing libraries! My primary goal was a personal tool, but it turned out much better than I expected, so please put it to good use and send feedback!

Acknowledgments

Haskell's parsing libraries (and UPenn's Haskell course), for showing me that parsers can even work this way.

Rajeev Alur (and UPenn's CIS 262), for formally introducing me to nondeterministic finite automata.

Rust, for making this possible.

Releases

No releases published

Packages

No packages published