design.txt


Introduction

I am going to describe some of the decisions behind the
ParserCombinator library, which, I hope, will show how useful Julia's
method-centred, multiple dispatch can be.

OK, so I wanted to write a parser combinator library in Julia.

If you don't know what a parser combinator library is, here's a quick
sketch:

  There's a pretty neat, easy to understand, way of writing parsers by
  hand, which is common in functional languages, where you construct a
  "tree" (or even a DAG) of functions that take a stream of characters
  as input and return some structured output as a result (along with
  the remaining text).

  The stream of characters is (of course) what you want to parse, and
  the structured output becomes your AST.

  Here's a simple (untested!) example (in Julia):

      function equals(text)
          return function _(string)
              n = length(text)
              if string[1:n] == text
                  return string[n:end], text
              else
                  throw(Backtrack())
              end
          end
      end

  By itself that doesn't seem very useful, but consider

      function follows(m1, m2)
          return function _(string)
              string1, result1 = m1(string)
              string2, result2 = m2(string1)
              return string2, [result1, result2]
          end
      end

  which can be used like:

     grammar = follows(equals("hello"), equals("world"))
     grammar("helloworld")
  
  and returns ["hello", "world"].

  And, of course, you can go crazy, with functions that test
  alternatives, or repeat a given number of times, or as often as
  possible, etc etc.

So that's the kind of thing I wanted to write.  But when you start
digging into the details, it's not quite so simple.


Complications

First, Julia isn't intended to be used as a functional language.  It's
not designed to support deep, recursive calls of functions (in fact,
the stack limit in a simple test is a little greater than 100,000 on
my machine, so providing you write combinators for multiple matches
that work with iteration rather than recursion, that's probably not
such a serious issue).

Second, there are different variations on the general idea of "parser
combinators", which work in slightly different ways.  For example, you
might want to cache calls to matchers so that when the "same" call is
made a second time, the cached value is used.  Or you might want to
write the parser so that all possible parses (of an ambiguous grammar)
are returned.  Or you might want to restrict backtracking so that
error messages are more reliable.  Or you might want to display a
record of what is happening to help with debugging.  Or...

These variations, on a little inspection, tend to be connected more
with how the parser "executes", rather than with how the individual
matchers are implemented.

What do I mean by that?  Take, for example, the idea that you cache
results to avoid repeated evaluation.  That is trivial in a "lazy"
language.  But in an eager language (like Julia) you need to intercept
each call to a function, so you can check whether it was cached
earlier.  These are, in a sense, "meta" issues, that have little to do
with checking whether one string is equal to another.

So the question is: can we write our library in a way that makes it
easy to change how the parser "executes"?  And the answer is,
thankfully, yes (or I wouldn't be writing this article)!

How do we do this?  We need two things.  First, we need to use a
"trampoline" to take control of execution.  Second, we need some way
of allowing different behaviors to be "plugged in".  In Julia, this is
typically done by multiple dispatch with methods.  And it works really
well.


Trampolines

All a trampoline is, really, is a "manual" replacement for what a
compiler does automatically: it's a loop, with a stack, that calls
each function in turn, checks what the result is, and then calls
another function, as appropriate.

In practice, what that means is that when we write a function, and we
need to call some other function, we don't just "make the call".
Instead, we return a "message" to the trampoline that says "please
call this other function and then give me the result".

In addition, of course, we need the main loop, which receives these
messages and does the work.

That may sound complicated, but in fact it's not so much work.
Particularly when you're only evaluating matchers in a parser, which
all work in a similar way.

Here's most of the code needed for the main loop, taken from
parsers.jl:

    type NoCache<:Config
        source::Any
        stack::Array{Tuple{Matcher, State},1}
    end

    function dispatch(k::NoCache, e::Execute)
        push!(k.stack, (e.parent, e.parent_state))
        execute(k, e.child, e.child_state, e.iter)
    end

    function dispatch(k::NoCache, s::Success)
        (parent, parent_state) = pop!(k.stack)
        success(k, parent, parent_state, s.child_state, s.iter, s.result)
    end

    function dispatch(k::NoCache, f::Failure)
        (parent, parent_state) = pop!(k.stack)
        failure(k, parent, parent_state)
    end

Those functions are simply responding to the different message types
(Execute, Success and Failure) by manipulating the stack and calling
the appropriate matchers.  Everything is driven by this main loop:

    while true
        msg = dispatch(k, msg)
    end
    
(plus some extra details for handling the setup and teardown).

Now if you're trying to understand the above in detail, you're
probably wondering what execute(), success() and failure() are.  These
are the methods where the matcher itself is implemented.

For example, an "equals" matcher looks like this:

    type Equal<:Matcher
        string
    end

    function execute(k::Config, m::Equal, s::Clean, i)
        for x in m.string
            if done(k.source, i)
                return FAILURE
            end
            y, i = next(k.source, i)
            if x != y
                return FAILURE
            end
        end
        Success(DIRTY, i, Any[m.string])
    end

where FAILURE is the singleton Failure() message and most of the work
is just checking for equality, character by character.

A more complex matcher, which calls out to child matchers, has work
spread across several methods.  Its success() method is called when a
child matcher successfully matches, for example.

You may still be a little confused, however, because Equal, above, is
a type, not a function.  That's because it makes more sense in Julia
to work this way.

Let me explain in more detail...


Grammar As Types

Originally, when I sketched how parser combinators worked, I described
them as functions that returned functions that parsed data (see
"equals()" and "follows()" above).

But in Julia, when you use a trampoline, it makes much more sense to
use types.  So constructing a grammar is not done be calling
functions, but by calling type constructors (like "Equal()" above).

That's because Julia is based around methods, which are dispatched on
multiple types.  By using types to describe our grammar we can then
use method dispatch to implement the work.

And that's what you can see above.  The execute() method includes
"m::Equal", which means that specific execute() is called ONLY when m
is an instance of Equal.  So that execute() is equivalent, roughly, to
the anonymous function in the original sketch.

It's similar to OOP, where the method "belongs" to the type.

To make it completely clear, let's work through things in more detail.

We have to start somewhere, and I am trying to avoid the details of
how we bootstrap or return a final result, since the important idea is
"in the middle".  So let's just assume that some other matcher (an
instance of a type) is similar to the "follows()" function above, in
that it contains two Equals children.  And we'll start with the
execute() method associated with that matcher returning an Execute
message to the trampoline, which contains the first Equal child
(containing the string "hello").

So the parent matcher is saying to the trampoline "call this Equal for
me".

The main loop of the trampoline receives the Execute message, saves
the parent matcher on the stack, and then calls the dispatch() method
for Execute.  That results in calling the execute(... m::Equal, ...)
method, which checks the input against "hello" in the child.

Assuming that matches, the child returns a Success message, which the
trampoline receives.  The trampoline pops the original matcher from
the stack and calls its success() method.  That, in turn, then returns
another Execute message, to match "world", etc etc.

Obviously these matchers need to save data as they work.  The parent
matcher discussed above, for example, needs to save "hello" while it
calls the second Equals for "world".  This is done in the State types
that appear in the dispatch code.

A very nice side effect of having this explicit state is that it makes
adding memoization very easy, because we know exactly when the context
is identical (when the state is identical) and so can decide when to
use the cache.


Was That Worth It?!

OK, so at this point you're probably starting to see how things work
(I hope).  It may help to think of the grammar (the DAG of types) as a
"program" that the trampoline is executing - it's very like an
interpreter, where the grammar is the "program" that the trampoline
"executes" given some input (which is being parsed).

But you may ask "OK, I kind-of get how it works, but was all that
complexity worth it?"

Well, first of all, it's not as complex as it sounds.  It's different,
sure.  Which means it takes some getting used to.  But with a little
experience it becomes surprisingly easy to understand.  The matchers,
for example, are "just" state machines, where the State instances
describe the state, and the execute(), success() and failure() methods
drive transitions from one state to another.

More than "not as bad as it looks" - it's actually surprisingly
elegant.  In a traditional interpreter the main loop is a case
statement that checks the type of what is being executed and then
calls the appropriate function.  Here, in a sense, the function is
always the same - "execute()" - and the right choice of which
particular execute is made for us, by Julia's type system.  Things get
even better when many matchers have similar functionality because they
can, by sharing a common super-type, share a single implementation.
In the case of the ParserCombinator library, for example, most
matchers that call a child matcher are derived from a Delegate
super-type, which provides common support.

And, second of all, it's easy - almost trivial - to hack things at a
"meta" level.  Take the example of caching results.  All you need to
do is add a Map (the cache) to a new sub-type of Config and provide
new dispatch() functions as above: the dispatch for Success should
populate the Map with results; the dispatch for Execute should check
the Map in case a value already exists.  That's it!


Multiple Dispatch v OOP

If the same ideas were implemented in a traditional OOP language, much
of what I have described above would carry across nicely.  There would
be a Trampoline base class that would be subclassed for different
approaches to execution, for example.

But Julia's multiple dispatch wins out when you want behavior to "cut
across" more than one type.

In the ParserCombinator library, one of the approaches to execution
emulates the Parsec library from Haskell.  This doesn't allow
backtracking if the source has been successfully matched.  But one,
"magic", matcher, Try() changes this.  So this one particular matcher
has to "know" about the trampoline in more detail than normal, so that
it can enable backtracking.

How do you do that in a traditional OO language?  It's not so clear,
because the matchers are, presumably, classes that are quite separate
from the trampoline (in Java, for example, you start having to worry
about "package private" access)

But in Julia a method can dispatch on multiple types.  So there's
nothing to stop you having an execute() method that is for one
specific trampoline type and one specific matcher type.  And which is
only called when both those types are used together.

(It may confuse more than it helps, but a similar idea is used here -
http://acooke.org/cute/FizzBuzzin0.html)


Conclusions

I've tried to show how you can implement a parser combinator library
with more than one way for the parser to "execute", by providing
pluggable trampolines.

Much of the approach can be done in any OO language (and the general
idea comes from an earlier attempt in Python, called Lepl).  But
Julia's methods make things particularly easy.

If you want to try ParserCombinator for yourself, please check it out
at https://github.com/andrewcooke/ParserCombinator.jl

Thanks!