forked from andrewcooke/ParserCombinator.jl
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdesign.txt
327 lines (247 loc) · 12.6 KB
/
design.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
Introduction
I am going to describe some of the decisions behind the
ParserCombinator library, which, I hope, will show how useful Julia's
method-centred, multiple dispatch can be.
OK, so I wanted to write a parser combinator library in Julia.
If you don't know what a parser combinator library is, here's a quick
sketch:
There's a pretty neat, easy to understand, way of writing parsers by
hand, which is common in functional languages, where you construct a
"tree" (or even a DAG) of functions that take a stream of characters
as input and return some structured output as a result (along with
the remaining text).
The stream of characters is (of course) what you want to parse, and
the structured output becomes your AST.
Here's a simple (untested!) example (in Julia):
function equals(text)
return function _(string)
n = length(text)
if string[1:n] == text
return string[n:end], text
else
throw(Backtrack())
end
end
end
By itself that doesn't seem very useful, but consider
function follows(m1, m2)
return function _(string)
string1, result1 = m1(string)
string2, result2 = m2(string1)
return string2, [result1, result2]
end
end
which can be used like:
grammar = follows(equals("hello"), equals("world"))
grammar("helloworld")
and returns ["hello", "world"].
And, of course, you can go crazy, with functions that test
alternatives, or repeat a given number of times, or as often as
possible, etc etc.
So that's the kind of thing I wanted to write. But when you start
digging into the details, it's not quite so simple.
Complications
First, Julia isn't intended to be used as a functional language. It's
not designed to support deep, recursive calls of functions (in fact,
the stack limit in a simple test is a little greater than 100,000 on
my machine, so providing you write combinators for multiple matches
that work with iteration rather than recursion, that's probably not
such a serious issue).
Second, there are different variations on the general idea of "parser
combinators", which work in slightly different ways. For example, you
might want to cache calls to matchers so that when the "same" call is
made a second time, the cached value is used. Or you might want to
write the parser so that all possible parses (of an ambiguous grammar)
are returned. Or you might want to restrict backtracking so that
error messages are more reliable. Or you might want to display a
record of what is happening to help with debugging. Or...
These variations, on a little inspection, tend to be connected more
with how the parser "executes", rather than with how the individual
matchers are implemented.
What do I mean by that? Take, for example, the idea that you cache
results to avoid repeated evaluation. That is trivial in a "lazy"
language. But in an eager language (like Julia) you need to intercept
each call to a function, so you can check whether it was cached
earlier. These are, in a sense, "meta" issues, that have little to do
with checking whether one string is equal to another.
So the question is: can we write our library in a way that makes it
easy to change how the parser "executes"? And the answer is,
thankfully, yes (or I wouldn't be writing this article)!
How do we do this? We need two things. First, we need to use a
"trampoline" to take control of execution. Second, we need some way
of allowing different behaviors to be "plugged in". In Julia, this is
typically done by multiple dispatch with methods. And it works really
well.
Trampolines
All a trampoline is, really, is a "manual" replacement for what a
compiler does automatically: it's a loop, with a stack, that calls
each function in turn, checks what the result is, and then calls
another function, as appropriate.
In practice, what that means is that when we write a function, and we
need to call some other function, we don't just "make the call".
Instead, we return a "message" to the trampoline that says "please
call this other function and then give me the result".
In addition, of course, we need the main loop, which receives these
messages and does the work.
That may sound complicated, but in fact it's not so much work.
Particularly when you're only evaluating matchers in a parser, which
all work in a similar way.
Here's most of the code needed for the main loop, taken from
parsers.jl:
type NoCache<:Config
source::Any
stack::Array{Tuple{Matcher, State},1}
end
function dispatch(k::NoCache, e::Execute)
push!(k.stack, (e.parent, e.parent_state))
execute(k, e.child, e.child_state, e.iter)
end
function dispatch(k::NoCache, s::Success)
(parent, parent_state) = pop!(k.stack)
success(k, parent, parent_state, s.child_state, s.iter, s.result)
end
function dispatch(k::NoCache, f::Failure)
(parent, parent_state) = pop!(k.stack)
failure(k, parent, parent_state)
end
Those functions are simply responding to the different message types
(Execute, Success and Failure) by manipulating the stack and calling
the appropriate matchers. Everything is driven by this main loop:
while true
msg = dispatch(k, msg)
end
(plus some extra details for handling the setup and teardown).
Now if you're trying to understand the above in detail, you're
probably wondering what execute(), success() and failure() are. These
are the methods where the matcher itself is implemented.
For example, an "equals" matcher looks like this:
type Equal<:Matcher
string
end
function execute(k::Config, m::Equal, s::Clean, i)
for x in m.string
if done(k.source, i)
return FAILURE
end
y, i = next(k.source, i)
if x != y
return FAILURE
end
end
Success(DIRTY, i, Any[m.string])
end
where FAILURE is the singleton Failure() message and most of the work
is just checking for equality, character by character.
A more complex matcher, which calls out to child matchers, has work
spread across several methods. Its success() method is called when a
child matcher successfully matches, for example.
You may still be a little confused, however, because Equal, above, is
a type, not a function. That's because it makes more sense in Julia
to work this way.
Let me explain in more detail...
Grammar As Types
Originally, when I sketched how parser combinators worked, I described
them as functions that returned functions that parsed data (see
"equals()" and "follows()" above).
But in Julia, when you use a trampoline, it makes much more sense to
use types. So constructing a grammar is not done be calling
functions, but by calling type constructors (like "Equal()" above).
That's because Julia is based around methods, which are dispatched on
multiple types. By using types to describe our grammar we can then
use method dispatch to implement the work.
And that's what you can see above. The execute() method includes
"m::Equal", which means that specific execute() is called ONLY when m
is an instance of Equal. So that execute() is equivalent, roughly, to
the anonymous function in the original sketch.
It's similar to OOP, where the method "belongs" to the type.
To make it completely clear, let's work through things in more detail.
We have to start somewhere, and I am trying to avoid the details of
how we bootstrap or return a final result, since the important idea is
"in the middle". So let's just assume that some other matcher (an
instance of a type) is similar to the "follows()" function above, in
that it contains two Equals children. And we'll start with the
execute() method associated with that matcher returning an Execute
message to the trampoline, which contains the first Equal child
(containing the string "hello").
So the parent matcher is saying to the trampoline "call this Equal for
me".
The main loop of the trampoline receives the Execute message, saves
the parent matcher on the stack, and then calls the dispatch() method
for Execute. That results in calling the execute(... m::Equal, ...)
method, which checks the input against "hello" in the child.
Assuming that matches, the child returns a Success message, which the
trampoline receives. The trampoline pops the original matcher from
the stack and calls its success() method. That, in turn, then returns
another Execute message, to match "world", etc etc.
Obviously these matchers need to save data as they work. The parent
matcher discussed above, for example, needs to save "hello" while it
calls the second Equals for "world". This is done in the State types
that appear in the dispatch code.
A very nice side effect of having this explicit state is that it makes
adding memoization very easy, because we know exactly when the context
is identical (when the state is identical) and so can decide when to
use the cache.
Was That Worth It?!
OK, so at this point you're probably starting to see how things work
(I hope). It may help to think of the grammar (the DAG of types) as a
"program" that the trampoline is executing - it's very like an
interpreter, where the grammar is the "program" that the trampoline
"executes" given some input (which is being parsed).
But you may ask "OK, I kind-of get how it works, but was all that
complexity worth it?"
Well, first of all, it's not as complex as it sounds. It's different,
sure. Which means it takes some getting used to. But with a little
experience it becomes surprisingly easy to understand. The matchers,
for example, are "just" state machines, where the State instances
describe the state, and the execute(), success() and failure() methods
drive transitions from one state to another.
More than "not as bad as it looks" - it's actually surprisingly
elegant. In a traditional interpreter the main loop is a case
statement that checks the type of what is being executed and then
calls the appropriate function. Here, in a sense, the function is
always the same - "execute()" - and the right choice of which
particular execute is made for us, by Julia's type system. Things get
even better when many matchers have similar functionality because they
can, by sharing a common super-type, share a single implementation.
In the case of the ParserCombinator library, for example, most
matchers that call a child matcher are derived from a Delegate
super-type, which provides common support.
And, second of all, it's easy - almost trivial - to hack things at a
"meta" level. Take the example of caching results. All you need to
do is add a Map (the cache) to a new sub-type of Config and provide
new dispatch() functions as above: the dispatch for Success should
populate the Map with results; the dispatch for Execute should check
the Map in case a value already exists. That's it!
Multiple Dispatch v OOP
If the same ideas were implemented in a traditional OOP language, much
of what I have described above would carry across nicely. There would
be a Trampoline base class that would be subclassed for different
approaches to execution, for example.
But Julia's multiple dispatch wins out when you want behavior to "cut
across" more than one type.
In the ParserCombinator library, one of the approaches to execution
emulates the Parsec library from Haskell. This doesn't allow
backtracking if the source has been successfully matched. But one,
"magic", matcher, Try() changes this. So this one particular matcher
has to "know" about the trampoline in more detail than normal, so that
it can enable backtracking.
How do you do that in a traditional OO language? It's not so clear,
because the matchers are, presumably, classes that are quite separate
from the trampoline (in Java, for example, you start having to worry
about "package private" access)
But in Julia a method can dispatch on multiple types. So there's
nothing to stop you having an execute() method that is for one
specific trampoline type and one specific matcher type. And which is
only called when both those types are used together.
(It may confuse more than it helps, but a similar idea is used here -
http://acooke.org/cute/FizzBuzzin0.html)
Conclusions
I've tried to show how you can implement a parser combinator library
with more than one way for the parser to "execute", by providing
pluggable trampolines.
Much of the approach can be done in any OO language (and the general
idea comes from an earlier attempt in Python, called Lepl). But
Julia's methods make things particularly easy.
If you want to try ParserCombinator for yourself, please check it out
at https://github.com/andrewcooke/ParserCombinator.jl
Thanks!