-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pearley #103
Comments
Pearley is really cool! When you think it's "done", I would happily put a link to it in this repository's README file. Imagine being able to write a parser in nearley and get a Python parser for free?! (PlyPlus seems like a pretty famous parsing module—you must share some of your "module advertising" tips with me sometime. :-) ) Okay, so the algorithm, which I've written about here, works by maintaining a table where every column represents a token of input. Each row of the table represents a partial parse (called an Earley state) and is basically the representation of a production rule's expansion, with a dot representing where you are in that production rule. This is the output of nearley-test, by the way. An example of an Earley state is statement -> expression • ";" newline At each iteration, nearley looks at each of these states in the current row of the table, and looks for states that are "expecting" the token you fed it. The state above expects a semicolon, so if you fed the algorithm a semicolon, it would create a fresh state in the next column of the table, that looks like statement -> expression ";" • newline Then, it goes ahead and adds new states if the new state expects a nonterminal. Here, for example, it'll add the states newline -> • "\r\n" and newline -> • "\n" This is what is happening on line 67. Whenever you create an Earley state where the dot is at the end (i.e. not expecting anything), you go back and find where that state was created, and push it along. So if the next input to the parse was "\n", then you would add the state newline -> "\n" • Since this state is complete, you would bump along its creator, getting the state statement -> expression ";" newline • and then you'd have to go and find where that state was created, etc. It turns out that when you have empty rules, then this process goes into an infinite loop because an empty rule is completed as soon as it is created. This sucks, because empty rules are actually useful. For example, maybe_whitespace -> empty | whitespace So, give or take, that's why there's that whole epsilon_closure trickery. Hope that vaguely explains what's going on there. As a corollary, you should be able to figure out why ambiguity needs to be allowed by this algorithm. Now, for debugging. I would suggest that first, you create special "instrumented" postprocessors for each production rule, such that the processed data knows the production rule that was used to make it. So, take expr -> expr "+" expr | expr "-" expr and turn it into expr -> expr "+" expr {% function(d) {return ["production1", d]; } %} | Then you parse the input that causes ambiguity, and get a data structure for each possible ambiguous parsing: 1+1+1 would yield ["production1", [ ["production2", 1], ["production1", [["production2", 1], ["production2", 1]]] ] ] Then, do some sort of "tree diff" algorithm to figure out which of the productions is getting parsed in two different ways, and report it to the user. In this case, it would show that a production1 was being parsed in two different ways. I would recommend writing this as a plugin for nearley before porting it to Python, so that I can help you out if you get stuck. Hope that helps! (Sorry if this is unclear, or if there are bugs in my writing. I type this in a hurry because I need to go to a meeting right now!) |
Hi, it's been a while but I have some updates! First, thank you for the explanations. I wrote a new library called Lark. It has your Earley implementation as the default, but also allows to switch to parsing using LALR(1) with the same grammar. I'm not using PLY anymore, I wrote my own LALR(1) implementation. I can write one for nearley if you like. It seems that in this span nearley has become very famous and has much surpassed plyplus (my old library) in its popularity. Maybe I should be the one asking for tips. It would be cool to have a converter from nearley grammar to lark, and vice-versa. I might just write one. It shouldn't be difficult to make an accurate translation, except for user-code of course. |
Hey! Lark looks super-cool; I really like the idea of having multiple engines "under the hood" to give users flexibility. A nearley <=> lark converter would be amazing — it would allow people to convert easily between JavaScript and Python parsers! Even if postprocessors get stripped, it would be incredibly useful to a lot of people. Please do let me know if you embark on this project. :D (P.S. The reason nearley suddenly became kind of popular is that someone recently wrote a Medium post that became popular on social media.) |
I'm definitely putting in on my TODO list! To that end, I have a small question about your method of tokenization. But you also "import" the following These two regular expressions collide, and if one is chosen over the other it could cause a parse error. |
nearley doesn't tokenize for you. It either parses a stream of characters "raw" (as in the csscolor example above), or it expects input to be already-tokenized. Does that answer your question? |
So in the case of Nearley treats it like this? |
Yes, it's equivalent to that. In reality,
So what happens is that nearley creates a nonterminal The code that does this, by the way, is here: https://github.com/Hardmath123/nearley/blob/master/lib/nearley.js#L45 |
Nearley->Lark is done! I might do a Lark->Nearley at some point. The main challenge is splitting up those regular expressions into 1-char chunks lol |
Interesting that you wrote it in Python… I would totally have written it in JS, by hacking on top of the existing compiler. But… very cool! I'm going to put a link to this in the README. :D |
It should be pretty easy to write a javascript version. Btw, I chose to move the tools into the lark directory, so you can install them with pip: So if you can fix the link to point to https://github.com/erezsh/lark/blob/master/lark/tools/nearley.py |
Oh, I see the link is simply to Lark. Nevermind then :) |
Oh, wow, the Js2Py idea is a very clever way to handle postprocessors. Sadly, I don't know of any JS-to-Python converters written in JavaScript. Maybe it can be a two-stage process, where you first convert the grammar and then convert the postprocessors separately? Good luck! |
Well, it took longer than I expected, but it's working! It's still experimental, but it's working! The only missing feature that I know of is macros. |
Oh, goodness, that's fantastic! Thank you, Erez. Yes, nearley macros are certainly a bit of a hack. I'm sorry about that! :-) I suppose one solution could have been to operate on compiled .ne.js files — and hence have nearleyc handle macro expansion/importing/EBNF modifiers — but of course, depending on your architecture, at this point it might be too much trouble for you to make the switch. In any case, it shouldn't be too hard to figure out the macros, and I would be happy to help if you run into specific issues. |
I'm glad you think so! :) Regarding the macros, we now have two ideas.
I would prefer option 1, if there was an elegant way to do it. |
Hmm, yes, I forgot that the output of nearleyc is JS, not JSON. Perhaps the easiest way would be a quick 5-line nodejs frontend to nearley->lark that converted it into a Python-readable format? (One of the unsaid goals of nearley at some point was to have some sort of "universal" — or at least portable — format to express grammars… guess that didn't work out! :P) As for lexers for lark->nearley, you should talk to @tjvr about that! We're working on (i.e. done with, but kind of too scared to merge) lexer support. :-) |
@tjvr How's the lexer coming along? |
@erezsh Nearly there! Have a look at #220 and https://github.com/tjvr/moo :-) |
@tjvr Looks nice. You can find line-breakers automatically, simply by looking for newline characters, or the dot-matches-newline flag. You might get false-positives but they will be rare, and only slightly affect performance anyway. Another thing you should take in mind, if you haven't already, is that the ordering of regexps, by itself, is not enough to solve all the common requirements from a lexer. For example, try to make your lexer match "begin" as a keyword but "beginner" as a variable name (\w+). The easy way out is to use callbacks and let the user deal with it. But I used in Lark another method that solves it automatically for the common case. Let me know if you're interested and I'll explain it. |
@erezsh Yeah—moo already handles keywords, see https://github.com/tjvr/moo#keywords :-) |
@tjvr Great, this is exactly my solution too :) But I automatically identify keywords, because they are always just strings without any special characters. |
Hello Guys, thanks for your libraries! |
@GoMino Sorry, I didn't try to write it. What were u hoping to use it for? FWIW, there's already a lot of infrastructure in place to make such a tool easy to write (i.e. Lark's internal state already serializes into a convenient json representation) |
Hi, I opened this separate issue so we don't spam everyone :)
I'm glad you liked Pearley. It really is a direct port and the code is almost identical. I'll try to keep it aligned when new features are added to nearley. There is a minor difference, because I'm using an external library for lexing. But that's not significant.
I would be happy if you could explain line 67, but really I don't have a strong grasp on the algorithm anyway, so you may have to add a lot of context :)
I would like to learn more about it, of course. I'm especially interested in making it faster, perhaps by pre-calculating some moves? I'm not sure.
As for ambiguity, I understand it can't be detected in the grammar. I still think it's possible to write code that will catch the common cases. Maybe I will try it some day.
Another solution which would be relatively satisfying is if we could catch the ambiguity as it happens, instead of at the end of the parse. Perhaps that's impossible for some reason, I'm not sure (see: my lack of understanding of the algorithm).
As for automating the debugging process - that's a nice idea, which I'm afraid to attempt :)
Sorry for the wall of text!
Let me know your thoughts.
The text was updated successfully, but these errors were encountered: