-
-
Notifications
You must be signed in to change notification settings - Fork 416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make lark.lark parse the same grammar as load_grammar.py, and make grammar.md document it more fully. #1388
base: master
Are you sure you want to change the base?
Conversation
…ammar.md document it more fully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are some suggestions and questions I have regarding your PR. I tried to get everything down but I might think of more after you address them.
tests/test_grammar_formal.py
Outdated
# '%ignore expects a value', '%ignore %import\n' | ||
self.assertRaisesRegex(UnexpectedToken, 'Unexpected token Token..__ANON_2., .%import..', l.parse, '%ignore %import\n') | ||
|
||
# def test_empty_literal(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why all the commented out parts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted it to be clear which test_grammar
tests weren't implementable for lark.lark
. Sort of a "show your work" thing. But I can remove them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 9493f81.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate your aim for completeness, but I think it was more confusing than anything.
If you think it's important information, perhaps just write a list of the names, without Python syntax. like
# List of non-testable constructs: empty literal, kraken, foobar, etc.
lark/grammars/lark.lark
Outdated
|
||
name: RULE | ||
| TOKEN | ||
|
||
_VBAR: _NL? "|" | ||
OP: /[+*]|[?](?![a-z])/ | ||
RULE: /!?[_?]?[a-z][_a-z0-9]*/ | ||
RULE: /_?[a-z][_a-z0-9]*/ | ||
RULE_MODIFIERS: /!|![?](?=[a-z])|[?]!?(?=[a-z])/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need the ?=
s here? I don't recall that ? or ! are valid anywhere else in the grammar..?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I originally had a comment explaining that, but it would have been the only part of the entire grammar that had commentary, so I removed it. The ?
rule modifier is not allowed on a rule whose name begins with _
(https://github.com/lark-parser/lark/blob/master/lark/load_grammar.py#L1057-L1059), and that bit or regex blocks that usage. So maybe I should put the comment back in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see. But wouldn't users be able to bypass it by adding whitespace between them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drat! I missed a test case. But no, this test fails similarly in load_grammar.py
, the original lark.lark
, and my modified version:
rule1: rule2
?! rule2: "a"
"""
load_grammar.py
:
Traceback (most recent call last):
File "C:\Ross\Source\lark\lark\parsers\lalr_parser_state.py", line 77, in feed_token
action, arg = states[state][token.type]
KeyError: 'OP'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Ross\Source\lark\tests\test_lark_lark.py", line 169, in test_11_rule_modifier_query_bang_space_lg
Lark(g)
...
File "C:\Ross\Source\lark\lark\lexer.py", line 674, in lex
raise UnexpectedToken(token, e.allowed, state=parser_state, token_history=[last_token], terminals_by_name=self.root_lexer.terminals_by_name)
lark.exceptions.UnexpectedToken: Unexpected token Token('OP', '?') at line 3, column 13.
Expected one of:
* "%declare"
* "%import"
* "%extend"
* RULE
* "%override"
* _NL
* $END
* "%ignore"
* TOKEN
* RULE_MODIFIERS
Previous tokens: [Token('_NL', '\n ')]
Original lark.lark
:
Traceback (most recent call last):
File "C:\Ross\Source\lark\lark\lexer.py", line 665, in lex
yield lexer.next_token(lexer_state, parser_state)
File "C:\Ross\Source\lark\lark\lexer.py", line 598, in next_token
raise UnexpectedCharacters(lex_state.text, line_ctr.char_pos, line_ctr.line, line_ctr.column,
lark.exceptions.UnexpectedCharacters: No terminal matches '?' in the current parser context, at line 3 col 13
?! rule2: "a"
^
...
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Ross\Source\lark\tests\test_lark_lark.py", line 176, in test_11_rule_modifier_query_bang_space_ll
self.lark_parser.parse(g)
...
File "C:\Ross\Source\lark\lark\lexer.py", line 674, in lex
raise UnexpectedToken(token, e.allowed, state=parser_state, token_history=[last_token], terminals_by_name=self.root_lexer.terminals_by_name)
lark.exceptions.UnexpectedToken: Unexpected token Token('OP', '?') at line 3, column 13.
Expected one of:
* _NL
* "%import"
* "%ignore"
* "%override"
* TOKEN
* $END
* "%declare"
* RULE
Previous tokens: [Token('_NL', '\n ')]
Modified lark.lark
:
Traceback (most recent call last):
File "C:\Ross\Source\lark\lark\lexer.py", line 665, in lex
yield lexer.next_token(lexer_state, parser_state)
File "C:\Ross\Source\lark\lark\lexer.py", line 598, in next_token
raise UnexpectedCharacters(lex_state.text, line_ctr.char_pos, line_ctr.line, line_ctr.column,
lark.exceptions.UnexpectedCharacters: No terminal matches '?' in the current parser context, at line 3 col 13
?! rule2: "a"
^
...
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Ross\Source\lark\tests\test_lark_lark.py", line 176, in test_11_rule_modifier_query_bang_space_ll
self.lark_parser.parse(g)
...
File "C:\Ross\Source\lark\lark\lexer.py", line 674, in lex
raise UnexpectedToken(token, e.allowed, state=parser_state, token_history=[last_token], terminals_by_name=self.root_lexer.terminals_by_name)
lark.exceptions.UnexpectedToken: Unexpected token Token('OP', '?') at line 3, column 13.
Expected one of:
* "%declare"
* RULE
* RULE_MODIFIERS
* $END
* "%ignore"
* "%extend"
* "%import"
* TOKEN
* _NL
* "%override"
Previous tokens: [Token('_NL', '\n ')]
lark/grammars/lark.lark
Outdated
?atom: "(" expansions ")" | ||
| "[" expansions "]" -> maybe | ||
| value | ||
?token_atom: "(" token_expansions ")" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
token_atom and rule_atom are the same, why not merge them using templates? Same with rule_expr and atom_expr, rule_expansion etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that the original lark.lark
can derive atom
-> value
-> template_usage
, but templates aren't allowed for terminals (pending resolution of Issue #555). I didn't see any way, with templates, to prevent that derivation. As far as I can tell, templates are effectively a text-substitution macro capability (although I understand the implementation is a little more sophisticated).
I suppose I could have done something like this:
_expansion{expr_type}: expr_type*
_expr{atom_type}: atom_type [OP | "~" NUMBER [".." NUMBER]]
_atom{expansion_type, value_type}: "(" expansion_type ")"
| "[" expansion_type "]" -> maybe
| value_type
?rule_expansions: rule_alias (_VBAR rule_alias)*
?rule_inner_expansions: rule_expansion (_VBAR rule_expansion)*
?rule_alias: rule_expansion ["->" RULE]
?rule_expansion: _expansion(rule_expr)
?rule_expr: _expr(rule_atom)
?rule_atom: _atom(rule_inner_expansions, rule_value)
?rule_value: RULE "{" rule_value ("," rule_value)* "}" -> rule_template_usage
| RULE
| value
?token_expansions: token_expansion (_VBAR token_expansion)*
?token_expansion: _expansion(token_expr)
?token_expr: _expr(token_atom)
?token_atom: _atom(token_expansions, token_value)
?value: STRING ".." STRING -> literal_range
| TOKEN
| (REGEXP | STRING) -> literal
Is it what you're suggesting? It doesn't read as cleanly to my eyes, but I'll grant that I've been studying Lark pretty hard lately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before we dig into that, let's come to an agreement in the main conversation (regarding restrictive-validating grammar vs relaxed grammar + validating visitor).
Afterwards if it's still relevant, I will try to craft a solution to this problem.
My general opinion is that |
I see your point, but I think for anyone who isn't writing code in the Lark codebase,
Mine as well. I was actually quite surprised at how different they were. It was your comments in Issue #1355 that pointed me down this path.
I guess I see it more as specifying the same grammar in two forms, not two grammars. But you're right about semantics - there are some things (e.g., my comments about |
I missed two lines that I should have deleted in the test files, which I used for before/after testing ( |
…rk_lark.py (trailing whitespace in test_06_*()) 2. Remove what was left of "Literals can be one of: ..." under "Terminals" in grammar.md. 3. Address @erezsh's point about inlined terminals under "Terminals" in grammar.md. 4. Remove "lark.lark-ORIG" references in test_lark_lark.py and test_grammar_formal.py that shouldn't have been pushed. 5. Address @erezsh's point about f.readlines() in test_grammar_formal.py. 6. Address @erezsh's point commented-out tests in test_grammar_formal.py.
Fixed in 9493f81. |
@MegaIng @RossPatterson I can see both points of view, and I think the determining factor should be practicality above all else. In an ideal world, load_grammar would only preload a minimal parser, that would parse lark.lark and use it to create the Lark parser that supports the full syntax. That would be the cleanest approach imho. But it would also be slower, probably in a very noticeable way. (did we ever do this experiment?) As for the BNF in load_grammar, it is non-restrictive on purpose. By moving the validation from the parser to the post-parsing code, we can perform more sophisticated validation, and provide custom error messages that are more helpful and friendly. As for what to do with lark.lark, in my mind it's still an open question, and depends heavily on what are its main uses. Is it to parse and manipulate lark grammars programatically? Is it to validate correct grammars? Is it to be used in a lark editor? etc. I think the options that we are facing are:
I'm leaning towards (2), because I'm can't really think of good use-cases for 3, except than it being a "reference grammar" for human eyes. But I don't know if that's a good enough reason, especially since that can be accomplished using better docs, or with comments within lark.lark What do you guys think? (needless to say, regardless of what we choose, I'm all for improving the documentation regarding the grammar) |
That's part of what drove me down this path. I've got a use case where I'm playing with @MegaIng's
That's another part of my motivation. As noted in Issue #1355, I attempted to use aliases on rule-alternates inside a rule-group, and received a weird result (from
I think having an inaccurate reference grammar for human purposes is almost worse than not having one at all. I think having a partial grammar for parsing purposes is somewhat useful, but will be surprising to anyone who tries to use it reliably. And I think better documentation might obviate the need for a perfect reference grammar, except for the cases like I'm not going to be offended by however the decision turns out. If it's something I can help with, fine, if not, also fine. This isn't my project, I'm just a (so far very happy) user who has the skills to contribute a bit. |
@RossPatterson Can you explain which option this refers to? I don't think I advocated at any point for an inaccurate grammar, only a non-validating one
The load_grammar error messages are definitely better than the generic lark error messages. If I overlooked or missed something, it's not proof that the approach is bad Do you have any objections to option 2 that I suggested? Do you think it doesn't serve your efforts as well as option 3? Do you think it's more confusing? Frankly, I don't accept or reject PR based on who will be offended. I'm just trying to make Lark as good as it can be. (with the limited time that I can spare for it) |
Sorry, I misunderstood.
No, in fact, like I said, it's great. The
No, I have no objections to option 2. My efforts will be served as long as
I understand completely. I was trying to say that I'm OK with whatever you decide to do. |
@MegaIng Any objections to option 2? i.e. keep lark.lark non-validating and pair it with a visitor that performs the same extra validation as load_grammar? |
Currently no, I think that is the ideal solution. Preferably the same visitor is used both internally and externally and is the only source of GrammerErrors in load_grammer to make sure that everything is in sync. |
That seems a bit optimistic, but I agree it would be nice if we could get there. @RossPatterson Would you be okay adapting your PR to solution 2? That means getting lark.lark as close as possible to load_grammar, and adding a validating visitor for the rest (or somehow using the one in load_grammar) If you'd like, we can split this PR to two, one for the docs, which I think we all agree can be merged already as is (or with tiny changes), and a separate one for lark.lark. But it's up to you. |
To note: lark.lark should probably still use ebnf and not restrict itself to bnf like the hardcoded grammar does, and IMO it's ok to change the grammar in load_grammar a bit if it makes stuff easier to sync up (if it doesn't have a performance hit and ofcourse doesn't break anything) |
Yup, I can do that. I've already had to do the analysis to find the differences, and to fix them. So it looks like I'll just be backing out some of the fixes, and creating their equivalent in a visitor.
I'm fine with keeping them together. |
I'm personally biased in favor of BNF, because it's so well defined, whereas "EBNF" means any of a variety of possibilities. But I agree,
I don't anticipate that being necessary. But I've been wrong before :-) |
I sat down tonight to start implementing option 2, and quickly found myself categorizing most of my changes (at the top of this PR) as "making
I somehow doubt this is what either of you expect. I think the changes that correct the So, before I go off and start messing things up, can you tell me if I'm reading things right or wrong, and if wrong, how so? |
@RossPatterson I think you misunderstood what we (at least I) meant with option 2. At the top of |
…a visitor that enforces some of its restrictions.
I reverted all my changes to
I built the validating visitor,
There wasn't any existing doc on using the I also moved the doc for templates in Some notes:
|
Very nice, this is about what I imagined :-) Thank you for going through this effort.
I think that's fine. I don't think there are too many people out there relying on the details of that grammar, and I think we changed it before without warning.
It might make sense to write an
We should have a doc page with a reference list for our "stdlib" grammars, but that can be a different PR. An example using the validator might be nice. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With regard to tests, also see that we are already testing lark.lark
:
Line 1047 in 7646fb3
# Check that the `lark.lark` grammar represents can parse every example used in these tests. |
It might make sense to extend the tests there to include LarkValidatorVisitor
and make sure that both also raise the same error messages although it might be hard to check that we don't expect grammar analysis errors. I don't think earley
should be raising any? But I might be wrong with that.
OK, I removed inlining from Fixed by 2ec5ef3. |
The |
If someone wants to write that page, I'll certainly write of an example for the validator. |
@RossPatterson Nice work. Let me know once you've cleaned it up a bit, and I'll give it another review. I'll see if I find the time to write that doc page. I imagine it should be pretty short and straight-forward. |
… they can be inlined.
Fixing the validator was pretty easy. Fixed in e9c026e. |
…with a nice message.
I'm working on blending my |
While doing that, I found that neither the original Fixed in 7f02bd1. |
I'll be offline for the next few weeks, but I'll wrap this up after I return in mid-March. |
While testing and verifying my changes, I realized there are a few more differences between
|
…ers and priority, like load_grammar does
FIxed by 4f7a5eb. |
@erezsh @MegaIng |
That's a bug that was introducded in #1018 that we both missed, specfically see here: Lines 1367 to 1369 in 7061908
That shouldn't be |
Is this work still of interest? I think it's ready for final review, and probably for merging. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you again for all your effort, sorry for not responding for so longer, I had other things on my mind.
Looks pretty good for the most part! Ofcourse, there is now a lot of duplication between load_grammar.py
and lark_validator.py
, but I think that can be reduced in a future PR and is fine for now.
tests/test_grammar.py
Outdated
def test_errors(self): | ||
# TODO test_errors needs a lot of work for lark.lark. | ||
if parser == LarkDotLark: | ||
self.skipTest("test_errors needs a lot of work for lark.lark.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what you mean with this comment. Do you mean that the error messages still need to be improved? Or that the test takes a lot of time?Or do you mean this wont be feasible to implement?
If the first, an xfail
would be more appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that was really vague. What I meant is that test_errors()
needs some rebuilding to use it for lark.lark
as well as load_grammar()
. It's not infeasible, but I didn't attempt it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MegaIng If my comment of 2024-06-20 is acceptable, let's resolve this point.
elif stmt.data == 'ignore': | ||
self._ignore(stmt) | ||
elif stmt.data in ['import', 'multi_import']: | ||
# TODO How can we process imports in the validator? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imports always explicitly list the names being imported, so we can treat them like %declare
, which should be decent enough, right?
Or maybe that produces problems with %extend
, hm...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, there's %extend
and %override
, at least. I didn't feel up to recreating the logic of %import.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MegaIng If my comment of 2024-06-20 is acceptable, let's resolve this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding to @MegaIng 's review.
rule: RULE rule_params priority? ":" expansions | ||
token: TOKEN token_params priority? ":" expansions | ||
rule: rule_modifiers RULE rule_params priority ":" expansions | ||
token: TOKEN priority? ":" expansions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
priority
is already optional
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but load_grammar.py
says priority
is a required element of rule
, and that priority
is _DOT NUMER
or null. I wanted lark.lark
to produce the same parse tree as load_grammar.py
.
It's different for token
(term
in load_grammar.py
) - there, load_grammar.py
[says priority
is optional[(https://github.com/lark-parser/lark/blob/master/lark/load_grammar.py#L162-L163).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@erezsh If my comment of 2024-06-20 is acceptable, let's resolve this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what I meant was that priority can already be an empty rule, so no point in making it optional.
tree = lark_parser.parse(grammar) | ||
LarkValidator.validate(tree) | ||
|
||
def parse(self, text: str, start=None, on_error=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks to me like it would be better if the tests that try to call parse()
wouldn't be inside _make_tests
, since their dot-lark variation will be skipped anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had some hope when I wrote that stuff that I could solve the "can't call parse()" problem. I made several attempts, all failed. If you want me to move them out, I will.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@erezsh Would you like me to do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's not too much trouble, yes.
tests/test_grammar.py
Outdated
assert a == b | ||
|
||
def test_override_rule2(self): | ||
if parser == LarkDotLark: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you always skip for LarkDotLark, why not take it out of _make_tests
?
Same question applies to the other tests with the same pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seemed easier to not break the file up that way. If you want me to do that, I will.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@erezsh Would you like me to do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I said so, then I guess so, yes :)
Unless there is a reason not to.
Make
lark.lark
parse the same grammar asload_grammar.py
, and makegrammar.md
document it more fully.Fixes #391, #1382, #1384.
All scenarios have tests in new file
tests/test_lark_lark.py
. Each test passedload_grammar.py
and failedlark.lark
before these changes, and passed both after. All existing un-skipped unit tests continue to pass after these changes. The skipped Earley tests have not been checked.Updates to the formal Lark grammar (
lark/grammars/lark.lark
) to make it match the actual parser (lark/load_grammar.py
)NAME_1 -> NAME_2
) are no longer allowed.load_grammar.py
appears to accept them, but forces the new name to be a rule-name.tests/test_grammar.py.test_alias_in_terminal
says they should raise aGrammarError
.expansions
rule, and all the rules it calls, intorule_expansions
andtoken_expansions
(and similarly named rules for them to call).name_1 -> name_2
) are now only valid for alternates at the top-level of a rule.%ignore
directive must now be a single token.load_grammar.py
accepts multiple tokens, but only processes the first.%extend
directive is now defined for rules.%extend
directive is now defined for tokens.%override
directive now supports overriding token definitions.rule1: !?rule2
)?!
, in addition to the existing!
,?
, and!?
.Several tests in
tests/tests/test_grammar.py
have been copied to new filetests/test_grammar_formal.py
and updated to parse usinglark.lark
instead ofload_grammar.py
. Only the tests that verify how the Lark grammar is parsed could be implemented. Those that verify how a Lark-generated parser operates could not, because there's no (easy?) way to create a parser from the parse tree.Oddities not changed:
lark.lark
andload_grammar.py
asSTRING
s, but must actually be single characters.load_grammar.py
initially accepts multi-character strings, but later rejects them.STRING ".." STRING
inlark.lark
with/./ ".." /./
failed, as Lark accepts Unicode escapes as strings (e.g.test_unicode_literal_range_escape2
uses"\U0000FFFF".."\U00010002"
).STRING ".." STRING
inlark.lark
with/.|"\\U[0-9a-fA-F]+"/ ".." /.|"\\U[0-9a-fA-F]+"/
failed, causing lots of unrelated tests to fail.load_grammar.py
does not allowSTRING
tokens to be empty (i.e.,""
). There is no (easy?) way to prevent that in lark.lark. The obvious answer, adding/(?<!"")/
, broketests/test_parser.py._TestParser.test_backslash2
.