Skip to content

Commit

Permalink
fix!: non-consuming alternatives in ordered choices
Browse files Browse the repository at this point in the history
If a branch in ordered choice has a potentially non-consuming successful
alternative (Optional, ZeroOrMore, Not, And), it will succeed if
alternative succeeds and thus will not try further choices.
  • Loading branch information
igordejanovic committed Apr 1, 2023
1 parent 0a57265 commit db503c9
Show file tree
Hide file tree
Showing 3 changed files with 32 additions and 18 deletions.
7 changes: 3 additions & 4 deletions arpeggio/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -416,10 +416,9 @@ def _parse(self, parser):
for e in self.nodes:
try:
result = e.parse(parser)
if result is not None:
match = True
result = [result]
break
match = True
result = [result]
break
except NoMatch:
parser.position = c_pos # Backtracking
finally:
Expand Down
11 changes: 8 additions & 3 deletions arpeggio/tests/regressions/issue_20/test_issue_20.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,15 +9,20 @@
#######################################################################

from __future__ import unicode_literals
import pytest

# Grammar
from arpeggio import ParserPython, Optional, EOF
from arpeggio import ParserPython, Optional, NoMatch, EOF

def g(): return [Optional('first'), Optional('second'), Optional('third')], EOF


def test_optional_in_choice():
parser = ParserPython(g)
# This input fails as the ordered choice will succeed on the first optional

This comment has been minimized.

Copy link
@stanislaw

stanislaw Apr 1, 2023

Contributor

This seems to be something else what I don't understand.

Why would this test not succeed? The second Optional is clearly a direct match.

You are writing that the ordered choice will succeed on the first optional - I cannot understand how Optional('first') can succeed on "second".

This comment has been minimized.

Copy link
@igordejanovic

igordejanovic Apr 1, 2023

Author Member

Optional always succeeds. The only difference is if it consumes the input or not. If it matches it will consume the input if it doesn't match it will not consume the input but will succeed nevertheless.

This comment has been minimized.

Copy link
@igordejanovic

igordejanovic Apr 1, 2023

Author Member

That means that having Optional as a choice in ordered choice is pathological case as the rest of the alternatives will never be tried. Alternatives are optional already so no use in using explicit optional, it only cuts away the rest of the alternatives.

This comment has been minimized.

Copy link
@igordejanovic

igordejanovic Apr 1, 2023

Author Member

These pathological cases are something that #101 is going to handle and report error during parser construction.

This comment has been minimized.

Copy link
@stanislaw

stanislaw Apr 1, 2023

Contributor

This makes sense but seems to be too pathological in a sense that the first optional kills the opportunity for the more appropriate second optional to succeed. Not only to succeed but also to consume the "second" part of the input.

Would not letting optional choice succeed directly in the OrderedChoice class make this case to be non-pathological, but deterministic?

This comment has been minimized.

Copy link
@igordejanovic

igordejanovic via email Apr 1, 2023

Author Member

This comment has been minimized.

Copy link
@stanislaw

stanislaw Apr 1, 2023

Contributor

Your reasoning is something what I definitely had less time to think about than you, so please take my input as very raw:

And you are trying to match 'd'. Should this ordered choice succeed or
not? If we continue to try 'b' and 'c' we'll fail, but obviously the
first branch must succeed and thus the whole ordered choice should
succeed.

I have checked with @mettta, and we tend to interpret the grammar: [Optional('a'), 'b', 'c'] using regular expressions that we are used to.

In this case, we definitely try to think about this expression as (a?|b|c), so in this way of thinking, there is no way how d would be matched. The following
expressions should be matched:

  • (nothing)
  • a
  • b
  • c

Note that d or any other character is not possible to be matched.

In your case, doing Optional('a') makes it match everything, and what's
especially puzzling is that you do not indicate is that everything of one character or everything that can also be a larger string could also be matched this way?

Having this said, we agree that our interpretation becomes equivalent to your alternative suggestion: If you want the whole rule to be optional just wrap the ordered choice in Optional.

I am questioning this also from the perspective of the downstream users (my StrictDoc library). So far, for me and some other users I was debugging my grammars, relying on the regex model was an easy way to get started writing based on TextX grammar. If you are introducing an exception, like this one about OrderedChoice, it might complicate the reasoning of the downstream users (I am certain about myself and like 75% certain about the other users).

This comment has been minimized.

Copy link
@igordejanovic

igordejanovic Apr 3, 2023

Author Member

Sorry for a bit late response.

In this case, we definitely try to think about this expression as (a?|b|c), so in this way of thinking, there is no way how d would be matched.

It seems that I wasn't clear enough. I apologize. What I meant was that if you give d to the above rule the parser won't fail. It will match empty and the next rule will be able to continue. So the above rule in PEG (with the current Arpeggio implementation) could match only:

  • (empty)
  • a

as the rest of the alternatives would be unreachable.

I have checked with @mettta, and we tend to interpret the grammar: [Optional('a'), 'b', 'c'] using regular expressions that we are used to.

Yes, indeed. Regexes works a bit different than PEGs. Regexes check each alternative and use "longest match" strategy when matching alternative. For example:

^(a*|abc|bc|bcd)e

Will match: ae, e, bce, bcde, abce.

But in PEG only: e, ae, aae... as a* is infallible and other alternatives won't be checked.

I checked with some other implementations and they are following current Arpeggio's approach.

For example try this grammar:

Test {
    A = "a"? | "b"
}

In Ohm parser's online editor, with input a and b. b won't match as the a? will match empty.

The problem with OrderedChoice are infallible rules, i.e. rules that always succeeds. The PEG strictly says that if a choice succeeds it is taken and the rest alternatives are not tried, effectively cutting of and making unreachable all alternatives after the infallible choice.

But I understand your concern and we could make an exception for OrderedChoice but I see two problems with that:

  1. If we follow "longest match" rule of regex-es we must always try every branch of OrderedChoice which would significantly impact performance. We won't anymore have linear complexity guarantees of packrat parsing. Thus, in PEG papers it is suggested to organize ordered choice matches from longest to shortest. E.g. instead of:

    a | ab
    

    you should write

    ab | a
    

    Thus, instead of:

    a? | b | c
    

    you should write:

    b | c | a?
    
  2. In the implementation we would have special handling in those situation. Currently, if the rule fails it throws NoMatch, otherwise it succeeded. Previous implementation used neither PEG nor regex semantics. In the previous implementation we looked at the matched content and if it is empty we would continue with other choices, which provided a false feeling that it obeyed regex rules, but it wasn't. It was broken as demonstrated in #96 where successful Not match is ignored as no input was consumed and the parser would go to the next alternative and fail.

There is a strategy that would be the middle ground that works most of the time:

  • If the alternative succeeds without consuming input try next alternative until some of them succeed consuming input.
  • If we have exhausted all alternatives without consuming input but there is infallible alternative then the whole ordered choice succeeds (as it is also infallible) otherwise it fails with NoMatch.

But, it would deviate from PEG and thus I would need some time to think about implementing this.

This should cover cases like #96. But, if we deviate from PEG we can't say that Arpeggio is PEG parser anymore.

For example:

(!a | b) c

In PEG this would parse: only c.

But with our approach: c and bc because !a succeeds without consuming input so b is tried and succeed after which c succeeds.

Thus this grammar defines different languages in PEG and Arpeggio with our modified strategy.

I did some analysis what should be done in #101.

This comment has been minimized.

Copy link
@stanislaw

stanislaw Apr 4, 2023

Contributor

Thank you for your explanations. I am fluctuating between fully understanding your explanation of the difference between regex and PEG and trying to match the explained with what I thought I understood about the OrderedChoice on the TextX level, as described here: http://textx.github.io/textX/3.1/grammar/#ordered-choice. I am trying to figure out if this discussion should change my understanding when I use or adjust the TextX-based StrictDoc grammar in the future. I don't think I ever used Optionals inside OrderedChoice, though.

It seems that I wasn't clear enough. I apologize. What I meant was that if you give d to the above rule the parser won't fail.

I think, I understood you exactly this way and this is what I found confusing with respect to the behavior of the OrderedChoice. I understood OrderedChoice as a construct that should exactly fail on d. Just to make sure:

Stanislav's understanding was (used to be 😄): If you give d to the OrderedChoice([Optional('a'), 'b', 'c']), the parser will fail just in the same way as the regular expression (a?|b|c) will fail on "d".

Now, with your explanation of the difference, you say that for PEG: if a choice succeeds it is taken and the rest alternatives are not tried, effectively cutting of and making unreachable all alternatives after the infallible choice. and that makes me wonder:

What if I would want to translate the regular expression (a?|b|c) into the Arpeggio expression. I.e., how to write an Arpeggio parser rule that does the same:

import re

regex = re.compile('^(a?|b|c)$')
print(regex.match(""))
print(regex.match("a"))
print(regex.match("b"))
print(regex.match("c"))
print(regex.match("d"))

<re.Match object; span=(0, 0), match=''>
<re.Match object; span=(0, 1), match='a'>
<re.Match object; span=(0, 1), match='b'>
<re.Match object; span=(0, 1), match='c'>
None

This would be Optional(OrderedChoice(["a", "b", "c"]), correct? For this discussion, it is relevant that I don't have a good real-world TextX-level example at hand, so this regex->Arpeggio translation exercise I would like to see confirmed by you to make sure I wasn't doing anything wrong in my TextX grammars.

I think the essentials of this discussion and the differences between PEG and regex should be documented somewhere, especially on the TextX side, in general or at least around the lines thus the | operator directly translates to Arpeggio's PEG ordered choice. Using ordered choice yields unambiguous parsing. If the text parses there is only one possible parse tree. Maybe it is already documented and I just rushed into the implementation of diagnostics without reading it? 🥲

P.S. Did you have a chance to consider the implementation strategy here: #102? I am eager to do what it takes, so that we can reach the next level of reporting by Arpeggio and TextX.

This comment has been minimized.

Copy link
@stanislaw

stanislaw Apr 4, 2023

Contributor

Also, to answer to this specifically:

But I understand your concern and we could make an exception for OrderedChoice but I see two problems with that:

I definitely do not want to change anything about the behavior of Arpeggio, especially after you explained the difference. The only source of confusion was me hacking on Arpeggio with the regex mindset in mind, and now I am trying to upgrade to PEG. This thread can be definitely closed as per my previous comment.

This comment has been minimized.

Copy link
@igordejanovic

igordejanovic Apr 5, 2023

Author Member

This would be Optional(OrderedChoice(["a", "b", "c"]), correct?

Yes, you are correct. Here, each branch of OC is fallible so branches are tried in order from left to right and only if nothing is matched the parser matches empty due to Optional and continues. So, yes, the behavior is similar to regex a?|b|c as the regex will try every branch and take the longest match where the longest match can still be of zero length due to a?. You could also rewrite that regex to a variant similar to PEG solution (a|b|c)?.

So, infallible rules in OC branches (like Optional, ZeroOrMore, regexes that can match empty, or sequences of infallibles) are always errors and thus should be reported to the user.

I think the essentials of this discussion and the differences between PEG and regex should be documented somewhere, especially on the TextX side

Definitely! I plan to extensively document this and explain the rationale as it is a breaking change. Hopefully #101 will catch all invalid cases in the wild and offer instructions for fixing the problems. So, #101 is a prerequisite for releasing the next version.

In general, this problem is aligned with a well known problem in PEG that the alternatives in PEG must be ordered from the longest matching to the shortest. I'll check do we have that in the docs also as it is important for the user to know it.

P.S. Did you have a chance to consider the implementation strategy here: #102? I am eager to do what it takes, so that we can reach the next level of reporting by Arpeggio and TextX.

Sorry. I caught cold the other day and still recovering. But, I did give it a quick overview and it seems good. I think that having a hierarchy of NoMatch classes is a good idea. As you already mentioned, having tuples in return values and using isinstance is not very elegant but I don't have a better idea to suggest at the moment. I'm also wondering how big of a performance impact it might be. I guess you can continue in that direction and we'll figure out something along the way.

This comment has been minimized.

Copy link
@stanislaw

stanislaw Apr 5, 2023

Contributor

All makes sense. Thanks for the feedback about the PR. I will try to clean it up towards 0 WIPs and Questions. There will be a couple of public holidays here in Germany, so I should be able to focus and have something by the end of this week.

I have a few follow-up questions about that PR, but I will ask them in that separate issue of the PR.

# without consuming the input.
input_str = "second"
parse_tree = parser.parse(input_str)
assert parse_tree is not None
with pytest.raises(NoMatch) as e:
parser.parse(input_str)

assert "Expected 'first' or EOF" in str(e.value)
32 changes: 21 additions & 11 deletions arpeggio/tests/test_error_reporting.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,16 +27,13 @@ def grammar():
assert "Expected 'a' or 'b'" in str(e.value)
assert (e.value.line, e.value.col) == (1, 1)

# This grammar always succeeds due to the optional match
def grammar():
return ['b', Optional('a')]

parser = ParserPython(grammar)

with pytest.raises(NoMatch) as e:
parser.parse('c')

assert "Expected 'b'" in str(e.value)
assert (e.value.line, e.value.col) == (1, 1)
parser.parse('b')
parser.parse('c')

This comment has been minimized.

Copy link
@stanislaw

stanislaw Apr 1, 2023

Contributor

We have started rebasing, and we have found this change. Could you explain why the test no longer raises here?

The c symbol is nowhere in the grammar, but the test still passes?

Doesn't the grammar have to exhaust all the input?

This comment has been minimized.

Copy link
@igordejanovic

igordejanovic Apr 1, 2023

Author Member

It doesn't need to exhaust all the input if the grammar doesn't end with EOF. Please see the first warning box on this page.

Thus, having c at the input is possible as the Optional would make the whole ordered choice to succeed. Maybe the test is a bit misleading. If you put EOF at the end of the grammar rule you wouldn't be able to parse c but you would be able to parse empty string.

This comment has been minimized.

Copy link
@stanislaw

stanislaw Apr 1, 2023

Contributor

The EOF warning box knowledge was something that I missed. Now I will take a look again!



def test_optional_with_better_match():
Expand All @@ -45,7 +42,7 @@ def test_optional_with_better_match():
has precedence over non-optional.
"""

def grammar(): return [first, Optional(second)]
def grammar(): return [first, (Optional(second), 'six')]
def first(): return 'one', 'two', 'three', '4'
def second(): return 'one', 'two', 'three', 'four', 'five'

Expand Down Expand Up @@ -131,9 +128,10 @@ def grammar():
return ['one', Not('two')], _(r'\w+')

parser = ParserPython(grammar)
parser.parse('three ident')

with pytest.raises(NoMatch) as e:
parser.parse(' three ident')
parser.parse(' two ident')
assert "Expected 'one' at " in str(e.value)


Expand Down Expand Up @@ -165,6 +163,18 @@ def grammar():
parser.parse(' three ident')
assert "Expected 'one' or 'two' at" in str(e.value)

with pytest.raises(NoMatch) as e:
parser.parse(' four ident')
assert "Expected 'one' or 'two' at" in str(e.value)
parser.parse(' four ident')


def test_not_succeed_in_ordered_choice():
"""
Test that Not can succeed in ordered choice leading to ordered choice
to succeed.
See: https://github.com/textX/Arpeggio/issues/96
"""

def grammar():
return [Not("a"), "a"], Optional("b")

parser=ParserPython(grammar)
parser.parse('b')

0 comments on commit db503c9

Please sign in to comment.