Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative branch should match shared prefix only once #411

Open
mvorisek opened this issue Jun 4, 2024 · 5 comments
Open

Alternative branch should match shared prefix only once #411

mvorisek opened this issue Jun 4, 2024 · 5 comments
Labels
enhancement New feature or request

Comments

@mvorisek
Copy link

mvorisek commented Jun 4, 2024

This is a feature request for engine optimization.

Repro: https://3v4l.org/JKn6D/rfc#vgit.master_jit

Such dummy regexes like (a|b|...|z) are very common for language grammars or routing.

The repro shows massive speedup (from 3.4 ms to 0.82 ms). According regex101.com the speedup is 865 vs. 93 steps only.

The engine should detect shared prefix/subpattern in alternative branch like (ab?c|ab?d) and optimize it into (ab?(?:c|d)).

Why we need this optimization? Consider these patterns:

  • non-consequent shared prefix - ex. (ab?c|x|ab?d) != (ab?(?:c|d)|x)
  • subpatterns with capturing groups - ex. ((a)b|((a)c) != ((a)(?:b|c))

such usecases need internal optimization support as they cannot be optimized outside the engine.

@zherczeg
Copy link
Collaborator

zherczeg commented Jun 4, 2024

Optimizations are not that easy in general. Pattern to pattern optimizations are usually done by other projects such as https://github.com/zherczeg/repan . The general workflow is optimizing patterns at compile time, and the application uses the optimized variants. This saves time and energy, since the costly steps are done once.

@mvorisek
Copy link
Author

mvorisek commented Jun 4, 2024

Pattern to pattern optimizations are usually done by other projects such as https://github.com/zherczeg/repan .

The repan project implements this feature as "merge alternatives" - https://github.com/zherczeg/repan/blob/25d6e3aabf/tests/opt/test_merge_alternatives_expected.txt.

Such optimization is very powerful with a little extra CPU cycles needed when compiling the pattern and with zero to minimal extra CPU cycles needed when evaluating the compiled pattern. Such optimization should be done by PCRE library itself as many typical usages do not involve compile phase (interpreted languages, user search inputs, ...) and requiring this to be done by intermediate library would require parsing the regex twice and requiring all vendors to maintain extra dependency.

I have updated the description to stress there are cases which can and should be optimized internally but cannot be done by "pattern to pattern optimizations" anyway.

@PhilipHazel
Copy link
Collaborator

Now that it's recorded as an issue, somebody might pick this up in the future. It won't be me, because after I manage to get 10.44 released (soon, I hope) I want to start looking for someone to take over the PCRE2 project because I am getting too old. :-(

@PhilipHazel
Copy link
Collaborator

Some experimentation on this is happening (not by me) but we have discovered that you cannot just pull out any old common prefix because it can change the meaning of the pattern. Any kind of backtracking can cause problems. Consider the pattern /(AB|AC)/ where A is a subpattern that has at least one backtracking point. If A matches and then B fails, control goes back to the backtracking point in A. However, if the pattern is /A(B|C)/ then when A matches but B fails, control moves on to C. The example given above, (ab?c|ab?d) looks safe, but it is easily modified into one that isn't. Here's some pcre2test output:

/(ab??c|ab??d|ab??b)/
abc
 0: abc
 1: abc

/ab??(c|d|b)/
abc
 0: ab
 1: b

@carenas
Copy link
Contributor

carenas commented Oct 18, 2024

FWIW, noticed that perl is using DFA to solve this as shown by:

perl -Mre=debug -e'"deEfficiency" =~ /d?(?:eficiency|efficiency|Efficiency)/;'
Compiling REx "d?(?:eficiency|efficiency|Efficiency)"
Final program:
   1: CURLY{0,1} (5)
   3:   EXACT <d> (0)
   5: TRIEC-EXACT[Ee] (21)
      <eficiency> 
      <efficiency> 
      <Efficiency> 
  21: END (0)
minlen 9 

@NWilson NWilson added the enhancement New feature or request label Dec 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants