Alternative branch should match shared prefix only once #411

mvorisek · 2024-06-04T10:11:27Z

This is a feature request for engine optimization.

Repro: https://3v4l.org/JKn6D/rfc#vgit.master_jit

Such dummy regexes like (a|b|...|z) are very common for language grammars or routing.

The repro shows massive speedup (from 3.4 ms to 0.82 ms). According regex101.com the speedup is 865 vs. 93 steps only.

The engine should detect shared prefix/subpattern in alternative branch like (ab?c|ab?d) and optimize it into (ab?(?:c|d)).

Why we need this optimization? Consider these patterns:

non-consequent shared prefix - ex. (ab?c|x|ab?d) != (ab?(?:c|d)|x)
subpatterns with capturing groups - ex. ((a)b|((a)c) != ((a)(?:b|c))

such usecases need internal optimization support as they cannot be optimized outside the engine.

The text was updated successfully, but these errors were encountered:

zherczeg · 2024-06-04T10:19:03Z

Optimizations are not that easy in general. Pattern to pattern optimizations are usually done by other projects such as https://github.com/zherczeg/repan . The general workflow is optimizing patterns at compile time, and the application uses the optimized variants. This saves time and energy, since the costly steps are done once.

mvorisek · 2024-06-04T10:53:44Z

Pattern to pattern optimizations are usually done by other projects such as https://github.com/zherczeg/repan .

The repan project implements this feature as "merge alternatives" - https://github.com/zherczeg/repan/blob/25d6e3aabf/tests/opt/test_merge_alternatives_expected.txt.

Such optimization is very powerful with a little extra CPU cycles needed when compiling the pattern and with zero to minimal extra CPU cycles needed when evaluating the compiled pattern. Such optimization should be done by PCRE library itself as many typical usages do not involve compile phase (interpreted languages, user search inputs, ...) and requiring this to be done by intermediate library would require parsing the regex twice and requiring all vendors to maintain extra dependency.

I have updated the description to stress there are cases which can and should be optimized internally but cannot be done by "pattern to pattern optimizations" anyway.

PhilipHazel · 2024-06-04T15:06:52Z

Now that it's recorded as an issue, somebody might pick this up in the future. It won't be me, because after I manage to get 10.44 released (soon, I hope) I want to start looking for someone to take over the PCRE2 project because I am getting too old. :-(

PhilipHazel · 2024-07-25T08:10:08Z

Some experimentation on this is happening (not by me) but we have discovered that you cannot just pull out any old common prefix because it can change the meaning of the pattern. Any kind of backtracking can cause problems. Consider the pattern /(AB|AC)/ where A is a subpattern that has at least one backtracking point. If A matches and then B fails, control goes back to the backtracking point in A. However, if the pattern is /A(B|C)/ then when A matches but B fails, control moves on to C. The example given above, (ab?c|ab?d) looks safe, but it is easily modified into one that isn't. Here's some pcre2test output:

/(ab??c|ab??d|ab??b)/
abc
 0: abc
 1: abc

/ab??(c|d|b)/
abc
 0: ab
 1: b

carenas · 2024-10-18T19:43:49Z

FWIW, noticed that perl is using DFA to solve this as shown by:

perl -Mre=debug -e'"deEfficiency" =~ /d?(?:eficiency|efficiency|Efficiency)/;'
Compiling REx "d?(?:eficiency|efficiency|Efficiency)"
Final program:
   1: CURLY{0,1} (5)
   3:   EXACT <d> (0)
   5: TRIEC-EXACT[Ee] (21)
      <eficiency> 
      <efficiency> 
      <Efficiency> 
  21: END (0)
minlen 9

mvorisek mentioned this issue Sep 7, 2024

Rewrite regexes where common prefix can be pulled out from alternation branches #464

Draft

NWilson added the enhancement New feature or request label Dec 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alternative branch should match shared prefix only once #411

Alternative branch should match shared prefix only once #411

mvorisek commented Jun 4, 2024 •

edited

Loading

zherczeg commented Jun 4, 2024

mvorisek commented Jun 4, 2024 •

edited

Loading

PhilipHazel commented Jun 4, 2024

PhilipHazel commented Jul 25, 2024

carenas commented Oct 18, 2024

Alternative branch should match shared prefix only once #411

Alternative branch should match shared prefix only once #411

Comments

mvorisek commented Jun 4, 2024 • edited Loading

zherczeg commented Jun 4, 2024

mvorisek commented Jun 4, 2024 • edited Loading

PhilipHazel commented Jun 4, 2024

PhilipHazel commented Jul 25, 2024

carenas commented Oct 18, 2024

mvorisek commented Jun 4, 2024 •

edited

Loading

mvorisek commented Jun 4, 2024 •

edited

Loading