-
Notifications
You must be signed in to change notification settings - Fork 199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alternative branch should match shared prefix only once #411
Comments
Optimizations are not that easy in general. Pattern to pattern optimizations are usually done by other projects such as https://github.com/zherczeg/repan . The general workflow is optimizing patterns at compile time, and the application uses the optimized variants. This saves time and energy, since the costly steps are done once. |
The repan project implements this feature as "merge alternatives" - https://github.com/zherczeg/repan/blob/25d6e3aabf/tests/opt/test_merge_alternatives_expected.txt. Such optimization is very powerful with a little extra CPU cycles needed when compiling the pattern and with zero to minimal extra CPU cycles needed when evaluating the compiled pattern. Such optimization should be done by PCRE library itself as many typical usages do not involve compile phase (interpreted languages, user search inputs, ...) and requiring this to be done by intermediate library would require parsing the regex twice and requiring all vendors to maintain extra dependency. I have updated the description to stress there are cases which can and should be optimized internally but cannot be done by "pattern to pattern optimizations" anyway. |
Now that it's recorded as an issue, somebody might pick this up in the future. It won't be me, because after I manage to get 10.44 released (soon, I hope) I want to start looking for someone to take over the PCRE2 project because I am getting too old. :-( |
Some experimentation on this is happening (not by me) but we have discovered that you cannot just pull out any old common prefix because it can change the meaning of the pattern. Any kind of backtracking can cause problems. Consider the pattern /(AB|AC)/ where A is a subpattern that has at least one backtracking point. If A matches and then B fails, control goes back to the backtracking point in A. However, if the pattern is /A(B|C)/ then when A matches but B fails, control moves on to C. The example given above, (ab?c|ab?d) looks safe, but it is easily modified into one that isn't. Here's some pcre2test output:
|
FWIW, noticed that perl is using DFA to solve this as shown by:
|
This is a feature request for engine optimization.
Repro: https://3v4l.org/JKn6D/rfc#vgit.master_jit
Such dummy regexes like
(a|b|...|z)
are very common for language grammars or routing.The repro shows massive speedup (from 3.4 ms to 0.82 ms). According regex101.com the speedup is 865 vs. 93 steps only.
The engine should detect shared prefix/subpattern in alternative branch like
(ab?c|ab?d)
and optimize it into(ab?(?:c|d))
.Why we need this optimization? Consider these patterns:
(ab?c|x|ab?d)
!=(ab?(?:c|d)|x)
((a)b|((a)c)
!=((a)(?:b|c))
such usecases need internal optimization support as they cannot be optimized outside the engine.
The text was updated successfully, but these errors were encountered: