Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lift compile time default maximum pattern length #406

Open
mvorisek opened this issue Apr 17, 2024 · 2 comments
Open

Lift compile time default maximum pattern length #406

mvorisek opened this issue Apr 17, 2024 · 2 comments
Labels
enhancement New feature or request
Milestone

Comments

@mvorisek
Copy link

mvorisek commented Apr 17, 2024

This is a feature request to lift the default length limit of 64k bytes.

PCRE library is used in many complex applications like to parse language grammars and the requirements naturally grow every year.

We are at year 2024 and the PCRE library itself should not impose too much unneeded limitation. Especially compile time limitation.

I propose to lift the compile time limit by raising the LINK_SIZE option to 4 by default. At least of 64-bit platforms.

The matching time/performance should not increase much.

If there should be a limit, it should be runtime configureable like other limits using pcre2_set_*_limit() function with default like 10M (UPDATE: implemented in 05aafb2).

related past issues: #119 #271 JuliaText/TextAnalysis.jl#258

@PhilipHazel
Copy link
Collaborator

I am not at all keen on making such a change, for a number of reasons:

  1. It's a serious incompatible change that may affect existing users. I don't like doing that.
  2. There are very many PCRE2 users for whom 64K is more than enough. I'm thinking of, for example, email routing (for which I originally wrote PCRE), text editing, and similar applications where the patterns are generally quite short.
  3. I know (because there have been issues in the past) that there are applications using PCRE2 that run very large numbers of separate threads, each of which has very limited memory. Such a change may well impact them.
  4. There is of course a performance cost, though it would depend very much on the pattern details.

If I were starting again (knowing what I now know) I would certainly do things differently, probably always compiling into 32-bit units, but hindsight is always 20/20.

@NWilson NWilson self-assigned this Jan 8, 2025
@NWilson NWilson added this to the 10.46 milestone Jan 8, 2025
@NWilson NWilson removed their assignment Jan 8, 2025
@NWilson
Copy link
Member

NWilson commented Jan 8, 2025

It seems to me that the issue here is not LINK_SIZE, but rather that we compile (a|b){1,2000} in such a space-inefficient way.

The default of 64K for compiled patterns is reasonable (although higher might be ideal), since patterns are typically small, and we can handle matching against arbitrarily-long subject strings.

We just need some kind of limit to how much bloat is produced for patterns with fixed-maximum repetitions. Doing a handful of duplications of the pattern is OK; doing 1000 is not.

@NWilson NWilson added the enhancement New feature or request label Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants