Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to replace replacement (for recursive grammars data) #337

Open
mvorisek opened this issue Nov 14, 2023 · 2 comments
Open

Allow to replace replacement (for recursive grammars data) #337

mvorisek opened this issue Nov 14, 2023 · 2 comments
Labels
enhancement New feature or request
Milestone

Comments

@mvorisek
Copy link

This is a feature request to allow to replace replacement, ie. to restart replace after next character of the match instead of the next character after the match.

Currently, recursive regexes must be manually restarted to match inner matches which imply some unneeded CPU overhead, especially in non-compiled programming languages.

I propose a new PCRE flag which will force the PCRE engine replace process to continue at +1 character instead of +N characters (where N is number of matched characters).

@PhilipHazel PhilipHazel added the enhancement New feature or request label Nov 15, 2023
@PhilipHazel
Copy link
Collaborator

I have just had a look at this, and what you suggest is not something that can easily be done because the code works by creating its output in a different buffer. When the global option is set, the scan continues in the old (input) buffer. I think you could implement what you want externally fairly efficiently by having two buffers. Start with buffer 1 holding the input, call pcre2_match(), remember the offset where it matched, call pcre2_substitute() with your match_data block and PCRE2_SUBSTITUTE_MATCHED but NOT the global option, and buffer 2 as the output. Start the next call to pcre2_match() with buffer 2 as the input and the appropriate offset and buffer 1 as the output. And so on.

@NWilson
Copy link
Member

NWilson commented Nov 6, 2024

In general, this would cause infinite recursion.

Consider doing 'aa'.replace(/a/g, 'ba') (JavaScript) or re.sub(r'a', 'ba', 'aa') (Python). After replacing the first 'a' with 'ba', if you continue matching then you'd just inflate the string forever. So other languages don't provide the behaviour you describe.

To stop this, you'd have to implement some sort of restriction, that the matching wouldn't simply continue at +1 character (bump forwards into the replaced string), but would additionally skip matches until finding a match that intersects the next unmatched portion of the string.

This would allow cases like 'aaa'.replace(/aa/g, 'ba') to do 'aaa' → 'baa' (first replacement) → 'bba' (second). It would terminate, since each replacement guarantees to always consume at least one further character from the original input string.

However, this isn't what you requested, I think.

The substitution functions in other languages, like Perl/Python/JavaScript, don't have this feature.

Maybe we should close this ticket, by adding some documentation explaining how to implement this yourself (in the application, by doing repeated replacements). It seems unlikely we'd want it in PCRE2 itself.

@NWilson NWilson self-assigned this Jan 8, 2025
@NWilson NWilson added this to the 10.46 milestone Jan 8, 2025
@NWilson NWilson removed their assignment Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants