-
Notifications
You must be signed in to change notification settings - Fork 199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Always validate UTF #588
Comments
What is the problem with PCRE2_MATCH_INVALID_UTF / PCRE2_JIT_INVALID_UTF ? |
For people who actually want to do lenient processing of messed-up data with PCRE2, this is not especially useful. For me... the №1 reason to do this change ("always validate UTF") is because I'm scared of the safety implications of the non-validating UTF modes. This is also the reason I opened the ticket "remove lengthptr != NULL pass" - performance is nice, but safety is better. |
The original requests was coming from JIT tests: This one for example parses many invalid bytes: |
It's documented in pcre2unicode:
It really does not seem to be the case, that you can write a pattern |
Yes, the |
A run the raw |
That seems to be what the documentation says - match the But I would have expected that what people want is for |
|
|
You need to add a lot of checks in that case. There are many ways a UTF sequence is invalid, e.g. \xc0\x80 is invalid, because \x00 can be encoded as a single byte character. You can also check jit speed with and without flag, but don't forget jit always do a lot of extra optimizations, even for invalid utf fragments, that the interpreter cannot do. |
Like @zherczeg I am skeptical about the possibility of cheap on the fly testing. There are 21 different UTF-8 error codes in pcre2.h. But if you can figure out a cunning way to do it with very little cost ... |
GNU grep doesn't use mmap anymore, but indeed does their "own" UTF validation which is apparently faster than ours (eventhough I think it breaks with the logic in a few places) |
they csn always get that if UTF mode is NOT enabled. |
The problem with invalid UTF is the same as operator precedence in character classes. It does not matter what you do, there will be people who want it differently. For example, a 0xc0 0x8f is an invalid form of 0x0f or two independent "characters". What is a caseless 0x80? Does it match to |
I did make a reasonable suggestion in my first post: "Any unpaired or bad bytes are just silently treated as U+FFFD." This is a very popular and canonical treatment of bad bytes, and gives them full working Unicode semantics. But I agree... there a many open questions, and it's unclear if we'd be able to even do this task. Maybe after 10.45 (or 10.46...) I'll revisit this. |
Currently, there's something that mildly scares me in the code. We have macros for reading UTF-8 chars (which is all inlined into every callsite... hmm). But worst, the macros don't validate the chars and can do out-of-bounds memory access. If you have a string which ends in 0x80, then
GETCHAR()
will merrily read over the end of the string.This "OK" because we validate upfront both for compile & match. That is... unless the clients tell us they are really, really sure that the input is valid! (If they lie to us, then undefined behaviour / memory badness is allowed to occur.)
Wild idea: it may be quicker to just always validate!
valid_utf()
, then later on, doing a non-validating decode, must surely be slower than simply doing a validating decode later? (Or... is the concern that regexes will do backtracking, and decode the input string many times? And so hoisting the validation to the beginning is worthwhile, anyway?) I still reckon that there's hardly anything to be gained by doing a non-validating UTF decode - it's practically the same cost to validate at the same time.BUT! We can now add a new option, PCRE2_PERMISSIVE_UTF, which would allow PCRE2 to process invalid UTF input. Any unpaired or bad bytes are just silently treated as U+FFFD. This would require, as a prerequisite, having the match code be able to handle invalid UTF input.
Currently, if you have a log file with some messed-up lines in it.... tough, PCRE2 won't process it (😞). So I think there would be clients that might well be interested in having a non-judgemental UTF mode. Bad UTF exists, we can't change the world... just handle it gracefully.
The text was updated successfully, but these errors were encountered: