Description
Currently, there's something that mildly scares me in the code. We have macros for reading UTF-8 chars (which is all inlined into every callsite... hmm). But worst, the macros don't validate the chars and can do out-of-bounds memory access. If you have a string which ends in 0x80, then GETCHAR()
will merrily read over the end of the string.
This "OK" because we validate upfront both for compile & match. That is... unless the clients tell us they are really, really sure that the input is valid! (If they lie to us, then undefined behaviour / memory badness is allowed to occur.)
Wild idea: it may be quicker to just always validate!
- For clients which don't pass in PCRE2_NO_UTF_CHECK, then merging the two-pass approach should be quicker. Doing a call to
valid_utf()
, then later on, doing a non-validating decode, must surely be slower than simply doing a validating decode later? (Or... is the concern that regexes will do backtracking, and decode the input string many times? And so hoisting the validation to the beginning is worthwhile, anyway?) I still reckon that there's hardly anything to be gained by doing a non-validating UTF decode - it's practically the same cost to validate at the same time. - For clients which do pass PCRE2_NO_UTF_CHECK, we just make that option a no-op.
BUT! We can now add a new option, PCRE2_PERMISSIVE_UTF, which would allow PCRE2 to process invalid UTF input. Any unpaired or bad bytes are just silently treated as U+FFFD. This would require, as a prerequisite, having the match code be able to handle invalid UTF input.
Currently, if you have a log file with some messed-up lines in it.... tough, PCRE2 won't process it (😞). So I think there would be clients that might well be interested in having a non-judgemental UTF mode. Bad UTF exists, we can't change the world... just handle it gracefully.