Always validate UTF

Currently, there's something that mildly scares me in the code. We have macros for reading UTF-8 chars (which is all inlined into every callsite... hmm). But worst, the macros _don't validate the chars_ and can do out-of-bounds memory access. If you have a string which ends in 0x80, then `GETCHAR()` will merrily read over the end of the string.

This "OK" because we validate upfront both for compile & match. That is... unless the clients tell us they are really, really sure that the input is valid! (If they lie to us, then undefined behaviour / memory badness is allowed to occur.)

Wild idea: it may be quicker to just always validate!
* For clients which don't pass in PCRE2_NO_UTF_CHECK, then merging the two-pass approach should be quicker. Doing a call to `valid_utf()`, then later on, doing a non-validating decode, must surely be slower than simply doing a validating decode later? (Or... is the concern that regexes will do backtracking, and decode the input string many times? And so hoisting the validation to the beginning is worthwhile, anyway?) I still reckon that there's hardly anything to be gained by doing a non-validating UTF decode - it's practically the same cost to validate at the same time.
* For clients which do pass PCRE2_NO_UTF_CHECK, we just make that option a no-op.

BUT! We can now add a new option, PCRE2_PERMISSIVE_UTF, which would allow PCRE2 to process invalid UTF input. Any unpaired or bad bytes are just silently treated as U+FFFD. This would require, as a prerequisite, having the match code be able to handle invalid UTF input.

Currently, if you have a log file with some messed-up lines in it.... tough, PCRE2 won't process it (😞). So I think there would be clients that might well be interested in having a non-judgemental UTF mode. Bad UTF exists, we can't change the world... just handle it gracefully.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Always validate UTF #588

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Always validate UTF #588

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions