Skip to content

Always validate UTF #588

Open
Open
@NWilson

Description

@NWilson

Currently, there's something that mildly scares me in the code. We have macros for reading UTF-8 chars (which is all inlined into every callsite... hmm). But worst, the macros don't validate the chars and can do out-of-bounds memory access. If you have a string which ends in 0x80, then GETCHAR() will merrily read over the end of the string.

This "OK" because we validate upfront both for compile & match. That is... unless the clients tell us they are really, really sure that the input is valid! (If they lie to us, then undefined behaviour / memory badness is allowed to occur.)

Wild idea: it may be quicker to just always validate!

  • For clients which don't pass in PCRE2_NO_UTF_CHECK, then merging the two-pass approach should be quicker. Doing a call to valid_utf(), then later on, doing a non-validating decode, must surely be slower than simply doing a validating decode later? (Or... is the concern that regexes will do backtracking, and decode the input string many times? And so hoisting the validation to the beginning is worthwhile, anyway?) I still reckon that there's hardly anything to be gained by doing a non-validating UTF decode - it's practically the same cost to validate at the same time.
  • For clients which do pass PCRE2_NO_UTF_CHECK, we just make that option a no-op.

BUT! We can now add a new option, PCRE2_PERMISSIVE_UTF, which would allow PCRE2 to process invalid UTF input. Any unpaired or bad bytes are just silently treated as U+FFFD. This would require, as a prerequisite, having the match code be able to handle invalid UTF input.

Currently, if you have a log file with some messed-up lines in it.... tough, PCRE2 won't process it (😞). So I think there would be clients that might well be interested in having a non-judgemental UTF mode. Bad UTF exists, we can't change the world... just handle it gracefully.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions