You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
syntax: reject '(?-u)\W' when UTF-8 mode is enabled
When Unicode mode is disabled (i.e., (?-u)), the Perl character classes
(\w, \d and \s) revert to their ASCII definitions. The negated forms
of these classes are also derived from their ASCII definitions, and this
means that they may actually match bytes outside of ASCII and thus
possibly invalid UTF-8. For this reason, when the translator is
configured to only produce HIR that matches valid UTF-8, '(?-u)\W'
should be rejected.
Previously, it was not being rejected, which could actually lead to
matches that produced offsets that split codepoints, and thus lead to
panics when match offsets are used to slice a string. For example, this
code
fn main() {
let re = regex::Regex::new(r"(?-u)\W").unwrap();
let haystack = "☃";
if let Some(m) = re.find(haystack) {
println!("{:?}", &haystack[m.range()]);
}
}
panics with
byte index 1 is not a char boundary; it is inside '☃' (bytes 0..3) of `☃`
That is, it reports a match at 0..1, which is technically correct, but
the regex itself should have been rejected in the first place since the
top-level Regex API always has UTF-8 mode enabled.
Also, many of the replacement tests were using '(?-u)\W' (or similar)
for some reason. I'm not sure why, so I just removed the '(?-u)' to make
those tests pass. Whether Unicode is enabled or not doesn't seem to be
an interesting detail for those tests. (All haystacks and replacements
appear to be ASCII.)
Fixes#895, Partially addresses #738
0 commit comments