-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Unicode case folding, caseless matching, and iterator methods #19277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I can certainly see The proposed Here's a suggested implementation: use std::iter::{Chain, Map, repeat, Repeat, TakeWhile, Zip};
type ZipLonger<'a, 'b, 'c, T, U, I1, I2> = std::iter::TakeWhile<
'a,
(Option<T>, Option<U>),
Zip<
Chain<
Map<'b, T, Option<T>, I1>,
Repeat<Option<T>>
>,
Chain<
Map<'c, U, Option<U>, I2>,
Repeat<Option<U>>
>
>
>;
fn zip_longer<'a, 'b, 'c, T: Clone, U: Clone, I1, I2>(a: I1, b: I2)
-> ZipLonger<'a, 'b, 'c, T, U, I1, I2>
where I1: Iterator<T>,
I2: Iterator<U> {
a.map(Some).chain(repeat(None))
.zip(b.map(Some).chain(repeat(None)))
.take_while(|&(ref left, ref right)| left.is_some() || right.is_some())
} (as an Iterator method rather than a free-standing function) IMHO as for |
@jakub- Indeed, |
|
Closing:
|
I made https://github.com/SimonSapin/rust-casefold for Servo, the HTML spec requires “compatibility caseless matching”. Some of it might be interesting to have in libunicode/libcollections. @aturon, @alexcrichton, how much do you think is appropriate to include? I’d like your input before a prepare a PR (and have to deal with Rust bootstrapping).
zip_all
anditer_eq
are two generic function (independent of Unicode) that could be default methods ofIterator
. The former is likei.zip(j).all(f)
, but also returnfalse
if the two iterators have a different length. The latter (which uses the former) check that the iterators have the same content. That is, it is equivalent toi.collect::Vec<_>() == j.collect::Vec<_>()
, but compares elements one by one and does not allocate. (It also stops at the first difference rather than consume both iterators until the end.)Case folding is fairly straightforward. The data could be generated with
src/etc/unicode.py
and kept insrc/libunicode/tables.rs
, like existing Unicode data.Caseless matching however is more complex: there are different variants of it. Other than the “default” variant, they require NFD and NFKD normalization. libunicode already has
nfd_chars
andnfkd_chars
methods on&str
, but here that would require allocating an intermediateString
. So, in the same spirit as #19042, it might be useful to expose another API for Unicode normalization (all four variants of it, while we’re at it) from a genericIterator<char>
rather than just&str
/Chars
.Thoughts?
Nothing urgent here, but consider this when stabilizing libunicode.
The text was updated successfully, but these errors were encountered: