-
Notifications
You must be signed in to change notification settings - Fork 214
Use ZeroVec in UnicodeSet #922
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
components/uniset/src/uniset.rs
Outdated
// TODO: need advice - how should we remove Hash and Eq from UnicodeSet unless we need it? | ||
|
||
// If we wanted to use an array to keep the memory on the stack, there is an unsafe nightly feature | ||
// https://doc.rust-lang.org/nightly/core/array/trait.FixedSizeArray.html | ||
// Allows for traits of fixed size arrays | ||
|
||
// Implements an [inversion list.](https://en.wikipedia.org/wiki/Inversion_list) | ||
inv_list: Vec<u32>, | ||
inv_list: ZeroVec<'d, u32>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do I solve the following compiler error(s)?:
error[E0277]: the trait bound `ZeroVec<'_, u32>: Hash` is not satisfied
--> components/uniset/src/uniset.rs:36:5
|
36 | inv_list: ZeroVec<'d, u32>,
| ^^^^^^^^^^^^^^^^^^^^^^^^^^ the trait `Hash` is not implemented for `ZeroVec<'_, u32>`
|
::: /home/elango/.rustup/toolchains/1.51.0-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/hash/mod.rs:168:13
|
168 | fn hash<H: Hasher>(&self, state: &mut H);
| - required by this bound in `core::hash::Hash::hash`
|
= note: this error originates in a derive macro (in Nightly builds, run with -Z macro-backtrace for more info)
error[E0277]: the trait bound `ZeroVec<'d, u32>: Eq` is not satisfied
--> components/uniset/src/uniset.rs:36:5
|
36 | inv_list: ZeroVec<'d, u32>,
| ^^^^^^^^^^^^^^^^^^^^^^^^^^ the trait `Eq` is not implemented for `ZeroVec<'d, u32>`
|
::: /home/elango/.rustup/toolchains/1.51.0-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/cmp.rs:297:31
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hash:
Easiest would be to stop implementing Hash on UnicodeSet. I remember that was only used by some test code or something. It never really made sense to me.
Otherwise, you could look into adding a Hash impl on ZeroVec. It's probably not too hard.
Eq:
Seems like we need an Eq impl on ZeroVec. You could make a separate PR to add it.
components/uniset/src/uniset.rs
Outdated
@@ -89,6 +90,7 @@ impl UnicodeSet { | |||
pub fn from_inversion_list(inv_list: Vec<u32>) -> Result<Self, UnicodeSetError> { | |||
if is_valid(&inv_list) { | |||
let size: usize = inv_list.chunks(2).map(|end_points| end_points[1] - end_points[0]).sum::<u32>() as usize; | |||
// TODO: how to convert a `Vec<u32>` into ZeroVec? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function from_inversion_list
takes an inversion list of Vec<u32>
and needs to turn that into a ZeroVec inversion list when constructing the UnicodeSet.
But my understanding is that ZeroVec is designed for taking a byte array and converting it via allocation to an aligned array/vector of other types/widths only at the time it is needed. When the input is a non-byte slice, then an allocation is required to convert it into the backing byte array.
So is there a way to convert the Vec<u32>
into the &[u8]
without allocations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should change the function to take ZeroVec
instead of Vec
.
If desired, you could keep these as two separate functions.
I could also use feedback on all things related to lifetime parameters here: whether I added lifetime parameters on the correct structs, in the right places, got the syntax right, etc. |
components/uniset/src/uniset.rs
Outdated
// TODO: need advice - how should we remove Hash and Eq from UnicodeSet unless we need it? | ||
|
||
// If we wanted to use an array to keep the memory on the stack, there is an unsafe nightly feature | ||
// https://doc.rust-lang.org/nightly/core/array/trait.FixedSizeArray.html | ||
// Allows for traits of fixed size arrays | ||
|
||
// Implements an [inversion list.](https://en.wikipedia.org/wiki/Inversion_list) | ||
inv_list: Vec<u32>, | ||
inv_list: ZeroVec<'d, u32>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hash:
Easiest would be to stop implementing Hash on UnicodeSet. I remember that was only used by some test code or something. It never really made sense to me.
Otherwise, you could look into adding a Hash impl on ZeroVec. It's probably not too hard.
Eq:
Seems like we need an Eq impl on ZeroVec. You could make a separate PR to add it.
components/uniset/src/uniset.rs
Outdated
@@ -89,6 +90,7 @@ impl UnicodeSet { | |||
pub fn from_inversion_list(inv_list: Vec<u32>) -> Result<Self, UnicodeSetError> { | |||
if is_valid(&inv_list) { | |||
let size: usize = inv_list.chunks(2).map(|end_points| end_points[1] - end_points[0]).sum::<u32>() as usize; | |||
// TODO: how to convert a `Vec<u32>` into ZeroVec? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should change the function to take ZeroVec
instead of Vec
.
If desired, you could keep these as two separate functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(setting review bit)
components/uniset/src/builder.rs
Outdated
@@ -16,15 +16,15 @@ pub struct UnicodeSetBuilder { | |||
intervals: Vec<u32>, | |||
} | |||
|
|||
impl UnicodeSetBuilder { | |||
impl<'d> UnicodeSetBuilder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this 'd doing anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. No, it's not doing anything -- now removed.
This commit manages to fix the crash, fwiw commit 6cfb09f329bbe0b5bc597e7cb0379ea62ae3227f
Author: Manish Goregaokar <[email protected]>
Date: Wed Aug 11 00:12:40 2021 -0700
work around rust bug
diff --git a/components/uniset/src/lib.rs b/components/uniset/src/lib.rs
index 92b24de7..52dc0579 100644
--- a/components/uniset/src/lib.rs
+++ b/components/uniset/src/lib.rs
@@ -51,6 +51,10 @@
#![cfg_attr(not(any(test, feature = "std")), no_std)]
+// Workaround for https://github.com/rust-lang/rust/issues/87932
+#[cfg(feature = "serde")]
+extern crate serde;
+
extern crate alloc;
#[macro_use]
diff --git a/components/uniset/src/utils.rs b/components/uniset/src/utils.rs
index 80ef039c..aea83a3e 100644
--- a/components/uniset/src/utils.rs
+++ b/components/uniset/src/utils.rs
@@ -7,7 +7,7 @@ use core::{
ops::{Bound::*, RangeBounds},
};
use zerovec::ZeroVec;
-// use zerovec::ule::AsULE;
+use zerovec::ule::AsULE;
/// Returns whether the vector is sorted ascending non inclusive, of even length,
/// and within the bounds of `0x0 -> 0x10FFFF` inclusive. |
I also am not sure why the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not done reviewing, but here are some comments to get started
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Picking a few nits, but generally this looks pretty good to me.
Cargo.toml
Outdated
@@ -16,8 +16,8 @@ members = [ | |||
"components/plurals", | |||
"components/uniset", | |||
"experimental/bies", | |||
# "experimental/calendar", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: did you mean to add this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed, thanks. I didn't realize that it was an artifact of merged from latest on main.
@@ -18,13 +18,13 @@ icu_benchmark_macros::static_setup!(); | |||
|
|||
use icu_uniset::{UnicodeSet, UnicodeSetBuilder}; | |||
|
|||
fn get_basic_latin_block() -> UnicodeSet { | |||
fn get_basic_latin_block<'data>() -> UnicodeSet<'data> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: should this be fn get_basic_latin_block() -> UnicodeSet<'static>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
components/uniset/src/lib.rs
Outdated
|
||
/// Custom Errors for [`UnicodeSet`]. | ||
#[derive(Display, Debug)] | ||
#[allow(missing_docs)] // TODO(#1030) - Add missing docs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you intend to remove this attribute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unremoved, thanks. That answers the question I had in the PR-level comment.
components/uniset/src/props.rs
Outdated
|
||
// helper fn | ||
fn get_prop<'data, D>(ppucd_provider: &D, resc_key: ResourceKey) -> UnisetResult | ||
fn get_prop<'data, D>(provider: &'data D, resc_key: ResourceKey) -> UnisetResult<'data> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch on changing the name here.
#[cfg_attr( | ||
feature = "provider_serde", | ||
derive(serde::Serialize, serde::Deserialize) | ||
)] | ||
pub struct UnicodePropertyV1<'data> { | ||
pub name: Cow<'data, str>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there still a purpose for the name field here? My understanding is that it was only used in the experimental ppucd provider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. Based on @sffc 's comment above, I'm not sure yet about unlinking ppucd_provider, and maybe it provides an alternate way of exercising the provider logic as a temporary measure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we get rid of the field in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
components/uniset/src/utils.rs
Outdated
// let v_iter = v.iter(); | ||
// let v_len = v.len(); | ||
// let v_last = v.last(); | ||
// is_iter_valid(v_iter, v_len, v_last) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: should this be removed, or uncommented?
components/uniset/src/uniset.rs
Outdated
} | ||
} | ||
|
||
impl UnicodeSet { | ||
impl<'data> Eq for UnicodeSet<'data> {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my own education: is there a reason we can't derive Eq
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like Eq
is not implemented on ZeroVec. When I remove this line and put Eq
into the derive for UnicodeSet, I got/get:
error[E0277]: the trait bound `usize: icu_provider::yoke::ZeroCopyFrom<_>` is not satisfied
--> components/uniset/src/uniset.rs:22:45
|
22 | #[derive(Debug, PartialEq, Clone, Yokeable, ZeroCopyFrom)]
| ^^^^^^^^^^^^ the trait `icu_provider::yoke::ZeroCopyFrom<_>` is not implemented for `usize`
|
= note: required by `zero_copy_from`
= note: this error originates in the derive macro `ZeroCopyFrom` (in Nightly builds, run with -Z macro-backtrace for more info)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you implement Eq on ZeroVec (where T is Eq) then? Something like:
impl<T> Eq for ZeroVec<T> where T: Eq {}
// or maybe just add #[derive(Eq)] on ZeroVec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
components/uniset/src/uniset.rs
Outdated
/// Note: this function currently causes an allocation due to the requirements from | ||
/// the constructors of the underlying [`ZeroVec`]. | ||
/// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we instead / also have a constructor along the lines of pub fn from_inversion_list<'a>(inv_list: &'a [u32]) -> Result<UnicodeSet<'a>, UnicodeSetError>
?
It's a little awkward that ZeroVec doesn't give us direct access to a slice of u32, or else we could replace the previous constructor with a more general form.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think we should basically have 3 constructors:
ZeroVec<'data, u32>
&'data [u32]
&[u32]
(requires cloning the data)
Note that technically option 2 is a subset of option 1, but I think it's common enough to warrant its own constructor.
components/uniset/src/uniset.rs
Outdated
let result: Vec<u32> = self | ||
.as_inversion_list() // Only crate public, to not leak impl | ||
.to_vec(); | ||
let result: Vec<u32> = self.as_inversion_list(); // Only crate public, to not leak impl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Verifying my understanding: we're keeping this private because ZeroVec can't expose a u32 slice, and ZeroVec can't expose a u32 slice because of endianness concerns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and alignment concerns. yes.
components/uniset/src/uniset.rs
Outdated
if let Some(x) = self.inv_list.get(pos + 1) { | ||
(till) <= x | ||
} else { | ||
// This code should be unreachable because: | ||
// Inversion list query should not return out of bounds index | ||
false | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: What do you think of till <= self.inv_list.get(pos + 1).expect("Inversion list query should be in bounds")
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally put in this change to avoid the possibility of a panic. @sffc -- should that be a concern here?
OTOH, the comment asserts that it should be unreachable (based on the binary search result from the inversion list query not giving an out-of-bounds index). There is some precedent -- we used an .unwrap()
in the CPT code because we couldn't avoid it and there weren't doubts about the code (which came from ICU4C).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do prefer what Elango has written over having unwraps in our code. However, I would also be okay with an unreachable!()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
… UnicodePropertyV1 Serde requires specifying the lifetime bounds on the struct that is borrowed, so that UnicodePropertyV1 provider struct can use the default serde derive when borrowing UnicodeSet as a field
…provider_serde feature enabled
Closes #855.