Skip to content

Use ZeroVec in UnicodeSet #922

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 50 commits into from
Sep 27, 2021
Merged

Use ZeroVec in UnicodeSet #922

merged 50 commits into from
Sep 27, 2021

Conversation

echeran
Copy link
Contributor

@echeran echeran commented Jul 30, 2021

Closes #855.

// TODO: need advice - how should we remove Hash and Eq from UnicodeSet unless we need it?

// If we wanted to use an array to keep the memory on the stack, there is an unsafe nightly feature
// https://doc.rust-lang.org/nightly/core/array/trait.FixedSizeArray.html
// Allows for traits of fixed size arrays

// Implements an [inversion list.](https://en.wikipedia.org/wiki/Inversion_list)
inv_list: Vec<u32>,
inv_list: ZeroVec<'d, u32>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do I solve the following compiler error(s)?:

error[E0277]: the trait bound `ZeroVec<'_, u32>: Hash` is not satisfied
   --> components/uniset/src/uniset.rs:36:5
    |
36  |     inv_list: ZeroVec<'d, u32>,
    |     ^^^^^^^^^^^^^^^^^^^^^^^^^^ the trait `Hash` is not implemented for `ZeroVec<'_, u32>`
    | 
   ::: /home/elango/.rustup/toolchains/1.51.0-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/hash/mod.rs:168:13
    |
168 |     fn hash<H: Hasher>(&self, state: &mut H);
    |             - required by this bound in `core::hash::Hash::hash`
    |
    = note: this error originates in a derive macro (in Nightly builds, run with -Z macro-backtrace for more info)

error[E0277]: the trait bound `ZeroVec<'d, u32>: Eq` is not satisfied
   --> components/uniset/src/uniset.rs:36:5
    |
36  |     inv_list: ZeroVec<'d, u32>,
    |     ^^^^^^^^^^^^^^^^^^^^^^^^^^ the trait `Eq` is not implemented for `ZeroVec<'d, u32>`
    | 
   ::: /home/elango/.rustup/toolchains/1.51.0-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/cmp.rs:297:31

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hash:

Easiest would be to stop implementing Hash on UnicodeSet. I remember that was only used by some test code or something. It never really made sense to me.

Otherwise, you could look into adding a Hash impl on ZeroVec. It's probably not too hard.

Eq:

Seems like we need an Eq impl on ZeroVec. You could make a separate PR to add it.

@@ -89,6 +90,7 @@ impl UnicodeSet {
pub fn from_inversion_list(inv_list: Vec<u32>) -> Result<Self, UnicodeSetError> {
if is_valid(&inv_list) {
let size: usize = inv_list.chunks(2).map(|end_points| end_points[1] - end_points[0]).sum::<u32>() as usize;
// TODO: how to convert a `Vec<u32>` into ZeroVec?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function from_inversion_list takes an inversion list of Vec<u32> and needs to turn that into a ZeroVec inversion list when constructing the UnicodeSet.

But my understanding is that ZeroVec is designed for taking a byte array and converting it via allocation to an aligned array/vector of other types/widths only at the time it is needed. When the input is a non-byte slice, then an allocation is required to convert it into the backing byte array.

So is there a way to convert the Vec<u32> into the &[u8] without allocations?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should change the function to take ZeroVec instead of Vec.

If desired, you could keep these as two separate functions.

@echeran
Copy link
Contributor Author

echeran commented Aug 2, 2021

I could also use feedback on all things related to lifetime parameters here: whether I added lifetime parameters on the correct structs, in the right places, got the syntax right, etc.

@echeran echeran requested a review from sffc August 2, 2021 04:09
// TODO: need advice - how should we remove Hash and Eq from UnicodeSet unless we need it?

// If we wanted to use an array to keep the memory on the stack, there is an unsafe nightly feature
// https://doc.rust-lang.org/nightly/core/array/trait.FixedSizeArray.html
// Allows for traits of fixed size arrays

// Implements an [inversion list.](https://en.wikipedia.org/wiki/Inversion_list)
inv_list: Vec<u32>,
inv_list: ZeroVec<'d, u32>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hash:

Easiest would be to stop implementing Hash on UnicodeSet. I remember that was only used by some test code or something. It never really made sense to me.

Otherwise, you could look into adding a Hash impl on ZeroVec. It's probably not too hard.

Eq:

Seems like we need an Eq impl on ZeroVec. You could make a separate PR to add it.

@@ -89,6 +90,7 @@ impl UnicodeSet {
pub fn from_inversion_list(inv_list: Vec<u32>) -> Result<Self, UnicodeSetError> {
if is_valid(&inv_list) {
let size: usize = inv_list.chunks(2).map(|end_points| end_points[1] - end_points[0]).sum::<u32>() as usize;
// TODO: how to convert a `Vec<u32>` into ZeroVec?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should change the function to take ZeroVec instead of Vec.

If desired, you could keep these as two separate functions.

Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(setting review bit)

@@ -16,15 +16,15 @@ pub struct UnicodeSetBuilder {
intervals: Vec<u32>,
}

impl UnicodeSetBuilder {
impl<'d> UnicodeSetBuilder {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this 'd doing anything?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. No, it's not doing anything -- now removed.

@Manishearth
Copy link
Member

This commit manages to fix the crash, fwiw

commit 6cfb09f329bbe0b5bc597e7cb0379ea62ae3227f
Author: Manish Goregaokar <[email protected]>
Date:   Wed Aug 11 00:12:40 2021 -0700

    work around rust bug

diff --git a/components/uniset/src/lib.rs b/components/uniset/src/lib.rs
index 92b24de7..52dc0579 100644
--- a/components/uniset/src/lib.rs
+++ b/components/uniset/src/lib.rs
@@ -51,6 +51,10 @@
 
 #![cfg_attr(not(any(test, feature = "std")), no_std)]
 
+// Workaround for https://github.com/rust-lang/rust/issues/87932
+#[cfg(feature = "serde")]
+extern crate serde;
+
 extern crate alloc;
 
 #[macro_use]
diff --git a/components/uniset/src/utils.rs b/components/uniset/src/utils.rs
index 80ef039c..aea83a3e 100644
--- a/components/uniset/src/utils.rs
+++ b/components/uniset/src/utils.rs
@@ -7,7 +7,7 @@ use core::{
     ops::{Bound::*, RangeBounds},
 };
 use zerovec::ZeroVec;
-// use zerovec::ule::AsULE;
+use zerovec::ule::AsULE;
 
 /// Returns whether the vector is sorted ascending non inclusive, of even length,
 /// and within the bounds of `0x0 -> 0x10FFFF` inclusive.

@echeran echeran marked this pull request as ready for review September 14, 2021 16:26
@echeran echeran requested review from iainireland and a team as code owners September 14, 2021 16:26
@echeran echeran requested a review from sffc September 14, 2021 16:26
@echeran
Copy link
Contributor Author

echeran commented Sep 14, 2021

I also am not sure why the #[displaydoc("...")] annotations on the UnicodeSetError enum variants are triggering a missing docs clippy error. I can always duplicate the doc string argument as the Rust doc string. I see #[displaydoc()] elsewhere in the codebase that doesn't do that, though. How should best address the error?

Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not done reviewing, but here are some comments to get started

Copy link
Contributor

@iainireland iainireland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Picking a few nits, but generally this looks pretty good to me.

Cargo.toml Outdated
@@ -16,8 +16,8 @@ members = [
"components/plurals",
"components/uniset",
"experimental/bies",
# "experimental/calendar",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: did you mean to add this line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed, thanks. I didn't realize that it was an artifact of merged from latest on main.

@@ -18,13 +18,13 @@ icu_benchmark_macros::static_setup!();

use icu_uniset::{UnicodeSet, UnicodeSetBuilder};

fn get_basic_latin_block() -> UnicodeSet {
fn get_basic_latin_block<'data>() -> UnicodeSet<'data> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: should this be fn get_basic_latin_block() -> UnicodeSet<'static>?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


/// Custom Errors for [`UnicodeSet`].
#[derive(Display, Debug)]
#[allow(missing_docs)] // TODO(#1030) - Add missing docs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you intend to remove this attribute?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unremoved, thanks. That answers the question I had in the PR-level comment.


// helper fn
fn get_prop<'data, D>(ppucd_provider: &D, resc_key: ResourceKey) -> UnisetResult
fn get_prop<'data, D>(provider: &'data D, resc_key: ResourceKey) -> UnisetResult<'data>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch on changing the name here.

#[cfg_attr(
feature = "provider_serde",
derive(serde::Serialize, serde::Deserialize)
)]
pub struct UnicodePropertyV1<'data> {
pub name: Cow<'data, str>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there still a purpose for the name field here? My understanding is that it was only used in the experimental ppucd provider.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. Based on @sffc 's comment above, I'm not sure yet about unlinking ppucd_provider, and maybe it provides an alternate way of exercising the provider logic as a temporary measure?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we get rid of the field in a separate PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 19 to 22
// let v_iter = v.iter();
// let v_len = v.len();
// let v_last = v.last();
// is_iter_valid(v_iter, v_len, v_last)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: should this be removed, or uncommented?

}
}

impl UnicodeSet {
impl<'data> Eq for UnicodeSet<'data> {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own education: is there a reason we can't derive Eq here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like Eq is not implemented on ZeroVec. When I remove this line and put Eq into the derive for UnicodeSet, I got/get:

error[E0277]: the trait bound `usize: icu_provider::yoke::ZeroCopyFrom<_>` is not satisfied
  --> components/uniset/src/uniset.rs:22:45
   |
22 | #[derive(Debug, PartialEq, Clone, Yokeable, ZeroCopyFrom)]
   |                                             ^^^^^^^^^^^^ the trait `icu_provider::yoke::ZeroCopyFrom<_>` is not implemented for `usize`
   |
   = note: required by `zero_copy_from`
   = note: this error originates in the derive macro `ZeroCopyFrom` (in Nightly builds, run with -Z macro-backtrace for more info)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you implement Eq on ZeroVec (where T is Eq) then? Something like:

impl<T> Eq for ZeroVec<T> where T: Eq {}

// or maybe just add #[derive(Eq)] on ZeroVec

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 91 to 93
/// Note: this function currently causes an allocation due to the requirements from
/// the constructors of the underlying [`ZeroVec`].
///
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we instead / also have a constructor along the lines of pub fn from_inversion_list<'a>(inv_list: &'a [u32]) -> Result<UnicodeSet<'a>, UnicodeSetError>?

It's a little awkward that ZeroVec doesn't give us direct access to a slice of u32, or else we could replace the previous constructor with a more general form.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think we should basically have 3 constructors:

  1. ZeroVec<'data, u32>
  2. &'data [u32]
  3. &[u32] (requires cloning the data)

Note that technically option 2 is a subset of option 1, but I think it's common enough to warrant its own constructor.

let result: Vec<u32> = self
.as_inversion_list() // Only crate public, to not leak impl
.to_vec();
let result: Vec<u32> = self.as_inversion_list(); // Only crate public, to not leak impl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verifying my understanding: we're keeping this private because ZeroVec can't expose a u32 slice, and ZeroVec can't expose a u32 slice because of endianness concerns?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and alignment concerns. yes.

Comment on lines 332 to 339
if let Some(x) = self.inv_list.get(pos + 1) {
(till) <= x
} else {
// This code should be unreachable because:
// Inversion list query should not return out of bounds index
false
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: What do you think of till <= self.inv_list.get(pos + 1).expect("Inversion list query should be in bounds")?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally put in this change to avoid the possibility of a panic. @sffc -- should that be a concern here?

OTOH, the comment asserts that it should be unreachable (based on the binary search result from the inversion list query not giving an out-of-bounds index). There is some precedent -- we used an .unwrap() in the CPT code because we couldn't avoid it and there weren't doubts about the code (which came from ICU4C).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do prefer what Elango has written over having unwraps in our code. However, I would also be okay with an unreachable!().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@echeran echeran merged commit 19e4a15 into unicode-org:main Sep 27, 2021
@echeran echeran deleted the uniset-zerovec branch September 27, 2021 14:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support zero-copy in UnicodeSet
4 participants