Skip to content

Support data for Bidi properties with combined data structure #3026

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 73 commits into from
Mar 27, 2023

Conversation

echeran
Copy link
Contributor

@echeran echeran commented Jan 25, 2023

Fixes #2833

@echeran echeran requested a review from hsivonen January 25, 2023 19:00
#[allow(clippy::exhaustive_structs)] // newtype
#[repr(transparent)]
#[zerovec::make_ule(BidiMirroringGlyphULE)]
pub struct BidiMirroringGlyph(pub u32);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: We should make this a 3-byte type so that the data provider also stores it as 3 bytes.

[u8; 3] is one way to accomplish this. Alternatively, a cleaner approach would be to add UnvalidatedChar in the zerovec crate with the correct semantics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to use the high 3 bits to fit more related properties in this data structure, so just "unvalidated char" wouldn't the the whole semantic. We might reuse the U24 struct that the normalizer had before.

Quoting self from Slack:

I do think it makes sense to bundle at least Bidi_Mirroring_Glyph, Bidi_Mirrored, Bidi_Paired_Bracket , and Bidi_Paired_Bracket_Type as a 24-bit value in one trie. I'm less sure of whether it makes sense to extend the trie value to 32 bits and bake Bidi_Class in the additional 8 bits.

@echeran echeran changed the title Add newtype enum for Bidi_Mirroring_Glyph Support data for Bidi properties with combined data structure Mar 3, 2023
///


/// 20..0 Code point return value for Bidi_Mirroring_Glyph value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these docs should go on the ULE type

Also, this module should probably be a submodule of provider and marked unstable the way we usually do provider code, but not doc(hidden)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also these ranges seem backwards?

//
// TODO: is the above statement correct and okay?
// TODO: or do we have a way to validate the enum discriminant against the enum for
// newtype idiom enums?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: we must validate that all unused bytes are 0 (or some other fixed value)

@@ -92,7 +92,7 @@ pub mod versions {
/// # Examples
///
/// ```
/// assert_eq!("icu4x/2023-02-24/72.x", icu_testdata::versions::icu_tag());
/// assert_eq!("icu4x/2023-03-22a/72.x", icu_testdata::versions::icu_tag());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I'd prefer for the tag change to be its own standalone PR (I've given similar feedback to others in the past)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just remembered: in order for ci-job-testdata to pass, the tag needs to be updated because it refers to the data from the latest tag when running cargo make testdata. So if we want the latest version tag to be changed in a separate PR, then merging this PR will require overriding the fact that the relevant CI check will be failing (and that the follow on CI check full-datagen won't run as a result, too).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can include the tag update in this PR but also pull it out, merge the other PR, and then update this PR to the latest main, which should be a small change. But I'll leave it up to you if you want to do this. It's a preference of mine but not super strong since this is only an additive change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@echeran echeran marked this pull request as ready for review March 24, 2023 19:30
@echeran echeran requested review from robertbastian and a team as code owners March 24, 2023 19:30
@echeran echeran requested a review from sffc March 24, 2023 19:30
Comment on lines +158 to +159
CheckedBidiPairedBracketType::Open => BidiPairingProperties::Open(mirroring_glyph),
CheckedBidiPairedBracketType::Close => BidiPairingProperties::Close(mirroring_glyph),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the place where you consume the CheckedBidiPairedBracketType, right? I don't understand the advantage of using the closed enum instead of using the open enum throughout.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't use the open enum throughout: it has states that cannot be represented in the ULE type because we don't have space.

If we want to use the open enum, MirroredPairedBracketData needs to have private fields, with a custom validating serde impl (and a custom ctor for databake).


use icu_provider::prelude::*;

/// A wrapper around certain Bidi properties data. Can be obtained via [`load_bidi_auxiliary_properties_unstable()`] and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought (nb): Now that we have all these nice wrapper types I kinda feel like the load methods should have gone on those types instead of being free functions.

Might be a thing to do for 2.0? What do y'all think?

#[cfg_attr(feature = "datagen", databake(path = icu_properties))]
#[allow(clippy::exhaustive_structs)] // newtype
#[repr(transparent)]
#[zerovec::make_ule(BidiPairedBracketTypeULE)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's necessary to be able to make CPTs for datagen. That said, we could cfg_attr(feature = "datagen) this, or just use CPT<u8> in datagen.

pub struct BidiPairedBracketType(pub u8);

#[allow(non_upper_case_globals)]
impl BidiPairedBracketType {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: hmm, so I was under the impression we already had this type when I suggested making a closed and open.

If this is a new type, perhaps this can just be a closed enum for now (and we have a single type). @sffc thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is being discussed below

@echeran echeran requested a review from sffc March 25, 2023 18:18
@@ -31,6 +31,9 @@ pub enum PropertiesError {
/// An unknown or unexpected property name was used for an API dealing with properties specified as strings at runtime
#[displaydoc("Unexpected or unknown property name")]
UnexpectedPropertyName,
/// A `u32` serialized value of `MirroredPairedBracketData` did not encode either a valid Bidi_Mirroring_Glyph or a valid Bidi_Paired_Bracket_Type
#[displaydoc("Invalid MirroredPairedBracketData serialized in int: {0}")]
MirroredPairedBracketDataFromU32(u32),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: I think slightly more idiomatic in Rust would be for this to be its own little error struct named MirroredPairedBracketDataTryFromError

@echeran echeran merged commit b675c71 into unicode-org:main Mar 27, 2023
@echeran echeran deleted the bmg-prop branch March 27, 2023 23:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Provide data for Bidi pairing of brackets Provide lookup for the Bidi_Mirroring_Glyph property
4 participants