Skip to content

Provide lookup for the Bidi_Mirroring_Glyph property #2833

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
hsivonen opened this issue Nov 18, 2022 · 26 comments · Fixed by #3026 or #3229
Closed

Provide lookup for the Bidi_Mirroring_Glyph property #2833

hsivonen opened this issue Nov 18, 2022 · 26 comments · Fixed by #3026 or #3229
Assignees
Labels
C-unicode Component: Props, sets, tries S-medium Size: Less than a week (larger bug fix or enhancement) T-core Type: Required functionality

Comments

@hsivonen
Copy link
Member

HarfBuzz wants to be able to look up the Bidi_Mirroring_Glyph property for a character.

To be able to use ICU4X as a HarfBuzz Unicode data back end, ICU4X should provide lookup for this property.

@hsivonen hsivonen added the C-unicode Component: Props, sets, tries label Nov 18, 2022
@sffc
Copy link
Member

sffc commented Nov 18, 2022

Note: This is a string property. The string properties in ICU4C are:

Enum Description
UCHAR_AGE String property Age.Corresponds to u_charAge.Stable:ICU 2.4
UCHAR_BIDI_MIRRORING_GLYPH String property Bidi_Mirroring_Glyph.Corresponds to u_charMirror.Stable:ICU 2.4
UCHAR_CASE_FOLDING String property Case_Folding.Corresponds to u_strFoldCase in ustring.h.Stable:ICU 2.4
UCHAR_ISO_COMMENT Deprecated string property ISO_Comment.Corresponds to u_getISOComment.Deprecated:ICU 49
UCHAR_LOWERCASE_MAPPING String property Lowercase_Mapping.Corresponds to u_strToLower in ustring.h.Stable:ICU 2.4
UCHAR_NAME String property Name.Corresponds to u_charName.Stable:ICU 2.4
UCHAR_SIMPLE_CASE_FOLDING String property Simple_Case_Folding.Corresponds to u_foldCase.Stable:ICU 2.4
UCHAR_SIMPLE_LOWERCASE_MAPPING String property Simple_Lowercase_Mapping.Corresponds to u_tolower.Stable:ICU 2.4
UCHAR_SIMPLE_TITLECASE_MAPPING String property Simple_Titlecase_Mapping.Corresponds to u_totitle.Stable:ICU 2.4
UCHAR_SIMPLE_UPPERCASE_MAPPING String property Simple_Uppercase_Mapping.Corresponds to u_toupper.Stable:ICU 2.4
UCHAR_TITLECASE_MAPPING String property Titlecase_Mapping.Corresponds to u_strToTitle in ustring.h.Stable:ICU 2.4
UCHAR_UNICODE_1_NAME String property Unicode_1_Name.This property is of little practical value. Beginning with ICU 49, ICU APIs return an empty string for this property. Corresponds to u_charName(U_UNICODE_10_CHAR_NAME).Deprecated:ICU 49
UCHAR_UPPERCASE_MAPPING String property Uppercase_Mapping.Corresponds to u_strToUpper in ustring.h.Stable:ICU 2.4
UCHAR_BIDI_PAIRED_BRACKET String property Bidi_Paired_Bracket (new in Unicode 6.3).Corresponds to u_getBidiPairedBracket.Stable:ICU 52

@sffc
Copy link
Member

sffc commented Nov 18, 2022

CC @echeran

@sffc sffc added the discuss Discuss at a future ICU4X-SC meeting label Nov 18, 2022
@sffc
Copy link
Member

sffc commented Nov 18, 2022

@echeran @markusicu How should we implement string properties?

I assume a CodePointTrie with indices into a VarZeroVec of strings. Does that sound right?

We'll need to add these to icuexportdata.zip. Probably they should be encoded as a CodePointTrie paired with a plain list of strings.

@sffc sffc added needs-approval One or more stakeholders need to approve proposal and removed discuss Discuss at a future ICU4X-SC meeting labels Nov 18, 2022
@zbraniecki
Copy link
Member

Can we do it as an enum? How is it different from Plural Categories?

@sffc
Copy link
Member

sffc commented Nov 18, 2022

It's a mapping from string to string (in this case, code point to code point).

@sffc sffc added this to the ICU4X 1.2 milestone Dec 1, 2022
@sffc sffc added T-core Type: Required functionality S-medium Size: Less than a week (larger bug fix or enhancement) labels Dec 1, 2022
@hsivonen
Copy link
Member Author

If we know this property is always code point to code point, it seems to me we shouldn't involve strings here and should use a CodePointTrie with 32-bit value.

@sffc
Copy link
Member

sffc commented Dec 14, 2022

@markusicu Is the Bidi_Mirroring_Glyph property always Code Point to Code Point?

@hsivonen
Copy link
Member Author

At least HarfBuzz expects this property to be scalar value to scalar value.

@hsivonen
Copy link
Member Author

The comment in the data file as well as the actual data very strongly suggest that this is always a mapping from char to Option<char>.

@echeran
Copy link
Contributor

echeran commented Dec 15, 2022

IIUC what @markusicu said, it makes sense, either way, whether you want to represent the mapping as one of these options:

  1. char -> Option<char>
  2. char -> char (as ICU does)

The behavior of ICU's API u_charMirror() is to return "another Unicode code point that may serve as a mirror-image substitute, or c itself if there is no such mapping or c does not have the Bidi_Mirrored property".

A few reasons for why ICU's API return type is char, including having a behavior of returning the input for non-mirror-able characters, beyond the obvious impossibility of a nullable primitive in C++:

  1. this matches the existing case mapping APIs' behavior
  2. if you do get a character that has no mirror-image substitute, then what do you ultimately do with it? you (the rendering engine) can't just go and flip the character yourself, that is defined to not make any sense. since there is no mirror-image substitute character defined, there also will not be a glyph available in the font. so all you can do is render the character in question as-is.

If you choose to return Option<char>, then someone (the rendering engine?) has to check for None. If you choose to return char, then someone has to check if input == output. So it shouldn't make a big difference either way.

And here is somewhat related background info from TR9:

In implementation, sometimes pairs of characters are acceptable mirrors for one another—for example, U+0028 “(” LEFT PARENTHESIS and U+0029 “)” RIGHT PARENTHESIS or U+22E0 “⋠” DOES NOT PRECEDE OR EQUAL and U+22E1 “⋡” DOES NOT SUCCEED OR EQUAL. Other characters such as U+2231 “∱” CLOCKWISE INTEGRAL do not have corresponding characters that can be used for acceptable mirrors. The informative BidiMirroring.txt data file [Data9], lists the paired characters with acceptable mirror glyphs. The formal property name for this data in the Unicode Character Database [UCD] is Bidi_Mirroring_Glyph. A comment in the file indicates where the pairs are “best fit”: they should be acceptable in rendering, although ideally the mirrored glyphs may have somewhat different shapes.

@echeran
Copy link
Contributor

echeran commented Dec 15, 2022

Btw, the interpretation of Bidi_Mirroring_Glyph as returning a code point is correct. According to UAX 44 on properties/UCD, Bidi_Mirroring_Glyph is a miscellaneous property, not a string property.

In general, a property's type describes the return value, and "miscellaneous" just means it's not one of the other types: Catalog/Enumeration (enum), Binary (boolean), String-valued (string), Numeric (int or float). I guess since returning a code point is not represented as one of the other property types, "Miscellaneous" is the catch-all type assigned.

I'm not sure of the original reason for why it got classified as a string property in ICU4C. Similarly, the Age property is also in the list of properties defined in ICU4C as string properties, and in UAX 44 it is a Catalog property (returns an enumerated value), but it behaves a little strange because it's a semantic version string whose most useful interpretation is not with an equals comparison like most enums but with a semantic version string less-than/equals/greater-than-or-equals comparison. So I guess the practical interpretation of these properties may differ from the official UAX 44 categorization or the ICU4C header file categorization for the property's enum?

@sffc
Copy link
Member

sffc commented Dec 15, 2022

It seems fine then to store this in the data provider as simply a CodePointTrie mapping from a code point to a u32 that we interpret as a code point.

@hsivonen
Copy link
Member Author

We should probably ensure that Option<char> gets a 32-bit ULE and is usable directly as the trie value (if that's not already the case) so that we get deserialization-time validation. (We could assign e.g. 0xFFFFFFFF to None.)

@sffc
Copy link
Member

sffc commented Dec 15, 2022

Unfortunately we can't deserialization-time validate the CodePointTrie or Char16Trie because the values are deep inside the trie buffer. I'm fine defining the data to be that if the code point is in range, we return it, and otherwise we return None (so all error cases become None).

@sffc sffc removed the needs-approval One or more stakeholders need to approve proposal label Dec 15, 2022
@hsivonen
Copy link
Member Author

I understand that Char16Trie isn't suitable for validation, but why isn't the value ZeroVec of CodePointTrie as deserialization-time-validatable based on checking TrieValue's try_from_u32 for each item in the ZeroVec?

@sffc
Copy link
Member

sffc commented Jan 16, 2023

Yes, CodePointTrie<char> is well-defined, for example. For an optional char, to keep the value encoded as 3 bytes, we should use CodePointTrie<NichedOption<char, 3>>, providing all the necessary impls to make that work.

@Manishearth
Copy link
Member

On the topic of whether or not the data should be stored alongside Bidi_Paired_Bracket[_Type] data:

The mirroring ones have overlap with Bidi_Paired_Bracket in that the data is similar (most Bidi_Mirroring_Glyph values are identical to the Bidi_Paired_Bracket values, IF the character is a paired bracket). So it may be smaller overall to have a single queryable structure for both mirroring and pairing, that specifies the mirrored glyph and then specifies whether the pairing is formulaic of an exception.

Looking through this set query, it seems like the following is true about the current data, for characters with Bidi_M true:

  • Bidi_Mirrored_Glyph is always a mirrored glyph (a different codepoint)
  • Bidi_Paired_Bracket is set to the mirrored glyph if Bidi_Paired_Bracket_Type is set to something (not None)
  • Bidi_Paired_Bracket defaults to the same character otherwise
  • There are many characters with Bidi_M=Yes but no Bidi_Paired_Bracket_Type=Yes/No
  • There sem to be no characters with Bidi_Paired_Bracket_Type=Yes/No but not Bidi_M=Yes

However the second thing does not seem to be a "rule" and just the way the data is now, it seems possible that this data could change in the future, so we still need to have space in our data for exceptions (at which point it may not be worth having an optimization)

@sffc
Copy link
Member

sffc commented Feb 23, 2023

Discussion:

  • Can we use CodePointSetData for this?
  • @sffc - No, because each property is supposed to point to its own data key, so we reply on deduplication, which we want to avoid
  • @Manishearth - Bidi_Paired_Bracket should be a derived property.
  • @hsivonen - There are docs about how to do that in the data files.
  • @sffc - I'd propose that we just introduce two data keys, one for harfbuzz bidi data and one for skia bidi data, and they have getters for things.
  • @Manishearth - The properties overlap a lot.
  • @hsivonen - And apps that use harfbuzz probably have skia or similar.
  • @Manishearth - I'd like to make this code-driven: not the value of bidi paired bracket, but the useful value of bidi paired bracket (the normalized one).
  • @hsivonen - I agree
  • @sffc - Why should we give normalized-bidi-paired-bracket special treatment?
  • @Manishearth - It's not officially stable but I think we could ask PAG to make it stable.
  • @sffc - I'd like to stay clean as much as possible
  • @hsivonen - The normalizer is already not clean on like 3 characters
  • @sffc - So back to the data key splitting, how should we do it?
  • @hsivonen - I think we should put Bidi_Class separately, and combine Bidi_Mirroring_Glyph, Bidi_Paired_Bracket, Bidi_Paired_Bracket_Type.
  • @Manishearth - and Bidi_Mirrored.
  • @Manishearth - (some things are baked into harfbuzz)
  • @Manishearth - It is valid to investigate whether harfbuzz needs the mirrored properties at all
  • @hsivonen - How about Bidi_Control?
  • @sffc - Appears this one is not used in the bidi algorithm
  • @Manishearth - I think the property probably exists so that code can detect these characters, but they're not used in the bidi algorithm or harfbuzz.
  • @Manishearth - What do we need for regex?
  • @sffc - Looks like we need Bidi_Mirrored and Bidi_Control. Bidi_Mirrored is 915B in Postcard, so I'm fine carrying it even if it duplicates other data.
  • @echeran - Under what conditions to we add a la cart properties?
  • @sffc - Only if there's a compelling reason and use case for it. For Bidi_Mirrored, the reason is that it's pre-existing, and it is in ECMA_262.

Conclusion:

  • Bidi_Control remains its own key
  • Bidi_Class remains its own key
  • Bidi_Mirrored remains its own key because it is needed for regex
  • Add a new all-inclusive key containing: Bidi_Mirroring_Glyph, Bidi_Paired_Bracket, Bidi_Paired_Bracket_Type, and Bidi_Mirrored

@sffc
Copy link
Member

sffc commented Feb 23, 2023

Note: We can store the data in 24 bits and we should make a custom ULE for it.

@Manishearth
Copy link
Member

It's not officially stable but I think we could ask PAG to make it stable

discussed with PAG: their tentative direction here is something that works with our plan to hardcode, more to be seen in the final recommendation doc

@echeran
Copy link
Contributor

echeran commented Mar 19, 2023

I'm currently facing an issue in PR #3026 that is effectively preventing using the 24-bit structure as-is (unless we widen it to 32 bits): the datagen code that attempts to create the CodePointTrie using our bit-packed 24-bit struct as the trie's data type runs into an error. Manish diagnosed the error as coming from the CPT builder:

thread '<unnamed>' panicked at 'Don't know how to make trie with width 3', /home/manishearth/dev/icu4x/components/collections/codepointtrie_builder/src/native.rs:118:18

The ICU4C CPT builder code takes an enum for the data array value width that is either 8, 16, or 32 bits. The Rust code that interops with the ICU4C CPT builder merely propagates/enforces that requirement to us.

So 2 options:

  1. Widen the 24-bit representation to 32 bits to fit the ICU4C CPT builder requirements, and leave it at that
  2. Pursue a workaround that involves doing Option 1, followed by creating some type of Rust CPT method that can convert a CPT<T> to CPT<U>.

Option 1:

  • Pro: It would get CPT working, easier to implement
  • Con: If we widen the 24 bits to 32 bits, then we would not actually be realizing the space savings that we originally intended through the creation of the 24-bit struct.

Option 2: If we follow Option 1 to get a CPT<T> for some 32-bit type T, and perhaps expose some method on CPT that can convert the data array field from ZeroVec<T> to ZeroVec<U> for our 24-bit type U, then in theory we would finally achieve having a 24-bit CPT in Rust.

  • Pro: If it works, gets us the 25% savings that we desire
  • Con: 1) It would require adding a (doc-hidden?) public method to CPT, and 2) Can we create such a method that avoids taking a closure fn as an argument? i think so, maybe if we create a new trait as an extra constraint, but I could use some design help.

@echeran echeran added the discuss-priority Discuss at the next ICU4X meeting label Mar 19, 2023
@sffc
Copy link
Member

sffc commented Mar 19, 2023

I vote for option 2; it should be pretty easy even with the current abstractions we have. If I have time later I might send a PR to add the CPT conversion function.

@sffc
Copy link
Member

sffc commented Mar 19, 2023

@sffc
Copy link
Member

sffc commented Mar 19, 2023

Actually that function converts types but not sizes. Let me open a PR that does what you want...

@sffc
Copy link
Member

sffc commented Mar 19, 2023

#3207

@echeran echeran removed the discuss-priority Discuss at the next ICU4X meeting label Mar 20, 2023
@echeran
Copy link
Contributor

echeran commented Mar 27, 2023

Reopening because we still have TODOs marked for this issue (TODO(#2833)) related to the HarfBuzz glue code, even though that specific work is tracked in #2514.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-unicode Component: Props, sets, tries S-medium Size: Less than a week (larger bug fix or enhancement) T-core Type: Required functionality
Projects
None yet
5 participants