Provide lookup for the Bidi_Mirroring_Glyph property #2833

hsivonen · 2022-11-18T06:05:01Z

HarfBuzz wants to be able to look up the Bidi_Mirroring_Glyph property for a character.

To be able to use ICU4X as a HarfBuzz Unicode data back end, ICU4X should provide lookup for this property.

The text was updated successfully, but these errors were encountered:

sffc · 2022-11-18T18:46:45Z

Note: This is a string property. The string properties in ICU4C are:

Enum	Description
UCHAR_AGE	String property Age.Corresponds to u_charAge.Stable:ICU 2.4
UCHAR_BIDI_MIRRORING_GLYPH	String property Bidi_Mirroring_Glyph.Corresponds to u_charMirror.Stable:ICU 2.4
UCHAR_CASE_FOLDING	String property Case_Folding.Corresponds to u_strFoldCase in ustring.h.Stable:ICU 2.4
UCHAR_ISO_COMMENT	Deprecated string property ISO_Comment.Corresponds to u_getISOComment.Deprecated:ICU 49
UCHAR_LOWERCASE_MAPPING	String property Lowercase_Mapping.Corresponds to u_strToLower in ustring.h.Stable:ICU 2.4
UCHAR_NAME	String property Name.Corresponds to u_charName.Stable:ICU 2.4
UCHAR_SIMPLE_CASE_FOLDING	String property Simple_Case_Folding.Corresponds to u_foldCase.Stable:ICU 2.4
UCHAR_SIMPLE_LOWERCASE_MAPPING	String property Simple_Lowercase_Mapping.Corresponds to u_tolower.Stable:ICU 2.4
UCHAR_SIMPLE_TITLECASE_MAPPING	String property Simple_Titlecase_Mapping.Corresponds to u_totitle.Stable:ICU 2.4
UCHAR_SIMPLE_UPPERCASE_MAPPING	String property Simple_Uppercase_Mapping.Corresponds to u_toupper.Stable:ICU 2.4
UCHAR_TITLECASE_MAPPING	String property Titlecase_Mapping.Corresponds to u_strToTitle in ustring.h.Stable:ICU 2.4
UCHAR_UNICODE_1_NAME	String property Unicode_1_Name.This property is of little practical value. Beginning with ICU 49, ICU APIs return an empty string for this property. Corresponds to u_charName(U_UNICODE_10_CHAR_NAME).Deprecated:ICU 49
UCHAR_UPPERCASE_MAPPING	String property Uppercase_Mapping.Corresponds to u_strToUpper in ustring.h.Stable:ICU 2.4
UCHAR_BIDI_PAIRED_BRACKET	String property Bidi_Paired_Bracket (new in Unicode 6.3).Corresponds to u_getBidiPairedBracket.Stable:ICU 52

sffc · 2022-11-18T18:46:53Z

CC @echeran

sffc · 2022-11-18T18:48:36Z

@echeran @markusicu How should we implement string properties?

I assume a CodePointTrie with indices into a VarZeroVec of strings. Does that sound right?

We'll need to add these to icuexportdata.zip. Probably they should be encoded as a CodePointTrie paired with a plain list of strings.

zbraniecki · 2022-11-18T20:23:16Z

Can we do it as an enum? How is it different from Plural Categories?

sffc · 2022-11-18T20:47:31Z

It's a mapping from string to string (in this case, code point to code point).

hsivonen · 2022-12-14T09:03:28Z

If we know this property is always code point to code point, it seems to me we shouldn't involve strings here and should use a CodePointTrie with 32-bit value.

sffc · 2022-12-14T09:28:11Z

@markusicu Is the Bidi_Mirroring_Glyph property always Code Point to Code Point?

hsivonen · 2022-12-14T09:48:27Z

At least HarfBuzz expects this property to be scalar value to scalar value.

hsivonen · 2022-12-14T09:51:20Z

The comment in the data file as well as the actual data very strongly suggest that this is always a mapping from char to Option<char>.

echeran · 2022-12-15T00:28:27Z

IIUC what @markusicu said, it makes sense, either way, whether you want to represent the mapping as one of these options:

char -> Option<char>
char -> char (as ICU does)

The behavior of ICU's API u_charMirror() is to return "another Unicode code point that may serve as a mirror-image substitute, or c itself if there is no such mapping or c does not have the Bidi_Mirrored property".

A few reasons for why ICU's API return type is char, including having a behavior of returning the input for non-mirror-able characters, beyond the obvious impossibility of a nullable primitive in C++:

this matches the existing case mapping APIs' behavior
if you do get a character that has no mirror-image substitute, then what do you ultimately do with it? you (the rendering engine) can't just go and flip the character yourself, that is defined to not make any sense. since there is no mirror-image substitute character defined, there also will not be a glyph available in the font. so all you can do is render the character in question as-is.

If you choose to return Option<char>, then someone (the rendering engine?) has to check for None. If you choose to return char, then someone has to check if input == output. So it shouldn't make a big difference either way.

And here is somewhat related background info from TR9:

In implementation, sometimes pairs of characters are acceptable mirrors for one another—for example, U+0028 “(” LEFT PARENTHESIS and U+0029 “)” RIGHT PARENTHESIS or U+22E0 “⋠” DOES NOT PRECEDE OR EQUAL and U+22E1 “⋡” DOES NOT SUCCEED OR EQUAL. Other characters such as U+2231 “∱” CLOCKWISE INTEGRAL do not have corresponding characters that can be used for acceptable mirrors. The informative BidiMirroring.txt data file [Data9], lists the paired characters with acceptable mirror glyphs. The formal property name for this data in the Unicode Character Database [UCD] is Bidi_Mirroring_Glyph. A comment in the file indicates where the pairs are “best fit”: they should be acceptable in rendering, although ideally the mirrored glyphs may have somewhat different shapes.

echeran · 2022-12-15T00:45:48Z

Btw, the interpretation of Bidi_Mirroring_Glyph as returning a code point is correct. According to UAX 44 on properties/UCD, Bidi_Mirroring_Glyph is a miscellaneous property, not a string property.

In general, a property's type describes the return value, and "miscellaneous" just means it's not one of the other types: Catalog/Enumeration (enum), Binary (boolean), String-valued (string), Numeric (int or float). I guess since returning a code point is not represented as one of the other property types, "Miscellaneous" is the catch-all type assigned.

I'm not sure of the original reason for why it got classified as a string property in ICU4C. Similarly, the Age property is also in the list of properties defined in ICU4C as string properties, and in UAX 44 it is a Catalog property (returns an enumerated value), but it behaves a little strange because it's a semantic version string whose most useful interpretation is not with an equals comparison like most enums but with a semantic version string less-than/equals/greater-than-or-equals comparison. So I guess the practical interpretation of these properties may differ from the official UAX 44 categorization or the ICU4C header file categorization for the property's enum?

sffc · 2022-12-15T10:00:44Z

It seems fine then to store this in the data provider as simply a CodePointTrie mapping from a code point to a u32 that we interpret as a code point.

hsivonen · 2022-12-15T13:05:54Z

We should probably ensure that Option<char> gets a 32-bit ULE and is usable directly as the trie value (if that's not already the case) so that we get deserialization-time validation. (We could assign e.g. 0xFFFFFFFF to None.)

sffc · 2022-12-15T13:30:21Z

Unfortunately we can't deserialization-time validate the CodePointTrie or Char16Trie because the values are deep inside the trie buffer. I'm fine defining the data to be that if the code point is in range, we return it, and otherwise we return None (so all error cases become None).

hsivonen · 2022-12-16T09:10:05Z

I understand that Char16Trie isn't suitable for validation, but why isn't the value ZeroVec of CodePointTrie as deserialization-time-validatable based on checking TrieValue's try_from_u32 for each item in the ZeroVec?

sffc · 2023-01-16T17:05:44Z

Yes, CodePointTrie<char> is well-defined, for example. For an optional char, to keep the value encoded as 3 bytes, we should use CodePointTrie<NichedOption<char, 3>>, providing all the necessary impls to make that work.

Manishearth · 2023-02-21T18:51:20Z

On the topic of whether or not the data should be stored alongside Bidi_Paired_Bracket[_Type] data:

The mirroring ones have overlap with Bidi_Paired_Bracket in that the data is similar (most Bidi_Mirroring_Glyph values are identical to the Bidi_Paired_Bracket values, IF the character is a paired bracket). So it may be smaller overall to have a single queryable structure for both mirroring and pairing, that specifies the mirrored glyph and then specifies whether the pairing is formulaic of an exception.

Looking through this set query, it seems like the following is true about the current data, for characters with Bidi_M true:

Bidi_Mirrored_Glyph is always a mirrored glyph (a different codepoint)
Bidi_Paired_Bracket is set to the mirrored glyph if Bidi_Paired_Bracket_Type is set to something (not None)
Bidi_Paired_Bracket defaults to the same character otherwise
There are many characters with Bidi_M=Yes but no Bidi_Paired_Bracket_Type=Yes/No
There sem to be no characters with Bidi_Paired_Bracket_Type=Yes/No but not Bidi_M=Yes

However the second thing does not seem to be a "rule" and just the way the data is now, it seems possible that this data could change in the future, so we still need to have space in our data for exceptions (at which point it may not be worth having an optimization)

sffc · 2023-02-23T19:22:03Z

Discussion:

Can we use CodePointSetData for this?
@sffc - No, because each property is supposed to point to its own data key, so we reply on deduplication, which we want to avoid
@Manishearth - Bidi_Paired_Bracket should be a derived property.
@hsivonen - There are docs about how to do that in the data files.
@sffc - I'd propose that we just introduce two data keys, one for harfbuzz bidi data and one for skia bidi data, and they have getters for things.
@Manishearth - The properties overlap a lot.
@hsivonen - And apps that use harfbuzz probably have skia or similar.
@Manishearth - I'd like to make this code-driven: not the value of bidi paired bracket, but the useful value of bidi paired bracket (the normalized one).
@hsivonen - I agree
@sffc - Why should we give normalized-bidi-paired-bracket special treatment?
@Manishearth - It's not officially stable but I think we could ask PAG to make it stable.
@sffc - I'd like to stay clean as much as possible
@hsivonen - The normalizer is already not clean on like 3 characters
@sffc - So back to the data key splitting, how should we do it?
@hsivonen - I think we should put Bidi_Class separately, and combine Bidi_Mirroring_Glyph, Bidi_Paired_Bracket, Bidi_Paired_Bracket_Type.
@Manishearth - and Bidi_Mirrored.
@Manishearth - (some things are baked into harfbuzz)
@Manishearth - It is valid to investigate whether harfbuzz needs the mirrored properties at all
@hsivonen - How about Bidi_Control?
@sffc - Appears this one is not used in the bidi algorithm
@Manishearth - I think the property probably exists so that code can detect these characters, but they're not used in the bidi algorithm or harfbuzz.
@Manishearth - What do we need for regex?
@sffc - Looks like we need Bidi_Mirrored and Bidi_Control. Bidi_Mirrored is 915B in Postcard, so I'm fine carrying it even if it duplicates other data.
@echeran - Under what conditions to we add a la cart properties?
@sffc - Only if there's a compelling reason and use case for it. For Bidi_Mirrored, the reason is that it's pre-existing, and it is in ECMA_262.

Conclusion:

Bidi_Control remains its own key
Bidi_Class remains its own key
Bidi_Mirrored remains its own key because it is needed for regex
Add a new all-inclusive key containing: Bidi_Mirroring_Glyph, Bidi_Paired_Bracket, Bidi_Paired_Bracket_Type, and Bidi_Mirrored

sffc · 2023-02-23T19:25:18Z

Note: We can store the data in 24 bits and we should make a custom ULE for it.

Manishearth · 2023-02-23T21:33:52Z

It's not officially stable but I think we could ask PAG to make it stable

discussed with PAG: their tentative direction here is something that works with our plan to hardcode, more to be seen in the final recommendation doc

echeran · 2023-03-19T05:32:14Z

I'm currently facing an issue in PR #3026 that is effectively preventing using the 24-bit structure as-is (unless we widen it to 32 bits): the datagen code that attempts to create the CodePointTrie using our bit-packed 24-bit struct as the trie's data type runs into an error. Manish diagnosed the error as coming from the CPT builder:

thread '<unnamed>' panicked at 'Don't know how to make trie with width 3', /home/manishearth/dev/icu4x/components/collections/codepointtrie_builder/src/native.rs:118:18

The ICU4C CPT builder code takes an enum for the data array value width that is either 8, 16, or 32 bits. The Rust code that interops with the ICU4C CPT builder merely propagates/enforces that requirement to us.

So 2 options:

Widen the 24-bit representation to 32 bits to fit the ICU4C CPT builder requirements, and leave it at that
Pursue a workaround that involves doing Option 1, followed by creating some type of Rust CPT method that can convert a CPT<T> to CPT<U>.

Option 1:

Pro: It would get CPT working, easier to implement
Con: If we widen the 24 bits to 32 bits, then we would not actually be realizing the space savings that we originally intended through the creation of the 24-bit struct.

Option 2: If we follow Option 1 to get a CPT<T> for some 32-bit type T, and perhaps expose some method on CPT that can convert the data array field from ZeroVec<T> to ZeroVec<U> for our 24-bit type U, then in theory we would finally achieve having a 24-bit CPT in Rust.

Pro: If it works, gets us the 25% savings that we desire
Con: 1) It would require adding a (doc-hidden?) public method to CPT, and 2) Can we create such a method that avoids taking a closure fn as an argument? i think so, maybe if we create a new trait as an extra constraint, but I could use some design help.

sffc · 2023-03-19T05:43:40Z

I vote for option 2; it should be pretty easy even with the current abstractions we have. If I have time later I might send a PR to add the CPT conversion function.

sffc · 2023-03-19T06:04:59Z

Actually the conversion method already exists:

https://unicode-org.github.io/icu4x/docs/icu_collections/codepointtrie/struct.CodePointTrie.html#method.try_into_converted

sffc · 2023-03-19T06:05:45Z

Actually that function converts types but not sizes. Let me open a PR that does what you want...

sffc · 2023-03-19T06:30:24Z

#3207

echeran · 2023-03-27T23:20:29Z

Reopening because we still have TODOs marked for this issue (TODO(#2833)) related to the HarfBuzz glue code, even though that specific work is tracked in #2514.

hsivonen added the C-unicode Component: Props, sets, tries label Nov 18, 2022

hsivonen mentioned this issue Nov 18, 2022

Provide glue code for HarfBuzz #2514

Closed

sffc added the discuss Discuss at a future ICU4X-SC meeting label Nov 18, 2022

sffc added needs-approval One or more stakeholders need to approve proposal and removed discuss Discuss at a future ICU4X-SC meeting labels Nov 18, 2022

hsivonen mentioned this issue Nov 22, 2022

First part of glue code for HarfBuzz #2839

Merged

sffc added this to the ICU4X 1.2 milestone Dec 1, 2022

sffc assigned echeran Dec 1, 2022

sffc added T-core Type: Required functionality S-medium Size: Less than a week (larger bug fix or enhancement) labels Dec 1, 2022

sffc removed the needs-approval One or more stakeholders need to approve proposal label Dec 15, 2022

sffc mentioned this issue Dec 16, 2022

Represent Property data for strings #1667

Open

echeran mentioned this issue Jan 18, 2023

move props bmg bpb EqUIdeo from misc to string unicode-org/unicodetools#383

Merged

sffc mentioned this issue Jan 25, 2023

How to handle non-orthogonal data #3022

Open

echeran mentioned this issue Jan 25, 2023

Support data for Bidi properties with combined data structure #3026

Merged

sffc modified the milestones: 1.2 Blocking ⟨P1⟩, 1.x Priority ⟨P2⟩ Feb 23, 2023

sffc mentioned this issue Feb 23, 2023

Provide data for Bidi pairing of brackets #3030

Closed

echeran added the discuss-priority Discuss at the next ICU4X meeting label Mar 19, 2023

sffc mentioned this issue Mar 19, 2023

Add CPT::try_alloc_map_value #3207

Merged

echeran removed the discuss-priority Discuss at the next ICU4X meeting label Mar 20, 2023

echeran linked a pull request Mar 27, 2023 that will close this issue

Support data for Bidi properties with combined data structure #3026

Merged

echeran closed this as completed in #3026 Mar 27, 2023

echeran reopened this Mar 27, 2023

echeran mentioned this issue Mar 28, 2023

Update HarfBuzz glue code for Bidi_Mirroring_Glyph data #3229

Merged

echeran closed this as completed in #3229 Mar 28, 2023

Provide lookup for the Bidi_Mirroring_Glyph property #2833

Provide lookup for the Bidi_Mirroring_Glyph property #2833

Comments

hsivonen commented Nov 18, 2022

sffc commented Nov 18, 2022

Uh oh!

sffc commented Nov 18, 2022

Uh oh!

sffc commented Nov 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zbraniecki commented Nov 18, 2022

Uh oh!

sffc commented Nov 18, 2022

Uh oh!

hsivonen commented Dec 14, 2022

Uh oh!

sffc commented Dec 14, 2022

Uh oh!

hsivonen commented Dec 14, 2022

Uh oh!

hsivonen commented Dec 14, 2022

Uh oh!

echeran commented Dec 15, 2022

Uh oh!

echeran commented Dec 15, 2022

Uh oh!

sffc commented Dec 15, 2022

Uh oh!

hsivonen commented Dec 15, 2022

Uh oh!

sffc commented Dec 15, 2022

Uh oh!

hsivonen commented Dec 16, 2022

Uh oh!

sffc commented Jan 16, 2023

Uh oh!

Manishearth commented Feb 21, 2023

Uh oh!

sffc commented Feb 23, 2023

Uh oh!

sffc commented Feb 23, 2023

Uh oh!

Manishearth commented Feb 23, 2023

Uh oh!

echeran commented Mar 19, 2023

Uh oh!

sffc commented Mar 19, 2023

Uh oh!

sffc commented Mar 19, 2023

Uh oh!

sffc commented Mar 19, 2023

Uh oh!

sffc commented Mar 19, 2023

Uh oh!

echeran commented Mar 27, 2023

Uh oh!

sffc commented Nov 18, 2022 •

edited

Loading