-
Notifications
You must be signed in to change notification settings - Fork 214
Provide lookup for the Bidi_Mirroring_Glyph property #2833
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Note: This is a string property. The string properties in ICU4C are:
|
CC @echeran |
@echeran @markusicu How should we implement string properties? I assume a CodePointTrie with indices into a VarZeroVec of strings. Does that sound right? We'll need to add these to icuexportdata.zip. Probably they should be encoded as a CodePointTrie paired with a plain list of strings. |
Can we do it as an enum? How is it different from Plural Categories? |
It's a mapping from string to string (in this case, code point to code point). |
If we know this property is always code point to code point, it seems to me we shouldn't involve strings here and should use a |
@markusicu Is the |
At least HarfBuzz expects this property to be scalar value to scalar value. |
The comment in the data file as well as the actual data very strongly suggest that this is always a mapping from |
IIUC what @markusicu said, it makes sense, either way, whether you want to represent the mapping as one of these options:
The behavior of ICU's API A few reasons for why ICU's API return type is
If you choose to return And here is somewhat related background info from TR9:
|
Btw, the interpretation of Bidi_Mirroring_Glyph as returning a code point is correct. According to UAX 44 on properties/UCD, In general, a property's type describes the return value, and "miscellaneous" just means it's not one of the other types: Catalog/Enumeration (enum), Binary (boolean), String-valued (string), Numeric (int or float). I guess since returning a code point is not represented as one of the other property types, "Miscellaneous" is the catch-all type assigned. I'm not sure of the original reason for why it got classified as a string property in ICU4C. Similarly, the |
It seems fine then to store this in the data provider as simply a |
We should probably ensure that |
Unfortunately we can't deserialization-time validate the CodePointTrie or Char16Trie because the values are deep inside the trie buffer. I'm fine defining the data to be that if the code point is in range, we return it, and otherwise we return None (so all error cases become None). |
I understand that |
Yes, |
On the topic of whether or not the data should be stored alongside The mirroring ones have overlap with Looking through this set query, it seems like the following is true about the current data, for characters with
However the second thing does not seem to be a "rule" and just the way the data is now, it seems possible that this data could change in the future, so we still need to have space in our data for exceptions (at which point it may not be worth having an optimization) |
Discussion:
Conclusion:
|
Note: We can store the data in 24 bits and we should make a custom ULE for it. |
discussed with PAG: their tentative direction here is something that works with our plan to hardcode, more to be seen in the final recommendation doc |
I'm currently facing an issue in PR #3026 that is effectively preventing using the 24-bit structure as-is (unless we widen it to 32 bits): the datagen code that attempts to create the CodePointTrie using our bit-packed 24-bit struct as the trie's data type runs into an error. Manish diagnosed the error as coming from the CPT builder:
The ICU4C CPT builder code takes an enum for the data array value width that is either 8, 16, or 32 bits. The Rust code that interops with the ICU4C CPT builder merely propagates/enforces that requirement to us. So 2 options:
Option 1:
Option 2: If we follow Option 1 to get a
|
I vote for option 2; it should be pretty easy even with the current abstractions we have. If I have time later I might send a PR to add the CPT conversion function. |
Actually the conversion method already exists: |
Actually that function converts types but not sizes. Let me open a PR that does what you want... |
Reopening because we still have TODOs marked for this issue ( |
HarfBuzz wants to be able to look up the
Bidi_Mirroring_Glyph
property for a character.To be able to use ICU4X as a HarfBuzz Unicode data back end, ICU4X should provide lookup for this property.
The text was updated successfully, but these errors were encountered: