Skip to content

move props bmg bpb EqUIdeo from misc to string #383

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 18, 2023

Conversation

markusicu
Copy link
Member

@markusicu markusicu commented Jan 11, 2023

For

[172-A77] Action Item for Markus Scherer, PAG: For a future version of the Unicode Standard, change PropertyAliases.txt to list Bidi_Mirroring_Glyph and Equivalent_Unified_Ideograph under “String Properties”, and to list Bidi_Paired_Bracket under “Enumerated Properties”.

except:
When writing the action item (L2/22-124 item UCD17), we seem to have mixed up Bidi_Paired_Bracket (Miscellaneous, should be String) with Bidi_Paired_Bracket_Type (is Enumerated). I changed Bidi_Paired_Bracket to String as well.
See

Old parts of the tools code hardcode the property type and @missing value for some properties. I changed that code to achieve the desired output, including keeping the Bidi_Paired_Bracket @missing value as <none> (as documented in the data file), rather than having it switch to <code point>. (See PR discussion.)

See https://www.unicode.org/reports/tr44/

Tag Interpretation
<none> the empty string
<code point> the string representation of the code point value

@macchiati
Copy link
Member

macchiati commented Jan 11, 2023

When I switched Bidi_Paired_Bracket to String, it changed the @missing value from <none> (as documented in the data file) to <code point> We need to decide which is right and make the data match the docs.

<code point> makes sense for any property that is fundamentally used for a transformation that leaves unmentioned code points alone of the string alone (like lowercase). This isn't used as that kind of transformation, so I think <none> is better.

@asmusf
Copy link

asmusf commented Jan 11, 2023

Agree with Mark.
Now, we need to check that we don't promise somewhere that a String property is defined for every code point.
If that is the case, that could be the reason why we went with "miscellaneous".

@markusicu
Copy link
Member Author

Agree with Mark.

Ok. I will see how I can make that happen.

Now, we need to check that we don't promise somewhere that a String property is defined for every code point.

Every character property is defined for every code point, with an explicit value or with a "missing" value.

@asmusf
Copy link

asmusf commented Jan 12, 2023

Technically you are correct, but that's not what I'm after. A value of "none" means that there's no string defined for that code point. So, to ask the question again, do we promise somewhere that string-valued properties are string-valued for all code points? If we do, then we should either change that promise, or make this a "miscellaneous" property again, where we can then spell out precisely that for this property, some code point are mapped to a string, while others are mapped to nothing (or "none").

And I would argue, further, that defining a missing value of "none"> or "n/a" is only mathematically, and not practically different from a property that is simply undefined for some ranges of code points.

Alternatively, we could make a stronger point of clarifying how mappings differ from other string-valued properties, including the guarantee that they have a string for each code point (and if needed, we can introduce the empty string).
It may be that we already did that, I just don't recall offhand and can't devote time right now to tracking this down.

@markusicu markusicu force-pushed the move-props-misc-strings branch from 16aabc5 to 1d40037 Compare January 17, 2023 22:55
@markusicu markusicu marked this pull request as ready for review January 17, 2023 23:00
@echeran
Copy link
Contributor

echeran commented Jan 18, 2023

...A value of "none" means that there's no string defined for that code point. So, to ask the question again, do we promise somewhere that string-valued properties are string-valued for all code points? If we do, then we should either change that promise, or make this a "miscellaneous" property again, where we can then spell out precisely that for this property, some code point are mapped to a string, while others are mapped to nothing (or "none").

And I would argue, further, that defining a missing value of "none"> or "n/a" is only mathematically, and not practically different from a property that is simply undefined for some ranges of code points.

I had similar questions recently in ICU4X (unicode-org/icu4x#2833) about the property value type for Bidi_Mirroring_Glyph, and Bidi_Paired_Bracket appears analogous. My comments there don't include later discussions, as already discussed here in this PR, that indicated that these properties apparently really should be considered as string properties because they return code points. (The header info for the corresponding BidiMirroring.txt and BidiBrackets.txt also describe their return values to occupy the same range.)

To the question of whether <none> is a String value, I looked up the section of String values in UTR # 23 Section 3.6:

PD32. String
An ordered sequence of zero or more code points.
At its most general, a string is any coded character sequence but extending the concept to encompass the empty sequence. Character mappings are common examples of properties for which the values are strings but not necessarily Unicode strings.
All code points in a string are from the same character encoding.

PD32a. Empty String
A string consisting of exactly zero code points.
Note that in principle any empty string is equivalent to any other empty string, so in many contexts, an instance of an empty string is simply referred to as the empty string.
...

IIUC, the <none> value is interpreted as the empty string, but it is still a string.

@markusicu
Copy link
Member Author

All <none> does is say that there is no real value there. APIs can express that in different ways. In ICU functions returning UChar32, the cleanest would be returning -1 (U_SENTINEL). I don't remember why I made the u_charMirror(c) return the input c instead. Probably because simple case mapping functions do so, although that's semantically a bit different, as Mark said.


Could I please get someone to approve this PR?

@asmusf
Copy link

asmusf commented Jan 18, 2023

I think we should not proceed hastily without solving the deeper issue. We are not doing API design here, so it matters whether we conceive of <none> as a way to say "undefined" or "empty string".

From memory, what I think I know about these two properties, the values are truly "undefied", "not there" or "not defined", in that there is the absence of a mapping.

If an API gave you the empty string, you would not use that literally, but to detect that a mapping isn't possible. In other words, for all other values you would map some code point to the string value as part of processing it, but if it's undefined you would not map to the empty string, but do something else. Which is the tell that <none> in this context is not the empty string.

That means, we still need to find out whether string-valued properties make an implied or explicit promise that <none> is the same as the "empty string". Which, in this case would be wrong.

I'm suggesting that in TR23 and chapter 3 we may need to have a definition for properties that are undefined for some inputs.

echeran
echeran previously approved these changes Jan 18, 2023
Copy link
Contributor

@echeran echeran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the purposes of annotating the property as a string in the data files and supporting tools, this PR LGTM.

It SGTM to have followup work in TR23 (and touchups to UAX44?) accordingly, which I assume has to happen outside of this PR.

@markusicu
Copy link
Member Author

Overthinking here. No one is confused about Bidi_Paired_Bracket mapping to <none>. Everyone looking at BidiBrackets.txt understands that for a character with bpb=<none> there exists no character that is its paired bracket.

The point of this AI & PR is that these three properties map characters to characters, and therefore we can call them String properties and don't have to use the "Miscellaneous" escape hatch.

@markusicu
Copy link
Member Author

I just made one more fix in BidiBrackets.txt which used to say unnecessarily what types its properties are. Rather than keeping those in sync, I removed that, leaving it up to PropertyAliases.txt and UAX 44 to define the types of properties, as usual.

@markusicu markusicu merged commit 6aa0313 into unicode-org:main Jan 18, 2023
@markusicu markusicu deleted the move-props-misc-strings branch January 18, 2023 19:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants