Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make codepoint(c) work for overlong chars #55152

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
Open

Conversation

stevengj
Copy link
Member

@stevengj stevengj commented Jul 17, 2024

As discussed in #54393, codepoint(c) should succeed for overlong encodings, and whenever ismalformed(c) returns false. This should be backwards compatible since it simply removes an error, and should be strictly faster than before since it merely removes a call to Base.is_overlong_enc.

Also, Base.ismalformed and Base.isoverlong are declared public (but not yet exported) and are included in the manual, since they are referenced in the docstring of codepoint etcetera. I also made Base.show_invalid
a public and documented function, since it is referenced from the ismalformed docs and is required by new implementations of AbstractChar types that support malformed data.

Fixes #54343, closes #54393.

@stevengj stevengj added the unicode Related to unicode characters and encodings label Jul 17, 2024
@nhz2
Copy link
Contributor

nhz2 commented Jul 28, 2024

These changes seems to be at odds with other docs, because now codepoint(a) == codepoint(b) no longer implies a == b

julia/base/char.jl

Lines 4 to 10 in 197295c

The `AbstractChar` type is the supertype of all character implementations
in Julia. A character represents a Unicode code point, and can be converted
to an integer via the [`codepoint`](@ref) function in order to obtain the
numerical value of the code point, or constructed from the same integer.
These numerical values determine how characters are compared with `<` and `==`,
for example. New `T <: AbstractChar` types should define a `codepoint(::T)`
method and a `T(::UInt32)` constructor, at minimum.

Also, the information that a Char is overlong will be silently destroyed by conversion to UInt32 instead of throwing an error.

julia/base/char.jl

Lines 40 to 44 in 197295c

In order to losslessly represent arbitrary byte streams stored in a `String`,
a `Char` value may store information that cannot be converted to a Unicode
codepoint — converting such a `Char` to `UInt32` will throw an error.
The [`isvalid(c::Char)`](@ref) function can be used to query whether `c`
represents a valid Unicode character.

@stevengj
Copy link
Member Author

stevengj commented Jul 28, 2024

A character represents a Unicode code point

The basic problem is that this is an oversimplification. The Char type represents the encoding, not just the codepoint, and can represent byte sequences that don’t encode Unicode code points.

I updated the AbstractChar docs to be more accurate.

@stevengj stevengj added the triage This should be discussed on a triage call label Jan 1, 2025
Copy link
Member

@StefanKarpinski StefanKarpinski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me

@stevengj
Copy link
Member Author

stevengj commented Jan 2, 2025

Unrelated CI failure, updating and re-running CI.

@stevengj stevengj added merge me PR is reviewed. Merge when all tests are passing and removed triage This should be discussed on a triage call labels Jan 2, 2025
base/char.jl Outdated Show resolved Hide resolved
@LilithHafner
Copy link
Member

The docstring of codepoint currently reads "...throw an exception if c does not represent a valid character...". That should be changes to "...represent a malformed character..." and link to the definition of invalid but not malformed.

@StefanKarpinski
Copy link
Member

Triage likes but would also like for "malformed" to be documented somewhere and to adjust the docstring of codepoint to refer to malformed rather than invalid. @LilithHafner feels that it would be good to block the publicness of ismalformed on documentation of what it means, so maybe that's a good ordering:

  1. Add docstring for ismalformed defining what it does
  2. Make ismalformed public
  3. Update codepoint docstring to refer to malformed versus invalid

An additional comment regarding equality and comparison:

  • Valid strings are compared as lexicographically ordered sequences of code points
  • A valid string and an invalid string must never be equal
  • Comparison of invalid strings is implementation-defined and may error but should be an ordering:
    • Reflexive: s == s for all strings
    • Antisymmetric: s <= t and t <= s implies s == t
    • Transitive: s <= t and t <= u implies s <= u
    • Total: either s <= t or t <= s or both are an error

This allows each string type to define a total ordering on valid and invalid strings in a way that's efficient and consistent within the type, but comparisons of invalid strings across types can simply error since there's no sensible way to implement that and forcing it to be consistent would force valid comparisons to be done inefficiently.

@stevengj
Copy link
Member Author

stevengj commented Jan 2, 2025

It seems like the triage requests were all already addressed.

  1. ismalformed already has a docstring in this PR (and is included in the manual)
  2. ismalformed is already public (but not exported) in this PR
  3. the codepoint docs already refer to malformed rather than valid in this PR

Removing the "merge me" label, however, until it is clear that everyone is satisfied.

@stevengj stevengj removed the merge me PR is reviewed. Merge when all tests are passing label Jan 2, 2025
base/char.jl Outdated Show resolved Hide resolved
base/char.jl Outdated Show resolved Hide resolved
base/char.jl Show resolved Hide resolved
@stevengj
Copy link
Member Author

stevengj commented Jan 2, 2025

Windows build failure looks unrelated: ERROR: Unable to open agent private key path 'C:\secrets/agent.key'! Make sure your agent has this file deployed within it!

Copy link
Member

@LilithHafner LilithHafner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like a more complete/specific/accessible definition of malformed vs invalid, or a link to the specific part of the unicode standard that defines it; but I don't think that is blocking given the level of docs already in this PR.

@stevengj stevengj added the merge me PR is reviewed. Merge when all tests are passing label Jan 3, 2025
@inkydragon
Copy link
Member

inkydragon commented Jan 3, 2025

malformed vs invalid

I didn't find a definition for either word, but did find definitions for their synonyms/antonyms..
Glossary of Unicode Terms

D84 Ill-formed: A Unicode code unit sequence that purports to be in a Unicode encoding form is called ill-formed if and only if it does not follow the specification of that Unicode encoding form.

  • Any code unit sequence that would correspond to a code point outside the defined range of Unicode scalar values would, for example, be ill-formed.
  • UTF-8 has some strong constraints on the possible byte ranges for leading and trailing bytes. A violation of those constraints would produce a code unit sequence that could not be mapped to a Unicode scalar value, resulting in an ill-formed code unit sequence.

xref:

D89 In a Unicode encoding form: ...

  • A Unicode string consisting of a well-formed UTF-8 code unit sequence is said to be in UTF-8. Such a Unicode string is referred to as a valid UTF-8 string, or a UTF-8 string for short.

@LilithHafner
Copy link
Member

Assuming UTF-8,

Unicode specifies that any code unit sequence not listed in this table is ill-formed and not well-formed.

IIUC, our definition of malformed is different from Unicode's definition of ill-formed. For example, overlong characters are not Base.ismalformed but are ill-formed according to Unicode.

@LilithHafner LilithHafner removed the merge me PR is reviewed. Merge when all tests are passing label Jan 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
unicode Related to unicode characters and encodings
Projects
None yet
Development

Successfully merging this pull request may close these issues.

isuppercase/islowercase fail on invalid characters
5 participants