make codepoint(c) work for overlong chars #55152

stevengj · 2024-07-17T15:20:38Z

As discussed in #54393, codepoint(c) should succeed for overlong encodings, and whenever ismalformed(c) returns false. This should be backwards compatible since it simply removes an error, and should be strictly faster than before since it merely removes a call to Base.is_overlong_enc.

Also, Base.ismalformed and Base.isoverlong are declared public (but not yet exported) and are included in the manual, since they are referenced in the docstring of codepoint etcetera. I also made Base.show_invalid
a public and documented function, since it is referenced from the ismalformed docs and is required by new implementations of AbstractChar types that support malformed data.

Fixes #54343, closes #54393.

nhz2 · 2024-07-28T16:20:54Z

These changes seems to be at odds with other docs, because now codepoint(a) == codepoint(b) no longer implies a == b

julia/base/char.jl

Lines 4 to 10 in 197295c

    
           The `AbstractChar` type is the supertype of all character implementations 
        
           in Julia. A character represents a Unicode code point, and can be converted 
        
           to an integer via the [`codepoint`](@ref) function in order to obtain the 
        
           numerical value of the code point, or constructed from the same integer. 
        
           These numerical values determine how characters are compared with `<` and `==`, 
        
           for example.  New `T <: AbstractChar` types should define a `codepoint(::T)` 
        
           method and a `T(::UInt32)` constructor, at minimum.

Also, the information that a Char is overlong will be silently destroyed by conversion to UInt32 instead of throwing an error.

julia/base/char.jl

Lines 40 to 44 in 197295c

    
           In order to losslessly represent arbitrary byte streams stored in a `String`, 
        
           a `Char` value may store information that cannot be converted to a Unicode 
        
           codepoint — converting such a `Char` to `UInt32` will throw an error. 
        
           The [`isvalid(c::Char)`](@ref) function can be used to query whether `c` 
        
           represents a valid Unicode character.

stevengj · 2024-07-28T19:54:05Z

A character represents a Unicode code point

The basic problem is that this is an oversimplification. The Char type represents the encoding, not just the codepoint, and can represent byte sequences that don’t encode Unicode code points.

I updated the AbstractChar docs to be more accurate.

StefanKarpinski

Looks great to me

stevengj · 2025-01-02T18:55:14Z

Unrelated CI failure, updating and re-running CI.

base/char.jl

LilithHafner · 2025-01-02T19:26:49Z

The docstring of codepoint currently reads "...throw an exception if c does not represent a valid character...". That should be changes to "...represent a malformed character..." and link to the definition of invalid but not malformed.

StefanKarpinski · 2025-01-02T19:32:50Z

Triage likes but would also like for "malformed" to be documented somewhere and to adjust the docstring of codepoint to refer to malformed rather than invalid. @LilithHafner feels that it would be good to block the publicness of ismalformed on documentation of what it means, so maybe that's a good ordering:

Add docstring for ismalformed defining what it does
Make ismalformed public
Update codepoint docstring to refer to malformed versus invalid

An additional comment regarding equality and comparison:

Valid strings are compared as lexicographically ordered sequences of code points
A valid string and an invalid string must never be equal
Comparison of invalid strings is implementation-defined and may error but should be an ordering:
- Reflexive: s == s for all strings
- Antisymmetric: s <= t and t <= s implies s == t
- Transitive: s <= t and t <= u implies s <= u
- Total: either s <= t or t <= s or both are an error

This allows each string type to define a total ordering on valid and invalid strings in a way that's efficient and consistent within the type, but comparisons of invalid strings across types can simply error since there's no sensible way to implement that and forcing it to be consistent would force valid comparisons to be done inefficiently.

stevengj · 2025-01-02T19:40:56Z

It seems like the triage requests were all already addressed.

ismalformed already has a docstring in this PR (and is included in the manual)
ismalformed is already public (but not exported) in this PR
the codepoint docs already refer to malformed rather than valid in this PR

Removing the "merge me" label, however, until it is clear that everyone is satisfied.

base/char.jl

stevengj · 2025-01-02T21:57:12Z

Windows build failure looks unrelated: ERROR: Unable to open agent private key path 'C:\secrets/agent.key'! Make sure your agent has this file deployed within it!

LilithHafner

I'd like a more complete/specific/accessible definition of malformed vs invalid, or a link to the specific part of the unicode standard that defines it; but I don't think that is blocking given the level of docs already in this PR.

inkydragon · 2025-01-03T03:31:45Z

malformed vs invalid

I didn't find a definition for either word, but did find definitions for their synonyms/antonyms..
Glossary of Unicode Terms

D84 Ill-formed: A Unicode code unit sequence that purports to be in a Unicode encoding form is called ill-formed if and only if it does not follow the specification of that Unicode encoding form.

Any code unit sequence that would correspond to a code point outside the defined range of Unicode scalar values would, for example, be ill-formed.

UTF-8 has some strong constraints on the possible byte ranges for leading and trailing bytes. A violation of those constraints would produce a code unit sequence that could not be mapped to a Unicode scalar value, resulting in an ill-formed code unit sequence.

xref:

D89 In a Unicode encoding form: ...

A Unicode string consisting of a well-formed UTF-8 code unit sequence is said to be in UTF-8. Such a Unicode string is referred to as a valid UTF-8 string, or a UTF-8 string for short.

LilithHafner · 2025-01-03T15:34:07Z

Assuming UTF-8,

Unicode specifies that any code unit sequence not listed in this table is ill-formed and not well-formed.

IIUC, our definition of malformed is different from Unicode's definition of ill-formed. For example, overlong characters are not Base.ismalformed but are ill-formed according to Unicode.

make codepoint(c) work for overlong chars

0b6cf37

stevengj added the unicode Related to unicode characters and encodings label Jul 17, 2024

stevengj assigned StefanKarpinski Jul 17, 2024

add PR # to NEWS

28f59dd

stevengj added 4 commits July 30, 2024 19:10

clarify AbstractChar docs

1cb0291

Merge branch 'master' into codepoint_overlong

6df8031

Update char.jl: rm trailing whitespace

bb40574

Merge branch 'master' into codepoint_overlong

596469b

stevengj added the triage This should be discussed on a triage call label Jan 1, 2025

stevengj mentioned this pull request Jan 1, 2025

add hascodepoint(c::AbstractChar) and use it #54393

Open

Merge branch 'master' into codepoint_overlong

7a15f00

StefanKarpinski approved these changes Jan 2, 2025

View reviewed changes

Merge branch 'master' into codepoint_overlong

a60f0c6

stevengj added merge me PR is reviewed. Merge when all tests are passing and removed triage This should be discussed on a triage call labels Jan 2, 2025

LilithHafner reviewed Jan 2, 2025

View reviewed changes

base/char.jl Outdated Show resolved Hide resolved

move public decls to public.jl

f778fab

stevengj removed the merge me PR is reviewed. Merge when all tests are passing label Jan 2, 2025

LilithHafner reviewed Jan 2, 2025

View reviewed changes

base/char.jl Outdated Show resolved Hide resolved

base/char.jl Outdated Show resolved Hide resolved

base/char.jl Show resolved Hide resolved

stevengj added 2 commits January 2, 2025 15:46

Update base/char.jl

b8a06bd

not well-formed -> malformed

c717418

LilithHafner approved these changes Jan 2, 2025

View reviewed changes

stevengj added the merge me PR is reviewed. Merge when all tests are passing label Jan 3, 2025

LilithHafner removed the merge me PR is reviewed. Merge when all tests are passing label Jan 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make codepoint(c) work for overlong chars #55152

make codepoint(c) work for overlong chars #55152

stevengj commented Jul 17, 2024 •

edited

Loading

nhz2 commented Jul 28, 2024

stevengj commented Jul 28, 2024 •

edited

Loading

StefanKarpinski left a comment

stevengj commented Jan 2, 2025

LilithHafner commented Jan 2, 2025

StefanKarpinski commented Jan 2, 2025

stevengj commented Jan 2, 2025 •

edited

Loading

stevengj commented Jan 2, 2025

LilithHafner left a comment

inkydragon commented Jan 3, 2025 •

edited

Loading

LilithHafner commented Jan 3, 2025

make codepoint(c) work for overlong chars #55152

Are you sure you want to change the base?

make codepoint(c) work for overlong chars #55152

Conversation

stevengj commented Jul 17, 2024 • edited Loading

nhz2 commented Jul 28, 2024

stevengj commented Jul 28, 2024 • edited Loading

StefanKarpinski left a comment

Choose a reason for hiding this comment

stevengj commented Jan 2, 2025

LilithHafner commented Jan 2, 2025

StefanKarpinski commented Jan 2, 2025

stevengj commented Jan 2, 2025 • edited Loading

stevengj commented Jan 2, 2025

LilithHafner left a comment

Choose a reason for hiding this comment

inkydragon commented Jan 3, 2025 • edited Loading

LilithHafner commented Jan 3, 2025

stevengj commented Jul 17, 2024 •

edited

Loading

stevengj commented Jul 28, 2024 •

edited

Loading

stevengj commented Jan 2, 2025 •

edited

Loading

inkydragon commented Jan 3, 2025 •

edited

Loading