-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make codepoint(c) work for overlong chars #55152
base: master
Are you sure you want to change the base?
Conversation
These changes seems to be at odds with other docs, because now Lines 4 to 10 in 197295c
Also, the information that a Lines 40 to 44 in 197295c
|
The basic problem is that this is an oversimplification. The I updated the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me
Unrelated CI failure, updating and re-running CI. |
The docstring of |
Triage likes but would also like for "malformed" to be documented somewhere and to adjust the docstring of codepoint to refer to malformed rather than invalid. @LilithHafner feels that it would be good to block the publicness of
An additional comment regarding equality and comparison:
This allows each string type to define a total ordering on valid and invalid strings in a way that's efficient and consistent within the type, but comparisons of invalid strings across types can simply error since there's no sensible way to implement that and forcing it to be consistent would force valid comparisons to be done inefficiently. |
It seems like the triage requests were all already addressed.
Removing the "merge me" label, however, until it is clear that everyone is satisfied. |
Windows build failure looks unrelated: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like a more complete/specific/accessible definition of malformed vs invalid, or a link to the specific part of the unicode standard that defines it; but I don't think that is blocking given the level of docs already in this PR.
I didn't find a definition for either word, but did find definitions for their synonyms/antonyms..
xref:
|
Assuming UTF-8, Unicode specifies that any code unit sequence not listed in this table is ill-formed and not well-formed. IIUC, our definition of malformed is different from Unicode's definition of ill-formed. For example, overlong characters are not |
As discussed in #54393,
codepoint(c)
should succeed for overlong encodings, and wheneverismalformed(c)
returnsfalse
. This should be backwards compatible since it simply removes an error, and should be strictly faster than before since it merely removes a call toBase.is_overlong_enc
.Also,
Base.ismalformed
andBase.isoverlong
are declaredpublic
(but not yet exported) and are included in the manual, since they are referenced in the docstring ofcodepoint
etcetera. I also madeBase.show_invalid
a
public
and documented function, since it is referenced from theismalformed
docs and is required by new implementations ofAbstractChar
types that support malformed data.Fixes #54343, closes #54393.