-
Notifications
You must be signed in to change notification settings - Fork 22
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
0e7c154
commit e22643e
Showing
1 changed file
with
226 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,226 @@ | ||
# New API for conversion from UTF-8 to UV | ||
|
||
## Preamble | ||
|
||
Author: K. H. Williamson <[email protected]> | ||
ID: RFC-0022 | ||
Status: Draft | ||
|
||
|
||
## Abstract | ||
|
||
Introduce an API more convenient to use safely | ||
|
||
## Motivation | ||
|
||
The existing API requires disambiguation between a NUL character and malformed | ||
UTF-8, which callers tend to not do, and the caller besides has to take | ||
extra steps to correctly implement current best practices for dealing with | ||
malformed UTF-8. | ||
|
||
## Rationale | ||
|
||
This API has no ambiguity between success and failure, and always returns | ||
based on best practices. Other proposals were discarded in the pre-RFC process: | ||
[Pre-RFC: New C API for converting from UTF-8 to code point](http://nntp.perl.org/group/perl.perl5.porters/264207) | ||
|
||
## Specification | ||
|
||
I think it best to start with just the two functions that will be applicable in | ||
almost all situations. The pod for these is | ||
|
||
=for apidoc |bool|utf8_to_uvchr |const char * s|const char * e|UV *cp|Size_t * len | ||
=for apidoc |bool|utf8_to_uvchr_flags |const char * s|const char * e|UV *cp|Size_t * len|U32 flags | ||
=for apidoc_item |bool|utf8_to_uvchr_nowarn|const char * s|const char * e|UV *cp|Size_t * len | ||
|
||
These each translate UTF-8 into UTF-32 (or UTF-64 on platforms where a UV is 64 | ||
bits long), returning <true> if the operation succeeded without problems; and | ||
false otherwise. (On EBCIDIC platforms, the input is considered to be | ||
UTF-EBCDIC rather than UTF-8.) | ||
|
||
They differ in how they handle problematic input. | ||
|
||
More precisely, they each calculate the first code point represented by the | ||
sequence of bytes bounded by <*s> .. <(*e) - 1>, interpreted as UTF-8. | ||
<e> must be strictly greather than <s>; this is asserted for in debugging | ||
builds. | ||
|
||
Since UTF-8 is a variable length encoding, the number of bytes examined also | ||
varies. The algorithm is to first look at <*s>. If that represents a full | ||
UTF-8 character, no other byte is examined, and the code point is calculated | ||
from just it. If more bytes are required to represent a complete UTF-8 | ||
character, <(*s) + 1> is examined as well. If that isn't enough, <(*s) + 2> is | ||
examined next, and so forth, up through <(*e) - 1>, quitting as soon as a | ||
complete character is found. | ||
|
||
If the input is valid, <true> is returned; the calculated code point is | ||
stored in <*cp>; and the number of UTF-8 bytes it consumes from <s> is | ||
stored in <*len>. | ||
|
||
If the input is in some way problematic, <false> is returned; the Unicode | ||
REPLACEMENT CHARACTER is stored in <*cp>; and the number of UTF-8 bytes | ||
consumed from <s> is stored in <*len>. This number will always be > 0, and | ||
is the correct number to add to <s> to continue examining the input for | ||
subsequent characters. This behavior follows current best practices for | ||
handling problematic UTF-8, which have evolved based on experiences with | ||
security attacks using malformations. | ||
|
||
UTF-8 syntax allows for the expression of 31 bit (30 in EBCDIC) code points. | ||
But Unicode has deemed all those above U+10FFFF to be illegal, and reserves | ||
certain others for internal use. Perl predates Unicode, and by default | ||
considers the code points above Unicode to be valid, as well as the reserved | ||
ones. Furthermore, Perl has created an extended UTF-8 (and UTF-EBCDIC) that | ||
allows for the expression of code points up to 64 bits wide. | ||
|
||
<utf8_to_uvchr> and <utf8_to_uvchr_nowarn> presume all the Perl extensions and | ||
reserved code points are valid. They are suitable for use when there is no | ||
need to worry about those being an issue. The other functions allow the caller | ||
to control more precisely what inputs are considered valid. | ||
|
||
Most callers of these functions will want to either croak on malformed input or | ||
forge ahead (using the returned REPLACEMENT CHARACTER), depending on the | ||
circumstances of the call. In the latter case, the results won't be "correct", | ||
but will be as good as possible, and would be apparent to anyone examining the | ||
outputs, as the REPLACEMENT CHARACTER has no use in Unicode other than to | ||
signify such an error. | ||
|
||
A typical use case for forging ahead no matter what, would be: | ||
|
||
while (s < e) { | ||
UV cp; | ||
Size_t len; | ||
|
||
(void) utf8_to_uvchr(s, e, &cp, &len); | ||
// handle the code point | ||
|
||
s += len; | ||
} | ||
|
||
And if the caller wants to do something different when the input isn't valid: | ||
|
||
while (s < e) { | ||
UV cp; | ||
Size_t len; | ||
|
||
if (utf8_to_uvchr(s, e, &cp, &len) { | ||
// handle the code point | ||
} | ||
else { | ||
// croak or recover from the error | ||
} | ||
|
||
s += len; | ||
} | ||
|
||
C<utf8_to_uvchr> will raise warnings for malformations if UTF8 warnings are | ||
enabled; C<utf8_to_uvchr_nowarn> will never raise a warning. | ||
|
||
Neither C<utf8_to_uvchr> nor C<utf8_to_uvchr_nowarn> will raise warnings for | ||
the extended set of code points accepted by Perl. | ||
|
||
If (unlikely) you need the Unicode versus native code point on an EBCDIC | ||
machine, modify the success case in the above example to: | ||
|
||
if (utf8_to_uvchr(s, e, &cp, &len) { | ||
cp = NATIVE_TO_UNICODE(cp); | ||
} | ||
|
||
(C<REPLACEMENT CHARACTER> is the same in both character sets, so the failure | ||
case doesn't need to be modified.) | ||
|
||
C<utf8_to_uvchr_flags> can be used to more finely control what classes of UTF-8 | ||
return <true> versus <false> and what classes raise warnings when encountered. | ||
|
||
There are three classes (one of which has two subclasses) of code points that | ||
can independently raise a warning when encountered, and/or be disallowed, | ||
returning <false> with the code point set to REPLACEMENT CHARACTER. | ||
|
||
First, are the surrogate characters, withdrawn from general use by Unicode (and | ||
now reserved for aid in specifying a different encoding, UTF-16). Including | ||
the flags UTF8_DISALLOW_SURROGATE and/or UTF8_WARN_SURROGATE in the <flags> | ||
parameter will respecitvely cause the function to return <false> when one is | ||
encountered and to raise a warning, if either <utf8> or <surrogate> warnings | ||
are enabled. | ||
|
||
Second, are the non-character code points. These are reserved by Unicode mostly | ||
for use as sentinels. UTF8_DISALLOW_NONCHAR and UT8_WARN_NONCHAR control the | ||
behavior when these are encountered. Either the <utf8> or <nonchar> warning | ||
must be enabled for warnings to actually be raised. | ||
|
||
Third, are the code points above the Unicode-allowed maximum of U+10FFFF. | ||
These are called "supers" in Perl terminology. UTF8_DISALLOW_SUPER and | ||
UTF8_WARN_SUPER control the behavior for these, with the warnings categories | ||
<utf8> or <non_unicode>. Since it is a Perl-designed extension to express code | ||
points using more than 31 bits, it is much less likely that a program written | ||
in another language would understand these than the smaller ones, which were | ||
acceptable until withdrawn from use by Unicode. Therefore, you can allow/not | ||
warn on the smaller ones, while disallowing or warning on the Perl-extended | ||
ones. Use UTF8_DISALLOW_PERL_EXTENDED and UTF8_WARN_PERL_EXTENDED. The | ||
warnings categories are the same for all supers: <utf8> and <non_unicode>. | ||
|
||
To disallow and or warn on all three categories at once, the shortcut flags | ||
UTF8_DISALLOW_ILLEGAL_INTERCHANGE and/or UTF8_WARN_ILLEGAL_INTERCHANGE can be | ||
used. Because Unicode changed its guidance on non-character code points in its | ||
Corregindum 9, there are UTF8_DISALLOW_ILLEGAL_C9_INTERCHANGE and | ||
UTF8_WARN_ILLEGAL_C9_INTERCHANGE, which make illegal just the surrogates and | ||
above Unicode code points. | ||
|
||
When a code point is disallowed, <false> is returned and C<*cp> is set to | ||
REPLACEMENT CHARACTER. | ||
|
||
The same basic logic is used for this function. For example, | ||
|
||
while (s < e) { | ||
UV cp; | ||
Size_t len; | ||
|
||
if (utf8_to_uvchr_flags(s, e, &cp, &len, (UTF8_DISALLOW_ILLEGAL_INTERCHANGE | ||
| UTF8_WARN_ILLEGAL_INTERCHANGE)) | ||
{ | ||
// handle the code point | ||
} | ||
else { | ||
// croak or recover from the error | ||
} | ||
|
||
s += len; | ||
} | ||
|
||
## Backwards Compatibility | ||
|
||
This is a new interface which I will add support to in Devel::PPPort. After | ||
that is done, I will issue pull requests to the relatively few places in CPAN | ||
that use the current API. | ||
|
||
## Security Implications | ||
|
||
This aims to remove any existing security flaws, and to make it easy to fix any | ||
new ones that may come along, without any XS changes. | ||
|
||
## Examples | ||
|
||
See the Specification | ||
|
||
## Prototype Implementation | ||
|
||
None; this is just an alternative API to the existing implementation | ||
|
||
## Future Scope | ||
|
||
After the above is worked through, several more specific functions need to be | ||
added, for situations where non-typical handling is required. | ||
|
||
## Rejected Ideas | ||
|
||
See [Pre-RFC: New C API for converting from UTF-8 to code point](http://nntp.perl.org/group/perl.perl5.porters/264207) | ||
|
||
## Open Issues | ||
|
||
Use this to summarise any points that are still to be resolved. | ||
|
||
## Copyright | ||
|
||
Copyright (C) 2022, K. H. Williamson | ||
|
||
This document and code and documentation within it may be used, redistributed and/or modified under the same terms as Perl itself. | ||
|