New API for translating UTF-8 to UV

Perl · Oct 19, 2022 · e22643e · e22643e
1 parent 0e7c154
commit e22643e
Showing 1 changed file with 226 additions and 0 deletions.
diff --git a/md b/md
@@ -0,0 +1,226 @@
+# New API for conversion from UTF-8 to UV
+
+## Preamble
+
+    Author:  K. H. Williamson <[email protected]>
+    ID:      RFC-0022
+    Status:  Draft
+
+
+## Abstract
+
+Introduce an API more convenient to use safely
+
+## Motivation
+
+The existing API requires disambiguation between a NUL character and malformed
+UTF-8, which callers tend to not do, and the caller besides has to take
+extra steps to correctly implement current best practices for dealing with
+malformed UTF-8.
+
+## Rationale
+
+This API has no ambiguity between success and failure, and always returns 
+based on best practices. Other proposals were discarded in the pre-RFC process:
+[Pre-RFC: New C API for converting from UTF-8 to code point](http://nntp.perl.org/group/perl.perl5.porters/264207)
+
+## Specification
+
+I think it best to start with just the two functions that will be applicable in
+almost all situations.  The pod for these is
+
+=for apidoc      |bool|utf8_to_uvchr       |const char * s|const char * e|UV *cp|Size_t * len
+=for apidoc      |bool|utf8_to_uvchr_flags |const char * s|const char * e|UV *cp|Size_t * len|U32 flags
+=for apidoc_item |bool|utf8_to_uvchr_nowarn|const char * s|const char * e|UV *cp|Size_t * len
+
+These each translate UTF-8 into UTF-32 (or UTF-64 on platforms where a UV is 64
+bits long), returning <true> if the operation succeeded without problems; and
+false otherwise.  (On EBCIDIC platforms, the input is considered to be
+UTF-EBCDIC rather than UTF-8.)
+
+They differ in how they handle problematic input.
+
+More precisely, they each calculate the first code point represented by the
+sequence of bytes bounded by <*s> .. <(*e) - 1>, interpreted as UTF-8.
+<e> must be strictly greather than <s>; this is asserted for in debugging
+builds.
+
+Since UTF-8 is a variable length encoding, the number of bytes examined also
+varies.  The algorithm is to first look at <*s>.  If that represents a full
+UTF-8 character, no other byte is examined, and the code point is calculated
+from just it.  If more bytes are required to represent a complete UTF-8
+character, <(*s) + 1> is examined as well.  If that isn't enough, <(*s) + 2> is
+examined next, and so forth, up through <(*e) - 1>, quitting as soon as a
+complete character is found.
+
+If the input is valid, <true> is returned; the calculated code point is
+stored in <*cp>; and the number of UTF-8 bytes it consumes from <s> is
+stored in <*len>.
+
+If the input is in some way problematic, <false> is returned; the Unicode
+REPLACEMENT CHARACTER is stored in <*cp>; and the number of UTF-8 bytes
+consumed from <s> is stored in <*len>.  This number will always be > 0, and
+is the correct number to add to <s> to continue examining the input for
+subsequent characters.  This behavior follows current best practices for
+handling problematic UTF-8, which have evolved based on experiences with
+security attacks using malformations.
+
+UTF-8 syntax allows for the expression of 31 bit (30 in EBCDIC) code points.
+But Unicode has deemed all those above U+10FFFF to be illegal, and reserves
+certain others for internal use.  Perl predates Unicode, and by default
+considers the code points above Unicode to be valid, as well as the reserved
+ones.  Furthermore, Perl has created an extended UTF-8 (and UTF-EBCDIC) that
+allows for the expression of code points up to 64 bits wide.
+
+<utf8_to_uvchr> and <utf8_to_uvchr_nowarn> presume all the Perl extensions and
+reserved code points are valid.  They are suitable for use when there is no
+need to worry about those being an issue.  The other functions allow the caller
+to control more precisely what inputs are considered valid.
+
+Most callers of these functions will want to either croak on malformed input or
+forge ahead (using the returned REPLACEMENT CHARACTER), depending on the
+circumstances of the call.  In the latter case, the results won't be "correct",
+but will be as good as possible, and would be apparent to anyone examining the
+outputs, as the REPLACEMENT CHARACTER has no use in Unicode other than to
+signify such an error.
+
+A typical use case for forging ahead no matter what, would be:
+
+ while (s < e) {
+     UV cp;
+     Size_t len;
+
+     (void) utf8_to_uvchr(s, e, &cp, &len);
+     // handle the code point
+
+     s += len;
+ }
+
+And if the caller wants to do something different when the input isn't valid:
+
+ while (s < e) {
+     UV cp;
+     Size_t len;
+
+     if (utf8_to_uvchr(s, e, &cp, &len) {
+        // handle the code point
+     }
+     else {
+        // croak or recover from the error
+     }
+
+     s += len;
+ }
+
+C<utf8_to_uvchr> will raise warnings for malformations if UTF8 warnings are
+enabled;  C<utf8_to_uvchr_nowarn> will never raise a warning.
+
+Neither C<utf8_to_uvchr> nor C<utf8_to_uvchr_nowarn> will raise warnings for
+the extended set of code points accepted by Perl.
+
+If (unlikely) you need the Unicode versus native code point on an EBCDIC
+machine, modify the success case in the above example to: 
+
+     if (utf8_to_uvchr(s, e, &cp, &len) {
+        cp = NATIVE_TO_UNICODE(cp);
+     }
+
+(C<REPLACEMENT CHARACTER> is the same in both character sets, so the failure
+case doesn't need to be modified.)
+
+C<utf8_to_uvchr_flags> can be used to more finely control what classes of UTF-8
+return <true> versus <false> and what classes raise warnings when encountered.
+
+There are three classes (one of which has two subclasses) of code points that
+can independently raise a warning when encountered, and/or be disallowed,
+returning <false> with the code point set to REPLACEMENT CHARACTER.
+
+First, are the surrogate characters, withdrawn from general use by Unicode (and
+now reserved for aid in specifying a different encoding, UTF-16).  Including
+the flags UTF8_DISALLOW_SURROGATE and/or UTF8_WARN_SURROGATE in the <flags>
+parameter will respecitvely cause the function to return <false> when one is
+encountered and to raise a warning, if either <utf8> or <surrogate> warnings
+are enabled.
+
+Second, are the non-character code points.  These are reserved by Unicode mostly
+for use as sentinels.  UTF8_DISALLOW_NONCHAR and UT8_WARN_NONCHAR control the
+behavior when these are encountered.  Either the <utf8> or <nonchar> warning
+must be enabled for warnings to actually be raised.
+
+Third, are the code points above the Unicode-allowed maximum of U+10FFFF.
+These are called "supers" in Perl terminology.  UTF8_DISALLOW_SUPER and
+UTF8_WARN_SUPER control the behavior for these, with the warnings categories
+<utf8> or <non_unicode>.  Since it is a Perl-designed extension to express code
+points using more than 31 bits, it is much less likely that a program written
+in another language would understand these than the smaller ones, which were
+acceptable until withdrawn from use by Unicode.  Therefore, you can allow/not
+warn on the smaller ones, while disallowing or warning on the Perl-extended
+ones.  Use UTF8_DISALLOW_PERL_EXTENDED and UTF8_WARN_PERL_EXTENDED.  The
+warnings categories are the same for all supers: <utf8> and <non_unicode>.
+
+To disallow and or warn on all three categories at once, the shortcut flags
+UTF8_DISALLOW_ILLEGAL_INTERCHANGE and/or UTF8_WARN_ILLEGAL_INTERCHANGE can be
+used.  Because Unicode changed its guidance on non-character code points in its
+Corregindum 9, there are UTF8_DISALLOW_ILLEGAL_C9_INTERCHANGE and
+UTF8_WARN_ILLEGAL_C9_INTERCHANGE, which make illegal just the surrogates and
+above Unicode code points.
+
+When a code point is disallowed, <false> is returned and C<*cp> is set to
+REPLACEMENT CHARACTER.
+
+The same basic logic is used for this function.  For example,
+
+ while (s < e) {
+     UV cp;
+     Size_t len;
+
+     if (utf8_to_uvchr_flags(s, e, &cp, &len, (UTF8_DISALLOW_ILLEGAL_INTERCHANGE
+                                             | UTF8_WARN_ILLEGAL_INTERCHANGE))
+     {
+        // handle the code point
+     }
+     else {
+        // croak or recover from the error
+     }
+
+     s += len;
+ }
+
+## Backwards Compatibility
+
+This is a new interface which I will add support to in Devel::PPPort.  After
+that is done, I will issue pull requests to the relatively few places in CPAN
+that use the current API.
+
+## Security Implications
+
+This aims to remove any existing security flaws, and to make it easy to fix any
+new ones that may come along, without any XS changes.
+
+## Examples
+
+See the Specification
+
+## Prototype Implementation
+
+None; this is just an alternative API to the existing implementation
+
+## Future Scope
+
+After the above is worked through, several more specific functions need to be
+added, for situations where non-typical handling is required.
+
+## Rejected Ideas
+
+See [Pre-RFC: New C API for converting from UTF-8 to code point](http://nntp.perl.org/group/perl.perl5.porters/264207)
+
+## Open Issues
+
+Use this to summarise any points that are still to be resolved.
+
+## Copyright
+
+Copyright (C) 2022, K. H. Williamson
+
+This document and code and documentation within it may be used, redistributed and/or modified under the same terms as Perl itself.
+