Skip to content

Commit

Permalink
New API for translating UTF-8 to UV
Browse files Browse the repository at this point in the history
  • Loading branch information
khwilliamson committed Oct 19, 2022
1 parent 0e7c154 commit e22643e
Showing 1 changed file with 226 additions and 0 deletions.
226 changes: 226 additions & 0 deletions md
Original file line number Diff line number Diff line change
@@ -0,0 +1,226 @@
# New API for conversion from UTF-8 to UV

## Preamble

Author: K. H. Williamson <[email protected]>
ID: RFC-0022
Status: Draft


## Abstract

Introduce an API more convenient to use safely

## Motivation

The existing API requires disambiguation between a NUL character and malformed
UTF-8, which callers tend to not do, and the caller besides has to take
extra steps to correctly implement current best practices for dealing with
malformed UTF-8.

## Rationale

This API has no ambiguity between success and failure, and always returns
based on best practices. Other proposals were discarded in the pre-RFC process:
[Pre-RFC: New C API for converting from UTF-8 to code point](http://nntp.perl.org/group/perl.perl5.porters/264207)

## Specification

I think it best to start with just the two functions that will be applicable in
almost all situations. The pod for these is

=for apidoc |bool|utf8_to_uvchr |const char * s|const char * e|UV *cp|Size_t * len
=for apidoc |bool|utf8_to_uvchr_flags |const char * s|const char * e|UV *cp|Size_t * len|U32 flags
=for apidoc_item |bool|utf8_to_uvchr_nowarn|const char * s|const char * e|UV *cp|Size_t * len

These each translate UTF-8 into UTF-32 (or UTF-64 on platforms where a UV is 64
bits long), returning <true> if the operation succeeded without problems; and
false otherwise. (On EBCIDIC platforms, the input is considered to be
UTF-EBCDIC rather than UTF-8.)

They differ in how they handle problematic input.

More precisely, they each calculate the first code point represented by the
sequence of bytes bounded by <*s> .. <(*e) - 1>, interpreted as UTF-8.
<e> must be strictly greather than <s>; this is asserted for in debugging
builds.

Since UTF-8 is a variable length encoding, the number of bytes examined also
varies. The algorithm is to first look at <*s>. If that represents a full
UTF-8 character, no other byte is examined, and the code point is calculated
from just it. If more bytes are required to represent a complete UTF-8
character, <(*s) + 1> is examined as well. If that isn't enough, <(*s) + 2> is
examined next, and so forth, up through <(*e) - 1>, quitting as soon as a
complete character is found.

If the input is valid, <true> is returned; the calculated code point is
stored in <*cp>; and the number of UTF-8 bytes it consumes from <s> is
stored in <*len>.

If the input is in some way problematic, <false> is returned; the Unicode
REPLACEMENT CHARACTER is stored in <*cp>; and the number of UTF-8 bytes
consumed from <s> is stored in <*len>. This number will always be > 0, and
is the correct number to add to <s> to continue examining the input for
subsequent characters. This behavior follows current best practices for
handling problematic UTF-8, which have evolved based on experiences with
security attacks using malformations.

UTF-8 syntax allows for the expression of 31 bit (30 in EBCDIC) code points.
But Unicode has deemed all those above U+10FFFF to be illegal, and reserves
certain others for internal use. Perl predates Unicode, and by default
considers the code points above Unicode to be valid, as well as the reserved
ones. Furthermore, Perl has created an extended UTF-8 (and UTF-EBCDIC) that
allows for the expression of code points up to 64 bits wide.

<utf8_to_uvchr> and <utf8_to_uvchr_nowarn> presume all the Perl extensions and
reserved code points are valid. They are suitable for use when there is no
need to worry about those being an issue. The other functions allow the caller
to control more precisely what inputs are considered valid.

Most callers of these functions will want to either croak on malformed input or
forge ahead (using the returned REPLACEMENT CHARACTER), depending on the
circumstances of the call. In the latter case, the results won't be "correct",
but will be as good as possible, and would be apparent to anyone examining the
outputs, as the REPLACEMENT CHARACTER has no use in Unicode other than to
signify such an error.

A typical use case for forging ahead no matter what, would be:

while (s < e) {
UV cp;
Size_t len;

(void) utf8_to_uvchr(s, e, &cp, &len);
// handle the code point

s += len;
}

And if the caller wants to do something different when the input isn't valid:

while (s < e) {
UV cp;
Size_t len;

if (utf8_to_uvchr(s, e, &cp, &len) {
// handle the code point
}
else {
// croak or recover from the error
}

s += len;
}

C<utf8_to_uvchr> will raise warnings for malformations if UTF8 warnings are
enabled; C<utf8_to_uvchr_nowarn> will never raise a warning.

Neither C<utf8_to_uvchr> nor C<utf8_to_uvchr_nowarn> will raise warnings for
the extended set of code points accepted by Perl.

If (unlikely) you need the Unicode versus native code point on an EBCDIC
machine, modify the success case in the above example to:

if (utf8_to_uvchr(s, e, &cp, &len) {
cp = NATIVE_TO_UNICODE(cp);
}

(C<REPLACEMENT CHARACTER> is the same in both character sets, so the failure
case doesn't need to be modified.)

C<utf8_to_uvchr_flags> can be used to more finely control what classes of UTF-8
return <true> versus <false> and what classes raise warnings when encountered.

There are three classes (one of which has two subclasses) of code points that
can independently raise a warning when encountered, and/or be disallowed,
returning <false> with the code point set to REPLACEMENT CHARACTER.

First, are the surrogate characters, withdrawn from general use by Unicode (and
now reserved for aid in specifying a different encoding, UTF-16). Including
the flags UTF8_DISALLOW_SURROGATE and/or UTF8_WARN_SURROGATE in the <flags>
parameter will respecitvely cause the function to return <false> when one is
encountered and to raise a warning, if either <utf8> or <surrogate> warnings
are enabled.

Second, are the non-character code points. These are reserved by Unicode mostly
for use as sentinels. UTF8_DISALLOW_NONCHAR and UT8_WARN_NONCHAR control the
behavior when these are encountered. Either the <utf8> or <nonchar> warning
must be enabled for warnings to actually be raised.

Third, are the code points above the Unicode-allowed maximum of U+10FFFF.
These are called "supers" in Perl terminology. UTF8_DISALLOW_SUPER and
UTF8_WARN_SUPER control the behavior for these, with the warnings categories
<utf8> or <non_unicode>. Since it is a Perl-designed extension to express code
points using more than 31 bits, it is much less likely that a program written
in another language would understand these than the smaller ones, which were
acceptable until withdrawn from use by Unicode. Therefore, you can allow/not
warn on the smaller ones, while disallowing or warning on the Perl-extended
ones. Use UTF8_DISALLOW_PERL_EXTENDED and UTF8_WARN_PERL_EXTENDED. The
warnings categories are the same for all supers: <utf8> and <non_unicode>.

To disallow and or warn on all three categories at once, the shortcut flags
UTF8_DISALLOW_ILLEGAL_INTERCHANGE and/or UTF8_WARN_ILLEGAL_INTERCHANGE can be
used. Because Unicode changed its guidance on non-character code points in its
Corregindum 9, there are UTF8_DISALLOW_ILLEGAL_C9_INTERCHANGE and
UTF8_WARN_ILLEGAL_C9_INTERCHANGE, which make illegal just the surrogates and
above Unicode code points.

When a code point is disallowed, <false> is returned and C<*cp> is set to
REPLACEMENT CHARACTER.

The same basic logic is used for this function. For example,

while (s < e) {
UV cp;
Size_t len;

if (utf8_to_uvchr_flags(s, e, &cp, &len, (UTF8_DISALLOW_ILLEGAL_INTERCHANGE
| UTF8_WARN_ILLEGAL_INTERCHANGE))
{
// handle the code point
}
else {
// croak or recover from the error
}

s += len;
}

## Backwards Compatibility

This is a new interface which I will add support to in Devel::PPPort. After
that is done, I will issue pull requests to the relatively few places in CPAN
that use the current API.

## Security Implications

This aims to remove any existing security flaws, and to make it easy to fix any
new ones that may come along, without any XS changes.

## Examples

See the Specification

## Prototype Implementation

None; this is just an alternative API to the existing implementation

## Future Scope

After the above is worked through, several more specific functions need to be
added, for situations where non-typical handling is required.

## Rejected Ideas

See [Pre-RFC: New C API for converting from UTF-8 to code point](http://nntp.perl.org/group/perl.perl5.porters/264207)

## Open Issues

Use this to summarise any points that are still to be resolved.

## Copyright

Copyright (C) 2022, K. H. Williamson

This document and code and documentation within it may be used, redistributed and/or modified under the same terms as Perl itself.

0 comments on commit e22643e

Please sign in to comment.