Big5 encoding mishandles some trailing bytes, with possible XSS

There are some sequences of bytes that are valid lead-trailing according to the description at https://encoding.spec.whatwg.org/#big5-decoder, but don't have a corresponding Unicode codepoint in the `index-big5.txt` mapping table.

In this case the first byte is converted to `U+FFFD`, but the second one is left "as is". In some cases that second byte can be backslash (`\`, `5C`), which can be used to "escape" the ending quote of strings in JavaScript and potentially resulting in XSS exploits.

Example (attached): the `83 5C` sequence
According to the algorithm at https://encoding.spec.whatwg.org/#big5-decoder
* the first byte is a lead byte (case 5, byte is in the range 0x81 to 0xFE, inclusive)
* for the second byte we have case 3
    * 3.1. Let offset be 0x40 (byte is less than 0x7F)
    * 3.2. byte is in the range 0x40 to 0x7E, inclusive => set pointer to `(lead − 0x81) × 157 + (byte − offset)`. The result of that is `(0x83 - 0x81) * 157 + (0x5C - 0x40)` which is `0x156`
   * there is no mapping for 0x156 in `index-big5.txt`, and because the `byte is an ASCII byte` we `prepend byte to stream` (case 3.6) and `Return error` (case 3.7)

The end result is a `U+FFFD` (from the error) followed by a `5C` (the trailing byte, "as is")

You can see this in the attached file.
When opened in both Chrome and Firefox the text is rendered as the "Unicode REPLACEMENT CHARACTER" (correct) followed by a back-slash (incorrect).

This is a valid lead-trail byte sequence that should either be replaced by one single `U+FFFD` character, or by two `U+FFFD` characters, whatever the policy is (I think the second case).

But the definitely the trailing byte should not be left "as is"

The possible exploit can use the trailing byte (which is backslash) to escape the end of a string, for example.
Checking the console of Firefox you will see the 'SyntaxError: "" string literal contains an unescaped line break' message. In Chrome the message is 'Uncaught SyntaxError: Invalid or unexpected token'

I did not check, but this might also happen in other DBCS (Double Byte Characters Sets) that have the second byte in the ASCII range (for instance in Shift JIS?).

[big5.zip](https://github.com/whatwg/encoding/files/2788604/big5.zip)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Big5 encoding mishandles some trailing bytes, with possible XSS #171

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Big5 encoding mishandles some trailing bytes, with possible XSS #171

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions