-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Big5 encoding mishandles some trailing bytes, with possible XSS #171
Comments
Heya, welcome! This is by design to avoid allowing attackers to use this to prevent ASCII delimiters from being seen. I took some time to find the rationale to make sure we're all on the same page going forward: |
Thank you for the quick answer. ==== TLDR ====
Because the ==== The long version ==== I have tried to carefully read the links provided, but I think this is kind of the opposite... That handling is correct, because it is a 'valid lead' followed by 'invalid trailing'
This is different, as This can lead the an exploit similar to the one quoted in the bug (from 2011):
In this case One might not see the problem, because To make things more interesting that sequence is actually mapped to a PUA character ( Considering that there are many Big 5 extensions (and tables), this is quite likely to happen. In fact, I discovered this exactly from a client side JSON parsing (using this algorithm), with data produced server side (using the Big 5-HKSCS table by default). None of these extensions are registered with IANA, so there is no standard way to communicate that information to another client. |
Thanks for that analysis, I guess that is indeed a novel angle that I'm not sure was fully considered. That would affect most legacy encodings to some extent, including Shift_JIS I suspect (we tend to unwind for ASCII bytes as a general principle). Since most implementations are now aligned I wonder to what extent they want to change this again. At the very least we should add a warning somewhere or maybe add a paragraph to the security considerations that suggests that if you're using a legacy encoding, you have to be sure that it's identical to those defined and that otherwise you need to account for the difference in behavior being exploited. |
If there is agreement that this is a possible security problem waiting to happen, then a fix is a lot better than a warning. Plus, it really "feels" incorrect: a valid lead-trailing byte sequence gets only half-converted to It is not an intrinsic problem with legacy encodings, it is a problem with the algorithm as described in this spec. Let's say we add a warning. "you have to be sure that it's identical to those defined": there is no way to do that. You get some content from somewhere, you don't even control it, and it is tagged as Big 5. There is no way to tag it as some variant of Big 5, because none of them is IANA registered. Would it help if I get some "chime in" from someone in the Chrome team saying they would implement this if the spec changes? Thank you very much, |
The warning would be for content producers, who also must use UTF-8 per the standard already, so they are already in violation of sorts. If Chrome were willing to drive this work (including the change proposals and test changes required) that would certainly help, but we'd need one more implementer (Apple or Mozilla) to also be on board. |
I'd rather not tweak legacy encoding implementations any further in Chromium's copy of ICU unless it's absolutely necessary. |
While novel angle for the Encoding Standard, my understanding is that a similar structural concern kept Shift_JIS unsupported as a system encoding in Fedora so that the transition was direct from EUC-JP to UTF-8. (I don't know how DOS and pre-NT Windows dealt internally with 0x5C in two-byte characters in file paths under the Japanese locale. I also don't how if Fedora supported Big5 as a system encoding previously.)
Interesting both because JSON is not allowed to be Big5-encoded (irrelevant as far as providing security even for people who don't conform with specs goes) and because the spec is trying to be able to decode HKSCS. While a backslash may have different implications than other ASCII-range bytes as the second byte of an unmappable sequence, I'm reluctant to make a special case for 0x5C in the the general ASCII unmasking policy with the level of evidence offered so far. Before debating special-casing 0x5C as the trail byte when an index lookup fails, I'd be interested in learning what Big5-HKSCS generator can generate byte pairs that the index in the Encoding Standard does not have mappings for. We have mappings for the 0x5C trail byte for every lead byte from 0x87 onwards. We have no mappings for any byte pair, whose lead is in the range 0x81 to 0x86, inclusive. What software produced the 0x83, 0x5C byte sequence and what Big5 extension does it belong to? (CC @foolip)
That's relatively useless advice especially in the case of Big5, which, as defined in the spec, is a WHATWG synthesis that is unlikely to be an exact match for any legacy implementation. |
The old Big5 study emails suggest have evidence of lead bytes in the 0x81 to 0x86 (inclusive) range in pages whose URLs seem no longer to work. How did those items of evidence not result in mappings for lead bytes in that range? |
The trouble here is that XSS attacks are not coming from "friendly" producers. If we say "don't do this, it is a possible vulnerability", we are just inviting bad actors to abuse it.
I am not advocating special treatment for
Anything Windows Code page 950. So probably most (all?) Windows APIs. And also ICU:
The code above produces:
(the We notice that Java considers Big5 and cp950 to be different charsets, but ICU4J considers them aliases. I totally understand the reluctance to change a spec, and to change implementations. I would really hate to see some exploit based on something this a few months down the line... |
If it helps you can also diff the mapping tables from the Unicode site: |
Interesting. So it maps to the PUA. Maybe we should map byte pairs with leads in the 0x81 to 0x86 (inclusive) range to the PUA. It seems worthwhile to check what Microsoft does for codepage 950.
These tables have mappings only starting from lead byte 0xA1 upwards. |
https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit950.txt says "EUDC" for those lead bytes (but also for 0x87 to 0xA0, inclusive). The Unicode.org vendor repo does not appear to have a Microsoft submission for the Hong Kong "951" shadow codepage. |
I'm afraid that I don't know much about the software which produced the content that @annevk, I and others analyzed back in the day. I feel like the best way to answer questions about what the spec should say now would be to perform a new scrape looking for certain patterns, perhaps starting from the list of URLs in httparchive.
@jungshik are we now in alignment with the Encodings spec for Big5? If not, do you know where the differences are? |
The It is only used to convert from Unicode to a charset. There is no such flag in |
That seems relevant in terms of identifying an encoder that can produce byte sequences below the HKSCS area. |
Some findings about 950/Big5: The U+FFFD cells that can be seen in the Big5 visualization are filled with PUA code points. U+0080 maps to 0x80 and U+F8F8 (PUA) maps to 0xFF. 949/EUC-KR: U+0080 maps to 0x80. Additionally, the row immediately above Hanja and the row immediately below Hanja (starting from 0xA1 trail, i.e. really just the part of the row where the trail is part of the original EUC trail range) is filled with PUA code points. Since the trail is always non-ASCII, this doesn't pose a security risk. 932/Shift_JIS: There are 4 PUA code points that map to single non-ASCII byte each. We already knew about this. Since these are non-ASCII, they don't pose a security risk. |
Random thought: eudcedit.exe offers to start from the start of the Unicode Private Use Area. In case of Shift_JIS and gbk, the Encoding Standard is well compatible with this. (Let's ignore for the moment if it's a good thing for text interchange.) The 949 PUA mappings that we don't have would also start from the start of the Unicode Private Use Area. However, in the case of Big5, HKSCS compat makes the start of the Private Use Area unavailable. Is there a good reason why the Encoding Standard doesn't have the EUDC bits for EUC-KR even though it explicitly has them for Shift_JIS and as a side effect of gb18030 support for gbk? |
(This is not yet a suggestion to make our EUC-KR PUA-consistent with 949. At present I'm just trying to understand why things are the way they are.) |
It seems that not all PUA mappings in Windows legacy code pages are strictly considered part of EUDC. |
moztw.org maintains a repository of information about Big5. Checking the UAO tables there, it's worth noting that UAO had mappings for the byte pairs whose lead is in the range 0x81 to 0x86 (inclusive), so even though the research that lead to the current state of the spec concluded that HKSCS was used on the Web and UAO pretty much not (and, therefore, the spec went with HKSCS instead of UAO), a UAO encoder could produce byte pairs whose lead is in the range 0x81 to 0x86 (inclusive). Both Python and Node have quite recent UAO packages. |
I'm against any more tweaking in general and extremely strongly against introducing any new decoding mapping to PUA (e.g for EUC-KR) in the Unicode. |
The more I've examined the issue reported here, the more convinced I am that
So at the very least I think we should tweak Big5 decode such that Big5 leads in the 0x81 to 0x86, inclusive, range consume the next byte, too, if it is in the Big5 trail range. I don't yet have an opinion on whether the byte pairs should result in error (U+FFFD) or in 950-consistent PUA code points. (Mixing UAO with HKSCS probably would hurt more than it would help.) Since JIS X 0208 has undergone less extension and the extensions have happened ages ago, I think we probably don't need to change Shift_JIS analogously for trails that are in the trail range but unmapped, because there probably aren't any Shift_JIS-ish encoders around that could emit leads 0x82, 0x85, 0x86, 0x88, 0xEB, 0xEF, 0xEC or 0xFC with trail 0x5C for some Unicode input. The 0x5C issue is moot for EUC-KR and gbk (but for opposite reasons).
If we don't change EUC-KR, Shift_JIS and gbk to match the corresponding Windows code pages, I think we should at the very least document how and why encodings that for non-PUA, non-U+0080 points match Windows code pages don't match the Windows code pages for PUA and U+0080. |
For completeness, it seems worthwhile to mention: After a byte in the Shift_JIS lead range, IE and pre-Chromium Edge consume the following byte if it is in the Shift_JIS trail range regardless of its mappability, so if we were to change Shift_JIS, we'd be changing it to an IE/Egde-consistent state security-wise (even if not in terms of the character used for replacement of unmappables). |
I think we should do this at least for Big5, because it would better protect against Java and Windows (kernel32.dll) -based generators getting XSSed. @achristensen07, do you have an opinion? |
Note that UTC disagreed with the Encoding Standard on this point: https://www.unicode.org/L2/L2019/19192-review-docs.pdf I.e. making the requested change here would align the UTC position. |
Their rationale cites ISO-2022-JP, which seems like a red herring. |
If I've understood this correctly, this seems to only change behavior when decoding "invalid" input. I would like to believe that most content comes from valid encoders, so I would hope compatibility issues from such a change would be minimal. I don't think I have a strong opinion. |
The paragraphs referring to VDCs are directly relevant to this issue. |
We have telemetry from Firefox 86 that is best explained by the hypothesis that users in Taiwan and Hong Kong encounter unlabeled Big5 containing byte sequences that the Encoding Standard considers unmapped. (I'm working on getting the numbers that I'm looking at OKed for publication.) Of the byte pairs in the Big5 range that aren't mapped by the Encoding Standard, only the ones with lead byte 0xA3 are unmapped in Internet Explorer. The rest map to the Private Use Area. In that demo, the middle column labeled "Bytes" can be used to verify the decoding in IE. The rightmost column labeled "PUA NCR" contains a cross-browser numeric character reference for what IE decodes the bytes to. This can be used for probing fonts for glyph assignments in non-IE browsers. (The page is declared as |
AR PL UMing HK on Ubuntu 20.04 provides glyphs for U+EEFF and U+F7EF through U+F816, inclusive. |
For clarity, the telemetry measures Text Encoding menu usage and not byte sequences. The connection to byte sequences is my hypothesis from menu usage. |
On Windows 10, I see Arial provides an empty glyph for U+F301 and MingLiU_HKSCS provides ideographic glyphs for U+ED2B through U+EE9A, inclusive. (The glyphs in AR PL UMing HK were not ideographic.) |
|
I forgot to report back the progress here. Firefox 89 shipped a version of chardetng that doesn't reject the Big5ness of the input if it contains structurally-valid but unmapped Big5 byte pairs. Telemetry changes don't look like they were caused by this change and instead look like they were dominated by removal of one of the encoding menu UI entry points, but the analysis of the telemetry data was structured to answer a different question and not this one. |
There are some sequences of bytes that are valid lead-trailing according to the description at https://encoding.spec.whatwg.org/#big5-decoder, but don't have a corresponding Unicode codepoint in the
index-big5.txt
mapping table.In this case the first byte is converted to
U+FFFD
, but the second one is left "as is". In some cases that second byte can be backslash (\
,5C
), which can be used to "escape" the ending quote of strings in JavaScript and potentially resulting in XSS exploits.Example (attached): the
83 5C
sequenceAccording to the algorithm at https://encoding.spec.whatwg.org/#big5-decoder
(lead − 0x81) × 157 + (byte − offset)
. The result of that is(0x83 - 0x81) * 157 + (0x5C - 0x40)
which is0x156
index-big5.txt
, and because thebyte is an ASCII byte
weprepend byte to stream
(case 3.6) andReturn error
(case 3.7)The end result is a
U+FFFD
(from the error) followed by a5C
(the trailing byte, "as is")You can see this in the attached file.
When opened in both Chrome and Firefox the text is rendered as the "Unicode REPLACEMENT CHARACTER" (correct) followed by a back-slash (incorrect).
This is a valid lead-trail byte sequence that should either be replaced by one single
U+FFFD
character, or by twoU+FFFD
characters, whatever the policy is (I think the second case).But the definitely the trailing byte should not be left "as is"
The possible exploit can use the trailing byte (which is backslash) to escape the end of a string, for example.
Checking the console of Firefox you will see the 'SyntaxError: "" string literal contains an unescaped line break' message. In Chrome the message is 'Uncaught SyntaxError: Invalid or unexpected token'
I did not check, but this might also happen in other DBCS (Double Byte Characters Sets) that have the second byte in the ASCII range (for instance in Shift JIS?).
big5.zip
The text was updated successfully, but these errors were encountered: