[Feature Request] UTF-8 text filter #19

ljdarj · 2024-09-30T13:07:31Z

I'm currently doing a Java prototype for tukaani-project/xz#50 and so far the results look pretty good. My choice was to convert it to SCSU because that way I'm both sure it would be reversible and wouldn't require to bring in something like ICU. Here are the results of the tests I did, using Українська кухня. Підручник from the C library's issue but getting rid first of the HTML prologue and epilogue the Internet Archive stuck in there:

UTF-8 file size: 2729604 bytes
SCSU file size: 1566123 bytes
KOI8-U file size: 1540610 bytes (non-reversible)
UTF-8 compressed file size: 425436 bytes
SCSU compressed file size: 399020 bytes
KOI8-U file size: 394820 bytes (non-reversible)

I still need to polish the code before even considering a draft pull request but so far so good.

Larhzu · 2024-10-03T16:32:28Z

Excluding the headers and footers, the file has 125 different Unicode code points. Thus one byte per code point is possible. One needs something to tell which bytes match which code point ideally without requiring user to specify the charset. That information will take some space. Use also UTF-16BE for comparison. I got 403156 bytes but my input was two bytes bigger than yours. iconv -f utf8 -t utf16be < ukrainskakuhnya1998_djvu.txt | xz --lzma2=pb=1,lp=1 | wc -c SCSU sounds more complex than some ideas, which means it would need to be clearly better too to be worth it. It's nice to see it tested still. :-) SCSU itself is a compression method already. A filter doesn't necessarily need to make the file smaller; filter needs to make the file easier to compress. For example, UTF-16 makes the file 13 % bigger but then it compresses better than UTF-8 in case of this test file.

ljdarj · 2024-10-05T20:53:42Z

I'll test UTF-16 and UTF-32 to see if that doesn't improve things but the issue I see with trying not to force the user to specify the charset is that the only encodings where I see that possible without trying to decode everything is the Unicode set with the BOM. Which a lot of time wouldn't help because UTF-8 tends to be written without BOM.

And that's if decoding would even help: while the ISO-8859 series or KOI8-B has unassigned bytes which would allow us to declare that a text isn't encoded in them and the ISO/IEC 2022 series is stateful so invalid states would help for declaring it can't be them, technically any octet-stream is valid, say, KOI8-R or VSCII-1. Possibly full of enough control characters for us to think it's not even text at all, but that wouldn't make it invalid.

So personally I would be for either the user specifies the charset, there's a BOM, or it has to be UTF-8.

Larhzu · 2024-10-08T15:47:21Z

Sorry, I think you misunderstood me. Input is UTF-8, no need to guess that or ask from the user. If you knew the language, you could manually tell the filter to convert to slightly modified KOI8-U. One could reserve a few control bytes as escapes so that Unicode codepoints not in KOI8-U and also invalid UTF-8 could be encoded too. My point was that it's nicer for users if the encoder can determine the UTF-8-to-8-bit mapping automatically. That is, if the 8-bit mapping is a good method. It's not the only way to go like your SCSU result shows. It's about figuring out what is simple and small code and still provides good result.

ljdarj · 2024-10-11T18:53:26Z

For an automatic conversion then I think the target ought to be SCSU, BOCU-1 and their ilk: set for all intents and purposes the text to a given Unicode block and escape to UTF-16 for whatever's outside is basically what a good encoder for them does...admitting, of course, that converting to UTF-16 or UTF-32 before setting LZMA 2 on it doesn't give bigger wins when all is said and done.

ljdarj · 2024-10-19T12:13:51Z

I have a question before trying the UTF-16/UTF-32 conversions: for data which is 2 bytes (resp. 4 bytes) wide, we put lc to 0 and lp to 1 (resp. 2), right?

Larhzu · 2024-10-19T14:06:15Z

UTF-16BE: pb=1, lp=1, lc=3 UTF-32BE: pb=2, lp=2, lc=2 pb and lp are about alignment. lp + lc must not exceed 4, thus one has to use lc=2 with lp=2. Also: UTF-8: pb=0, lp=0, lc=3 (or sometimes lc=4) With UTF-16 and UTF-32, big endian should compress better than little endian.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] UTF-8 text filter #19

[Feature Request] UTF-8 text filter #19

ljdarj commented Sep 30, 2024

Larhzu commented Oct 3, 2024 via email

ljdarj commented Oct 5, 2024 •

edited

Loading

Larhzu commented Oct 8, 2024 via email

ljdarj commented Oct 11, 2024 •

edited

Loading

ljdarj commented Oct 19, 2024

Larhzu commented Oct 19, 2024 via email

[Feature Request] UTF-8 text filter #19

[Feature Request] UTF-8 text filter #19

Comments

ljdarj commented Sep 30, 2024

Larhzu commented Oct 3, 2024 via email

ljdarj commented Oct 5, 2024 • edited Loading

Larhzu commented Oct 8, 2024 via email

ljdarj commented Oct 11, 2024 • edited Loading

ljdarj commented Oct 19, 2024

Larhzu commented Oct 19, 2024 via email

ljdarj commented Oct 5, 2024 •

edited

Loading

ljdarj commented Oct 11, 2024 •

edited

Loading