-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] UTF-8 text filter #19
Comments
Excluding the headers and footers, the file has 125 different Unicode code points. Thus one byte per code point is possible. One needs something to tell which bytes match which code point ideally without requiring user to specify the charset. That information will take some space.
Use also UTF-16BE for comparison. I got 403156 bytes but my input was two bytes bigger than yours.
iconv -f utf8 -t utf16be < ukrainskakuhnya1998_djvu.txt | xz --lzma2=pb=1,lp=1 | wc -c
SCSU sounds more complex than some ideas, which means it would need to be clearly better too to be worth it. It's nice to see it tested still. :-)
SCSU itself is a compression method already. A filter doesn't necessarily need to make the file smaller; filter needs to make the file easier to compress. For example, UTF-16 makes the file 13 % bigger but then it compresses better than UTF-8 in case of this test file.
|
I'll test UTF-16 and UTF-32 to see if that doesn't improve things but the issue I see with trying not to force the user to specify the charset is that the only encodings where I see that possible without trying to decode everything is the Unicode set with the BOM. Which a lot of time wouldn't help because UTF-8 tends to be written without BOM. And that's if decoding would even help: while the ISO-8859 series or KOI8-B has unassigned bytes which would allow us to declare that a text isn't encoded in them and the ISO/IEC 2022 series is stateful so invalid states would help for declaring it can't be them, technically any octet-stream is valid, say, KOI8-R or VSCII-1. Possibly full of enough control characters for us to think it's not even text at all, but that wouldn't make it invalid. So personally I would be for either the user specifies the charset, there's a BOM, or it has to be UTF-8. |
Sorry, I think you misunderstood me. Input is UTF-8, no need to guess that or ask from the user.
If you knew the language, you could manually tell the filter to convert to slightly modified KOI8-U. One could reserve a few control bytes as escapes so that Unicode codepoints not in KOI8-U and also invalid UTF-8 could be encoded too.
My point was that it's nicer for users if the encoder can determine the UTF-8-to-8-bit mapping automatically. That is, if the 8-bit mapping is a good method. It's not the only way to go like your SCSU result shows. It's about figuring out what is simple and small code and still provides good result.
|
For an automatic conversion then I think the target ought to be SCSU, BOCU-1 and their ilk: set for all intents and purposes the text to a given Unicode block and escape to UTF-16 for whatever's outside is basically what a good encoder for them does...admitting, of course, that converting to UTF-16 or UTF-32 before setting LZMA 2 on it doesn't give bigger wins when all is said and done. |
I have a question before trying the UTF-16/UTF-32 conversions: for data which is 2 bytes (resp. 4 bytes) wide, we put lc to 0 and lp to 1 (resp. 2), right? |
UTF-16BE: pb=1, lp=1, lc=3
UTF-32BE: pb=2, lp=2, lc=2
pb and lp are about alignment. lp + lc must not exceed 4, thus one has to use lc=2 with lp=2. Also:
UTF-8: pb=0, lp=0, lc=3 (or sometimes lc=4)
With UTF-16 and UTF-32, big endian should compress better than little endian.
|
I'm currently doing a Java prototype for tukaani-project/xz#50 and so far the results look pretty good. My choice was to convert it to SCSU because that way I'm both sure it would be reversible and wouldn't require to bring in something like ICU. Here are the results of the tests I did, using Українська кухня. Підручник from the C library's issue but getting rid first of the HTML prologue and epilogue the Internet Archive stuck in there:
I still need to polish the code before even considering a draft pull request but so far so good.
The text was updated successfully, but these errors were encountered: