Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] UTF-8 text filter #19

Open
ljdarj opened this issue Sep 30, 2024 · 6 comments
Open

[Feature Request] UTF-8 text filter #19

ljdarj opened this issue Sep 30, 2024 · 6 comments

Comments

@ljdarj
Copy link

ljdarj commented Sep 30, 2024

I'm currently doing a Java prototype for tukaani-project/xz#50 and so far the results look pretty good. My choice was to convert it to SCSU because that way I'm both sure it would be reversible and wouldn't require to bring in something like ICU. Here are the results of the tests I did, using Українська кухня. Підручник from the C library's issue but getting rid first of the HTML prologue and epilogue the Internet Archive stuck in there:

  • UTF-8 file size: 2729604 bytes
  • SCSU file size: 1566123 bytes
  • KOI8-U file size: 1540610 bytes (non-reversible)
  • UTF-8 compressed file size: 425436 bytes
  • SCSU compressed file size: 399020 bytes
  • KOI8-U file size: 394820 bytes (non-reversible)

I still need to polish the code before even considering a draft pull request but so far so good.

@Larhzu
Copy link
Member

Larhzu commented Oct 3, 2024 via email

@ljdarj
Copy link
Author

ljdarj commented Oct 5, 2024

I'll test UTF-16 and UTF-32 to see if that doesn't improve things but the issue I see with trying not to force the user to specify the charset is that the only encodings where I see that possible without trying to decode everything is the Unicode set with the BOM. Which a lot of time wouldn't help because UTF-8 tends to be written without BOM.

And that's if decoding would even help: while the ISO-8859 series or KOI8-B has unassigned bytes which would allow us to declare that a text isn't encoded in them and the ISO/IEC 2022 series is stateful so invalid states would help for declaring it can't be them, technically any octet-stream is valid, say, KOI8-R or VSCII-1. Possibly full of enough control characters for us to think it's not even text at all, but that wouldn't make it invalid.

So personally I would be for either the user specifies the charset, there's a BOM, or it has to be UTF-8.

@Larhzu
Copy link
Member

Larhzu commented Oct 8, 2024 via email

@ljdarj
Copy link
Author

ljdarj commented Oct 11, 2024

For an automatic conversion then I think the target ought to be SCSU, BOCU-1 and their ilk: set for all intents and purposes the text to a given Unicode block and escape to UTF-16 for whatever's outside is basically what a good encoder for them does...admitting, of course, that converting to UTF-16 or UTF-32 before setting LZMA 2 on it doesn't give bigger wins when all is said and done.

@ljdarj
Copy link
Author

ljdarj commented Oct 19, 2024

I have a question before trying the UTF-16/UTF-32 conversions: for data which is 2 bytes (resp. 4 bytes) wide, we put lc to 0 and lp to 1 (resp. 2), right?

@Larhzu
Copy link
Member

Larhzu commented Oct 19, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants