-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unzip non-UTF-8 archives #112
Comments
There is also this crazyness https://unix.stackexchange.com/a/364344 |
Interesting, I've never had to deal with locales so this could be a challenge, but I'll try to implement this when I get the chance. Since it's a larger change it could take a while. |
https://github.com/vlm/zip-fix-filename-encoding/blob/master/src/runzip.c this might help a little bit. They are trying to guess an encoding by character frequencies there. Also there are some test files that might be useful https://github.com/Stuk/jszip/tree/master/test/ref I was probably wrong about jszip in the first comment, apparently they are handling it somehow (or at least they have tests for that), yet for some of my files jszip produces something wrong in file names. And I definitely have cp866. |
I've found that Unicode filenames in general are annoying to deal with. You need to set |
JSZip punts the issue to the end user. https://github.com/Stuk/jszip/blob/master/documentation/api_jszip/load_async.md#decodefilename-option |
That actually seems like a decent solution. I'm assuming no sane person would use cp866 for creating new ZIP files, so just decode support might work OK with a similar setting in |
I think it is tempting to archivers authors to use one-byte encodings to save a few more bytes. So the problem is not going away any time soon. But yeah. putting the problem on the user is fine by me too. |
👀 |
I still want to fix this issue but I can't promise I'll get to it in v0.8.2. I will give it another honest attempt though. |
does this issue only occur with non-utf8 archives? I'm having this issue with a zip created in MacOS:
becomes
And ChatGPT has me believe MacOS uses UTF-8 for filenames inside zip. |
What can't you do right now?
It happens that in Russia file names inside zip files are often encoded with cp866. Such filenames currently decoded incorrectly in fflate. The best I can do is
but it produces correct characters interleaved with some gibberish.
An optimal solution
Either provide the raw name in UnzipFile
, or make it possible to provide an encoding for entries marked as not utf-8.
(How) is this done by other libraries?
jszip also fails to decode it correctly.
There is
unzip -O cp866
in Ubuntu starting from some version, and before that version I believe they had a hack that would have used cp866 automatically if it had seen a Russian locale in the OS.A browser equivalent for that hack would be
navigator.language == 'ru-RU'
if you are willing to use that approach.The text was updated successfully, but these errors were encountered: