-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected Crash When Archiving Files with Special Characters in Names #241
Comments
Can you attach a zip archive with a set of files that will reproduce the issue? I'll take a look when I'm back from vacation. The test suite explicitly covers files with Unicode characters (including emojis) and runs fine on Windows. |
Okay, here's something. At least one of those files is causing the crash. |
Thanks, that helped a lot! The root cause is that the non-empty file in the archive does indeed contain a character in the file name that cannot be represented in Unicode (this is what you see as �). This triggers an error when converting the file name to UTF-8 (which DwarFS uses internally). This error causes This is certainly fixable, but I'm not entirely sure yet exactly how to fix it. The logging code crashing is obviously a bug, and for that bit I already have a fix. Producing a useful log message is going to be more challenging, but is likely doable with some Windows-specific code to convert the "raw" file name to Unicode without throwing an exception. But given we can replace invalid file name characters by �, what should the overall behavior be? I'm leaning towards adding the file with the modified file name, although this would mean the stored file name is different from the original file name. Alternatively, files with invalid file names could be skipped. |
Thanks for the detailed explanation! I agree with your suggestion to add the file with the modified file name. While it means the stored file name would be different from the original, it seems like the more practical solution compared to skipping the file entirely. This way, at least the file will still be included in the archive, even if the name has been altered slightly. I'm also encountering another issue. The process seems to work by first searching for all files, then scanning and hashing them. In my case, after completing this phase, the process stopped accessing the disk and started using around 20% of the CPU. It stayed in this state for about 3 minutes without showing any progress, and then it crashed again without any error message. Looking forward to hearing your thoughts. |
The CI workflow produces debug build artifacts for Windows, e.g. here, look for dwarfs-0.10.1-Windows-AMD64-debug.7z. Instead of a silent crash, these will (hopefully) show an assertion / more details on why the program crashed. Maybe you can give this a try? |
For some reason, Windows allows invalid UTF-16 characters in file names. Try to handle these gracefully when converting to UTF-8.
Can you check if dwarfs-0.10.1-71-ga5b71e2cb3-Windows-AMD64.7z or dwarfs-0.10.1-71-ga5b71e2cb3-Windows-AMD64-debug.7z fixes this issue? |
Thanks a lot. I’ll definitely try it, but probably sometime next week. My use case is quite time-consuming. This got me thinking that I’d like to ask for some advice, if you don’t mind it being a bit off-topic, on how to set the command line parameters for the best possible result. To give you an idea of what I’m trying to do – I have a directory that I want to archive. These are various backups of different things over several years. But mostly, they are full copies of system disks of several (Windows) PCs. Therefore, I’m counting a lot on deduplication because these disks are surely full of duplicate files (or parts of files). The whole thing is 7.5TB and contains 17 million files. The goal is to reduce the disk space it occupies while still being able to access the files relatively quickly and conveniently. I’ve already applied KopiaUI to it, which reduced it to 3.7TB, and it took about 26 hours. I was curious how DwarFS would handle it. (Of course, I’d be happy to share the results if you’re interested.) So, could you advise me on how to set the parameters in the best way to achieve the best result, while also making sure it doesn’t take forever? (It might be important to note that I have about 16GB of free RAM available.) Thanks in advance! |
The holy grail of archiving: producing the best possible result regardless of input. :) In all seriousness, if there was a set of "best possible parameters", I'd make it the default. Most options involve trade-offs. If you want to maximize compression, it's likely going to be (much) slower and use (much) more memory. If you want to maximize access speed, you won't be able to maximize compression. If you have a good understanding of the data you're going to compress, this can guide you in tweaking different options. But that's going to be hard with a very heterogeneous set of data. Here's what I'd do:
Definitely let me know your results! |
For some reason, Windows allows invalid UTF-16 characters in file names. Try to handle these gracefully when converting to UTF-8.
For some reason, Windows allows invalid UTF-16 characters in file names. Try to handle these gracefully when converting to UTF-8.
Thanks a lot for the valuable advice. Thanks again for your help, and I'll definitely let you know how it turns out. |
It's only the block compression that can be changed after the fact. I'm thinking about the possibility of metadata manipulation, but there's currently no clear plan. It should be possible to change the block size, too, once the metadata can be manipulated. Everything else (window size/shift, lookback, categorizers, ...) cannot be changed later.
|
Hello, |
Okay, time to get out some slightly bigger guns... That "Debug Error!" dialog isn't very useful and it's been annoying me ever since I started porting DwarFS to Windows. I've made a change to:
Please try again using dwarfs-0.10.1-76-g9e6ed1fec6-Windows-AMD64-debug.7z. |
So here's the stacktrace. But I don't know if it will help anything.
|
Sorry, that's not how this is supposed to look like. :/ I have to admit that I haven't used Windows seriously for more than 20 years, so obviously my experience here is limited. As it turns out, for the stack traces to work, you need a I've made a new build and tried really hard to make sure it actually works this time. Please get dwarfs-0.10.1-78-g4a0dba5aec-Windows-AMD64-debug-stacktrace.7z and make sure you're running |
Ok, here You are a little more informative stacktrace. I hope it helps.
|
Thanks so much, that was helpful. Not quite to the extent that I had hoped, but definitely better than nothing. :) I've actually managed the code to crash on my machine under some circumstances, but never with the same stack trace as on your machine. I've made two changes:
Here's a new version including these changes: dwarfs-0.10.1-115-g53ac77f237-Windows-AMD64-debug-stacktrace.7z It should (hopefully) run without crashing, but in either case I'd be very interested in the full output. |
Hi, As a final note for this issue, I’d like to ask something somewhat unrelated. – I admit I didn’t read through all the discussions, so I might be touching on a topic that has already been mentioned, but still – would it really be so technically challenging to implement some option to save an intermediate state of the application, either periodically or phase by phase, or perhaps a “pre-crash” state of the app or an "on-demand" state triggered by a key? This way, it would be possible to resume from that point rather than starting completely from the beginning. Thank you very much for your interest and help, regardless. Please also keep in mind that English is not my native language, so everything you're reading from me has gone through a translator, and it might not come across exactly as I intended. Especially if something sounds offensive in any way, it's definitely a translation mistake, because my main intention is, in fact, gratitude. |
And a small note regarding logging? |
I wouldn't have been able to tell that you were using a translator. I really appreciate the time and effort you've put not only into reproducing the errors, but also into writing this up properly.
That's great! :)
Probably not lost, but more likely very hard to spot.
There should have been at least errors for the file(s) with changes to their file name. If you could at some point run this again with the default log level and check if there's anything suspicious; in particular, I'm looking for anything that would look like this:
If you don't see anything like this, that means I've somehow managed to fix the issue you ran into. :)
It would be technically challenging. A "pre-crash" version would be even more challenging. Here's the problem: There aren't that many "phases" with a singular transition state. What do I mean by that? To remind myself of what's going on, I've made this sequence diagram. There are 3 "phases":
So the only somewhat feasible "checkpoint" would be between phases 2 and 3, this is the only time when I think having the possibility to "checkpoint" after the scanning phase would be nice, although you'd have to be aware that when you resume (which could be two years later in theory), the input data may have significantly changed. I'll add it to my huge bucket of ideas. Any more granular "checkpointing" is completely off the table. It would make the code much more complex, likely slower, and almost impossible to test.
The request is definitely reasonable! But the options are really limited. In the meantime, the best thing you can do to limit the amount of time it takes to create an image, as mentioned earlier, is to use a fast compression algorithm and recompress the image later.
Yes, the speed is per-core. Hashing is the first thing that happens, so this one is typically limited by I/O speed. Both categorization and similarity hashing can show much higher speeds because they happen after hashing and benefit from the fact that the files are already cached. Take these numbers with a grain of salt, the way they're currently defined (per-core wallclock speed) is probably not what you'd intuitively expect. |
Hi,
Otherwise, a save point at least between the scanning and compression phases would be quite nice. For me, the scanning phase takes over 20 hours. And when I finally got to the compression phase, I quickly calculated, based on the current compression speed (with the settings you recommended), that the compression phase would take another 5 days or so. Unfortunately, I wasn’t able to let it run that long, so I had to stop it, and now I’ll have to start over from the beginning. I’ll either try an even faster compression or figure out how to run it on a more powerful machine. In any case, while I don't yet know what the final compression ratio will be, I have to note that KopiaUI finished completely in 26 hours, as I mentioned earlier, so based on my current testing, I’m afraid you might not be able to beat that time. But I definitely don’t want to say I’m giving up. ;-) |
I hope you can make sense of the output. Unfortunately, the formatting got messed up somehow. |
And... |
Okay, there are two potential bottlenecks here:
So I would guess that it's a file I/O bottleneck, based on the fact that you're not seeing a single saturated core.
I don't know for sure (I'm definitely not a Windows expert), but in my tests, accessing files on Windows was painfully slow compared to Linux. Things got slightly better when I disabled real-time malware scanning (this isn't a recommendation).
I don't know how Kopia works exactly, but I assume that it's doing a single pass over the files rather than two passes. The fact that DwarFS' scanning phase takes roughly the same amount of time as Kopia's whole backup operation is a strong hint. However, this means that (depending on the input data), Kopia will likely fail to detect a lot of sub-file-level duplication. In fact, I've just downloaded Kopia and ran it on one of my test cases. Kopia compressed 51 GB of input data in ~2 million files to 18 GB in about 30 seconds. DwarFS compressed the same input data to 325 MB (that's more than 50 times smaller) in less than 2 minutes. Note that this is not an average use case, so the difference on average can be much smaller. But when you have an input dataset that you know is highly redundant across files, DwarFS will very likely perform much better. Ultimately, the two tools have slightly different goals. The goal of DwarFS is to compress redundant data as much as possible without being ridiculously slow. Kopia's primary goal is likely speed, while still offering acceptable compression. |
Ahhhh, and that is likely what triggered the crash. Not the underlying error, but the fact that the Czech error message wasn't UTF-8, but rather some Windows code page ( |
|
Description: When attempting to archive a complex directory structure with a large number of files using Dwarfs, the program crashes unexpectedly without any error message. After some investigation and testing, it appears that the crash is caused by files with special characters in their names.
An example of a problematic file name is:
2022-06-07 16-55-41 URGENT_ 🚨Someone have Run a Background-check on Your (Public Records)🔎 Tape Your Name & See Results Yourself!⭐ N°7267.eml
It seems that characters such as emojis (e.g., 🚨, 🔎, ⭐), special symbols, or other non-standard characters in file names are not handled properly by Dwarfs, resulting in the crash.
Steps to Reproduce:
Create a directory structure with files containing special characters and emojis in their names.
Attempt to archive the directory using Dwarfs.
The program crashes without an error message.
Expected Behavior: Dwarfs should either handle the special characters gracefully or provide a clear error message indicating which file name is causing the issue.
Environment:
Dwarfs version: 0.10.1
OS: Windows 10
Please let me know if more information is needed to address this issue. Thank you!
EDIT:
After further investigation, it seems that the issue might not be solely caused by emoji characters. There are indications that the Mathematical Alphanumeric Symbols could be contributing to the problem as well. These symbols are part of the Unicode block ranging from U+1D400 to U+1D7FF and include various styled versions of Latin letters and digits (e.g., bold, italic, double-struck). They are typically used in mathematical notations but can appear in file names and text. Since they have different Unicode representations, they may not be handled properly by dwarfs.
Additionally, there is a confirmed issue with the replacement character (�), which is represented by U+FFFD in Unicode. This character often appears when there is an invalid or undecodable sequence in the text. It seems that dwarfs is having difficulty processing file names containing this character, which is likely causing it to crash.
Apologies for any confusion caused, as this issue has proven to be quite difficult to diagnose. One major challenge is that dwarfs doesn't provide a clear error message indicating which specific file or character caused the failure, making it tough to isolate the root cause.
The text was updated successfully, but these errors were encountered: