Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected Crash When Archiving Files with Special Characters in Names #241

Open
Zpovednice-adm opened this issue Oct 5, 2024 · 25 comments
Assignees
Labels
bug Something isn't working

Comments

@Zpovednice-adm
Copy link

Zpovednice-adm commented Oct 5, 2024

Description: When attempting to archive a complex directory structure with a large number of files using Dwarfs, the program crashes unexpectedly without any error message. After some investigation and testing, it appears that the crash is caused by files with special characters in their names.

An example of a problematic file name is:

2022-06-07 16-55-41 URGENT_ 🚨Someone have Run a Background-check on Your (Public Records)🔎 Tape Your Name & See Results Yourself!⭐ N°7267.eml

It seems that characters such as emojis (e.g., 🚨, 🔎, ⭐), special symbols, or other non-standard characters in file names are not handled properly by Dwarfs, resulting in the crash.

Steps to Reproduce:

Create a directory structure with files containing special characters and emojis in their names.
Attempt to archive the directory using Dwarfs.
The program crashes without an error message.
Expected Behavior: Dwarfs should either handle the special characters gracefully or provide a clear error message indicating which file name is causing the issue.

Environment:

Dwarfs version: 0.10.1
OS: Windows 10

Please let me know if more information is needed to address this issue. Thank you!

EDIT:
After further investigation, it seems that the issue might not be solely caused by emoji characters. There are indications that the Mathematical Alphanumeric Symbols could be contributing to the problem as well. These symbols are part of the Unicode block ranging from U+1D400 to U+1D7FF and include various styled versions of Latin letters and digits (e.g., bold, italic, double-struck). They are typically used in mathematical notations but can appear in file names and text. Since they have different Unicode representations, they may not be handled properly by dwarfs.

Additionally, there is a confirmed issue with the replacement character (�), which is represented by U+FFFD in Unicode. This character often appears when there is an invalid or undecodable sequence in the text. It seems that dwarfs is having difficulty processing file names containing this character, which is likely causing it to crash.

Apologies for any confusion caused, as this issue has proven to be quite difficult to diagnose. One major challenge is that dwarfs doesn't provide a clear error message indicating which specific file or character caused the failure, making it tough to isolate the root cause.

@mhx
Copy link
Owner

mhx commented Oct 5, 2024

Can you attach a zip archive with a set of files that will reproduce the issue? I'll take a look when I'm back from vacation. The test suite explicitly covers files with Unicode characters (including emojis) and runs fine on Windows.

@Zpovednice-adm
Copy link
Author

Okay, here's something. At least one of those files is causing the crash.
dwarfs issue files.zip

@mhx
Copy link
Owner

mhx commented Oct 8, 2024

Thanks, that helped a lot!

The root cause is that the non-empty file in the archive does indeed contain a character in the file name that cannot be represented in Unicode (this is what you see as �). This triggers an error when converting the file name to UTF-8 (which DwarFS uses internally).

This error causes mkdwarfs to try to log a message about the error that just happened. However, the log statement assumes that it's possible to output the file name, which triggers the same error again. So it fails to produce a log message and that empty log message makes the logging code fall over, triggering an assertion (the logging code didn't expect empty log messages). In the release build, this assertion is compiled out and the code performs some invalid memory access, resulting in the crash you're seeing.

This is certainly fixable, but I'm not entirely sure yet exactly how to fix it.

The logging code crashing is obviously a bug, and for that bit I already have a fix.

Producing a useful log message is going to be more challenging, but is likely doable with some Windows-specific code to convert the "raw" file name to Unicode without throwing an exception.

But given we can replace invalid file name characters by �, what should the overall behavior be? I'm leaning towards adding the file with the modified file name, although this would mean the stored file name is different from the original file name. Alternatively, files with invalid file names could be skipped.

@Zpovednice-adm
Copy link
Author

Thanks for the detailed explanation!

I agree with your suggestion to add the file with the modified file name. While it means the stored file name would be different from the original, it seems like the more practical solution compared to skipping the file entirely. This way, at least the file will still be included in the archive, even if the name has been altered slightly.

I'm also encountering another issue. The process seems to work by first searching for all files, then scanning and hashing them. In my case, after completing this phase, the process stopped accessing the disk and started using around 20% of the CPU. It stayed in this state for about 3 minutes without showing any progress, and then it crashed again without any error message.

Looking forward to hearing your thoughts.

@mhx
Copy link
Owner

mhx commented Oct 10, 2024

The CI workflow produces debug build artifacts for Windows, e.g. here, look for dwarfs-0.10.1-Windows-AMD64-debug.7z.

Instead of a silent crash, these will (hopefully) show an assertion / more details on why the program crashed. Maybe you can give this a try?

mhx added a commit that referenced this issue Oct 10, 2024
For some reason, Windows allows invalid UTF-16 characters in file names.
Try to handle these gracefully when converting to UTF-8.
@mhx
Copy link
Owner

mhx commented Oct 10, 2024

@Zpovednice-adm
Copy link
Author

Thanks a lot.

I’ll definitely try it, but probably sometime next week. My use case is quite time-consuming.

This got me thinking that I’d like to ask for some advice, if you don’t mind it being a bit off-topic, on how to set the command line parameters for the best possible result. To give you an idea of what I’m trying to do – I have a directory that I want to archive. These are various backups of different things over several years. But mostly, they are full copies of system disks of several (Windows) PCs. Therefore, I’m counting a lot on deduplication because these disks are surely full of duplicate files (or parts of files). The whole thing is 7.5TB and contains 17 million files. The goal is to reduce the disk space it occupies while still being able to access the files relatively quickly and conveniently. I’ve already applied KopiaUI to it, which reduced it to 3.7TB, and it took about 26 hours. I was curious how DwarFS would handle it. (Of course, I’d be happy to share the results if you’re interested.) So, could you advise me on how to set the parameters in the best way to achieve the best result, while also making sure it doesn’t take forever? (It might be important to note that I have about 16GB of free RAM available.)

Thanks in advance!

@mhx
Copy link
Owner

mhx commented Oct 12, 2024

The holy grail of archiving: producing the best possible result regardless of input. :)

In all seriousness, if there was a set of "best possible parameters", I'd make it the default.

Most options involve trade-offs. If you want to maximize compression, it's likely going to be (much) slower and use (much) more memory. If you want to maximize access speed, you won't be able to maximize compression. If you have a good understanding of the data you're going to compress, this can guide you in tweaking different options. But that's going to be hard with a very heterogeneous set of data.

Here's what I'd do:

  • First, think about how you want to access the data later. This should affect the block size of the DwarFS image (-S). A large block size means that if you want to access a small file (say, 10KB), one or more file system blocks have to be decompressed. These blocks are typically several MB in size. If the chosen compression algorithm is slow to decompress (e.g. lzma), or if you want to access lots of small files, chances are a large block size will slow down access significantly (whether or not this is noticeable in your scenario is a different question, but it will be measurable). Also, fewer large blocks will fit in the block cache, which again can be detrimental for access performance. However, larger block sizes also mean (much) better compression. So if you want maximum compression, go for a larger block size (64 MB, or -S 26). If you want fast access, pick a smaller one (8/16 MB, or -S 23/-S 24).
  • If you know that there's a significant amount of already-compressed files (MP3, JPG, ZIP, all sorts of video) in the input, you can turn on the incompressible categorizer. This will significantly speed up the compression stage by not trying to compress the already compressed data.
  • If you know that there are uncompressed audio files (WAV, ...) in the input, turn on the pcmaudio categorizer.
  • While you can play with the -W / -w options to tweak the segmenter, this can become quite tedious and I don't recommend it unless you have massive amounts of time to spare or know that this will help compression based on knowing your data.
  • If you have enough temporary disk space, don't worry about the compression algorithm and its configuration yet. You can always change the algorithm later by re-writing the DwarFS image (see --recompress). I'd probably start with something like -C zstd:level=9 and once I have a working image, recompress it using different algorithms to see which one performs best. That way, you can skip the whole scanning / de-duplication stage and only focus on selecting the compression algorithm.
  • If you don't have the temporary disk space, choosing the compression algorithm, again, is a trade-off. lzma will, most of the time, give you the best compression, but at the expense of being about 10 times slower to decompress compared to zstd. Whether that slower decompression speed is something you'd notice is a different story and again, it's a question of how you're planning to access the data. If you're occasionally opening a couple of files, this choice won't matter. If you regularly want to search through the contents of thousands of files, it will.
  • There isn't much point in tweaking the "memory limit" (-L) unless you have very large blocks or many CPU cores. As a rule of thumb, if the memory limit is twice as large as the block size multiplied by the number of CPU cores, you're most likely fine.
  • Last but not least, once you have a DwarFS image, make sure you protect it from bit rot. A single bit flip in the metadata block of the DwarFS image can render your whole image useless.

Definitely let me know your results!

mhx added a commit that referenced this issue Oct 12, 2024
For some reason, Windows allows invalid UTF-16 characters in file names.
Try to handle these gracefully when converting to UTF-8.
mhx added a commit that referenced this issue Oct 13, 2024
For some reason, Windows allows invalid UTF-16 characters in file names.
Try to handle these gracefully when converting to UTF-8.
@Zpovednice-adm
Copy link
Author

Thanks a lot for the valuable advice.
I'll try ZSTD compression to start with – it was also used in the case of KopiaUI. I’ll see what comes out of it, and if necessary, I’ll try --recompress afterwards. That's very good idea, disk space is not problem for me. I just have one question about the --max-lookback-blocks option. I don’t think I fully understand how it works exactly, and mainly which phase it affects – whether it can also be changed during recompression, or if it already impacts the scanning phase.

Thanks again for your help, and I'll definitely let you know how it turns out.

@mhx
Copy link
Owner

mhx commented Oct 15, 2024

It's only the block compression that can be changed after the fact.

I'm thinking about the possibility of metadata manipulation, but there's currently no clear plan. It should be possible to change the block size, too, once the metadata can be manipulated.

Everything else (window size/shift, lookback, categorizers, ...) cannot be changed later.

--max-lookback-blocks sets the number of blocks the segmenter can use to find matches. The segmenter receives a stream of data and checks if it has seen some of that data before. If a match is found by the segmenter, it will insert a reference to the earlier data instead of the data itself. The -W option determines the minimum size of a match, and --max-lookback-blocks determines how far back the segmenter will look. The output of the segmenter are the final file system blocks and a list of "chunks" for each file, referencing the blocks. The blocks produced by the segmenter are passed on to the compression stage, but will be kept in memory for the segmenter for matching. Up to the last --max-lookback-blocks blocks are kept in memory. So that means the total lookback in bytes is the product of the block size and the number of lookback blocks. Hope that clears things up a bit.

@Zpovednice-adm
Copy link
Author

Hello,
I apologize for the bad news.
I tried dwarfs-0.10.1-71-ga5b71e2cb3-Windows-AMD64-debug.7z on my original data, and once again, it crashed on the file "2019-12-23 13-50-24 𝗘𝗹𝗶𝗺𝗶𝗻𝗮𝘁𝗲 𝗕𝗮𝘁𝗵𝗿𝗼𝗼𝗺 𝗕𝗮𝘁𝗵𝗶𝗻𝗴 𝗛𝗮𝘇𝗮𝗿𝗱𝘀—𝗚𝗲𝘁 𝗮 𝗪𝗮𝗹𝗸 𝗜𝗻 𝗕𝗮𝘁𝗵𝘁𝘂𝗯 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗶.eml". I got a Microsoft Visual C++ Runtime Library error message, as shown in the screenshot. Nothing was printed to the console. I had the log set to trace, which allowed me to see in the log the last successfully processed file "... [file_scanner.cpp:273] scanning file ...".
But it neither processed nor logged the problematic file.

image

@mhx
Copy link
Owner

mhx commented Oct 17, 2024

Okay, time to get out some slightly bigger guns...

That "Debug Error!" dialog isn't very useful and it's been annoying me ever since I started porting DwarFS to Windows.

I've made a change to:

  1. disable that useless dialog and
  2. print a stack trace when abort() is called. This will also cover assertions, as they ultimately call abort() as well.

Please try again using dwarfs-0.10.1-76-g9e6ed1fec6-Windows-AMD64-debug.7z.

@Zpovednice-adm
Copy link
Author

Zpovednice-adm commented Oct 17, 2024

So here's the stacktrace. But I don't know if it will help anything.

Caught signal 22
Stack trace (most recent call first):
#0  0x00007ff70dff091a in  ??
#1  0x00007ff70e9e5174 in  ??
#2  0x00007ff70e9638a8 in  ??
#3  0x00007ff70e9c7a4f in  ??
#4  0x00007ff70e95c1c7 in  ??
#5  0x00007ff70e95dbc7 in  ??
#6  0x00007ff70e95ddf9 in  ??
#7  0x00007ff70e941e5a in  ??
#8  0x00007ff70e93af2f in  ??
#9  0x00007ff8a2d528be in _chkstk
#10 0x00007ff8a2d02553 in RtlRaiseException
#11 0x00007ff8a2d022a6 in RtlRaiseException
#12 0x00007ff8a072b698 in RaiseException
#13 0x00007ff70e93ffe1 in  ??
#14 0x00007ff70da5131a in  ??
#15 0x00007ff70da4bd53 in  ??
#16 0x00007ff70da4bf31 in  ??
#17 0x00007ff70da55dc8 in  ??
#18 0x00007ff70da4b137 in  ??
#19 0x00007ff70da4af17 in  ??
#20 0x00007ff70da4b097 in  ??
#21 0x00007ff70da4ae67 in  ??
#22 0x00007ff70e016465 in  ??
#23 0x00007ff70e00e5f7 in  ??
#24 0x00007ff70e00e3d4 in  ??
#25 0x00007ff70d9b8f50 in  ??
#26 0x00007ff70eb1257b in  ??
#27 0x00007ff70e96183f in  ??
#28 0x00007ff70e95ee2d in  ??
#29 0x00007ff8a2d51c25 in RtlCaptureContext2
#30 0x00007ff70dcda892 in  ??
#31 0x00007ff70dcf097a in  ??
#32 0x00007ff70dce4ee5 in  ??
#33 0x00007ff70d9f91a0 in  ??
#34 0x00007ff70d949a4c in  ??
#35 0x00007ff70d9fee43 in  ??
#36 0x00007ff70d9ff076 in  ??
#37 0x00007ff70da00cd2 in  ??
#38 0x00007ff70d9ff2f9 in  ??
#39 0x00007ff70da01c5e in  ??
#40 0x00007ff70da01b13 in  ??
#41 0x00007ff70d9fef4c in  ??
#42 0x00007ff70da4a925 in  ??
#43 0x00007ff70e93b5b8 in  ??
#44 0x00007ff70e93b461 in  ??
#45 0x00007ff70e93b31d in  ??
#46 0x00007ff70e93b64d in  ??
#47 0x00007ff8a1547373 in BaseThreadInitThunk
#48 0x00007ff8a2cfcc90 in RtlUserThreadStart

@mhx
Copy link
Owner

mhx commented Oct 18, 2024

Sorry, that's not how this is supposed to look like. :/

I have to admit that I haven't used Windows seriously for more than 20 years, so obviously my experience here is limited.

As it turns out, for the stack traces to work, you need a .pdb file alongside the .exe file and you need to run the .exe from within its directory, as otherwise it doesn't seem to find the .pdb.

I've made a new build and tried really hard to make sure it actually works this time. Please get dwarfs-0.10.1-78-g4a0dba5aec-Windows-AMD64-debug-stacktrace.7z and make sure you're running mkdwarfs from within its directory.

@mhx mhx self-assigned this Oct 18, 2024
@mhx mhx added the bug Something isn't working label Oct 18, 2024
@Zpovednice-adm
Copy link
Author

Zpovednice-adm commented Oct 20, 2024

Ok, here You are a little more informative stacktrace. I hope it helps.

Caught signal SIGABRT
Stack trace (most recent call first):
#0  0x00007ff6bdc00adf in dwarfs::`anonymous namespace'::fatal_signal_handler_win(int) at C:\actions-runner\_work\dwarfs\dwarfs\src\util.cpp:450
#1  0x00007ff6be62dfb4 in raise(int) at minkernel\crts\ucrt\src\appcrt\misc\signal.cpp:541
#2  0x00007ff6be5ac6e8 in abort() at minkernel\crts\ucrt\src\appcrt\startup\abort.cpp:64
#3  0x00007ff6be61088f in terminate() at minkernel\crts\ucrt\src\appcrt\misc\terminate.cpp:58
#4  0x00007ff6be5a5007 in FindHandler<__FrameHandler4>(EHExceptionRecord*, unsigned int*, _CONTEXT*, _xDISPATCHER_CONTEXT*, FH4::FuncInfo4*, unsigned int, int, unsigned int*) at D:\a\_work\1\s\src\vctools\crt\vcruntime\src\eh\frame.cpp:739
#5  0x00007ff6be5a6a07 in __InternalCxxFrameHandler<__FrameHandler4>(EHExceptionRecord*, unsigned int*, _CONTEXT*, _xDISPATCHER_CONTEXT*, FH4::FuncInfo4*, int, unsigned int*, unsigned int) at D:\a\_work\1\s\src\vctools\crt\vcruntime\src\eh\frame.cpp:396
#6  0x00007ff6be5a6c39 in __InternalCxxFrameHandlerWrapper<__FrameHandler4>(EHExceptionRecord*, unsigned int*, _CONTEXT*, _xDISPATCHER_CONTEXT*, FH4::FuncInfo4*, int, unsigned int*, unsigned int) at D:\a\_work\1\s\src\vctools\crt\vcruntime\src\eh\frame.cpp:236
#7  0x00007ff6be58ac9a in __CxxFrameHandler4(EHExceptionRecord*, unsigned int, _CONTEXT*, _xDISPATCHER_CONTEXT*) at D:\a\_work\1\s\src\vctools\crt\vcruntime\src\eh\risctrnsctrl.cpp:304
#8  0x00007ff6be583d6f in __GSHandlerCheck_EH4(_EXCEPTION_RECORD*, void*, _CONTEXT*, _DISPATCHER_CONTEXT*) at D:\a\_work\1\s\src\vctools\crt\vcstartup\src\gs\amd64\gshandlereh4.cpp:73
#9  0x00007ff9920d28be in _chkstk
#10 0x00007ff992082553 in RtlRaiseException
#11 0x00007ff9920822a6 in RtlRaiseException
#12 0x00007ff98f7eb698 in RaiseException
#13 0x00007ff6be588e21 in _CxxThrowException(void*, _s__ThrowInfo*) at D:\a\_work\1\s\src\vctools\crt\vcruntime\src\eh\throw.cpp:81
#14 0x00007ff6bd66132a in fmt::v10::detail::do_throw<std::runtime_error>(std::runtime_error&) at C:\vcpkg\buildtrees\fmt\src\10.2.1-a991065f88.clean\include\fmt\format.h:124
#15 0x00007ff6bd65bd63 in <lambda_2476853628ce5e4d38026ca625155c1d>::operator()(unsigned int, fmt::v10::basic_string_view<char>) at C:\vcpkg\buildtrees\fmt\src\10.2.1-a991065f88.clean\include\fmt\format-inl.h:1397
#16 0x00007ff6bd65bf41 in <lambda_22829297f0ddcbf21f33e8075ba356b8>::operator()(char*, char*) at C:\vcpkg\buildtrees\fmt\src\10.2.1-a991065f88.clean\include\fmt\format.h:672
#17 0x00007ff6bd665dd8 in fmt::v10::detail::for_each_codepoint<<lambda_2476853628ce5e4d38026ca625155c1d> >(fmt::v10::basic_string_view<char>, fmt::v10::detail::utf8_to_utf16::{ctor}::__l2::<lambda_2476853628ce5e4d38026ca625155c1d>) at C:\vcpkg\buildtrees\fmt\src\10.2.1-a991065f88.clean\include\fmt\format.h:680
#18 0x00007ff6bd65b147 in fmt::v10::detail::utf8_to_utf16::utf8_to_utf16(fmt::v10::basic_string_view<char>) at C:\vcpkg\buildtrees\fmt\src\10.2.1-a991065f88.clean\include\fmt\format-inl.h:1396
#19 0x00007ff6bd65af27 in fmt::v10::detail::write_console(int, fmt::v10::basic_string_view<char>) at C:\vcpkg\buildtrees\fmt\src\10.2.1-a991065f88.clean\include\fmt\format-inl.h:1444
#20 0x00007ff6bd65b0a7 in fmt::v10::detail::print(_iobuf*, fmt::v10::basic_string_view<char>) at C:\vcpkg\buildtrees\fmt\src\10.2.1-a991065f88.clean\include\fmt\format-inl.h:1468
#21 0x00007ff6bd65ae77 in fmt::v10::vprint(_iobuf*, fmt::v10::basic_string_view<char>, fmt::v10::basic_format_args<fmt::v10::basic_format_context<fmt::v10::appender, char> >) at C:\vcpkg\buildtrees\fmt\src\10.2.1-a991065f88.clean\include\fmt\format-inl.h:1478
#22 0x00007ff6bdc27955 in fmt::v10::print<std::basic_string_view<char, std::char_traits<char> > &>(_iobuf*, fmt::v10::basic_format_string<char, std::basic_string_view<char, std::char_traits<char> > &>, std::basic_string_view<char, std::char_traits<char> >&) at C:\actions-runner\_work\dwarfs\vcpkg-install-dwarfs\x64-windows-static\include\fmt\core.h:2941
#23 0x00007ff6bdc1ee07 in dwarfs::stream_logger::write_nolock(std::basic_string_view<char, std::char_traits<char> >) at C:\actions-runner\_work\dwarfs\dwarfs\src\logger.cpp:127
#24 0x00007ff6bdc1ebcc in dwarfs::stream_logger::write(dwarfs::logger::level_type, std::basic_string<char, std::char_traits<char>, std::allocator<char> >&, char*, int) at C:\actions-runner\_work\dwarfs\dwarfs\src\logger.cpp:249
#25 0x00007ff6bd5c8f60 in dwarfs::level_logger::~level_logger() at C:\actions-runner\_work\dwarfs\dwarfs\include\dwarfs\logger.h:154
#26 0x00007ff6be72521b in `dwarfs::writer::internal::scanner_<dwarfs::prod_logger_policy>::add_entry'::`1'::catch$26() at C:\actions-runner\_work\dwarfs\dwarfs\src\writer\scanner.cpp:372
#27 0x00007ff6be5aa67f in _CallSettingFrame_LookupContinuationIndex() at D:\a\_work\1\s\src\vctools\crt\vcruntime\src\eh\amd64\handlers.asm:97
#28 0x00007ff6be5a7c6d in __FrameHandler4::CxxCallCatchBlock() at D:\a\_work\1\s\src\vctools\crt\vcruntime\src\eh\frame.cpp:1439
#29 0x00007ff9920d1c25 in RtlCaptureContext2
#30 0x00007ff6bd8ebad2 in dwarfs::writer::internal::scanner_<dwarfs::prod_logger_policy>::add_entry(std::filesystem::path&, std::shared_ptr<dwarfs::writer::internal::dir>*, dwarfs::writer::internal::progress&, dwarfs::writer::internal::file_scanner&, bool) at C:\actions-runner\_work\dwarfs\dwarfs\src\writer\scanner.cpp:370
#31 0x00007ff6bd9013aa in dwarfs::writer::internal::scanner_<dwarfs::prod_logger_policy>::scan_tree(std::filesystem::path&, dwarfs::writer::internal::progress&, dwarfs::writer::internal::file_scanner&) at C:\actions-runner\_work\dwarfs\dwarfs\src\writer\scanner.cpp:545
#32 0x00007ff6bd8fa0f5 in dwarfs::writer::internal::scanner_<dwarfs::prod_logger_policy>::scan(dwarfs::writer::filesystem_writer&, std::filesystem::path&, dwarfs::writer::writer_progress&, std::optional<std::span<std::filesystem::path const , -1> >, std::shared_ptr<dwarfs::file_access const >*) at C:\actions-runner\_work\dwarfs\dwarfs\src\writer\scanner.cpp:667
#33 0x00007ff6bd6091b0 in dwarfs::writer::scanner::scan(dwarfs::writer::filesystem_writer&, std::filesystem::path&, dwarfs::writer::writer_progress&, std::optional<std::span<std::filesystem::path const , -1> >, std::shared_ptr<dwarfs::file_access const >*) at C:\actions-runner\_work\dwarfs\dwarfs\include\dwarfs\writer\scanner.h:67
#34 0x00007ff6bd559a5c in dwarfs::tool::mkdwarfs_main(int, wchar_t**, dwarfs::tool::iolayer&) at C:\actions-runner\_work\dwarfs\dwarfs\tools\src\mkdwarfs_main.cpp:1366
#35 0x00007ff6bd60ee53 in dwarfs::tool::main_adapter::operator()(int, wchar_t**) at C:\actions-runner\_work\dwarfs\dwarfs\src\tool\main_adapter.cpp:49
#36 0x00007ff6bd60f086 in `dwarfs::tool::main_adapter::safe'::`2'::<lambda_1>::operator()() at C:\actions-runner\_work\dwarfs\dwarfs\src\tool\main_adapter.cpp:63
#37 0x00007ff6bd610ce2 in std::invoke<`dwarfs::tool::main_adapter::safe'::`2'::<lambda_1> &>(dwarfs::tool::main_adapter::safe::__l2::<lambda_1>&) at C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.40.33807\include\type_traits:1705
#38 0x00007ff6bd60f309 in std::_Func_impl_no_alloc<`dwarfs::tool::main_adapter::safe'::`2'::<lambda_1>, int>::_Do_call() at C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.40.33807\include\functional:876
#39 0x00007ff6bd611c6e in std::_Func_class<int>::operator()() at C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.40.33807\include\functional:920
#40 0x00007ff6bd611b23 in dwarfs::tool::safe_main(std::function<int __cdecl(void)>*) at C:\actions-runner\_work\dwarfs\dwarfs\src\tool\safe_main.cpp:40
#41 0x00007ff6bd60ef5c in dwarfs::tool::main_adapter::safe(int, wchar_t**) at C:\actions-runner\_work\dwarfs\dwarfs\src\tool\main_adapter.cpp:63
#42 0x00007ff6bd65a935 in wmain(int, wchar_t**) at C:\actions-runner\_work\dwarfs\dwarfs\tools\src\mkdwarfs.cpp:28
#43 0x00007ff6be5843f8 in invoke_main() at D:\a\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl:90
#44 0x00007ff6be5842a1 in __scrt_common_main_seh() at D:\a\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl:288
#45 0x00007ff6be58415d in __scrt_common_main() at D:\a\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl:330
#46 0x00007ff6be58448d in wmainCRTStartup(void*) at D:\a\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_wmain.cpp:16
#47 0x00007ff9900e7373 in BaseThreadInitThunk
#48 0x00007ff99207cc90 in RtlUserThreadStart

@mhx
Copy link
Owner

mhx commented Oct 20, 2024

Thanks so much, that was helpful. Not quite to the extent that I had hoped, but definitely better than nothing. :)

I've actually managed the code to crash on my machine under some circumstances, but never with the same stack trace as on your machine.

I've made two changes:

  • Replaced (hopefully) all file path conversions with the "safe" conversions I introduced with the previous change. This got rid of the crashes on my machine.
  • Added code to (hopefully) catch the error I see happening on your machine and print a hexdump of the string that it is unable to write to the console.

Here's a new version including these changes: dwarfs-0.10.1-115-g53ac77f237-Windows-AMD64-debug-stacktrace.7z

It should (hopefully) run without crashing, but in either case I'd be very interested in the full output.

@Zpovednice-adm
Copy link
Author

Hi,
I tested the latest version on a sample dataset where that problematic file name was.
From my point of view, everything went more or less exactly as it should. No error occurred. In the resulting image file, the file name was corrected to "2019-12-23 13-50-24 𝗘𝗹𝗶𝗺𝗶𝗻𝗮𝘁𝗲 𝗕𝗮𝘁𝗵𝗿𝗼𝗼𝗺 𝗕𝗮𝘁𝗵𝗶𝗻𝗴 𝗛𝗮𝘇𝗮𝗿𝗱𝘀—𝗚𝗲𝘁 𝗮 𝗪𝗮𝗹𝗸 𝗜𝗻 𝗕𝗮𝘁𝗵𝘁𝘂𝗯 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗶�.eml"
However, I had logging set to trace, so I’m not sure if any potential warning might have been lost within the tracing, but I didn’t get any visible alert about an error or filename change (at least, I didn’t see anything like that).
But the entire process completed successfully, and the file was found, with its contents matching. So, from my perspective, it looks good.

As a final note for this issue, I’d like to ask something somewhat unrelated. – I admit I didn’t read through all the discussions, so I might be touching on a topic that has already been mentioned, but still – would it really be so technically challenging to implement some option to save an intermediate state of the application, either periodically or phase by phase, or perhaps a “pre-crash” state of the app or an "on-demand" state triggered by a key? This way, it would be possible to resume from that point rather than starting completely from the beginning.
Sorry if I’m already annoying you with my requests ;-) – I really appreciate all the effort you’ve put into actively trying to help me. On the other hand, it’s true that I’ve gotten quite attached to your project, and my computer has spent hundreds of hours running your code... (For example, after 20 hours of running your script, my (fucking) MS Windows decided to restart due to a required update... Which, of course, I understand is not your problem. But still, it would be nice if I didn’t have to start the process all over again. Please don’t take this as an unreasonable request.)

Thank you very much for your interest and help, regardless.

Please also keep in mind that English is not my native language, so everything you're reading from me has gone through a translator, and it might not come across exactly as I intended. Especially if something sounds offensive in any way, it's definitely a translation mistake, because my main intention is, in fact, gratitude.

@Zpovednice-adm
Copy link
Author

And a small note regarding logging?
scanning: ?\S:\b\ZALOHA FOTEK\fotky_2016\novy fotak\6\jpg\DSC_2008.JPG
2,100,023 dirs, 0/0 soft/hard links, 12,142,778/17,931,612 files, 0 other
original size: 6.84 TiB, hashed: 2.266 TiB (14,134,318 files, 50.47 MiB/s)
scanned: 2.793 TiB (2,918,309 files, 42.42 MiB/s), categorizing: 0 B/s
saved by deduplication: 1.424 TiB (9,224,469 files), saved by segmenting: 0 B
filesystem: 0 B in 0 blocks (0 chunks, 2,918,308 fragments, 2,918,309/8,707,143 inodes)
compressed filesystem: 0 blocks/0 B written

shouldn’t that 50.47 MiB/s be multiplied by the number of CPU cores? Because then it would roughly correspond to what I see in the Task Manager in Windows, where I can see how many MiB/s were read from the HDD...

@mhx
Copy link
Owner

mhx commented Oct 21, 2024

I wouldn't have been able to tell that you were using a translator. I really appreciate the time and effort you've put not only into reproducing the errors, but also into writing this up properly.

So, from my perspective, it looks good.

That's great! :)

However, I had logging set to trace, so I’m not sure if any potential warning might have been lost within the tracing,

Probably not lost, but more likely very hard to spot.

but I didn’t get any visible alert about an error or filename change (at least, I didn’t see anything like that).

There should have been at least errors for the file(s) with changes to their file name. If you could at some point run this again with the default log level and check if there's anything suspicious; in particular, I'm looking for anything that would look like this:

Unexpected error writing string:
00000000  0d 1b 5b 41 1b 5b 41 1b  5b 41 1b 5b 41 1b 5b 41  |..[A.[A.[A.[A.[A|
00000010  1b 5b 41 1b 5b 41 1b 5b  41 1b 5b 41 49 20 31 38  |.[A.[A.[A.[AI 18|
00000020  3a 35 35 3a 35 33 2e 34  31 35 31 30 38 20 73 61  |:55:53.415108 sa|
00000030  76 69 6e 67 20 73 79 6d  6c 69 6e 6b 73 20 74 61  |ving symlinks ta|
00000040  62 6c 65 2e 2e 2e 20 5b  32 33 2e 32 75 73 5d 1b  |ble... [23.2us].|
00000050  5b 4b 0a e2 8e af e2 8e  af e2 8e af e2 8e af e2  |[K..............|
00000060  8e af e2 8e af e2 8e af  e2 8e af e2 8e af e2 8e  |................|
[...]

If you don't see anything like this, that means I've somehow managed to fix the issue you ran into. :)

would it really be so technically challenging to implement some option to save an intermediate state of the application, either periodically or phase by phase, or perhaps a “pre-crash” state of the app or an "on-demand" state triggered by a key? This way, it would be possible to resume from that point rather than starting completely from the beginning.

It would be technically challenging. A "pre-crash" version would be even more challenging.

Here's the problem: There aren't that many "phases" with a singular transition state.

What do I mean by that? mkdwarfs is highly asynchronous. Many tasks are performed in parallel. Processing isn't strictly linear, meaning that you cannot stop at an arbitrary point and know exactly what the state of the program is.

To remind myself of what's going on, I've made this sequence diagram. There are 3 "phases":

  1. The "startup" phase is just setup, no input has been processed yet when this finishes. Having a resumable "checkpoint" after this phase is useless.
  2. The "scanning" phase: this is where mkdwarfs builds its internal view of the input data, and performs tasks like categorization, file-level deduplication, and similarity hashing.
  3. The "segmentation" phase: this includes similarity ordering, block-level deduplication (segmentation), metadata packing, and block compression.

So the only somewhat feasible "checkpoint" would be between phases 2 and 3, this is the only time when mkdwarfs is running synchronously. Saving/resuming the state at this point would still be non-trivial, but at least feasible. Trying to save/resume in the middle of phases 2/3 is not impossible, but it would make the code much more complex and nearly impossible to test.

I think having the possibility to "checkpoint" after the scanning phase would be nice, although you'd have to be aware that when you resume (which could be two years later in theory), the input data may have significantly changed. I'll add it to my huge bucket of ideas.

Any more granular "checkpointing" is completely off the table. It would make the code much more complex, likely slower, and almost impossible to test.

Please don’t take this as an unreasonable request.

The request is definitely reasonable! But the options are really limited.

In the meantime, the best thing you can do to limit the amount of time it takes to create an image, as mentioned earlier, is to use a fast compression algorithm and recompress the image later.

shouldn’t that 50.47 MiB/s be multiplied by the number of CPU cores? Because then it would roughly correspond to what I see in the Task Manager in Windows, where I can see how many MiB/s were read from the HDD...

Yes, the speed is per-core. Hashing is the first thing that happens, so this one is typically limited by I/O speed. Both categorization and similarity hashing can show much higher speeds because they happen after hashing and benefit from the fact that the files are already cached. Take these numbers with a grain of salt, the way they're currently defined (per-core wallclock speed) is probably not what you'd intuitively expect.

@Zpovednice-adm
Copy link
Author

Hi,
so I tried running it again without detailed logging, and the error appeared. I'm attaching the complete output. I can point out that the string "V.c.lov. v.cebajtov. znakov. str.nce neexistuje mapov.n. pro Unicode" is a Czech localized error message, meaning something like "No mapping for the Unicode character exists in the target multi-byte code page...".

c:\Users\Tom\Downloads\dwarfs-0.10.1-115-g53ac77f237-Windows-AMD64-debug-stacktrace\bin>mkdwarfs -i s:/b -o m:/backup.dwarfs -C zstd:level=9 --force I 17:36:01.670433 scanning "\\\\?\\S:\\b" ????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? scanning: \\?\S:\b\netboocek\!!! původní složky\DŮLEŽITÉ Z PC\FOTKY\Mamincin fotak\! Školní rok 2013-14 (zkomplet. k 2013-12-20)\2013-10-03 - Výlet za koníčky\Thumbs.db 2,038,902 dirs, 0/0 soft/hard links, 680,475/15,002,656 files, 0 other original size: 5.823 TiB, hashed: 473.9 GiB (829,424 files, 33.95 MiB/s) scanned: 571.8 GiB (380,694 files, 8.421 MiB/s), categorizing: 0 B/s saved by deduplication: 318.2 GiB (299,781 files), saved by segmenting: 0 B filesystem: 0 B in 0 blocks (0 chunks, 380,694 fragments, 380,694/14,702,876 inodes) compressed filesystem: 0 blocks/0 B written ? ? 0% ?? [scanning] ...\!!! původní složky\DŮLEŽITÉ Z PC\FOTKY\Mamincin fotak\! Školní rok 2013-14 (zkomplet. k 2013-12-20)\2013-10-03 - Výlet za koníčky\SDC15675.JPG -----------------------------------? ?7.042 MiB/s [scanning] ...\!!! původní složky\DŮLEŽITÉ Z PC\FOTKY\Mamincin fotak\! Školní rok 2013-14 (zkomplet. k 2013-12-20)\2013-10-03 - Výlet za koníčky\SDC15676.JPG -----------------------? ?4.695 MiB/s [scanning] ...\!!! původní složky\DŮLEŽITÉ Z PC\FOTKY\Mamincin fotak\! Školní rok 2013-14 (zkomplet. k 2013-12-20)\2013-10-03 - Výlet za koníčky\SDC15679.JPG -----------? ?0 B/s [scanning] ...\!!! původní složky\DŮLEŽITÉ Z PC\FOTKY\Mamincin fotak\! Školní rok 2013-14 (zkomplet. k 2013-12-20)\2013-10-03 - Výlet za koníčky\SDC15683.JPG ? ?0 B/s [scanning] ...\!!! původní složky\DŮLEŽITÉ Z PC\FOTKY\Mamincin fotak\! Školní rok 2013-14 (zkomplet. k 2013-12-20)\2013-10-03 - Výlet za koníčky\SDC15685.JPG ? ?0 B/s [scanning] ...\DŮLEŽITÉ Z PC\FOTKY\Mamincin fotak\! Školní rok 2013-14 (zkomplet. k 2013-12-20)\2013-10-04 - Tvoření s rodiči Dřevěné domečky\SDC15686.JPG ? ?0 B/s Unexpected error writing string: 00000000 0d 1b 5b 41 1b 5b 41 1b 5b 41 1b 5b 41 1b 5b 41 |..[A.[A.[A.[A.[A| 00000010 1b 5b 41 1b 5b 41 1b 5b 41 1b 5b 41 1b 5b 41 1b |.[A.[A.[A.[A.[A.| 00000020 5b 41 1b 5b 41 1b 5b 41 1b 5b 41 1b 5b 41 1b 5b |[A.[A.[A.[A.[A.[| 00000030 31 3b 33 31 6d 45 20 32 32 3a 30 35 3a 35 31 2e |1;31mE 22:05:51.| 00000040 33 39 31 39 34 39 20 69 6e 76 61 6c 69 64 20 66 |391949 invalid f| 00000050 69 6c 65 20 6e 61 6d 65 20 69 6e 20 22 5c 5c 3f |ile name in "\\?| 00000060 5c 53 3a 5c 62 5c 7a 61 6c 6f 68 79 5c 70 6f c5 |\S:\b\zalohy\po.| 00000070 a1 74 61 5c 61 6c 69 2e 73 74 62 40 73 65 7a 6e |.ta\ali.stb@sezn| 00000080 61 6d 2e 63 7a 5c 44 6f 72 75 c4 8d 65 6e c3 a1 |am.cz\Doru..en..| 00000090 20 70 6f c5 a1 74 61 22 2c 20 73 74 6f 72 69 6e | po..ta", storin| 000000a0 67 20 61 73 20 22 32 30 31 39 2d 31 32 2d 32 33 |g as "2019-12-23| 000000b0 20 31 33 2d 35 30 2d 32 34 20 f0 9d 97 98 f0 9d | 13-50-24 ......| 000000c0 97 b9 f0 9d 97 b6 f0 9d 97 ba f0 9d 97 b6 f0 9d |................| 000000d0 97 bb f0 9d 97 ae f0 9d 98 81 f0 9d 97 b2 20 f0 |.............. .| 000000e0 9d 97 95 f0 9d 97 ae f0 9d 98 81 f0 9d 97 b5 f0 |................| 000000f0 9d 97 bf f0 9d 97 bc f0 9d 97 bc f0 9d 97 ba 20 |............... | 00000100 f0 9d 97 95 f0 9d 97 ae f0 9d 98 81 f0 9d 97 b5 |................| 00000110 f0 9d 97 b6 f0 9d 97 bb f0 9d 97 b4 20 f0 9d 97 |............ ...| 00000120 9b f0 9d 97 ae f0 9d 98 87 f0 9d 97 ae f0 9d 97 |................| 00000130 bf f0 9d 97 b1 f0 9d 98 80 e2 80 94 f0 9d 97 9a |................| 00000140 f0 9d 97 b2 f0 9d 98 81 20 f0 9d 97 ae 20 f0 9d |........ .... ..| 00000150 97 aa f0 9d 97 ae f0 9d 97 b9 f0 9d 97 b8 20 f0 |.............. .| 00000160 9d 97 9c f0 9d 97 bb 20 f0 9d 97 95 f0 9d 97 ae |....... ........| 00000170 f0 9d 98 81 f0 9d 97 b5 f0 9d 98 81 f0 9d 98 82 |................| 00000180 f0 9d 97 af 20 f0 9d 97 98 f0 9d 98 83 f0 9d 97 |.... ...........| 00000190 ae f0 9d 97 b9 f0 9d 98 82 f0 9d 97 ae f0 9d 98 |................| 000001a0 81 f0 9d 97 b6 ef bf bd 2e 65 6d 6c 22 3a 20 56 |.........eml": V| 000001b0 a0 63 ed 6c 6f 76 e9 20 76 ed 63 65 62 61 6a 74 |.c.lov. v.cebajt| 000001c0 6f 76 e9 20 7a 6e 61 6b 6f 76 e9 20 73 74 72 e1 |ov. znakov. str.| 000001d0 6e 63 65 20 6e 65 65 78 69 73 74 75 6a 65 20 6d |nce neexistuje m| 000001e0 61 70 6f 76 e1 6e ed 20 70 72 6f 20 55 6e 69 63 |apov.n. pro Unic| 000001f0 6f 64 65 2e 1b 5b 30 6d 1b 5b 4b 0a e2 8e af e2 |ode..[0m.[K.....| 00000200 8e af e2 8e af e2 8e af e2 8e af e2 8e af e2 8e |................| 00000210 af e2 8e af e2 8e af e2 8e af e2 8e af e2 8e af |................| 00000220 e2 8e af e2 8e af e2 8e af e2 8e af e2 8e af e2 |................| 00000230 8e af e2 8e af e2 8e af e2 8e af e2 8e af e2 8e |................| 00000240 af e2 8e af e2 8e af e2 8e af e2 8e af e2 8e af |................| 00000250 e2 8e af e2 8e af e2 8e af e2 8e af e2 8e af e2 |................| 00000260 8e af e2 8e af e2 8e af e2 8e af e2 8e af e2 8e |................| 00000270 af e2 8e af e2 8e af e2 8e af e2 8e af e2 8e af |................| 00000280 e2 8e af e2 8e af e2 8e af e2 8e af e2 8e af e2 |................| 00000290 8e af e2 8e af e2 8e af e2 8e af e2 8e af e2 8e |................| 000002a0 af e2 8e af e2 8e af e2 8e af e2 8e af e2 8e af |................| 000002b0 e2 8e af e2 8e af e2 8e af e2 8e af e2 8e af e2 |................| 000002c0 8e af e2 8e af e2 8e af e2 8e af e2 8e af e2 8e |................| 000002d0 af e2 8e af e2 8e af e2 8e af e2 8e af e2 8e af |................| 000002e0 e2 8e af e2 8e af e2 8e af e2 8e af e2 8e af e2 |................| 000002f0 8e af e2 8e af e2 8e af e2 8e af e2 8e af e2 8e |................| 00000300 af e2 8e af e2 8e af e2 8e af e2 8e af e2 8e af |................| 00000310 e2 8e af e2 8e af e2 8e af e2 8e af e2 8e af e2 |................| 00000320 8e af e2 8e af e2 8e af e2 8e af e2 8e af e2 8e |................| 00000330 af e2 8e af e2 8e af e2 8e af e2 8e af e2 8e af |................| 00000340 e2 8e af e2 8e af e2 8e af e2 8e af e2 8e af e2 |................| 00000350 8e af e2 8e af e2 8e af e2 8e af e2 8e af e2 8e |................| 00000360 af e2 8e af e2 8e af e2 8e af e2 8e af e2 8e af |................| 00000370 e2 8e af e2 8e af e2 8e af e2 8e af e2 8e af e2 |................| 00000380 8e af e2 8e af e2 8e af e2 8e af e2 8e af e2 8e |................| 00000390 af e2 8e af e2 8e af e2 8e af e2 8e af e2 8e af |................| 000003a0 e2 8e af e2 8e af e2 8e af e2 8e af e2 8e af e2 |................| 000003b0 8e af e2 8e af e2 8e af e2 8e af e2 8e af e2 8e |................| 000003c0 af e2 8e af e2 8e af e2 8e af e2 8e af e2 8e af |................| 000003d0 e2 8e af e2 8e af e2 8e af e2 8e af e2 8e af e2 |................| 000003e0 8e af e2 8e af e2 8e af e2 8e af e2 8e af e2 8e |................| 000003f0 af e2 8e af e2 8e af e2 8e af e2 8e af e2 8e af |................| 00000400 e2 8e af e2 8e af e2 8e af e2 8e af e2 8e af e2 |................| 00000410 8e af e2 8e af e2 8e af e2 8e af e2 8e af e2 8e |................| 00000420 af e2 8e af e2 8e af e2 8e af e2 8e af e2 8e af |................| 00000430 e2 8e af e2 8e af e2 8e af e2 8e af e2 8e af e2 |................| 00000440 8e af e2 8e af e2 8e af e2 8e af e2 8e af e2 8e |................| 00000450 af e2 8e af e2 8e af e2 8e af e2 8e af e2 8e af |................| 00000460 e2 8e af e2 8e af e2 8e af e2 8e af e2 8e af e2 |................| 00000470 8e af e2 8e af e2 8e af e2 8e af e2 8e af e2 8e |................| 00000480 af e2 8e af e2 8e af e2 8e af e2 8e af e2 8e af |................| 00000490 e2 8e af e2 8e af e2 8e af e2 8e af e2 8e af e2 |................| 000004a0 8e af e2 8e af e2 8e af e2 8e af e2 8e af e2 8e |................| 000004b0 af e2 8e af e2 8e af e2 8e af e2 8e af e2 8e af |................| 000004c0 e2 8e af 0a 1b 5b 31 3b 33 36 6d 73 63 61 6e 6e |.....[1;36mscann| 000004d0 69 6e 67 3a 20 5c 5c 3f 5c 53 3a 5c 62 5c 6e 65 |ing: \\?\S:\b\ne| 000004e0 74 62 6f 6f 63 65 6b 5c 21 21 21 20 70 c5 af 76 |tboocek\!!! p..v| 000004f0 6f 64 6e c3 ad 20 73 6c 6f c5 be 6b 79 5c 44 c5 |odn.. slo..ky\D.| 00000500 ae 4c 45 c5 bd 49 54 c3 89 20 5a 20 50 43 5c 46 |.LE..IT.. Z PC\F| 00000510 4f 54 4b 59 5c 4d 61 6d 69 6e 63 69 6e 20 66 6f |OTKY\Mamincin fo| 00000520 74 61 6b 5c 21 20 c5 a0 6b 6f 6c 6e c3 ad 20 72 |tak\! ..koln.. r| 00000530 6f 6b 20 32 30 31 33 2d 31 34 20 28 7a 6b 6f 6d |ok 2013-14 (zkom| 00000540 70 6c 65 74 2e 20 6b 20 32 30 31 33 2d 31 32 2d |plet. k 2013-12-| 00000550 32 30 29 5c 32 30 31 33 2d 31 30 2d 30 33 20 2d |20)\2013-10-03 -| 00000560 20 56 c3 bd 6c 65 74 20 7a 61 20 6b 6f 6e c3 ad | V..let za kon..| 00000570 c4 8d 6b 79 5c 54 68 75 6d 62 73 2e 64 62 1b 5b |..ky\Thumbs.db.[| 00000580 30 6d 1b 5b 4b 0a 32 2c 30 33 38 2c 39 30 32 20 |0m.[K.2,038,902 | 00000590 64 69 72 73 2c 20 30 2f 30 20 73 6f 66 74 2f 68 |dirs, 0/0 soft/h| 000005a0 61 72 64 20 6c 69 6e 6b 73 2c 20 36 38 30 2c 34 |ard links, 680,4| 000005b0 37 35 2f 31 35 2c 30 30 32 2c 36 35 36 20 66 69 |75/15,002,656 fi| 000005c0 6c 65 73 2c 20 30 20 6f 74 68 65 72 1b 5b 4b 0a |les, 0 other.[K.| 000005d0 6f 72 69 67 69 6e 61 6c 20 73 69 7a 65 3a 20 35 |original size: 5| 000005e0 2e 38 32 33 20 54 69 42 2c 20 68 61 73 68 65 64 |.823 TiB, hashed| 000005f0 3a 20 34 37 33 2e 39 20 47 69 42 20 28 38 32 39 |: 473.9 GiB (829| 00000600 2c 34 32 34 20 66 69 6c 65 73 2c 20 33 33 2e 39 |,424 files, 33.9| 00000610 35 20 4d 69 42 2f 73 29 1b 5b 4b 0a 73 63 61 6e |5 MiB/s).[K.scan| 00000620 6e 65 64 3a 20 35 37 31 2e 38 20 47 69 42 20 28 |ned: 571.8 GiB (| 00000630 33 38 30 2c 36 39 34 20 66 69 6c 65 73 2c 20 38 |380,694 files, 8| 00000640 2e 34 32 31 20 4d 69 42 2f 73 29 2c 20 63 61 74 |.421 MiB/s), cat| 00000650 65 67 6f 72 69 7a 69 6e 67 3a 20 30 20 42 2f 73 |egorizing: 0 B/s| 00000660 1b 5b 4b 0a 73 61 76 65 64 20 62 79 20 64 65 64 |.[K.saved by ded| 00000670 75 70 6c 69 63 61 74 69 6f 6e 3a 20 33 31 38 2e |uplication: 318.| 00000680 32 20 47 69 42 20 28 32 39 39 2c 37 38 31 20 66 |2 GiB (299,781 f| 00000690 69 6c 65 73 29 2c 20 73 61 76 65 64 20 62 79 20 |iles), saved by | 000006a0 73 65 67 6d 65 6e 74 69 6e 67 3a 20 30 20 42 1b |segmenting: 0 B.| 000006b0 5b 4b 0a 66 69 6c 65 73 79 73 74 65 6d 3a 20 30 |[K.filesystem: 0| 000006c0 20 42 20 69 6e 20 30 20 62 6c 6f 63 6b 73 20 28 | B in 0 blocks (| 000006d0 30 20 63 68 75 6e 6b 73 2c 20 33 38 30 2c 36 39 |0 chunks, 380,69| 000006e0 34 20 66 72 61 67 6d 65 6e 74 73 2c 20 33 38 30 |4 fragments, 380| 000006f0 2c 36 39 34 2f 31 34 2c 37 30 32 2c 38 37 36 20 |,694/14,702,876 | 00000700 69 6e 6f 64 65 73 29 1b 5b 4b 0a 63 6f 6d 70 72 |inodes).[K.compr| 00000710 65 73 73 65 64 20 66 69 6c 65 73 79 73 74 65 6d |essed filesystem| 00000720 3a 20 30 20 62 6c 6f 63 6b 73 2f 30 20 42 20 77 |: 0 blocks/0 B w| 00000730 72 69 74 74 65 6e 1b 5b 4b 0a e2 96 8f 20 20 20 |ritten.[K.... | 00000740 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 | | 00000750 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 | | 00000760 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 | | 00000770 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 | | 00000780 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 | | 00000790 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 | | 000007a0 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 | | 000007b0 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 | | 000007c0 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 | | 000007d0 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 | | 000007e0 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 | | 000007f0 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 | | 00000800 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 | | 00000810 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 | | 00000820 e2 96 8f 20 20 30 25 20 f0 9f 8c 91 0a 1b 5b 31 |... 0% ......[1| 00000830 3b 33 33 6d 5b 73 63 61 6e 6e 69 6e 67 5d 20 1b |;33m[scanning] .| 00000840 5b 30 6d 1b 5b 33 33 6d 2e 2e 2e 5c 21 21 21 20 |[0m.[33m...\!!! | 00000850 70 c5 af 76 6f 64 6e c3 ad 20 73 6c 6f c5 be 6b |p..vodn.. slo..k| 00000860 79 5c 44 c5 ae 4c 45 c5 bd 49 54 c3 89 20 5a 20 |y\D..LE..IT.. Z | 00000870 50 43 5c 46 4f 54 4b 59 5c 4d 61 6d 69 6e 63 69 |PC\FOTKY\Maminci| 00000880 6e 20 66 6f 74 61 6b 5c 21 20 c5 a0 6b 6f 6c 6e |n fotak\! ..koln| 00000890 c3 ad 20 72 6f 6b 20 32 30 31 33 2d 31 34 20 28 |.. rok 2013-14 (| 000008a0 7a 6b 6f 6d 70 6c 65 74 2e 20 6b 20 32 30 31 33 |zkomplet. k 2013| 000008b0 2d 31 32 2d 32 30 29 5c 32 30 31 33 2d 31 30 2d |-12-20)\2013-10-| 000008c0 30 33 20 2d 20 56 c3 bd 6c 65 74 20 7a 61 20 6b |03 - V..let za k| 000008d0 6f 6e c3 ad c4 8d 6b 79 5c 53 44 43 31 35 36 37 |on....ky\SDC1567| 000008e0 35 2e 4a 50 47 20 20 20 20 20 20 20 20 20 e2 96 |5.JPG ..| 000008f0 88 e2 96 88 e2 96 88 e2 96 88 e2 96 88 e2 96 88 |................| 00000900 e2 96 88 e2 96 88 e2 96 88 e2 96 88 e2 96 88 e2 |................| 00000910 96 88 e2 96 88 e2 96 88 e2 96 88 e2 96 88 e2 96 |................| 00000920 88 e2 96 88 e2 96 88 e2 96 88 e2 96 88 e2 96 88 |................| 00000930 e2 96 88 e2 96 88 e2 96 88 e2 96 88 e2 96 88 e2 |................| 00000940 96 88 e2 96 88 e2 96 88 e2 96 88 e2 96 88 e2 96 |................| 00000950 88 e2 96 88 e2 96 88 e2 96 8c 20 20 20 20 20 20 |.......... | 00000960 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 | | 00000970 e2 96 8f 37 2e 30 34 32 20 4d 69 42 2f 73 1b 5b |...7.042 MiB/s.[| 00000980 30 6d 1b 5b 4b 0a 1b 5b 31 3b 33 33 6d 5b 73 63 |0m.[K..[1;33m[sc| 00000990 61 6e 6e 69 6e 67 5d 20 1b 5b 30 6d 1b 5b 33 33 |anning] .[0m.[33| 000009a0 6d 2e 2e 2e 5c 21 21 21 20 70 c5 af 76 6f 64 6e |m...\!!! p..vodn| 000009b0 c3 ad 20 73 6c 6f c5 be 6b 79 5c 44 c5 ae 4c 45 |.. slo..ky\D..LE| 000009c0 c5 bd 49 54 c3 89 20 5a 20 50 43 5c 46 4f 54 4b |..IT.. Z PC\FOTK| 000009d0 59 5c 4d 61 6d 69 6e 63 69 6e 20 66 6f 74 61 6b |Y\Mamincin fotak| 000009e0 5c 21 20 c5 a0 6b 6f 6c 6e c3 ad 20 72 6f 6b 20 |\! ..koln.. rok | 000009f0 32 30 31 33 2d 31 34 20 28 7a 6b 6f 6d 70 6c 65 |2013-14 (zkomple| 00000a00 74 2e 20 6b 20 32 30 31 33 2d 31 32 2d 32 30 29 |t. k 2013-12-20)| 00000a10 5c 32 30 31 33 2d 31 30 2d 30 33 20 2d 20 56 c3 |\2013-10-03 - V.| 00000a20 bd 6c 65 74 20 7a 61 20 6b 6f 6e c3 ad c4 8d 6b |.let za kon....k| 00000a30 79 5c 53 44 43 31 35 36 37 36 2e 4a 50 47 20 20 |y\SDC15676.JPG | 00000a40 20 20 20 20 20 20 20 e2 96 88 e2 96 88 e2 96 88 | .........| 00000a50 e2 96 88 e2 96 88 e2 96 88 e2 96 88 e2 96 88 e2 |................| 00000a60 96 88 e2 96 88 e2 96 88 e2 96 88 e2 96 88 e2 96 |................| 00000a70 88 e2 96 88 e2 96 88 e2 96 88 e2 96 88 e2 96 88 |................| 00000a80 e2 96 88 e2 96 88 e2 96 88 e2 96 88 e2 96 8c 20 |............... | 00000a90 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 | | 00000aa0 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 | | 00000ab0 20 e2 96 8f 34 2e 36 39 35 20 4d 69 42 2f 73 1b | ...4.695 MiB/s.| 00000ac0 5b 30 6d 1b 5b 4b 0a 1b 5b 31 3b 33 33 6d 5b 73 |[0m.[K..[1;33m[s| 00000ad0 63 61 6e 6e 69 6e 67 5d 20 1b 5b 30 6d 1b 5b 33 |canning] .[0m.[3| 00000ae0 33 6d 2e 2e 2e 5c 21 21 21 20 70 c5 af 76 6f 64 |3m...\!!! p..vod| 00000af0 6e c3 ad 20 73 6c 6f c5 be 6b 79 5c 44 c5 ae 4c |n.. slo..ky\D..L| 00000b00 45 c5 bd 49 54 c3 89 20 5a 20 50 43 5c 46 4f 54 |E..IT.. Z PC\FOT| 00000b10 4b 59 5c 4d 61 6d 69 6e 63 69 6e 20 66 6f 74 61 |KY\Mamincin fota| 00000b20 6b 5c 21 20 c5 a0 6b 6f 6c 6e c3 ad 20 72 6f 6b |k\! ..koln.. rok| 00000b30 20 32 30 31 33 2d 31 34 20 28 7a 6b 6f 6d 70 6c | 2013-14 (zkompl| 00000b40 65 74 2e 20 6b 20 32 30 31 33 2d 31 32 2d 32 30 |et. k 2013-12-20| 00000b50 29 5c 32 30 31 33 2d 31 30 2d 30 33 20 2d 20 56 |)\2013-10-03 - V| 00000b60 c3 bd 6c 65 74 20 7a 61 20 6b 6f 6e c3 ad c4 8d |..let za kon....| 00000b70 6b 79 5c 53 44 43 31 35 36 37 39 2e 4a 50 47 20 |ky\SDC15679.JPG | 00000b80 20 20 20 20 20 20 20 20 e2 96 88 e2 96 88 e2 96 | ........| 00000b90 88 e2 96 88 e2 96 88 e2 96 88 e2 96 88 e2 96 88 |................| 00000ba0 e2 96 88 e2 96 88 e2 96 88 e2 96 8b 20 20 20 20 |............ | 00000bb0 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 | | 00000bc0 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 | | 00000bd0 20 20 20 20 20 20 20 20 20 20 e2 96 8f 30 20 42 | ...0 B| 00000be0 2f 73 1b 5b 30 6d 1b 5b 4b 0a 1b 5b 31 3b 33 33 |/s.[0m.[K..[1;33| 00000bf0 6d 5b 73 63 61 6e 6e 69 6e 67 5d 20 1b 5b 30 6d |m[scanning] .[0m| 00000c00 1b 5b 33 33 6d 2e 2e 2e 5c 21 21 21 20 70 c5 af |.[33m...\!!! p..| 00000c10 76 6f 64 6e c3 ad 20 73 6c 6f c5 be 6b 79 5c 44 |vodn.. slo..ky\D| 00000c20 c5 ae 4c 45 c5 bd 49 54 c3 89 20 5a 20 50 43 5c |..LE..IT.. Z PC\| 00000c30 46 4f 54 4b 59 5c 4d 61 6d 69 6e 63 69 6e 20 66 |FOTKY\Mamincin f| 00000c40 6f 74 61 6b 5c 21 20 c5 a0 6b 6f 6c 6e c3 ad 20 |otak\! ..koln.. | 00000c50 72 6f 6b 20 32 30 31 33 2d 31 34 20 28 7a 6b 6f |rok 2013-14 (zko| 00000c60 6d 70 6c 65 74 2e 20 6b 20 32 30 31 33 2d 31 32 |mplet. k 2013-12| 00000c70 2d 32 30 29 5c 32 30 31 33 2d 31 30 2d 30 33 20 |-20)\2013-10-03 | 00000c80 2d 20 56 c3 bd 6c 65 74 20 7a 61 20 6b 6f 6e c3 |- V..let za kon.| 00000c90 ad c4 8d 6b 79 5c 53 44 43 31 35 36 38 33 2e 4a |...ky\SDC15683.J| 00000ca0 50 47 20 20 20 20 20 20 20 20 20 e2 96 8f 20 20 |PG ... | 00000cb0 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 | | 00000cc0 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 | | 00000cd0 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 | | 00000ce0 20 20 20 20 20 20 20 e2 96 8f 30 20 42 2f 73 1b | ...0 B/s.| 00000cf0 5b 30 6d 1b 5b 4b 0a 1b 5b 31 3b 33 33 6d 5b 73 |[0m.[K..[1;33m[s| 00000d00 63 61 6e 6e 69 6e 67 5d 20 1b 5b 30 6d 1b 5b 33 |canning] .[0m.[3| 00000d10 33 6d 2e 2e 2e 5c 21 21 21 20 70 c5 af 76 6f 64 |3m...\!!! p..vod| 00000d20 6e c3 ad 20 73 6c 6f c5 be 6b 79 5c 44 c5 ae 4c |n.. slo..ky\D..L| 00000d30 45 c5 bd 49 54 c3 89 20 5a 20 50 43 5c 46 4f 54 |E..IT.. Z PC\FOT| 00000d40 4b 59 5c 4d 61 6d 69 6e 63 69 6e 20 66 6f 74 61 |KY\Mamincin fota| 00000d50 6b 5c 21 20 c5 a0 6b 6f 6c 6e c3 ad 20 72 6f 6b |k\! ..koln.. rok| 00000d60 20 32 30 31 33 2d 31 34 20 28 7a 6b 6f 6d 70 6c | 2013-14 (zkompl| 00000d70 65 74 2e 20 6b 20 32 30 31 33 2d 31 32 2d 32 30 |et. k 2013-12-20| 00000d80 29 5c 32 30 31 33 2d 31 30 2d 30 33 20 2d 20 56 |)\2013-10-03 - V| 00000d90 c3 bd 6c 65 74 20 7a 61 20 6b 6f 6e c3 ad c4 8d |..let za kon....| 00000da0 6b 79 5c 53 44 43 31 35 36 38 35 2e 4a 50 47 20 |ky\SDC15685.JPG | 00000db0 20 20 20 20 20 20 20 20 e2 96 8f 20 20 20 20 20 | ... | 00000dc0 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 | | 00000dd0 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 | | 00000de0 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 | | 00000df0 20 20 20 20 e2 96 8f 30 20 42 2f 73 1b 5b 30 6d | ...0 B/s.[0m| 00000e00 1b 5b 4b 0a 1b 5b 31 3b 33 33 6d 5b 73 63 61 6e |.[K..[1;33m[scan| 00000e10 6e 69 6e 67 5d 20 1b 5b 30 6d 1b 5b 33 33 6d 2e |ning] .[0m.[33m.| 00000e20 2e 2e 5c 44 c5 ae 4c 45 c5 bd 49 54 c3 89 20 5a |..\D..LE..IT.. Z| I 22:22:42.820576 assigning directory and link inodes... I 22:23:24.603407 waiting for background scanners... I 22:29:09.165063 scanning CPU time: 131.9h I 22:29:09.173320 finalizing file inodes... I 22:33:20.708348 saved 2.527 TiB / 6.84 TiB in 11,277,347/17,931,614 duplicate files I 22:33:22.247638 assigning device inodes... I 22:33:31.536723 assigning pipe/socket inodes... I 22:33:32.380912 building metadata... I 22:33:32.381907 building blocks... I 22:33:32.381893 saving names and symlinks... I 22:33:32.389204 waiting for segmenting/blockifying to finish... I 22:34:50.041672 updating name and link indices... ????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? writing: \\?\S:\b\Pandora\Puvodni D\2015-12-31 - Ostravský Štědrý den, stromeček\DSC_8464.JPG 2,100,023 dirs, 0/0 soft/hard links, 17,931,614/17,931,614 files, 0 other original size: 6.84 TiB, hashed: 3.919 TiB (17,682,121 files, 55.88 MiB/s) scanned: 4.313 TiB (6,654,267 files, 9.227 MiB/s), categorizing: 0 B/s saved by deduplication: 2.527 TiB (11,277,347 files), saved by segmenting: 81.83 GiB filesystem: 595.4 GiB in 38,105 blocks (15,864,560 chunks, 386,260/6,654,266 fragments, 6,654,267 inodes) compressed filesystem: 38,105 blocks/370.6 GiB written ------------------------------------------------------------------------------------------------------------------------------------? ? 58% ?? \\?\S:\b\Pandora\Puvodni D\2015-12-31 - Ostravský Štědrý den, stromeček\DSC_8464.JPG ---------? ?13.01 MiB/s [compressing] compressed 566.1 GiB to 341.3 GiB (ratio 60.29%) 13.27 MiB/s ^C

Otherwise, a save point at least between the scanning and compression phases would be quite nice. For me, the scanning phase takes over 20 hours. And when I finally got to the compression phase, I quickly calculated, based on the current compression speed (with the settings you recommended), that the compression phase would take another 5 days or so. Unfortunately, I wasn’t able to let it run that long, so I had to stop it, and now I’ll have to start over from the beginning. I’ll either try an even faster compression or figure out how to run it on a more powerful machine. In any case, while I don't yet know what the final compression ratio will be, I have to note that KopiaUI finished completely in 26 hours, as I mentioned earlier, so based on my current testing, I’m afraid you might not be able to beat that time. But I definitely don’t want to say I’m giving up. ;-)

@Zpovednice-adm
Copy link
Author

I hope you can make sense of the output. Unfortunately, the formatting got messed up somehow.

@Zpovednice-adm
Copy link
Author

Zpovednice-adm commented Oct 24, 2024

And...
I have to admit that I don’t really understand the resource usage during the compression phase. The disk is, of course, barely being used, there's plenty of free RAM, and my CPU is running at about 20%, with 6 cores, each loaded differently, but none of them going above 50%. So, I really don’t understand what the bottleneck is that's slowing it down, when it seems at first glance like it could easily run 5 times faster. I’m not sure if it’s some issue with Windows and if it behaves more efficiently under Linux. But it's probably obvious (and I really don’t mean to annoy you) that, for instance, KopiaUI seems to make better use of the available resources.
In the previous phase, where scanning and hashing were happening, I totally understood it – either the disk was running at 100%, or the CPU was at 100%. At times, even the memory was running low, and it might have even started swapping. However, in the compression phase, it seems like most of the resources are freely available, and I really don’t know what or who is holding it back – I hope you understand what I’m trying to say.

@mhx
Copy link
Owner

mhx commented Oct 25, 2024

I have to admit that I don’t really understand the resource usage during the compression phase. The disk is, of course, barely being used, there's plenty of free RAM, and my CPU is running at about 20%, with 6 cores, each loaded differently, but none of them going above 50%. So, I really don’t understand what the bottleneck is that's slowing it down, when it seems at first glance like it could easily run 5 times faster.

Okay, there are two potential bottlenecks here:

  1. The segmenter. This usually becomes the bottleneck if you have a lot of cores, a fast compression algorithm, too much data to fit in the filesystem cache, or a combination thereof. The segmenter (currently) cannot run multi-threaded, so all data has to squeeze through it. If the segmenter was the bottleneck, I'd expect one CPU core to be saturated, though. On my machine (i9-13900K running Linux), the segmenter runs at speeds of 200 MiB/s up to more than 1 GiB/s with the default configuration. The speed is affected by the configuration (e.g. window size, lookback size, ...).
  2. File I/O. DwarFS accesses all files that it cannot deduplicate based on the file hash at least twice; first during hashing/categorization, then again during segmenting. The first time, it'll discover files in "filesystem order", which might be faster to read. Also, during that first scan, there are no writes to disk. The second time files are read is during segmenting, and this time the order in which files are accessed depends on file similarity, so it can be pretty random, at least a lot more unpredictable than the first time. Also, this time there are concurrent writes, though that's probably only relevant if the image is saved to the same disk.

So I would guess that it's a file I/O bottleneck, based on the fact that you're not seeing a single saturated core.

I’m not sure if it’s some issue with Windows and if it behaves more efficiently under Linux.

I don't know for sure (I'm definitely not a Windows expert), but in my tests, accessing files on Windows was painfully slow compared to Linux. Things got slightly better when I disabled real-time malware scanning (this isn't a recommendation).

I have to note that KopiaUI finished completely in 26 hours

I don't know how Kopia works exactly, but I assume that it's doing a single pass over the files rather than two passes. The fact that DwarFS' scanning phase takes roughly the same amount of time as Kopia's whole backup operation is a strong hint.

However, this means that (depending on the input data), Kopia will likely fail to detect a lot of sub-file-level duplication. In fact, I've just downloaded Kopia and ran it on one of my test cases. Kopia compressed 51 GB of input data in ~2 million files to 18 GB in about 30 seconds. DwarFS compressed the same input data to 325 MB (that's more than 50 times smaller) in less than 2 minutes. Note that this is not an average use case, so the difference on average can be much smaller. But when you have an input dataset that you know is highly redundant across files, DwarFS will very likely perform much better.

Ultimately, the two tools have slightly different goals. The goal of DwarFS is to compress redundant data as much as possible without being ridiculously slow. Kopia's primary goal is likely speed, while still offering acceptable compression.

@mhx
Copy link
Owner

mhx commented Oct 25, 2024

I can point out that the string "V.c.lov. v.cebajtov. znakov. str.nce neexistuje mapov.n. pro Unicode" is a Czech localized error message, meaning something like "No mapping for the Unicode character exists in the target multi-byte code page...".

Ahhhh, and that is likely what triggered the crash. Not the underlying error, but the fact that the Czech error message wasn't UTF-8, but rather some Windows code page (cp1250?). You can try running chcp on the command line to see the current code page.

@Zpovednice-adm
Copy link
Author

I can point out that the string "V.c.lov. v.cebajtov. znakov. str.nce neexistuje mapov.n. pro Unicode" is a Czech localized error message, meaning something like "No mapping for the Unicode character exists in the target multi-byte code page...".

Ahhhh, and that is likely what triggered the crash. Not the underlying error, but the fact that the Czech error message wasn't UTF-8, but rather some Windows code page (cp1250?). You can try running chcp on the command line to see the current code page.

C:\Users\Tom>chcp
Active code page: 852

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants