Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: Can jwalk gain from pdus directory walking? #30

Closed
Byron opened this issue May 29, 2021 · 14 comments
Closed

Discussion: Can jwalk gain from pdus directory walking? #30

Byron opened this issue May 29, 2021 · 14 comments

Comments

@Byron
Copy link
Owner

Byron commented May 29, 2021

pdu is a tool to analise disk usage and display it similar to dutree. It does so while being faster than what dua can achieve for simple aggregation, that is (at least according to my profiling) it seems most of the program time is indeed spent in jwalk. Additionally I verified this by removing all of duas counting code to measure only jwalk throughput.

Another oddity is that on an M1 walking directories gets significantly slower when all 8 cores are used (4 high efficiency cores and 4 high performance once) to the point where I changed the defaults to 4 on Apple Silicon to get to previous performance numbers. pdu does not suffer from this issue.

Is there something special on how pdu traverses the directory tree and is this something jwalk can benefit from?

@KSXGitHub
Copy link

I am the author of pdu. All I did was recursively calling read_dir combined with rayon's .into_par_iter().map(...).collect(). The resulting data structure is a nested tree instead of a flat list, which I guess also helped.

@Byron
Copy link
Owner Author

Byron commented May 30, 2021

I couldn't let it rest and did some more research - if the receiving end of the iterator is doing nothing, what else could affect iteration performance?
Since jwalk 0.5 client state can be populated in the iterating thread and thus benefit from multi-threading. dua does this as well. When removing client state acquisition the runtime in my dataset drops from ~12s to ~6s. It's interesting that without the metadata calls, there is no performance penalty when 8 threads are used on an M1 either - an effect that wasn't observed on Intel previously but it was a 4 core machine.

To my mind there is no way around this as metadata per entry is required to do its job, and I wouldn't know way to do that faster.

If memory serves jwalk 0.4 would provide metadata itself, and an investigation on dua-e873656 showed that it is indeed performing the same task in ~11s. Interestingly, using 8 threads slows it down to ~12.5s. pdu does not exhibit this behaviour, or if it does its still faster at 10.5s on the same dataset.

After forcing pdu to use 4 threads with this:

rayon::ThreadPoolBuilder::new().num_threads(4).build_global().unwrap();

it managed to also get the work done faster in 9.7s instead of 10.5s.

Summary

  • jwalk 0.4 for some reason is a little faster in obtaining metadata than jwalk 0.5 if the same task is performed by client code.
  • On my system, more than 4 threads doing metadata() calls slow down the overall performance universally across pdu and jwalk.

I will take a closer look at the metadata calls that are performed to see if improvements are possible either in jwalk or in dua.

Corrections

Whenever jwalk 0.5 is mentioned, it refers to jwalk 0.6.

@Byron
Copy link
Owner Author

Byron commented May 30, 2021

I did find a regression in dua as it would get metadata for directories as well - now that this is fixed the standard dataset went from ~12s to ~11s, about the value of jwalk 0.4.

Investigations of whether following symlinks or not has anything to do with it lead to nothing.

Then I changed to a smaller dataset from 1.4 million files to only ~90k files and ran it in a profiler. Here is the results for dua and pdu in order.

dua

pdu

Both seem to be listing the same amount of directories so I would assume there are no superfluous ones here, and overall dua does less syscalls which should result in better performance.

This was also when I found a critical bug I introduced yesterday which would add 1s of sleep time to the main thread 🤦‍♂️. In this dataset, pdu is about 17% faster according to hyperfine, but with an overall runtime of ~200ms it's not very interesting.

Back to the bigger dataset the times are now at 10.3s for dua and 9.5s for pdu. The latter spends ~1.4s in userland, the former a whopping ~5.3s.

With all dua code removed (except for the iteration itself and the metadata calls) this goes down to 5.2s. Definitely an avenue to explore on what's costing so much there.

Progress thus far

  • found regression in dua with it asking for metadata for directories which wasn't required
  • found quite a terrible bug that added 1s to all dua invocations
  • investigating where the 5.2s of userspace activity come from

@Byron
Copy link
Owner Author

Byron commented May 30, 2021

🎉🎉🎉 🎉🎉🎉

diff --git a/src/core/jwalk_par_bridge.rs b/src/core/jwalk_par_bridge.rs
index 7bf35d0..42b7564 100644
--- a/src/core/jwalk_par_bridge.rs
+++ b/src/core/jwalk_par_bridge.rs
@@ -60,7 +60,7 @@ where
     where
         C: UnindexedConsumer<Self::Item>,
     {
-        let split_count = AtomicUsize::new(current_num_threads());
+        let split_count = AtomicUsize::new(current_num_threads().min(4));
         let worker = Worker::new_fifo();
         let stealer = worker.stealer();
         let done = AtomicBool::new(false);

When looking at the jwalk machinery it's quite noticeable that there is a lot of complexity probably stemming from the requirement of being mostly a walkdir drop-in.

When reading the docs for the JWalkIterBridge, this sentence made me suspicious.

This has the advantage of being able to parallelize just about anything, but the resulting ParallelIterator can be less efficient than if you started with par_iter instead.

Skimming the code at least told me that indeed a lot is going on here with queues, orderings and atomics - way above my head for sure. But in conjunction with the observation that that it doesn't scale well with cores (or at least: high-efficiency cores) and mere chance I thought less splits might be causing bigger chunks and less contention around work stealing. My theory is that the system works so well that indeed all threads are being fed, but low performance cores may now compete with the high performance ones and slow down the entire operation.

And indeed, with the patch above dua -t8 gets down to ~9.8s, which is indistinguishable from pdu. The user space time is now ~4.6s, still higher than pdu but ultimately not very relevant to the wallclock time. It's notable that dua can now use 8 threads similar to pdu.

What follows is more tests on an intel machine with 4 cores (+4 with hyper threading) to see if this 'hack' is applicable there as well. My guess is that it reduces performance and what we are seeing here is actually related to the 'fast and slow' cores split.

@Byron
Copy link
Owner Author

Byron commented May 30, 2021

Running some tests on an MBP with 4 cores + 4 via hyper threading, the split_count had to be matching the amount of used threads for maximum performance. Thus what I am seeing here is just present on M1 and I am pretty sure that the split_count value has to match the amount of high-performance cores. Unfortunately there is no way yet to actually query this value, so all cores count as physical cores without differentiating them further. Right now the amount of 4 threads is hard-coded in dua in case it compiles on applie silicon which will probably break when the new machines arrive. On the other hand, these machines are rumoured to have 8 performance cores and only 2 high efficiency cores, a much more favourable ratio that is probably less noticeable even though it won't be the best achievable performance.

pdu on Intel was maybe 500ms faster than dua with a warm cache on intel and a wallclock time of ~15.3s, an advantage I am happy about as it shows how fast parallel iteration could be if there are no constraints (or how much is currently lost within jwalks machinery).

Learnings

  • The issue I have been investigating here is all about regressions and bugs in dua
  • On the M1 specifically there the split_count doesn't fare well and should rather be the 'number of high performance cores' (4) while the actual thread count is all physical cores (8)
  • In future machines with speculatively less low power cores compared to their high-performance counterparts this performance loss should be less significant.
  • pdu shows how an unconstrained parallel iteration engine can perform, and use significantly less user space time which leads to some noticeable performance wins along with a lack of scheduling issues on M1's.

Further steps

It would be possible to make the split_count configurable to let current M1 builds take advantage of it. This would currently need hardcoded numbers from the callers side and will probably break with the arrival of new machines.

Even though I would be happy to contribute this in a PR as I would benefit right now, I would love to hear @jessegrosjean opinion first. I could understand to postpone this or discard it entirely, maybe and hopefully there are much better solutions.

Independently of that, this issue should be closed as all the learnings are summarised here.

@Byron Byron closed this as completed May 30, 2021
@KSXGitHub
Copy link

@Byron You actually end up writing what is in essence a small blog post. If you think others can also benefit from your learning experience, may I recommend you to share it on programming forums? (disclaimer: I indeed have an ulterior motive)

@Byron
Copy link
Owner Author

Byron commented May 30, 2021

Here is the headline: "pdu is fastest, and here is why…" ;D. Despite my sparking love for the tool I am all out of time, having overstretched my maintenance budget considerably just to satisfy my unrelenting curiosity. Even though I am glad I did it, I am also glad its over as the whole thing had obsessive tendencies which would be in the way of doing what I actually do.

If you find anything interesting here or think there is a story to tell, I would definitely be looking forward to reading what you come up with.

@KSXGitHub
Copy link

To be honest, my implementation of multi-threading is very naive because my understanding of parallel computing is very limited. I would totally write a blog post if only I know what to say.

@jessegrosjean
Copy link
Collaborator

Interesting about the M1 different speed cores. I would certainly like to address this... but it must be effecting other Rust crates with parallel loads?... and smarter developers? I'm also not a parallel expert :) I guess I would like to wait and see what how other crates with parallel work handle M1 before adding anything specific to jwalk. Please let me know if you hear of any good approaches to this.

I would expect jwalk to be a "bit" slower than pdu because I think jwalk is doing more work to support "streaming ordered results". Meaning that you'll start getting ordered results right away, without having to wait for the entire traversal to finish.
To enable this jwalk maintains an ordered queue, and pulls items off that queue using rayon par_bridge instead of using rayons more efficient par_iter on a Vec like pdu does.

The benefit of this extra work is that if you these directories:

a
b
c

And each directory has a 10 seconds worth of files to traverse, then using jwalk you should be able to print the final total for "a" after 10 seconds, "b" after 20 seconds, and "c" after 30 seconds. On the other hand I "think" with pdu you would need to wait for the entire 30 seconds until you were sure that you had the final calculation for any particular directory. I'm not certain about that, but I think that's what's happening in pdu code.

Jesse

@Byron
Copy link
Owner Author

Byron commented May 30, 2021

I guess I would like to wait and see what how other crates with parallel work handle M1 before adding anything specific to jwalk.

I couldn't agree more.

However, the highlight of the 'ordered' functionality got me intrigued: What if there was an iteration mode that doesn't enforce the 'ordered' behaviour? Programs like pdu and dua aggregate might be benefitting from the performance that comes with it without the specifics of the traversal mode being observable. dua interactive for instance would certainly still use the current ordered traversal mode as it will update traversal results in real time and users should see each directory as soon as its traversal is complete.

Ordering sits deeply in jwalk as far as I can see, but if you have a little guidance for me I might just try to provide a PR for this.

@jessegrosjean
Copy link
Collaborator

jessegrosjean commented May 31, 2021

What if there was an iteration mode that doesn't enforce the 'ordered' behaviour?

I think this would be difficult, but more important I was really only guessing, I'm not even sure that's causing the difference. I think most useful would be to get a file traversal in the style of pdu .into_par_iter().map(...).collect() into jwalks benchmark suite. That would make comparison easier and then it would be much easier to see if there are still ways to make jwalk faster. If anyone has time to write that that would be a great addition to the benchmark suite.

@Byron
Copy link
Owner Author

Byron commented Jun 1, 2021

I of course couldn't resist and added the simplest imaginable iteration mode. The benchmarks show its a tiny bit faster, maybe, but on the real world data set if used by dua it suffers from the same M1 threading issue that for some reason pdu doesn't exhibit, but now at least is en-par with it with dua --apparent-size -t4 ~/dev.

You know, I am not bothered by these 500ms at all, but it's the M1 weirdness that really gives me a hard time. The answer should be in the pdu codebase, but it's not at all obvious to me. There is no special rayon configuration… but I have a hunch that there is other synchronization going on on which the threads get stuck. That way they are slowed down by other means, which happens to be a good thing on my system.

Talking to myself really: I will stop this investigation and do it again once I have access to Apple Silicon with more high performance cores. My hypothesis is that pdu in its current form wont' scale due to that synchronisation.

…to be continued…

Ok, that's it - I ripped out all the reporter calls and now pdu is seeing the same M1 related issues.

./target/release/pdu ~/dev  2.55s user 69.83s system 604% cpu 11.973 total

Great - puzzle solved.

What remains here is the M1 oddity which probably should be continued one somebody has access to hardware with more fast cores, or once num_cpus allows to query the amount of fast cores which is when the issue can probably be fixed for good.

@jessegrosjean
Copy link
Collaborator

Thanks for the benchmark code, I've merged that. I also added a simple du example that can be useful for testing since you can point it at any directory instead of just testing agains the benchmark directory.

So for example:

% cargo build --release --example du
% time target/release/examples/du ~/Documents
path: /Users/jessegrosjean/Documents total bytes: 40416183344
target/release/examples/du ~/Documents  3.39s user 22.47s system 661% cpu 3.906 total

@Byron
Copy link
Owner Author

Byron commented Jun 2, 2021

That's a great idea, thank you! I did indeed use it on my dataset on the off chance that the other changes/refactorings have any effect on performance, and initially I got confusing results! With 8 threads, and the 'M1 problem' it performed about a second better, just as good as dua with only 4 threads. How was that possible?

The answer is in this PR.

To me skipping hidden files by default seemed more like a foot-gun and I wonder where this is coming from, merely out of curiosity. I checked walkdir and there skipping hidden files has to be implemented by hand.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants