-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion: Can jwalk
gain from pdu
s directory walking?
#30
Comments
I couldn't let it rest and did some more research - if the receiving end of the iterator is doing nothing, what else could affect iteration performance? To my mind there is no way around this as metadata per entry is required to do its job, and I wouldn't know way to do that faster. If memory serves After forcing rayon::ThreadPoolBuilder::new().num_threads(4).build_global().unwrap(); it managed to also get the work done faster in 9.7s instead of 10.5s. Summary
I will take a closer look at the metadata calls that are performed to see if improvements are possible either in CorrectionsWhenever |
I did find a regression in Investigations of whether following symlinks or not has anything to do with it lead to nothing. Then I changed to a smaller dataset from 1.4 million files to only ~90k files and ran it in a profiler. Here is the results for Both seem to be listing the same amount of directories so I would assume there are no superfluous ones here, and overall This was also when I found a critical bug I introduced yesterday which would add 1s of sleep time to the main thread 🤦♂️. In this dataset, Back to the bigger dataset the times are now at 10.3s for With all Progress thus far
|
🎉🎉🎉 🎉🎉🎉 diff --git a/src/core/jwalk_par_bridge.rs b/src/core/jwalk_par_bridge.rs
index 7bf35d0..42b7564 100644
--- a/src/core/jwalk_par_bridge.rs
+++ b/src/core/jwalk_par_bridge.rs
@@ -60,7 +60,7 @@ where
where
C: UnindexedConsumer<Self::Item>,
{
- let split_count = AtomicUsize::new(current_num_threads());
+ let split_count = AtomicUsize::new(current_num_threads().min(4));
let worker = Worker::new_fifo();
let stealer = worker.stealer();
let done = AtomicBool::new(false); When looking at the When reading the docs for the
Skimming the code at least told me that indeed a lot is going on here with queues, orderings and atomics - way above my head for sure. But in conjunction with the observation that that it doesn't scale well with cores (or at least: high-efficiency cores) and mere chance I thought less splits might be causing bigger chunks and less contention around work stealing. My theory is that the system works so well that indeed all threads are being fed, but low performance cores may now compete with the high performance ones and slow down the entire operation. And indeed, with the patch above What follows is more tests on an intel machine with 4 cores (+4 with hyper threading) to see if this 'hack' is applicable there as well. My guess is that it reduces performance and what we are seeing here is actually related to the 'fast and slow' cores split. |
Running some tests on an MBP with 4 cores + 4 via hyper threading, the
Learnings
Further stepsIt would be possible to make the Even though I would be happy to contribute this in a PR as I would benefit right now, I would love to hear @jessegrosjean opinion first. I could understand to postpone this or discard it entirely, maybe and hopefully there are much better solutions. Independently of that, this issue should be closed as all the learnings are summarised here. |
@Byron You actually end up writing what is in essence a small blog post. If you think others can also benefit from your learning experience, may I recommend you to share it on programming forums? (disclaimer: I indeed have an ulterior motive) |
Here is the headline: " If you find anything interesting here or think there is a story to tell, I would definitely be looking forward to reading what you come up with. |
To be honest, my implementation of multi-threading is very naive because my understanding of parallel computing is very limited. I would totally write a blog post if only I know what to say. |
Interesting about the M1 different speed cores. I would certainly like to address this... but it must be effecting other Rust crates with parallel loads?... and smarter developers? I'm also not a parallel expert :) I guess I would like to wait and see what how other crates with parallel work handle M1 before adding anything specific to jwalk. Please let me know if you hear of any good approaches to this. I would expect jwalk to be a "bit" slower than pdu because I think jwalk is doing more work to support "streaming ordered results". Meaning that you'll start getting ordered results right away, without having to wait for the entire traversal to finish. The benefit of this extra work is that if you these directories:
And each directory has a 10 seconds worth of files to traverse, then using jwalk you should be able to print the final total for "a" after 10 seconds, "b" after 20 seconds, and "c" after 30 seconds. On the other hand I "think" with pdu you would need to wait for the entire 30 seconds until you were sure that you had the final calculation for any particular directory. I'm not certain about that, but I think that's what's happening in pdu code. Jesse |
I couldn't agree more. However, the highlight of the 'ordered' functionality got me intrigued: What if there was an iteration mode that doesn't enforce the 'ordered' behaviour? Programs like Ordering sits deeply in |
I think this would be difficult, but more important I was really only guessing, I'm not even sure that's causing the difference. I think most useful would be to get a file traversal in the style of pdu |
I of course couldn't resist and added the simplest imaginable iteration mode. The benchmarks show its a tiny bit faster, maybe, but on the real world data set if used by You know, I am not bothered by these 500ms at all, but it's the M1 weirdness that really gives me a hard time. The answer should be in the Talking to myself really: I will stop this investigation and do it again once I have access to Apple Silicon with more high performance cores. My hypothesis is that …to be continued… Ok, that's it - I ripped out all the
Great - puzzle solved. What remains here is the M1 oddity which probably should be continued one somebody has access to hardware with more fast cores, or once |
Thanks for the benchmark code, I've merged that. I also added a simple So for example:
|
That's a great idea, thank you! I did indeed use it on my dataset on the off chance that the other changes/refactorings have any effect on performance, and initially I got confusing results! With 8 threads, and the 'M1 problem' it performed about a second better, just as good as The answer is in this PR. To me skipping hidden files by default seemed more like a foot-gun and I wonder where this is coming from, merely out of curiosity. I checked |
pdu is a tool to analise disk usage and display it similar to
dutree
. It does so while being faster than whatdua
can achieve for simple aggregation, that is (at least according to my profiling) it seems most of the program time is indeed spent injwalk
. Additionally I verified this by removing all ofdua
s counting code to measure onlyjwalk
throughput.Another oddity is that on an M1 walking directories gets significantly slower when all 8 cores are used (4 high efficiency cores and 4 high performance once) to the point where I changed the defaults to 4 on Apple Silicon to get to previous performance numbers.
pdu
does not suffer from this issue.Is there something special on how
pdu
traverses the directory tree and is this somethingjwalk
can benefit from?The text was updated successfully, but these errors were encountered: