Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Boost indexing performance by using parallel iterators #12

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ghassanachi
Copy link

@ghassanachi ghassanachi commented Mar 2, 2023

Just came across your channel by watching https://www.youtube.com/watch?v=b0KIDIOL_i4.

Since it was focussed on performance, I was thinking indexing (and maybe searching as well) could be improved by using parallel iterators. I decided to give it a shot using jwalk and rayon -- re-exported by jwalk.

This is just a proof of concept, and the speed could be improved further if instead of collecting the documents into a Vec before adding them to the Model, we instead used an Arc<Mutex<Model>> and inserted the documents directly into the model during the parallel iteration.

The WalkDir also takes care of recursively traversing the directory structure, so this removes the overhead of the recursion on the add_to_model call. It also supports following symlinks, so enabling that option would also take care of that TODO.

I was initially hoping to not have to collect into dir, but it looks like WalkDir does the walking in parallel, but then collects the results when calling into_iter and I could not find a convenient way of getting a parallel iterator directly from the walk.process_read_dir seemed like the solution, but given it's function signature / arguments it would complicate the code a lot.

Leaving this as a draft, if you like the prototype I'll polish it a little more.

Anyway here are the results
Machine: 2019 MacBook Pro 16in -- i7-9750H CPU @ 2.60GHz (6 core)
Data: https://github.com/BSVino/docs.gl (downloaded via zip option in github)

#### Debug Sync
Executed in   __21.78 secs__    fish           external
   usr time   21.36 secs    0.19 millis   21.36 secs
   sys time    0.33 secs    1.66 millis    0.32 secs

#### Release Sync
Executed in   __3.64 secs__    fish           external
   usr time    3.36 secs    0.14 millis    3.36 secs
   sys time    0.27 secs    1.90 millis    0.27 secs

#### Debug Parallel
Executed in   __6.13 secs__   fish           external
   usr time   22.93 secs  154.00 micros   22.93 secs
   sys time    0.40 secs  969.00 micros    0.40 secs

#### Release Parallel
Executed in  __875.63 millis__    fish           external
   usr time    3.77 secs      0.16 millis    3.77 secs
   sys time    0.17 secs      1.13 millis    0.17 secs

@ghassanachi
Copy link
Author

I just tested the Arc<Mutex<Model>> solution and there was about a 15% regression in performance. I didn't profile it, but my guess would be that lock-contention is introducing more of a bottleneck and ends up being slower than the collect call.

With that in mind, I'll move this PR from a Draft state.

Let me know if you want me to add symlink support.

@ghassanachi ghassanachi marked this pull request as ready for review March 4, 2023 05:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant