Boost indexing performance by using parallel iterators #12

ghassanachi · 2023-03-02T22:39:42Z

Just came across your channel by watching https://www.youtube.com/watch?v=b0KIDIOL_i4.

Since it was focussed on performance, I was thinking indexing (and maybe searching as well) could be improved by using parallel iterators. I decided to give it a shot using jwalk and rayon -- re-exported by jwalk.

This is just a proof of concept, and the speed could be improved further if instead of collecting the documents into a Vec before adding them to the Model, we instead used an Arc<Mutex<Model>> and inserted the documents directly into the model during the parallel iteration.

The WalkDir also takes care of recursively traversing the directory structure, so this removes the overhead of the recursion on the add_to_model call. It also supports following symlinks, so enabling that option would also take care of that TODO.

I was initially hoping to not have to collect into dir, but it looks like WalkDir does the walking in parallel, but then collects the results when calling into_iter and I could not find a convenient way of getting a parallel iterator directly from the walk.process_read_dir seemed like the solution, but given it's function signature / arguments it would complicate the code a lot.

Leaving this as a draft, if you like the prototype I'll polish it a little more.

Anyway here are the results
Machine: 2019 MacBook Pro 16in -- i7-9750H CPU @ 2.60GHz (6 core)
Data: https://github.com/BSVino/docs.gl (downloaded via zip option in github)

#### Debug Sync
Executed in   __21.78 secs__    fish           external
   usr time   21.36 secs    0.19 millis   21.36 secs
   sys time    0.33 secs    1.66 millis    0.32 secs

#### Release Sync
Executed in   __3.64 secs__    fish           external
   usr time    3.36 secs    0.14 millis    3.36 secs
   sys time    0.27 secs    1.90 millis    0.27 secs

#### Debug Parallel
Executed in   __6.13 secs__   fish           external
   usr time   22.93 secs  154.00 micros   22.93 secs
   sys time    0.40 secs  969.00 micros    0.40 secs

#### Release Parallel
Executed in  __875.63 millis__    fish           external
   usr time    3.77 secs      0.16 millis    3.77 secs
   sys time    0.17 secs      1.13 millis    0.17 secs

ghassanachi · 2023-03-04T05:45:54Z

I just tested the Arc<Mutex<Model>> solution and there was about a 15% regression in performance. I didn't profile it, but my guess would be that lock-contention is introducing more of a bottleneck and ends up being slower than the collect call.

With that in mind, I'll move this PR from a Draft state.

Let me know if you want me to add symlink support.

feat: use parallel iterator for indexing

4d9bb59

ghassanachi marked this pull request as ready for review March 4, 2023 05:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Boost indexing performance by using parallel iterators #12

Boost indexing performance by using parallel iterators #12

ghassanachi commented Mar 2, 2023 •

edited

Loading

ghassanachi commented Mar 4, 2023

Boost indexing performance by using parallel iterators #12

Are you sure you want to change the base?

Boost indexing performance by using parallel iterators #12

Conversation

ghassanachi commented Mar 2, 2023 • edited Loading

ghassanachi commented Mar 4, 2023

ghassanachi commented Mar 2, 2023 •

edited

Loading