Boost indexing performance by using parallel iterators #12
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Just came across your channel by watching https://www.youtube.com/watch?v=b0KIDIOL_i4.
Since it was focussed on performance, I was thinking indexing (and maybe searching as well) could be improved by using parallel iterators. I decided to give it a shot using jwalk and rayon -- re-exported by jwalk.
This is just a proof of concept, and the speed could be improved further if instead of collecting the documents into a
Vec
before adding them to theModel
, we instead used anArc<Mutex<Model>>
and inserted the documents directly into the model during the parallel iteration.The
WalkDir
also takes care of recursively traversing the directory structure, so this removes the overhead of the recursion on theadd_to_model
call. It also supports following symlinks, so enabling that option would also take care of thatTODO
.I was initially hoping to not have to collect into
dir
, but it looks likeWalkDir
does the walking in parallel, but then collects the results when callinginto_iter
and I could not find a convenient way of getting a parallel iterator directly from the walk.process_read_dir seemed like the solution, but given it's function signature / arguments it would complicate the code a lot.Leaving this as a draft, if you like the prototype I'll polish it a little more.
Anyway here are the results
Machine: 2019 MacBook Pro 16in -- i7-9750H CPU @ 2.60GHz (6 core)
Data: https://github.com/BSVino/docs.gl (downloaded via zip option in github)