Explore using multithreading in dump parsing #532

andrewtavis · 2024-12-19T18:53:16Z

Terms

I have searched open and closed feature requests
I agree to follow Scribe-Data's Code of Conduct

Description

Currently the total time to parse a Wikidata lexeme dump in Google Colab is ~250 seconds. It would be great if we could explore multithreading this process in order to get the time down even more. This should be based off of the total number of available CPUs. We should have this run on an appropriate number of CPUs that the user has available, which is likely not the maximum to not overload their system.

Contribution

@axif0 will be working on this as a part of Outreachy! 📶✈️

andrewtavis added feature New feature or request help wanted Extra attention is needed labels Dec 19, 2024

andrewtavis assigned axif0 Dec 19, 2024

andrewtavis added this to Scribe Board Dec 19, 2024

github-project-automation bot moved this to Todo in Scribe Board Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore using multithreading in dump parsing #532

Explore using multithreading in dump parsing #532

andrewtavis commented Dec 19, 2024

Explore using multithreading in dump parsing #532

Explore using multithreading in dump parsing #532

Comments

andrewtavis commented Dec 19, 2024

Terms

Description

Contribution