Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore using multithreading in dump parsing #532

Open
2 tasks done
andrewtavis opened this issue Dec 19, 2024 · 0 comments
Open
2 tasks done

Explore using multithreading in dump parsing #532

andrewtavis opened this issue Dec 19, 2024 · 0 comments
Assignees
Labels
feature New feature or request help wanted Extra attention is needed

Comments

@andrewtavis
Copy link
Member

Terms

Description

Currently the total time to parse a Wikidata lexeme dump in Google Colab is ~250 seconds. It would be great if we could explore multithreading this process in order to get the time down even more. This should be based off of the total number of available CPUs. We should have this run on an appropriate number of CPUs that the user has available, which is likely not the maximum to not overload their system.

Contribution

@axif0 will be working on this as a part of Outreachy! 📶✈️

@andrewtavis andrewtavis added feature New feature or request help wanted Extra attention is needed labels Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request help wanted Extra attention is needed
Projects
Status: Todo
Development

No branches or pull requests

2 participants