-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
load_dataset('bigcode/the-stack-dedup', streaming=True) very slow! #5846
Comments
This is due to the slow resolution of the data files: #5537. We plan to switch to |
You're right, when I try to parse more than 50GB of text data, I also get very slow, usually taking hours or even tens of hours. |
That's unrelated to the problem discussed in this issue. |
Sorry, I misunderstood it. |
Closing this issue as it has been addressed in (This now takes 25s to execute on my machine.) |
Thanks for the improvements! 🎉🎉 25 seconds is better but still about 2500x slower than this should be! Loading a tiny 1-2KB metadata file is all that would be necessary with a better design. |
Once we merge huggingface/huggingface_hub#2103, this should only take a few seconds. For the 2500x speed-up (without metadata files with pre-cached results), we wouldn't even be allowed to use The HF Datasets Hub is (almost) platform-agnostic, so you are free to implement your own library (in a faster language than Python) to achieve this kind of performance, and we would be happy to support it 🙂. |
Describe the bug
Running
takes about 2.5 minutes!
I would expect this to be near instantaneous. With other datasets, the runtime is one or two seconds.
Environment info
datasets
version: 2.11.0The text was updated successfully, but these errors were encountered: