Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load_dataset('bigcode/the-stack-dedup', streaming=True) very slow! #5846

Closed
tbenthompson opened this issue May 11, 2023 · 7 comments
Closed
Assignees

Comments

@tbenthompson
Copy link

tbenthompson commented May 11, 2023

Describe the bug

Running

import datasets
ds = datasets.load_dataset('bigcode/the-stack-dedup', streaming=True)

takes about 2.5 minutes!

I would expect this to be near instantaneous. With other datasets, the runtime is one or two seconds.

Environment info

  • datasets version: 2.11.0
  • Platform: macOS-13.3.1-arm64-arm-64bit
  • Python version: 3.10.10
  • Huggingface_hub version: 0.13.4
  • PyArrow version: 11.0.0
  • Pandas version: 2.0.0
@mariosasko
Copy link
Collaborator

mariosasko commented May 12, 2023

This is due to the slow resolution of the data files: #5537.

We plan to switch to huggingface_hub's HfFileSystem soon to make the resolution faster (will be up to 20x faster once we merge huggingface/huggingface_hub#1443)

@mariosasko mariosasko self-assigned this May 12, 2023
@enze5088
Copy link

You're right, when I try to parse more than 50GB of text data, I also get very slow, usually taking hours or even tens of hours.

@tbenthompson
Copy link
Author

You're right, when I try to parse more than 50GB of text data, I also get very slow, usually taking hours or even tens of hours.

That's unrelated to the problem discussed in this issue.

@enze5088
Copy link

You're right, when I try to parse more than 50GB of text data, I also get very slow, usually taking hours or even tens of hours.

That's unrelated to the problem discussed in this issue.

Sorry, I misunderstood it.

@mariosasko
Copy link
Collaborator

Closing this issue as it has been addressed in huggingface_hub!

(This now takes 25s to execute on my machine.)

@tbenthompson
Copy link
Author

Thanks for the improvements! 🎉🎉

25 seconds is better but still about 2500x slower than this should be! Loading a tiny 1-2KB metadata file is all that would be necessary with a better design.

@mariosasko
Copy link
Collaborator

mariosasko commented Apr 8, 2024

Once we merge huggingface/huggingface_hub#2103, this should only take a few seconds.

For the 2500x speed-up (without metadata files with pre-cached results), we wouldn't even be allowed to use os.path functions or requests/aiohttp for HTTP requests, so I don't think this is feasible for us as it would make the code unreadable.

The HF Datasets Hub is (almost) platform-agnostic, so you are free to implement your own library (in a faster language than Python) to achieve this kind of performance, and we would be happy to support it 🙂.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants