load_dataset('bigcode/the-stack-dedup', streaming=True) very slow! #5846

tbenthompson · 2023-05-11T17:58:57Z

Describe the bug

Running

import datasets
ds = datasets.load_dataset('bigcode/the-stack-dedup', streaming=True)

takes about 2.5 minutes!

I would expect this to be near instantaneous. With other datasets, the runtime is one or two seconds.

Environment info

datasets version: 2.11.0
Platform: macOS-13.3.1-arm64-arm-64bit
Python version: 3.10.10
Huggingface_hub version: 0.13.4
PyArrow version: 11.0.0
Pandas version: 2.0.0

The text was updated successfully, but these errors were encountered:

mariosasko · 2023-05-12T14:28:15Z

This is due to the slow resolution of the data files: #5537.

We plan to switch to huggingface_hub's HfFileSystem soon to make the resolution faster (will be up to 20x faster once we merge huggingface/huggingface_hub#1443)

enze5088 · 2023-05-15T05:27:12Z

You're right, when I try to parse more than 50GB of text data, I also get very slow, usually taking hours or even tens of hours.

tbenthompson · 2023-05-15T15:07:34Z

You're right, when I try to parse more than 50GB of text data, I also get very slow, usually taking hours or even tens of hours.

That's unrelated to the problem discussed in this issue.

enze5088 · 2023-05-16T03:23:46Z

You're right, when I try to parse more than 50GB of text data, I also get very slow, usually taking hours or even tens of hours.

That's unrelated to the problem discussed in this issue.

Sorry, I misunderstood it.

mariosasko · 2024-04-05T12:28:58Z

Closing this issue as it has been addressed in huggingface_hub!

(This now takes 25s to execute on my machine.)

tbenthompson · 2024-04-05T13:45:08Z

Thanks for the improvements! 🎉🎉

25 seconds is better but still about 2500x slower than this should be! Loading a tiny 1-2KB metadata file is all that would be necessary with a better design.

mariosasko · 2024-04-08T12:50:55Z

Once we merge huggingface/huggingface_hub#2103, this should only take a few seconds.

For the 2500x speed-up (without metadata files with pre-cached results), we wouldn't even be allowed to use os.path functions or requests/aiohttp for HTTP requests, so I don't think this is feasible for us as it would make the code unreadable.

The HF Datasets Hub is (almost) platform-agnostic, so you are free to implement your own library (in a faster language than Python) to achieve this kind of performance, and we would be happy to support it 🙂.

mariosasko self-assigned this May 12, 2023

lhoestq mentioned this issue Jul 11, 2023

Switch to huggingface_hub's HfFileSystem #6017

Closed

mariosasko closed this as completed Apr 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

load_dataset('bigcode/the-stack-dedup', streaming=True) very slow! #5846

load_dataset('bigcode/the-stack-dedup', streaming=True) very slow! #5846

tbenthompson commented May 11, 2023 •

edited

Loading

mariosasko commented May 12, 2023 •

edited

Loading

enze5088 commented May 15, 2023

tbenthompson commented May 15, 2023

enze5088 commented May 16, 2023

mariosasko commented Apr 5, 2024

tbenthompson commented Apr 5, 2024

mariosasko commented Apr 8, 2024 •

edited

Loading

load_dataset('bigcode/the-stack-dedup', streaming=True) very slow! #5846

load_dataset('bigcode/the-stack-dedup', streaming=True) very slow! #5846

Comments

tbenthompson commented May 11, 2023 • edited Loading

Describe the bug

Environment info

mariosasko commented May 12, 2023 • edited Loading

enze5088 commented May 15, 2023

tbenthompson commented May 15, 2023

enze5088 commented May 16, 2023

mariosasko commented Apr 5, 2024

tbenthompson commented Apr 5, 2024

mariosasko commented Apr 8, 2024 • edited Loading

tbenthompson commented May 11, 2023 •

edited

Loading

mariosasko commented May 12, 2023 •

edited

Loading

mariosasko commented Apr 8, 2024 •

edited

Loading