Skip to content

[feat] Webdataset support #111

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

[feat] Webdataset support #111

wants to merge 3 commits into from

Conversation

blefaudeux
Copy link
Contributor

@blefaudeux blefaudeux commented Apr 28, 2025

cc @photoroman for early visibility

  • New Webdataset head
  • Python demo, benchmark
  • multithread bit of a mess, could just use tokio more
  • perf optim, probably way too many copies at the moment.
  • implement shuffling, buffer and shards wise probably
  • missing unit tests left and right
  • Expose max download concurrency
  • Dispatch tarball contents on the fly, instead of waiting for the full buffer. Great test set with PD12M (11GB shards)
  • Graciously handle paths which require authentication / tokens (ie: HF for instance). If not provided then stop instead of trying to pull all payloads one by one
  • Support rank and worldsize
  • Expose the image key on which to align all image payloads, if any

datago

@blefaudeux blefaudeux force-pushed the ben/webdataset branch 5 times, most recently from d7412f6 to 7118b34 Compare April 30, 2025 21:55
@blefaudeux blefaudeux force-pushed the ben/webdataset branch 2 times, most recently from 307f2c6 to dff6f43 Compare May 2, 2025 14:42
@blefaudeux blefaudeux changed the base branch from main to ben/refactor May 2, 2025 14:42
- async tarball pull, but behavior is clunky
- general arch could be simpler and using tokio more
- handling jpg/png/jpeg/cls/txt/json types
- some shuffling handling

missing unit tests, and better behavior, doing pauses at the moment

better documentation

big rewrite, nicer and smaller code I believe (#117)

Co-authored-by: Benjamin Lefaudeux <[email protected]>

Async tarball pull and dispatch

Random_sampling in the config, at least for now. Thanks for the review Roman !
@blefaudeux blefaudeux changed the base branch from ben/unrelated_changes to main May 26, 2025 11:36
@blefaudeux blefaudeux changed the title [WIP] Webdataset support [feat] Webdataset support May 26, 2025
@blefaudeux blefaudeux marked this pull request as ready for review May 26, 2025 11:36
Copy link
Contributor

@photoroman photoroman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! I left some nits mostly and some questions around error handling.

.map(|url| serde_json::Value::String(url.clone()))
.collect())
} else {
assert!(config.url.contains("https://storage.googleapis.com/"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this assert about?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've no idea whether this works for something else than google storage actually, so when I got started with this I guarded it. Would need to ask HF folks for instance, maybe it's actually fairly generic ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me it looks like it would. Which part do you think could be specific for GCS?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something like
curl -s "https://storage.googleapis.com/storage/v1/b/webdataset/o?prefix=fake-imagenet/" | jq
returns a list of all the tarballs, but
the same with https://huggingface.co/datasets/sayakpaul/pd12m-full/resolve/main (even with the auth token in the header) doesn't, pretty much all I know.

I wrote these two paths when debugging, maybe that we can remove the google one actually as I think
"https://huggingface.co/datasets/sayakpaul/pd12m-full/resolve/dataset/main/{00155..02480}.tar" is more typical, but maybe we could ask our HF friends ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, makes sense. I don't know what's typical. Yes, best to ask the HF folks.

Some missing items (would be good to propagate the archive name for instance), but most fixes should be there
@blefaudeux
Copy link
Contributor Author

Perf profile, I was a bit curious. The image resize is really what is eating CPU cycles, the rest seems pretty ok
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants