-
Notifications
You must be signed in to change notification settings - Fork 2
[feat] Webdataset support #111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
d7412f6
to
7118b34
Compare
307f2c6
to
dff6f43
Compare
fd20f69
to
05ae0e3
Compare
- async tarball pull, but behavior is clunky - general arch could be simpler and using tokio more - handling jpg/png/jpeg/cls/txt/json types - some shuffling handling missing unit tests, and better behavior, doing pauses at the moment better documentation big rewrite, nicer and smaller code I believe (#117) Co-authored-by: Benjamin Lefaudeux <[email protected]> Async tarball pull and dispatch Random_sampling in the config, at least for now. Thanks for the review Roman !
5b9d6a1
to
cc6827f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! I left some nits mostly and some questions around error handling.
.map(|url| serde_json::Value::String(url.clone())) | ||
.collect()) | ||
} else { | ||
assert!(config.url.contains("https://storage.googleapis.com/")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this assert about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've no idea whether this works for something else than google storage actually, so when I got started with this I guarded it. Would need to ask HF folks for instance, maybe it's actually fairly generic ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me it looks like it would. Which part do you think could be specific for GCS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something like
curl -s "https://storage.googleapis.com/storage/v1/b/webdataset/o?prefix=fake-imagenet/" | jq
returns a list of all the tarballs, but
the same with https://huggingface.co/datasets/sayakpaul/pd12m-full/resolve/main (even with the auth token in the header) doesn't, pretty much all I know.
I wrote these two paths when debugging, maybe that we can remove the google one actually as I think
"https://huggingface.co/datasets/sayakpaul/pd12m-full/resolve/dataset/main/{00155..02480}.tar" is more typical, but maybe we could ask our HF friends ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, makes sense. I don't know what's typical. Yes, best to ask the HF folks.
Some missing items (would be good to propagate the archive name for instance), but most fixes should be there
… competing sample pull
1d6c20c
to
72fc230
Compare
cc @photoroman for early visibility