Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve estimated row count #3055

Open
hlky opened this issue Aug 31, 2024 · 3 comments
Open

Improve estimated row count #3055

hlky opened this issue Aug 31, 2024 · 3 comments

Comments

@hlky
Copy link

hlky commented Aug 31, 2024

Currently estimated row count assumes file sizes are consistent across the entire set, from what I've seen this results in wildly inaccurate estimates for WebDataset. Typically WebDatasets are created with a set number of samples per file, therefore a simpler more accurate estimate can be calculated from the row count of one shard multiplied by the total number of shards.

@lhoestq
Copy link
Member

lhoestq commented Sep 3, 2024

The current estimator works using this formula

estimated_num_rows = total_files_bytes / sampled_bytes * num_rows_in_sample

Where we take a sample from the dataset by streaming the first 5GB of in-memory data.

It was made to work for arbitrary file formats.

For webdataset afaik there is no strict rule to have a fixed number of samples per shard, I don't know how often your method would be more accurate or less accurate. Unless this rule is enforced somewhere ?

@hlky
Copy link
Author

hlky commented Sep 3, 2024

Yes I am aware how the current estimator works, as stated in the issue this assumes a consistent file size across the entire set.

There may not be a strict rule to have a fixed number of samples, just as there is no fixed rule that the number of samples in the first 5GB is the same as the rest. Nevertheless a fixed number of samples per shard is the typical usage and Webdataset's ShardWriter does enforce a fixed number of samples per shard. I don't know how inaccurate the current method is across the entire of Hugging Face, you'd have to check that, I do know it's inaccurate for at least 2 of my own datasets, 288k estimated vs 237k actual in one case and 689k estimated vs 929k actual in another.

@lhoestq
Copy link
Member

lhoestq commented Sep 4, 2024

Oh great to see that the ShardWriter does enforce this. Since it's the official implementation and most people use it we can probably rely on this assumption :)

I'd be happy to provide some guidance if you want to look into how to implement this !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants