Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inefficiencies in DataLoader Instantiation #258

Open
le1nux opened this issue Sep 17, 2024 · 0 comments
Open

Inefficiencies in DataLoader Instantiation #258

le1nux opened this issue Sep 17, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@le1nux
Copy link
Member

le1nux commented Sep 17, 2024

System Info

modalities v0.02

🐛 Describe the bug

When instantiating the dataloader, there are two inefficiencies that cause very long instantiation times when using large datasets.

  1. The index generation for the packed data comprises a for loop over all samples. Since block_size and num_samples is known, we can replace the for loop with a vector operation.

https://github.com/Modalities/modalities/blob/c9b4aabd1d216931c33cbf2b10e227a2502767f2/src/modalities/dataloader/dataset.py#L331-#L334

  1. The Dataloader implementation uses a ResumableBatchSampler to allow for skipping samples (e.g., for warm starts)
    The index in the ResumableBatchSampler is created in the constructor
    self.indices = list(iter(self.underlying_batch_sampler))

and the samples are later skipped via

return iter(self.indices[self.start_index :])

Since we only have an iterable coming from the DistributedSampler, we can only have this for loop to build a copy of the index.

A solution would be, to adapt the original DistributedSampler with sample skipping functionality here:
https://github.com/pytorch/pytorch/blob/e248c1d7ebe437094d42d6cad0acf5ffd0a27cad/torch/utils/data/distributed.py#L114

Skipping samples directly in DistributedSampler would allow us to remove the for-loop.

@le1nux le1nux added the bug Something isn't working label Sep 17, 2024
@le1nux le1nux mentioned this issue Sep 17, 2024
6 tasks
@le1nux le1nux mentioned this issue Sep 24, 2024
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant