Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to support num workers > batch_size ? #10

Open
kaixinbear opened this issue Mar 21, 2023 · 1 comment
Open

How to support num workers > batch_size ? #10

kaixinbear opened this issue Mar 21, 2023 · 1 comment

Comments

@kaixinbear
Copy link

Hi, etienne87:
I find that this repo would change num_worker to no larger than batch_size.(https://github.com/etienne87/pytorch-stream-dataloader/blob/master/pytorch_stream_dataloader/stream_dataloader.py#L45)
However, this would make the dataloader much slower than vannilar dataloader, which would make training time longer.
Do you have any suggestion on how to support num_workers just like original dataloader in pytorch ?

@etienne87
Copy link
Owner

etienne87 commented Mar 21, 2023

Hi @kaixinbear. The original intent of the repository was to stream from uninterrupted stream readers (e.g hard to seek-to video, online data, etc.) , so by definition 1 process = 1 or multiple streams. The use-case where 1 stream is handled by several processes in parallel (so num_workers can be >> batch_size) is possible if you can seek inside your stream. In that case you can use the original dataloader of pytorch (you just need to map batch_index to file position and use this in the __getitem__ method).
However i am not totally convinced this is so much slower, can you provide an example ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants