Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training time and implementation details #59

Open
MLDeS opened this issue Sep 14, 2023 · 1 comment
Open

Training time and implementation details #59

MLDeS opened this issue Sep 14, 2023 · 1 comment

Comments

@MLDeS
Copy link

MLDeS commented Sep 14, 2023

How long did it take to train the Tapir model on 64 TPU-v3 cores on the complete Movi E dataset? How is the time expected to scale to 4 A100 80GB or 40GB GPUs? Also, by 50 000 training steps, I assume it is the number of gradient updates? How many epochs would that be close to approx?

Edit: I see that the batch size is 8 and the dataset size is around 10k : so would it be close to ~40 epochs?

@cdoersch
Copy link
Collaborator

It's a bit difficult to define "epochs" since we sample different points on every step. Our internal dataset is more like 100K videos, and the batch size is 8 per device, so you need to multiply the batch size by 64.

Overall training finishes in about 3 days. For what it's worth, we suspect that it will be more efficient to train TAPIR on NVIDIA hardware since it TAPIR has lots of gather operations, and GPU gather operations are much faster than TPUs. However, internally we don't have access to larger multi-GPU setups, so for us it's still faster for us to use TPUs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants