You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
How long did it take to train the Tapir model on 64 TPU-v3 cores on the complete Movi E dataset? How is the time expected to scale to 4 A100 80GB or 40GB GPUs? Also, by 50 000 training steps, I assume it is the number of gradient updates? How many epochs would that be close to approx?
Edit: I see that the batch size is 8 and the dataset size is around 10k : so would it be close to ~40 epochs?
The text was updated successfully, but these errors were encountered:
It's a bit difficult to define "epochs" since we sample different points on every step. Our internal dataset is more like 100K videos, and the batch size is 8 per device, so you need to multiply the batch size by 64.
Overall training finishes in about 3 days. For what it's worth, we suspect that it will be more efficient to train TAPIR on NVIDIA hardware since it TAPIR has lots of gather operations, and GPU gather operations are much faster than TPUs. However, internally we don't have access to larger multi-GPU setups, so for us it's still faster for us to use TPUs.
How long did it take to train the Tapir model on 64 TPU-v3 cores on the complete Movi E dataset? How is the time expected to scale to 4 A100 80GB or 40GB GPUs? Also, by 50 000 training steps, I assume it is the number of gradient updates? How many epochs would that be close to approx?
Edit: I see that the batch size is 8 and the dataset size is around 10k : so would it be close to ~40 epochs?
The text was updated successfully, but these errors were encountered: