Training time and implementation details #59

MLDeS · 2023-09-14T07:12:59Z

How long did it take to train the Tapir model on 64 TPU-v3 cores on the complete Movi E dataset? How is the time expected to scale to 4 A100 80GB or 40GB GPUs? Also, by 50 000 training steps, I assume it is the number of gradient updates? How many epochs would that be close to approx?

Edit: I see that the batch size is 8 and the dataset size is around 10k : so would it be close to ~40 epochs?

cdoersch · 2023-09-15T09:25:17Z

It's a bit difficult to define "epochs" since we sample different points on every step. Our internal dataset is more like 100K videos, and the batch size is 8 per device, so you need to multiply the batch size by 64.

Overall training finishes in about 3 days. For what it's worth, we suspect that it will be more efficient to train TAPIR on NVIDIA hardware since it TAPIR has lots of gather operations, and GPU gather operations are much faster than TPUs. However, internally we don't have access to larger multi-GPU setups, so for us it's still faster for us to use TPUs.

shivanimall mentioned this issue Jun 12, 2024

TAPIR training time stats #98

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training time and implementation details #59

Training time and implementation details #59

MLDeS commented Sep 14, 2023 •

edited

Loading

cdoersch commented Sep 15, 2023

Training time and implementation details #59

Training time and implementation details #59

Comments

MLDeS commented Sep 14, 2023 • edited Loading

cdoersch commented Sep 15, 2023

MLDeS commented Sep 14, 2023 •

edited

Loading