All the model weights are saved with the clip_teacher
, which are loaded from the CLIP vision encoder.
We load those models with K710 pretraining (Stage1) and further pretrain them on multimodality data (Stage2).
- 5M: CC3M + WebVid2M
- 17M: CC3M + CC12M + COCO + VG + SBU + WebVid2M
- 25M: CC3M + CC12M + COCO + VG + SBU + WebVid10M
Model | Setting | Model | Script |
---|---|---|---|
UMT-B/16 | 5M | ckpt | script |
UMT-B/16 | 17M | ckpt | script |
UMT-B/16 | 25M | ckpt | script |
UMT-L/16 | 5M | ckpt | script |
UMT-L/16 | 17M | ckpt | script |
UMT-L/16 | 25M | ckpt | script |
Dataset | Retrieval | UMT-B/16 | UMT-L/16 | ||||
---|---|---|---|---|---|---|---|
5M | 17M | 25M | 5M | 17M | 25M | ||
MSRVTT | T2V | R@1: 29.6 R@5: 52.8 R@10: 61.9 |
R@1: 35.5 R@5: 59.3 R@10: 68.6 |
R@1: 35.2 R@5: 57.8 R@10: 66.0 |
R@1: 33.3 R@5: 58.1 R@10: 66.7 |
R@1: 42.6 R@5: 64.4 R@10: 73.1 |
R@1: 40.7 R@5: 63.4 R@10: 71.8 |
V2T | R@1: 26.2 R@5: 46.7 R@10: 54.9 |
R@1: 31.6 R@5: 53.5 R@10: 64.1 |
R@1: 30.3 R@5: 50.7 R@10: 61.4 |
R@1: 30.2 R@5: 51.3 R@10: 61.6 |
R@1: 38.6 R@5: 59.8 R@10: 69.6 |
R@1: 37.1 R@5: 58.7 R@10: 68.9 |
|
Material | script | script | script | script | script | script | |
DiDeMo | T2V | R@1: 33.4 R@5: 58.3 R@10: 67.0 |
R@1: 41.9 R@5: 66.7 R@10: 75.0 |
R@1: 41.2 R@5: 65.4 R@10: 74.9 |
R@1: 34.0 R@5: 60.4 R@10: 68.7 |
R@1: 46.4 R@5: 70.0 R@10: 78.8 |
R@1: 48.6 R@5: 72.9 R@10: 79.0 |
V2T | R@1: 32.0 R@5: 58.7 R@10: 68.2 |
R@1: 40.3 R@5: 66.6 R@10: 75.8 |
R@1: 40.8 R@5: 67.7 R@10: 76.7 |
R@1: 36.2 R@5: 60.0 R@10: 68.6 |
R@1: 46.5 R@5: 72.2 R@10: 79.5 |
R@1: 49.9 R@5: 74.8 R@10: 81.4 |
|
Material | script | script | script | script | script | script | |
ActivityNet | T2V | R@1: 28.3 R@5: 53.0 R@10: 64.2 |
R@1: 33.8 R@5: 59.1 R@10: 70.4 |
R@1: 35.5 R@5: 60.6 R@10: 71.8 |
R@1: 31.9 R@5: 60.2 R@10: 72.0 |
R@1: 42.8 R@5: 69.6 R@10: 79.8 |
R@1: 41.9 R@5: 68.9 R@10: 80.3 |
V2T | R@1: 25.9 R@5: 50.2 R@10: 61.7 |
R@1: 31.6 R@5: 56.2 R@10: 67.9 |
R@1: 32.8 R@5: 57.6 R@10: 69.2 |
R@1: 30.0 R@5: 59.1 R@10: 71.3 |
R@1: 40.7 R@5: 67.6 R@10: 78.6 |
R@1: 39.4 R@5: 66.8 R@10: 78.3 |
|
Material | script | script | script | script | script | script | |
LSMDC | T2V | R@1: 16.8 R@5: 30.5 R@10: 37.6 |
R@1: 18.1 R@5: 33.1 R@10: 40.0 |
R@1: 19.1 R@5: 33.4 R@10: 42.2 |
R@1: 20.0 R@5: 37.2 R@10: 43.7 |
R@1: 25.2 R@5: 43.0 R@10: 50.5 |
R@1: 24.9 R@5: 41.7 R@10: 51.8 |
V2T | R@1: 12.9 R@5: 27.4 R@10: 33.6 |
R@1: 16.0 R@5: 29.9 R@10: 35.7 |
R@1: 15.7 R@5: 30.6 R@10: 37.4 |
R@1: 16.1 R@5: 32.0 R@10: 39.2 |
R@1: 23.2 R@5: 37.7 R@10: 44.2 |
R@1: 21.9 R@5: 37.8 R@10: 45.7 |
|
Material | script | script | script | script | script | script | |
MSVD | T2V | R@1: 36.2 R@5: 65.7 R@10: 76.1 |
R@1: 41.4 R@5: 70.6 R@10: 80.1 |
R@1: 42.3 R@5: 71.7 R@10: 80.8 |
R@1: 44.4 R@5: 73.3 R@10: 82.4 |
R@1: 49.9 R@5: 77.7 R@10: 85.3 |
R@1: 49.0 R@5: 76.9 R@10: 84.7 |
V2T | R@1: 58.5 R@5: 78.7 R@10: 84.3 |
R@1: 62.5 R@5: 80.8 R@10: 87.0 |
R@1: 61.9 R@5: 82.5 R@10: 88.5 |
R@1: 66.1 R@5: 85.5 R@10: 89.4 |
R@1: 75.4 R@5: 89.6 R@10: 94.0 |
R@1: 74.5 R@5: 89.7 R@10: 92.8 |
|
Material | script | script | script | script | script | script |
Dataset | Retrieval | UMT-B/16 | UMT-L/16 | ||||
---|---|---|---|---|---|---|---|
5M | 17M | 25M | 5M | 17M | 25M | ||
MSRVTT | T2V | R@1: 46.3 R@5: 72.7 R@10: 82.0 |
R@1: 50.6 R@5: 75.4 R@10: 83.5 |
R@1: 51.0 R@5: 76.5 R@10: 84.2 |
R@1: 53.3 R@5: 76.6 R@10: 83.9 |
R@1: 56.5 R@5: 80.1 R@10: 87.4 |
R@1: 58.8 R@5: 81.0 R@10: 87.1 |
V2T | R@1: 44.4 R@5: 72.8 R@10: 80.7 |
R@1: 49.4 R@5: 76.7 R@10: 83.5 |
R@1: 49.0 R@5: 77.0 R@10: 84.7 |
R@1: 51.4 R@5: 76.3 R@10: 82.8 |
R@1: 56.7 R@5: 79.6 R@10: 86.7 |
R@1: 58.6 R@5: 81.6 R@10: 86.5 |
|
Material | script | script | script [ckpt] | script | script | script [ckpt] | |
DiDeMo | T2V | R@1: 54.8 R@5: 83.0 R@10: 89.0 |
R@1: 60.8 R@5: 85.1 R@10: 91.0 |
R@1: 61.6 R@5: 86.8 R@10: 91.5 |
R@1: 59.7 R@5: 84.9 R@10: 90.8 |
R@1: 66.6 R@5: 89.9 R@10: 93.7 |
R@1: 70.4 R@5: 90.1 R@10: 93.5 |
V2T | R@1: 52.9 R@5: 80.2 R@10: 85.8 |
R@1: 59.5 R@5: 83.8 R@10: 90.7 |
R@1: 59.5 R@5: 84.9 R@10: 90.5 |
R@1: 59.5 R@5: 84.5 R@10: 90.7 |
R@1: 66.4 R@5: 87.5 R@10: 92.9 |
R@1: 65.7 R@5: 89.6 R@10: 93.3 |
|
Material | script | script | script [ckpt] | script | script | script [ckpt] | |
ActivityNet | T2V | R@1: 52.1 R@5: 80.5 R@10: 89.6 |
R@1: 56.1 R@5: 82.5 R@10: 91.2 |
R@1: 58.3 R@5: 83.9 R@10: 91.5 |
R@1: 58.1 R@5: 85.5 R@10: 92.9 |
R@1: 66.6 R@5: 88.6 R@10: 94.7 |
R@1: 66.8 R@5: 89.1 R@10: 94.9 |
V2T | R@1: 50.0 R@5: 79.8 R@10: 88.2 |
R@1: 54.6 R@5: 82.1 R@10: 91.1 |
R@1: 56.0 R@5: 83.5 R@10: 91.7 |
R@1: 55.4 R@5: 84.4 R@10: 92.9 |
R@1: 64.3 R@5: 87.8 R@10: 94.8 |
R@1: 64.4 R@5: 89.1 R@10: 94.8 |
|
Material | script | script | script [ckpt] | script | script | script [ckpt] | |
LSMDC | T2V | R@1: 30.3 R@5: 51.8 R@10: 61.4 |
R@1: 32.3 R@5: 54.5 R@10: 61.9 |
R@1: 32.7 R@5: 54.7 R@10: 63.4 |
R@1: 37.7 R@5: 60.6 R@10: 67.3 |
R@1: 41.4 R@5: 63.8 R@10: 72.3 |
R@1: 43.0 R@5: 65.5 R@10: 73.0 |
V2T | R@1: 29.8 R@5: 52.2 R@10: 60.5 |
R@1: 31.5 R@5: 53.6 R@10: 61.9 |
R@1: 32.7 R@5: 53.5 R@10: 63.2 |
R@1: 36.2 R@5: 58.9 R@10: 65.7 |
R@1: 40.3 R@5: 63.1 R@10: 71.1 |
R@1: 41.4 R@5: 64.3 R@10: 71.5 |
|
Material | script | script | script [ckpt] | script | script | script [ckpt] | |
MSVD | T2V | R@1: 47.4 R@5: 76.8 R@10: 84.0 |
R@1: 49.6 R@5: 78.5 R@10: 85.7 |
R@1: 50.8 R@5: 79.7 R@10: 86.2 |
R@1: 53.7 R@5: 80.5 R@10: 86.8 |
R@1: 57.4 R@5: 83.0 R@10: 88.5 |
R@1: 58.2 R@5: 83.9 R@10: 89.6 |
V2T | R@1: 69.1 R@5: 85.8 R@10: 92.1 |
R@1: 71.6 R@5: 88.8 R@10: 92.7 |
R@1: 73.3 R@5: 89.6 R@10: 93.7 |
R@1: 77.2 R@5: 91.6 R@10: 94.8 |
R@1: 82.4 R@5: 93.6 R@10: 96.0 |
R@1: 82.4 R@5: 94.6 R@10: 96.7 |
|
Material | script | script | script [ckpt] | script | script | script [ckpt] | |
SSV2- label |
T2V | R@1: 63.1 R@5: 87.1 R@10: 92.3 |
R@1: 63.4 R@5: 88.0 R@10: 92.9 |
R@1: 64.2 R@5: 88.2 R@10: 92.7 |
R@1: 70.5 R@5: 92.4 R@10: 95.5 |
R@1: 73.1 R@5: 93.2 R@10: 96.4 |
R@1: 73.3 R@5: 92.7 R@10: 96.9 |
Material | script | script | script [ckpt] | script | script | script [ckpt] | |
SSV2- template |
T2V | R@1: 87.3 R@5: 100 R@10: 100 |
R@1: 86.8 R@5: 99.4 R@10: 100 |
R@1: 87.9 R@5: 99.4 R@10: 100 |
R@1: 90.2 R@5: 99.4 R@10: 100 |
R@1: 90.8 R@5: 100 R@10: 100 |
R@1: 90.8 R@5: 99.4 R@10: 100 |
Material | script | script | script [ckpt] | script | script | script [ckpt] |
Dataset | UMT-B/16 | UMT-L/16 | ||||
---|---|---|---|---|---|---|
5M | 17M | 25M | 5M | 17M | 25M | |
ActivityNet-QA | 43.5 | 44.9 | 44.8 | 45.1 | 47.3 | 47.9 |
script | script | script [ckpt] | script | script | script [ckpt] | |
MSRVTT-QA | 44.3 | 44.9 | 44.9 | 45.5 | 46.4 | 47.1 |
script | script | script [ckpt] | script | script | script [ckpt] | |
MSRVTT-MC | 95.9 | 96.3 | 96.3 | 96.8 | 97.7 | 97.3 |
script | script | script | script | script | script | |
MSVD-QA | 49.1 | 48.9 | 49.5 | 51.3 | 53.4 | 55.2 |
script | script | script [ckpt] | script | script | script [ckpt] |