TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding
Shuhuai Ren, Sishuo Chen, Shicheng Li, Xu Sun, Lu Hou
- (Nov 11, 2023)
- Upload 32-frame finetuned ckpt for paragraph-video retrieval.
- (Oct 29, 2023)
- Codes for video pre-training, video qa, video-paragraph retrieval.
- Checkpoints of pre-trained TESTA-base model.
- (Oct 8, 2023)
- Our paper has been accepted by EMNLP 2023 (Findings).
- We introduce an efficient method named TESTA (TEmporal-Spatial Token Aggregation) for long-form video understanding. TESTA progressively aggregates similar visual tokens during video encoding, which can reduce the number of visual tokens by 75% and thus accelerate video encoding.
- Building upon TESTA, we introduce a pre-trained video-language model equipped with a divided space-time token aggregation module in each video encoder block.
- Experimental results on five datasets for paragraph-to-video retrieval and long-form VideoQA tasks show that, TESTA improves computing efficiency by 1.7 times, and achieves significant performance gains from its scalability in processing longer input frames, e.g., +13.7 R@1 on QuerYD and +6.5 R@1 on Condensed Movie.
Currently, the repository contains the code for pre-training a general-purpose video-language model and fine-tuning it on downstream video understanding tasks including video-paragraph retrieval and VideoQA.
To install the dependencies, run
# create
conda env create -f environment.yml
# activate
conda activate testa
Please follow the instructions at DATASETS.md to prepare all datasets.
zero-shot performance on paragraph-to-video retrieval:
Model | frames | QuerYD R@1 | DiDeMo R@1 | ActivityNet Caption R@1 | GFLOPs | Checkpoint |
---|---|---|---|---|---|---|
TESTA-base (ViT-B/16) | 32 | 64.4 | 64.9 | 37.1 | 786 | testa_model_base_pretrain.pth |
Model | frames | R@1 | R@5 | R@10 | GFLOPs | Checkpoint |
---|---|---|---|---|---|---|
TESTA-base (ViT-B/16) | 32 | 77.0 | 90.8 | 92.6 | 420 | testa_model_base_queryd_f32_f1p12.pth |
Model | frames | R@1 | R@5 | R@10 | GFLOPs | Checkpoint |
---|---|---|---|---|---|---|
TESTA-base (ViT-B/16) | 32 | 51.6 | 79.1 | 88.3 | 420 | testa_model_base_anet_f32_f1p12.pth |
Model | frames | R@1 | R@5 | R@10 | GFLOPs | Checkpoint |
---|---|---|---|---|---|---|
TESTA-base (ViT-B/16) | 32 | 57.7 | 83.3 | 89.4 | 420 | testa_model_base_didemo_f32_f1p12.pth |
Model | frames | R@1 | R@5 | R@10 | GFLOPs | Checkpoint |
---|---|---|---|---|---|---|
TESTA-base (ViT-B/16) | 32 | 21.5 | 42.4 | 50.7 | 420 | testa_model_base_cm_f32_f1p12.pth |
Please refer to the RUN.md for detailed instructions on training, evaluating and reproducing the results.
- Upload fine-tuned checkpoints
- Add visualization code
- Add demos
If you have any questions, please feel free to create an issue on this repository.
If you find this code useful for your research, please consider citing:
@article{Ren2023TESTA,
title={TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding},
author={Shuhuai Ren and Sishuo Chen and Shicheng Li and Xu Sun and Lu Hou},
journal={ArXiv},
year={2023},
volume={abs/2310.19060},
}
The codebase relies on resources from BLIP, ToMe,and TimeSFormer. We thank the original authors for their open-sourcing.