This paper was accepted at CVPR 2022!
Slow description video.
This repo provides code for QB-Norm (Cross Modal Retrieval with Querybank Normalisation
)
Usage example
python dynamic_inverted_softmax.py --sims_train_test_path msrvtt/tt-ce-train-captions-test-videos-seed0.pkl --sims_test_path msrvtt/tt-ce-test-captions-test-videos-seed0.pkl --test_query_masks_path msrvtt/tt-ce-test-query_masks.pkl
To test QB-Norm on your own data you need to:
- Extract the similarity matrix between the caption from the training split and the videos from the testing split
path/to/sims/train/test
- Extract testing split similarity matrix (similarities between testing captions and testing video)
path/to/sims/test
- Run QB-Norm
python dynamic_inverted_softmax.py --sims_train_test_path path/to/sims/train/test --sims_test_path path/to/sims/test
The similarity matrices for each method were extracted using the official repositories as follows: CE+, TT-CE+, CLIP2Video, CLIP4Clip (for CLIP4Clip we used the official repo to train from scratch new models since they do not provide pre-trained weights), CLIP, MMT, Audio-Retrieval.
Here you can find our trained weights for CLIP4Clip: MSRVTT, DiDeMo, LSMDC, Activity-Net.
You can download the extracted similarity matrices for training and testing here: MSRVTT, MSRVTT 1kA CLIP2Video, MSVD, DiDeMo, LSMDC.
The value used for the inverse temperature is 20, with the exception for CLIP2Video where we used 1/1.99.
QB-Norm Results on MSRVTT Benchmark
Model | Split | Task | R@1 | R@5 | R@10 | MdR | Geom |
---|---|---|---|---|---|---|---|
CE+ | Full | t2v | 14.4(0.1) | 37.4(0.1) | 50.2(0.1) | 10.0(0.0) | 30.0(0.1) |
CE+ (+QB-Norm) | Full | t2v | 16.4(0.0) | 40.3(0.1) | 52.9(0.1) | 9.0(0.0) | 32.7(0.1) |
TT-CE+ | Full | t2v | 14.9(0.1) | 38.3(0.1) | 51.5(0.1) | 10.0(0.0) | 30.9(0.1) |
TT-CE+ (+QB-Norm) | Full | t2v | 17.3(0.0) | 42.1(0.2) | 54.9(0.1) | 8.0(0.0) | 34.2(0.1) |
QB-Norm Results on MSVD Benchmark
Model | Split | Task | R@1 | R@5 | R@10 | MdR | Geom |
---|---|---|---|---|---|---|---|
TT-CE+ | Full | t2v | 25.4(0.3) | 56.9(0.4) | 71.3(0.2) | 4.0(0.0) | 46.9(0.3) |
TT-CE+ (+QB-Norm) | Full | t2v | 28.9(0.3) | 62.0(0.4) | 74.8(0.3) | 3.0(0.0) | 43.1(0.1) |
CLIP2Video | Full | t2v | 47.0 | 76.8 | 85.9 | 2.0 | 67.7 |
CLIP2Video (+QB-Norm) | Full | t2v | 47.6 | 77.6 | 86.1 | 2.0 | 68.5 |
QB-Norm Results on DiDeMo Benchmark
Model | Split | Task | R@1 | R@5 | R@10 | MdR | Geom |
---|---|---|---|---|---|---|---|
TT-CE+ | Full | t2v | 21.6(0.7) | 48.6(0.4) | 62.9(0.6) | 6.0(0.0) | 40.4(0.4) |
TT-CE+ (+QB-Norm) | Full | t2v | 24.2(0.7) | 50.8(0.7) | 64.4(0.1) | 5.3(0.5) | 43.0(0.2) |
CLIP4Clip | Full | t2v | 43.0 | 70.5 | 80.0 | 2.0 | 62.4 |
CLIP4Clip (+QB-Norm) | Full | t2v | 43.5 | 71.4 | 80.9 | 2.0 | 63.1 |
QB-Norm Results on LSMDC Benchmark
Model | Split | Task | R@1 | R@5 | R@10 | MdR | Geom |
---|---|---|---|---|---|---|---|
TT-CE+ | Full | t2v | 17.2(0.4) | 36.5(0.6) | 46.3(0.3) | 13.7(0.5) | 30.7(0.3) |
TT-CE+ (+QB-Norm) | Full | t2v | 17.8(0.4) | 37.7(0.5) | 47.6(0.6) | 12.7(0.5) | 31.7(0.3) |
CLIP4Clip | Full | t2v | 21.3 | 40.0 | 49.5 | 11.0 | 34.8 |
CLIP4Clip (+QB-Norm) | Full | t2v | 22.3 | 40.1 | 49.5 | 11.0 | 35.4 |
The temperature used for CLIP4Clip method on the LSMDC dataset is 0.8.
QB-Norm Results on VaTeX Benchmark
Model | Split | Task | R@1 | R@5 | R@10 | MdR | Geom |
---|---|---|---|---|---|---|---|
TT-CE+ | Full | t2v | 53.2(0.2) | 87.4(0.1) | 93.3(0.0) | 1.0(0.0) | 75.7(0.1) |
TT-CE+ (+QB-Norm) | Full | t2v | 54.8(0.1) | 88.2(0.1) | 93.8(0.1) | 1.0(0.0) | 76.8(0.0) |
CLIP2Video | Full | t2v | 57.4 | 87.9 | 93.6 | 1.0 | 77.9 |
CLIP2Video (+QB-Norm) | Full | t2v | 58.8 | 88.3 | 93.8 | 1.0 | 78.7 |
QB-Norm Results on QuerYD Benchmark
Model | Split | Task | R@1 | R@5 | R@10 | MdR | Geom |
---|---|---|---|---|---|---|---|
CE+ | Full | t2v | 13.2(2.0) | 37.1(2.9) | 50.5(1.9) | 10.3(1.2) | 29.1(2.2) |
CE+ (+QB-Norm) | Full | t2v | 14.1(1.8) | 38.6(1.3) | 51.1(1.6) | 10.0(0.8) | 30.2(1.7) |
TT-CE+ | Full | t2v | 14.4(0.5) | 37.7(1.7) | 50.9(1.6) | 9.8(1.0) | 30.3(0.9) |
TT-CE+ (+QB-Norm) | Full | t2v | 15.1(1.6) | 38.3(2.4) | 51.2(2.8) | 10.3(1.7) | 30.9(2.3) |
QB-Norm Results on MSCoCo Benchmark
Model | Split | Task | R@1 | R@5 | R@10 | MdR | Geom |
---|---|---|---|---|---|---|---|
CLIP | 5k | t2i | 30.3 | 56.1 | 67.1 | 4.0 | 48.5 |
CLIP (+QB-Norm) | 5k | t2i | 34.8 | 59.9 | 70.4 | 3.0 | 52.8 |
MMT-Oscar | 5k | t2i | 52.2 | 80.2 | 88.0 | 1.0 | 71.7 |
MMT-Oscar (+QB-Norm) | 5k | t2i | 53.9 | 80.5 | 88.1 | 1.0 | 72.6 |
QB-Norm Results on AudioCaps Benchmark
Model | Split | Task | R@1 | R@5 | R@10 | MdR | Geom |
---|---|---|---|---|---|---|---|
AR-CE | Full | t2a | 23.1(0.6) | 55.1(0.7) | 70.7(0.6) | 4.7(0.5) | 44.8(0.7) |
AR-CE (+QB-Norm) | Full | t2a | 23.9(0.2) | 57.1(0.3) | 71.6(0.4) | 4.0(0.0) | 46.0(0.3) |
If you find this code useful or use the extracted similarity matrices, please consider citing:
@inproceedings{bogolin2021cross,
title={Cross Modal Retrieval with Querybank Normalisation},
author={Simion-Vlad Bogolin and Ioana Croitoru and Hailin Jin and Yang Liu and Samuel Albanie},
booktitle={CVPR}
year={2022}
}