source code of our paper Cross-Lingual Cross-Modal Retrieval with Noise-Robust Learning
- Environments
- Required Data
- NRCCR on VATEX
- NRCCR on MSRVTT10K-CN
- NRCCR on Multi-30K
- NRCCR on MSCOCO
- Reference
- CUDA 11.3
- Python 3.8.5
- PyTorch 1.10.2
We used Anaconda to setup a deep learning workspace that supports PyTorch. Run the following script to install the required packages.
conda create --name nrccr_env python=3.8.5
conda activate nrccr_env
git clone https://github.com/LiJiaBei-7/nrccr.git
cd nrccr
pip install -r requirements.txt
conda deactivate
We use three public datasets: VATEX, MSR-VTT-CN, and Multi-30K. The extracted feature is placed in $HOME/VisualSearch/
.
For Multi-30K, we have provided translation version (from Google Translate) of Task1 and Task2, respectively. [Task1: Applied to translation tasks. Task2: Applied to captioning tasks.].
In addition, we also provide MSCOCO dataset here, and corresponding performance below. The validation and test set on Japanese from STAIR Captions, and that on Chinese from COCO-CN.
Training set:
source(en) + translation(en2xx) + back-translation(en2xx2en)
Validation set and test set:
target(xx) + translation(xx2en)
Dataset | feature | caption |
---|---|---|
VATEX | vatex-i3d.tar.gz, pwd:p3p0 | vatex_caption, pwd:oy27 |
MSR-VTT-CN | msrvtt10k-resnext101_resnet152.tar.gz, pwd:p3p0 | cn_caption, pwd:es37 |
Multi-30K | multi30k-resnet152.tar.gz, pwd:5khe | multi30k_caption, pwd:oy27 |
MSCOCO | mscoco_caption, pwd:21kx |
ROOTPATH=$HOME/VisualSearch
mkdir -p $ROOTPATH && cd $ROOTPATH
Organize these files like this:
# download the data of VATEX[English, Chinese]
VisualSearch/VATEX/
FeatureData/
i3d_kinetics/
feature.bin
id.txt
shape.txt
video2frames.txt
TextData/
xx.txt
# download the data of MSR-VTT-CN[English, Chinese]
VisualSearch/msrvttcn/
FeatureData/
resnext101-resnet152/
feature.bin
id.txt
shape.txt
video2frames.txt
TextData/
xx.txt
# download the data of Multi-30K[Englich, German, French, Czech]
# For Task2, the training set was translated from Flickr30K, which contains five captions per image, while for task1, each image corresponds to one caption.
# The validation and test set on French and Czech are same in both tasks.
VisualSearch/multi30k/
FeatureData/
train_id.txt
val_id.txt
test_id_2016.txt
resnet_152[optional]/
train-resnet_152-avgpool.npy
val-resnet_152-avgpool.npy
test_2016_flickr-resnet_152-avgpool.npy
TextData/
xx.txt
flickr30k-images/
xx.jpg
# download the data of MSCOCO[English, Chinese, Japanese]
VisualSearch/mscoco/
FeatureData/
train_id.txt
ja_val_id.txt
zh_val_id.txt
ja_test_id.txt
zh_test_id.txt
TextData/
xx.txt
all_pics/
xx.jpg
image_ids.txt
Run the following script to train and evaluate NRCCR
network. Specifically, it will train NRCCR
network and select a checkpoint that performs best on the validation set as the final model. Notice that we only save the best-performing checkpoint on the validation set to save disk space.
ROOTPATH=$HOME/VisualSearch
conda activate nrccr_env
# To train the model on the MSR-VTT, which the feature is resnext-101_resnet152-13k
# Template:
./do_all_vatex.sh $ROOTPATH <gpu-id>
# Example:
# Train NRCCR
./do_all_vatex.sh $ROOTPATH 0
<gpu-id>
is the index of the GPU where we train on.
Download trained checkpoint on VATEX from Baidu pan (url, pwd:ise6) and run the following script to evaluate it.
ROOTPATH=$HOME/VisualSearch/
tar zxf $ROOTPATH/<best_model>.pth.tar -C $ROOTPATH
./do_test_vatex.sh $ROOTPATH $MODELDIR <gpu-id>
# $MODELDIR is the path of checkpoints, $ROOTPATH/.../runs_0
Type | Text-to-Video Retrieval | Video-to-Text Retrieval | SumR | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
R@1 | R@5 | R@10 | MedR | mAP | R@1 | R@5 | R@10 | MedR | mAP | ||
en2cn | 30.8 | 64.4 | 74.6 | 3.0 | 45.78 | 43.1 | 72.3 | 81.4 | 2.0 | 32.57 | 366.5 |
Run the following script to train and evaluate NRCCR
network on MSR-VTT-CN.
ROOTPATH=$HOME/VisualSearch
conda activate nrccr_env
# To train the model on the VATEX
./do_all_msrvttcn.sh $ROOTPATH <gpu-id>
Download trained checkpoint on MSR-VTT-CN from Baidu pan (url, pwd:ise6) and run the following script to evaluate it.
ROOTPATH=$HOME/VisualSearch/
tar zxf $ROOTPATH/<best_model>.pth.tar -C $ROOTPATH
./do_test_msrvttcn.sh $ROOTPATH $MODELDIR <gpu-id>
# $MODELDIR is the path of checkpoints, $ROOTPATH/.../runs_0
Type | Text-to-Video Retrieval | Video-to-Text Retrieval | SumR | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
R@1 | R@5 | R@10 | MedR | mAP | R@1 | R@5 | R@10 | MedR | mAP | ||
en2cn | 28.9 | 56.3 | 67.3 | 4.0 | 41.28 | 28.9 | 57.6 | 69.0 | 4.0 | 42.02 | 308 |
Run the following script to train and evaluate NRCCR
network on Multi-30K. Besides, if you want use the clip as the backbone to train, you need to download the raw images from here for Flickr30K.
ROOTPATH=$HOME/VisualSearch
conda activate nrccr_env
# To train the model on the Multi-30K
./do_all_multi30k.sh $ROOTPATH <task> <gpu-id>
Download trained checkpoint on Multi-30K from Baidu pan (url, pwd:ise6) and run the following script to evaluate it.
ROOTPATH=$HOME/VisualSearch/
tar zxf $ROOTPATH/<best_model>.pth.tar -C $ROOTPATH
./do_test_multi30k.sh $ROOTPATH $MODELDIR $image_path <gpu-id>
# $MODELDIR is the path of checkpoints, $ROOTPATH/.../runs_0
# $image_path is the path of the raw images for Flickr30K, if you use the frozen resnet-152, just set the None.
Task1:
Type | Text-to-Video Retrieval | Video-to-Text Retrieval | SumR | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
R@1 | R@5 | R@10 | MedR | mAP | R@1 | R@5 | R@10 | MedR | mAP | ||
en2de_clip | 53.8 | 81.8 | 88.3 | 1.0 | 66.60 | 53.8 | 82.7 | 90.3 | 1.0 | 66.66 | 450.7 |
en2fr_clip | 54.7 | 81.7 | 89.2 | 1.0 | 67.05 | 54.9 | 82.7 | 89.7 | 1.0 | 67.29 | 452.9 |
en2cs_clip | 52.6 | 79.4 | 87.9 | 1.0 | 65.26 | 52.3 | 78.7 | 87.8 | 1.0 | 64.68 | 438.7 |
en2cs_resnet152 | 29.5 | 56.0 | 68.1 | 4.0 | 41.89 | 27.5 | 55.1 | 67.4 | 4.0 | 40.59 | 303.6 |
Task2 :
(with clip)
en2de_SumR | en2fr_SumR | en2cs_SumR |
---|---|---|
480.9 | 482.1 | 467.1 |
Run the following script to train and evaluate NRCCR
network on MSCOCO.
ROOTPATH=$HOME/VisualSearch
conda activate nrccr_env
# To train the model on the Multi-30K
./do_all_mscoco.sh $ROOTPATH <gpu-id>
(with clip)
en2cn_SumR | en2ja_SumR |
---|---|
512.4 | 507.0 |
If you find the package useful, please consider citing our paper:
@inproceedings{wang2022cross,
title={Cross-Lingual Cross-Modal Retrieval with Noise-Robust Learning},
author={Yabing Wang and Jianfeng Dong and Tianxiang Liang and Minsong Zhang and Rui Cai and Xun Wang},
journal={In Proceedings of the 30th ACM international conference on Multimedia},
year={2022}
}