Dual Encoding for Video Retrieval by Text

Source code of our TPAMI'21 paper Dual Encoding for Video Retrieval by Text and CVPR'19 paper Dual Encoding for Zero-Example Video Retrieval.

Requirements

Environments

Ubuntu 16.04
CUDA 10.1
Python 3.8
PyTorch 1.5.1

We used Anaconda to setup a deep learning workspace that supports PyTorch. Run the following script to install the required packages.

conda create --name ws_dual_py3 python=3.8
conda activate ws_dual_py3
git clone https://github.com/danieljf24/hybrid_space.git
cd hybrid_space
pip install -r requirements.txt
conda deactivate

Dual Encoding on MSRVTT10K

Required Data

Run the following script to download and extract MSR-VTT (msrvtt10k-resnext101_resnet152.tar.gz(4.3G)) dataset and a pre-trained word2vec (vec500flickr30m.tar.gz(3.0G). The data can also be downloaded from Baidu pan (url, password:p3p0) or Google drive (url). The extracted data is placed in $HOME/VisualSearch/.

ROOTPATH=$HOME/VisualSearch
mkdir -p $ROOTPATH && cd $ROOTPATH

# download and extract dataset
wget http://8.210.46.84:8787/msrvtt10k-resnext101_resnet152.tar.gz
tar zxf msrvtt10k-resnext101_resnet152.tar.gz -C $ROOTPATH

# download and extract pre-trained word2vec
wget http://lixirong.net/data/w2vv-tmm2018/word2vec.tar.gz
tar zxf word2vec.tar.gz -C $ROOTPATH

Model Training and Evaluation

Run the following script to train and evaluate Dual Encoding network with hybrid space on the official partition of MSR-VTT. The video features are the concatenation of ResNeXt-101 and ResNet-152 features.

conda activate ws_dual_py3
./do_all.sh msrvtt10k hybrid resnext101-resnet152

Running the script will do the following things:

Train Dual Encoding network with hybrid space and select a checkpoint that performs best on the validation set as the final model. Notice that we only save the best-performing checkpoint on the validation set to save disk space.
Evaluate the final model on the test set. Note that the dataset has already included vocabulary and concept annotations. If you would like to generate vocabulary and concepts by yourself, run the script ./do_vocab_concept.sh msrvtt10k 1.

If you would like to train Dual Encoding network with latent space (Conference Version), please run the following scrip:

./do_all.sh msrvtt10k latent resnext101-resnet152

To train the model on the Test1k-Miech partition and Test1k-Yu partition of MSR-VTT, please run the following scrip:

./do_all.sh msrvtt10kmiech hybrid resnext101-resnet152
./do_all.sh msrvtt10kyu hybrid resnext101-resnet152

Expected Performance

Run the following script to download and evaluate our trained models on MSR-VTT. The trained models can also be downloaded from Baidu pan (url, password:p3p0). Note that if you would like to evaluate using our trained model, please make sure to use the vocabulary and concept annotations we provided in the msrvtt10k-resnext101_resnet152.tar.gz.

MODELDIR=$HOME/VisualSearch/checkpoints
mkdir -p $MODELDIR

# download trained checkpoints
wegt -P $MODELDIR http://8.210.46.84:8787/checkpoints/msrvtt10k_model_best.pth.tar

# evaluate on official split of MSR-VTT
CUDA_VISIBLE_DEVICES=0 python tester.py --testCollection msrvtt10k --logger_name $MODELDIR  --checkpoint_name msrvtt10k_model_best.pth.tar

In order to evaluate on the other splits, please download corresponding checkpoints and replace the parameter of checkpoint_name to msrvtt10kmiech_model_best.pth.tar(on Test1k-Miech) and to msrvtt10kyu_model_best.pth.tar(on Test1k-Yu). The overview of pre-trained checkpoints on MSR-VTT is as follows.

Split	Pre-trained Model
Official	msrvtt10k_model_best.pth.tar(264M)
Test1k-Miech	msrvtt10kmiech_model_best.pth.tar(267M)
Test1k-Yu	msrvtt10kyu_model_best.pth.tar(267M)

The expected performance of Dual Encoding on MSR-VTT is as follows. Notice that due to random factors in SGD based training, the numbers differ slightly from those reported in the paper.

Split	Text-to-Video Retrieval					Video-to-Text Retrieval					SumR
Split	R@1	R@5	R@10	MedR	mAP	R@1	R@5	R@10	MedR	mAP	SumR
Official	11.8	30.6	41.8	17	21.4	21.6	45.9	58.5	7	10.3	210.2
Test1k-Miech	22.7	50.2	63.1	5	35.6	24.7	52.3	64.2	5	37.2	277.2
Test1k-Yu	21.5	48.8	60.2	6	34.0	21.7	49.0	61.4	6	34.6	262.6

Dual Encoding on VATEX

Required Data

Download VATEX dataset (vatex-i3d.tar.gz(3.0G)) and a pre-trained word2vec (vec500flickr30m.tar.gz(3.0G)). The data can also be downloaded from Baidu pan (url, password:p3p0) or Google drive (url). Please extract data into $HOME/VisualSearch/.

Model Training and Evaluation

Run the following script to train and evaluate Dual Encoding network with hybrid space on VATEX.

# download and extract dataset
wget http://8.210.46.84:8787/vatex-i3d.tar.gz
tar zxf vatex-i3d.tar.gz -C $ROOTPATH

./do_all.sh vatex hybrid

Expected Performance

Run the following script to download and evaluate our trained model (vatex_model_best.pth.tar(230M)) on VATEX.

MODELDIR=$HOME/VisualSearch/checkpoints

# download trained checkpoints
wegt -P $MODELDIR http://8.210.46.84:8787/checkpoints/vatex_model_best.pth.tar

CUDA_VISIBLE_DEVICES=0 python tester.py --testCollection vatex --logger_name $MODELDIR  --checkpoint_name vatex_model_best.pth.tar

The expected performance of Dual Encoding with hybrid space learning on MSR-VTT is as follows.

Split	Text-to-Video Retrieval					Video-to-Text Retrieval					SumR
Split	R@1	R@5	R@10	MedR	mAP	R@1	R@5	R@10	MedR	mAP	SumR
VATEX	35.8	72.8	82.9	2	52.0	47.5	76.0	85.3	2	39.1	400.3

Dual Encoding on Ad-hoc Video Search (AVS) (still working)

Data

The following three datasets are used for training, validation and testing: tgif-msrvtt10k, tv2016train and iacc.3. For more information about these datasets, please refer to https://github.com/li-xirong/avs.

Run the following scripts to download and extract these datasets. The extracted data is placed in $HOME/VisualSearch/.

Sentence data

Sentences: tgif-msrvtt10k, tv2016train
TRECVID 2016 / 2017 / 2018 AVS topics and ground truth: iacc.3

Frame-level feature data

2048-dim ResNeXt-101: tgif(7G), msrvtt10k(2G), tv2016train(42M), iacc.3(27G)

ROOTPATH=$HOME/VisualSearch
cd $ROOTPATH

# download and extract dataset
wget http://39.104.114.128/avs/tgif_ResNext-101.tar.gz
tar zxf tgif_ResNext-101.tar.gz

wget http://39.104.114.128/avs/msrvtt10k_ResNext-101.tar.gz
tar zvf msrvtt10k_ResNext-101.tar

wget http://39.104.114.128/avs/tv2016train_ResNext-101.tar.gz
tar zvf tv2016train_ResNext-101.tar.gz

wget http://39.104.114.128/avs/iacc.3_ResNext-101.tar.gz
tar zvf iacc.3_ResNext-101.tar.gz

# combine feature of tgif and msrvtt10k
./do_combine_features.sh

Train Dual Encoding model from scratch

source ~/ws_dual/bin/activate

trainCollection=tgif-msrvtt10k
visual_feature=pyresnext-101_rbps13k,flatten0_output,os

# Generate a vocabulary on the training set
./do_get_vocab.sh $trainCollection

# Generate video frame info
#./do_get_frameInfo.sh $trainCollection $visual_feature


# training and testing
./do_all_avs.sh 

deactive

How to run Dual Encoding on another datasets? (still working)

Store the training, validation and test subset into three folders in the following structure respectively.

${subset_name}
├── FeatureData
│   └── ${feature_name}
│       ├── feature.bin
│       ├── shape.txt
│       └── id.txt
└── TextData
    └── ${subset_name}train.caption.txt
    └── ${subset_name}val.caption.txt
    └── ${subset_name}test.caption.txt

FeatureData: video frame features. Using txt2bin.py to convert video frame feature in the required binary format.
${dsubset_name}.caption.txt: caption data. The file structure is as follows, in which the video and sent in the same line are relevant.

video_id_1#1 sentence_1
video_id_1#2 sentence_2
...
video_id_n#1 sentence_k
...

You can run the following script to check whether the data is ready:

./do_format_check.sh ${train_set} ${val_set} ${test_set} ${rootpath} ${feature_name}

where train_set, val_set and test_set indicate the name of training, validation and test set, respectively, ${rootpath} denotes the path where datasets are saved and feature_name is the video frame feature name.

If you pass the format check, use the following script to train and evaluate Dual Encoding on your own dataset:

source ~/ws_dual/bin/activate
./do_all_own_data.sh ${train_set} ${val_set} ${test_set} ${rootpath} ${feature_name} ${caption_num} full
deactive

If training data of your task is relatively limited, we suggest dual encoding with level 2 and 3. Compared to the full edition, this version gives nearly comparable performance on MSR-VTT, but with less trainable parameters.

source ~/ws_dual/bin/activate
./do_all_own_data.sh ${train_set} ${val_set} ${test_set} ${rootpath} ${feature_name} ${caption_num} reduced
deactive

References

If you find the package useful, please consider citing our TPAMI'21 or CVPR'19 paper:

@article{dong2021dual,
  title={Dual Encoding for Video Retrieval by Text},
  author={Dong, Jianfeng and Li, Xirong and Xu, Chaoxi and Yang, Xun and Yang, Gang and Wang, Xun and Wang, Meng},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  doi = {10.1109/TPAMI.2021.3059295},
  year={2021}
}

@inproceedings{cvpr2019-dual-dong,
title = {Dual Encoding for Zero-Example Video Retrieval},
author = {Jianfeng Dong and Xirong Li and Chaoxi Xu and Shouling Ji and Yuan He and Gang Yang and Xun Wang},
booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2019},
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
basic		basic
tv-avs-eval		tv-avs-eval
util		util
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
do_all.sh		do_all.sh
do_vocab_concept.sh		do_vocab_concept.sh
dual_encoding.jpg		dual_encoding.jpg
evaluation.py		evaluation.py
loss.py		loss.py
model.py		model.py
requirements.txt		requirements.txt
tester.py		tester.py
trainer.py		trainer.py
validate.py		validate.py
未命名绘图.drawio		未命名绘图.drawio

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dual Encoding for Video Retrieval by Text

Requirements

Environments

Dual Encoding on MSRVTT10K

Required Data

Model Training and Evaluation

Expected Performance

Dual Encoding on VATEX

Required Data

Model Training and Evaluation

Expected Performance

Dual Encoding on Ad-hoc Video Search (AVS) (still working)

Data

Sentence data

Frame-level feature data

Train Dual Encoding model from scratch

How to run Dual Encoding on another datasets? (still working)

References

About

Releases

Packages

Languages

License

1617226214/hybrid_space

Folders and files

Latest commit

History

Repository files navigation

Dual Encoding for Video Retrieval by Text

Requirements

Environments

Dual Encoding on MSRVTT10K

Required Data

Model Training and Evaluation

Expected Performance

Dual Encoding on VATEX

Required Data

Model Training and Evaluation

Expected Performance

Dual Encoding on Ad-hoc Video Search (AVS) (still working)

Data

Sentence data

Frame-level feature data

Train Dual Encoding model from scratch

How to run Dual Encoding on another datasets? (still working)

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages