This repository corresponds to the PyTorch implementation of the MmNas for {Visual Question Answering, Visual Grounding, Image-Text Matching}.
You may need a machine with at least 4 GPU (>= 8GB), 50GB memory for VQA and VGD and 150GB for ITM and 50GB free disk space. We strongly recommend to use a SSD drive to guarantee high-speed I/O.
You should first install some necessary packages.
-
Install Python >= 3.6
-
Install PyTorch >= 0.4.1 with CUDA (Pytorch 1.x is also supported).
-
Install SpaCy and initialize the GloVe as follows:
$ pip install -r requirements.txt $ wget https://github.com/explosion/spacy-models/releases/download/en_vectors_web_lg-2.1.0/en_vectors_web_lg-2.1.0.tar.gz -O en_vectors_web_lg-2.1.0.tar.gz $ pip install en_vectors_web_lg-2.1.0.tar.gz
The image features are extracted using the bottom-up-attention strategy, with each image being represented as an dynamic number (from 10 to 100) of 2048-D features. We store the features for each image in a .npz
file. You can prepare the visual features by yourself or download the extracted features from OneDrive or BaiduYun. The downloaded files contains three files: train2014.tar.gz, val2014.tar.gz, and test2015.tar.gz, corresponding to the features of the train/val/test images for VQA-v2, respectively. You should place them as follows:
|-- data
|-- coco_extract
| |-- train2014.tar.gz
| |-- val2014.tar.gz
| |-- test2015.tar.gz
Besides, we use the VQA samples from the visual genome dataset to expand the training samples. Similar to existing strategies, we preprocessed the samples by two rules:
- Select the QA pairs with the corresponding images appear in the MSCOCO train and val splits.
- Select the QA pairs with the answer appear in the processed answer list (occurs more than 8 times in whole VQA-v2 answers).
For convenience, we provide our processed vg questions and annotations files, you can download them from OneDrive or BaiduYun, and place them as follow:
|-- datasets
|-- vqa
| |-- VG_questions.json
| |-- VG_annotations.json
After that, you should:
- Download the QA files for VQA-v2.
- Unzip the bottom-up features
Finally, the data
folders will have the following structure:
|-- data
|-- coco_extract
| |-- train2014
| | |-- COCO_train2014_...jpg.npz
| | |-- ...
| |-- val2014
| | |-- COCO_val2014_...jpg.npz
| | |-- ...
| |-- test2015
| | |-- COCO_test2015_...jpg.npz
| | |-- ...
|-- vqa
| |-- v2_OpenEnded_mscoco_train2014_questions.json
| |-- v2_OpenEnded_mscoco_val2014_questions.json
| |-- v2_OpenEnded_mscoco_test2015_questions.json
| |-- v2_OpenEnded_mscoco_test-dev2015_questions.json
| |-- v2_mscoco_train2014_annotations.json
| |-- v2_mscoco_val2014_annotations.json
| |-- VG_questions.json
| |-- VG_annotations.json
The image features are extracted using the bottom-up-attention strategy, with two types featrues are used: 1. visual genome(W/O reference images) pre-trained faster-rcnn detector; 2. coco pre-trained mask-rcnn detector following MAttNet. We store the features for each image in a .npz
file. You can prepare the visual features by yourself or download the extracted features from OneDrive and place in ./data folder.
Refs dataset{refcoco, refcoco+, recocog} were introduced here, build and place them as follow:
|-- data
|-- vgd_coco
| |-- fix100
| | |-- refcoco_unc
| | |-- refcoco+_unc
| | |-- refcocg_umd
|-- detfeat100_woref
|-- refs
| |-- refcoco
| | |-- instances.json
| | |-- refs(google).p
| | |-- refs(unc).p
| |-- refcoco+
| | |-- instances.json
| | |-- refs(unc).p
| |-- refcocog
| | |-- instances.json
| | |-- refs(google).p
Additionally, it is also needed to bulid as follows:
cd mmnas/utils
python3 setup.py build
cp build/[lib.*/*.so] .
cd ../..
The image features are extracted using the bottom-up-attention strategy, with each image being represented as an dynamic number (fixed 36) of 2048-D features. We store the features for each image in a .npz
file. You can prepare the visual features by yourself or download the extracted features from OneDrive and place in ./data folder.
Retrival dataset{flickr, coco} can be found here, extract and place them as follow:
|-- data
|-- rois_resnet101_fix36
| |-- train2014
| |-- val2014
|-- flickr_rois_resnet101_fix36
|-- itm
| |-- coco_precomp
| |-- f30k_precomp
The following script will start training with the default hyperparameters:
- VQA
$ python3 train_vqa.py --RUN='train' --GENO_PATH='./logs/ckpts/arch/train_vqa.json'
- VGD
$ python3 train_vgd.py --RUN='train' --GENO_PATH='./logs/ckpts/arch/train_vgd.json'
- ITM
$ python3 train_itm.py --RUN='train' --GENO_PATH='./logs/ckpts/arch/train_itm.json'
To add:
-
--VERSION=str
, e.g.--VERSION='small_model'
to assign a name for your this model. -
--GPU=str
, e.g.--GPU='0, 1, 2, 3'
to train the model on specified GPU device. -
--NW=int
, e.g.--NW=8
to accelerate I/O speed. -
--MODEL={'small', 'large'}
( Warning: The large model will consume more GPU memory, maybe Multi-GPU Training and Gradient Accumulation can help if you want to train the model with limited GPU memory.) -
--SPLIT={'train', 'train+val', 'train+val+vg'}
can combine the training datasets as you want. The default training split is'train+val+vg'
. Setting--SPLIT='train'
will trigger the evaluation script to run the validation score after every epoch automatically. -
--RESUME
to start training with saved checkpoint parameters. -
--GENO_PATH
can use the different searched architectures.
Warning: If you train the model use --MODEL
args or multi-gpu training, it should be also set in evaluation.
It is a easy way to modify follows args: --RUN={'val', 'test'} --CKPT_PATH=[Your Model Path] to Run val or test Split.
Example:
$ python3 train_vqa.py --RUN='test' --CKPT_PATH=[Your Model Path] --GENO_PATH=[Searched Architecture Path]
Test Result files will stored in ./logs/ckpts/result_test/result_run_[Your Version].json
You can upload the obtained result json file to Eval AI to evaluate the scores on test-dev and test-std splits.
If this repository is helpful for your research, we'd really appreciate it if you could cite the following paper:
@article{yu2020deep,
title={Deep Multimodal Neural Architecture Search},
author={Yu, Zhou and Cui, Yuhao and Yu, Jun and Wang, Meng and Tao, Dacheng and Tian, Qi},
journal={arXiv preprint arXiv:2004.12070},
year={2020}
}