Skip to content
forked from MILVLG/mmnas

Deep Multimodal Neural Architecture Search

License

Notifications You must be signed in to change notification settings

cuiyuhao1996/mmnas

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MmNas - Deep Multimodal Neural Architecture Search

This repository corresponds to the PyTorch implementation of the MmNas for {Visual Question Answering, Visual Grounding, Image-Text Matching}. icon

Prerequisites

Software and Hardware Requirements

You may need a machine with at least 4 GPU (>= 8GB), 50GB memory for VQA and VGD and 150GB for ITM and 50GB free disk space. We strongly recommend to use a SSD drive to guarantee high-speed I/O.

You should first install some necessary packages.

  1. Install Python >= 3.6

  2. Install Cuda >= 9.0 and cuDNN

  3. Install PyTorch >= 0.4.1 with CUDA (Pytorch 1.x is also supported).

  4. Install SpaCy and initialize the GloVe as follows:

    $ pip install -r requirements.txt
    $ wget https://github.com/explosion/spacy-models/releases/download/en_vectors_web_lg-2.1.0/en_vectors_web_lg-2.1.0.tar.gz -O en_vectors_web_lg-2.1.0.tar.gz
    $ pip install en_vectors_web_lg-2.1.0.tar.gz

Setup for VQA

The image features are extracted using the bottom-up-attention strategy, with each image being represented as an dynamic number (from 10 to 100) of 2048-D features. We store the features for each image in a .npz file. You can prepare the visual features by yourself or download the extracted features from OneDrive or BaiduYun. The downloaded files contains three files: train2014.tar.gz, val2014.tar.gz, and test2015.tar.gz, corresponding to the features of the train/val/test images for VQA-v2, respectively. You should place them as follows:

|-- data
	|-- coco_extract
	|  |-- train2014.tar.gz
	|  |-- val2014.tar.gz
	|  |-- test2015.tar.gz

Besides, we use the VQA samples from the visual genome dataset to expand the training samples. Similar to existing strategies, we preprocessed the samples by two rules:

  1. Select the QA pairs with the corresponding images appear in the MSCOCO train and val splits.
  2. Select the QA pairs with the answer appear in the processed answer list (occurs more than 8 times in whole VQA-v2 answers).

For convenience, we provide our processed vg questions and annotations files, you can download them from OneDrive or BaiduYun, and place them as follow:

|-- datasets
	|-- vqa
	|  |-- VG_questions.json
	|  |-- VG_annotations.json

After that, you should:

  1. Download the QA files for VQA-v2.
  2. Unzip the bottom-up features

Finally, the data folders will have the following structure:

|-- data
	|-- coco_extract
	|  |-- train2014
	|  |  |-- COCO_train2014_...jpg.npz
	|  |  |-- ...
	|  |-- val2014
	|  |  |-- COCO_val2014_...jpg.npz
	|  |  |-- ...
	|  |-- test2015
	|  |  |-- COCO_test2015_...jpg.npz
	|  |  |-- ...
	|-- vqa
	|  |-- v2_OpenEnded_mscoco_train2014_questions.json
	|  |-- v2_OpenEnded_mscoco_val2014_questions.json
	|  |-- v2_OpenEnded_mscoco_test2015_questions.json
	|  |-- v2_OpenEnded_mscoco_test-dev2015_questions.json
	|  |-- v2_mscoco_train2014_annotations.json
	|  |-- v2_mscoco_val2014_annotations.json
	|  |-- VG_questions.json
	|  |-- VG_annotations.json

Setup for Visual Grounding

The image features are extracted using the bottom-up-attention strategy, with two types featrues are used: 1. visual genome(W/O reference images) pre-trained faster-rcnn detector; 2. coco pre-trained mask-rcnn detector following MAttNet. We store the features for each image in a .npz file. You can prepare the visual features by yourself or download the extracted features from OneDrive and place in ./data folder.

Refs dataset{refcoco, refcoco+, recocog} were introduced here, build and place them as follow:

|-- data
	|-- vgd_coco
	|  |-- fix100
	|  |  |-- refcoco_unc
	|  |  |-- refcoco+_unc
	|  |  |-- refcocg_umd
	|-- detfeat100_woref
	|-- refs
	|  |-- refcoco
	|  |   |-- instances.json
	|  |   |-- refs(google).p
	|  |   |-- refs(unc).p
	|  |-- refcoco+
	|  |   |-- instances.json
	|  |   |-- refs(unc).p
	|  |-- refcocog
	|  |   |-- instances.json
	|  |   |-- refs(google).p

Additionally, it is also needed to bulid as follows:

cd mmnas/utils
python3 setup.py build
cp build/[lib.*/*.so] .
cd ../..

Setup for Image-Text Matching

The image features are extracted using the bottom-up-attention strategy, with each image being represented as an dynamic number (fixed 36) of 2048-D features. We store the features for each image in a .npz file. You can prepare the visual features by yourself or download the extracted features from OneDrive and place in ./data folder.

Retrival dataset{flickr, coco} can be found here, extract and place them as follow:

|-- data
	|-- rois_resnet101_fix36
	|  |-- train2014
	|  |-- val2014
	|-- flickr_rois_resnet101_fix36
	|-- itm
	|  |-- coco_precomp
	|  |-- f30k_precomp

Training

The following script will start training with the default hyperparameters:

  1. VQA
$ python3 train_vqa.py --RUN='train' --GENO_PATH='./logs/ckpts/arch/train_vqa.json'
  1. VGD
$ python3 train_vgd.py --RUN='train' --GENO_PATH='./logs/ckpts/arch/train_vgd.json'
  1. ITM
$ python3 train_itm.py --RUN='train' --GENO_PATH='./logs/ckpts/arch/train_itm.json'

To add:

  1. --VERSION=str, e.g.--VERSION='small_model' to assign a name for your this model.

  2. --GPU=str, e.g.--GPU='0, 1, 2, 3' to train the model on specified GPU device.

  3. --NW=int, e.g.--NW=8 to accelerate I/O speed.

  4. --MODEL={'small', 'large'} ( Warning: The large model will consume more GPU memory, maybe Multi-GPU Training and Gradient Accumulation can help if you want to train the model with limited GPU memory.)

  5. --SPLIT={'train', 'train+val', 'train+val+vg'} can combine the training datasets as you want. The default training split is 'train+val+vg'. Setting --SPLIT='train' will trigger the evaluation script to run the validation score after every epoch automatically.

  6. --RESUME to start training with saved checkpoint parameters.

  7. --GENO_PATH can use the different searched architectures.

Validation and Testing

Warning: If you train the model use --MODEL args or multi-gpu training, it should be also set in evaluation.

Offline Evaluation

It is a easy way to modify follows args: --RUN={'val', 'test'} --CKPT_PATH=[Your Model Path] to Run val or test Split.

Example:

$ python3 train_vqa.py --RUN='test' --CKPT_PATH=[Your Model Path] --GENO_PATH=[Searched Architecture Path]

Online Evaluation (ONLY FOR VQA)

Test Result files will stored in ./logs/ckpts/result_test/result_run_[Your Version].json

You can upload the obtained result json file to Eval AI to evaluate the scores on test-dev and test-std splits.

Citation

If this repository is helpful for your research, we'd really appreciate it if you could cite the following paper:

@article{yu2020deep,
  title={Deep Multimodal Neural Architecture Search},
  author={Yu, Zhou and Cui, Yuhao and Yu, Jun and Wang, Meng and Tao, Dacheng and Tian, Qi},
  journal={arXiv preprint arXiv:2004.12070},
  year={2020}
}

About

Deep Multimodal Neural Architecture Search

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%