Datasets

The datasets of both pretraining stage and finetuning stage consist of the Image Features and Text Annotations parts. The datasets folder forms the following structure:

|-- datasets
    |-- imgfeats
    |   |-- ...
    |-- annotations
        |-- ...

We provide the method to extract image region features and formatted text annotations for pretraining. For each downstream tasks, we provide the extracted image region features and formatted text annotations.

Pretraining

Image Features

We use pre-extracted region region features for each image. For the pretraining stage in this repository, four image datasets COCO, VG , SBU and Conceptual are used.

The image features are extracted using the commonly-used bottom-up-attention manner, with each image being represented as a fixed number (k=36) of 2048-D features. You can extract the visual features by yourself.

Extract the visual features

Image features can be extracted by using our bottom-up-attention.pytorch repository. To make a fair comparison to other VLP methods, we use the standard Faster R-CNN with R101-fix36 model (ckpt) to extract the features.

The following command below extract features for each image in $IMAGE_DIR and output corresponding features in .npz format to $OUT_DIR:

$ python extract_features.py --mode caffe \
         --num-cpus 32 --gpu '0,1,2,3' \
         --extract-mode roi_feats \
         --min-max-boxes 36, 36 \
         --config-file configs/bua-caffe/extract-bua-caffe-r101-fix36.yaml \
         --image-dir $IMAGE_DIR \
         --out-dir $OUT_DIR \
         --resume

The $IMAGE_DIR refers to a folder contains .jpg format images, and the $OUT_DIR refers to the folder that stores extracted .npz format features.

After preparing the visual features, the datasets folder will have the following structure:

|-- datasets
    |-- imgfeats
    |   |-- $COCO_NPZ_DIR
    |   |   |-- npz_files
    |   |        |-- ***.npz
    |   |        |-- ...
    |   |-- $VG_NPZ_DIR
    |   |   |-- npz_files
    |   |        |-- ***.npz
    |   |        |-- ...
    |   |-- $SBU_NPZ_DIR
    |   |   |-- npz_files
    |   |        |-- ***.npz
    |   |        |-- ...
    |   |-- $CONCEPTUAL_NPZ_DIR
    |       |-- npz_files
    |            |-- ***.npz
    |            |-- ...
    |-- annotations
        |-- ...

The $COCO_NPZ_DIR, $VG_NPZ_DIR, $SBU_NPZ_DIRand the $CONCEPTUAL_NPZ_DIR are the $NPZ_DIR folders for COCO, VG, SBU and Conceptual, respectively. Make sure that the paths are consistent with the settings in the config files. For simplicity, we recommend setting$COCO_NPZ_DIR, $VG_NPZ_DIR, $SBU_NPZ_DIRand $CONCEPTUAL_NPZ_DIR to mscoco_bua_r101_fix36, visualgenome_bua_r101_fix36, sbu_bua_r101_fix36 and conceptual_bua_r101_fix36, respectively.

Text Annotations

For each pretraining dataset, we provide the formatted annotation files in tsv format as follows.

MSCOCO

You can download the formatted annotations here and run the following command to unzip the annotations:

$ cd datasets/annotations
$ tar -xzvf pt-coco.tar.gz

After unzipping, there will be a folder pt-coco containing several files, the datasets folder will have the following structure:

|-- datasets
    |-- imgfeats
    |   |-- ...
    |-- annotations
        |-- pt-coco
        |   |-- text_piror_coco_vg_cc_sbu.json
        |   |-- pt_coco_annotations_train.tsv
        |   |-- pt_coco_annotations_train.lineidx
        |   |-- pt_coco_annotations_test.tsv
        |   |-- pt_coco_annotations_test.lineidx
        |   |-- pt_coco_annotations_testall.tsv
        |   |-- pt_coco_annotations_testall.lineidx
        |   |-- pt_coco_annotations_dev.tsv
        |   |-- pt_coco_annotations_dev.lineidx
        |-- ...

Visual Genome

You can download the formatted annotations here and run the following command to unzip the annotations:

$ cd datasets/annotations
$ tar -xzvf pt-vg.tar.gz

After unzipping, there will be a folder pt-vg containing several files, the datasets folder will have the following structure:

|-- datasets
    |-- imgfeats
    |   |-- ...
    |-- annotations
        |-- pt-vg
        |   |-- pt_vg_annotations_train.tsv
        |   |-- pt_vg_annotations_train.lineidx
        |-- ...

SBU

You can download the formatted annotations here and run the following command to unzip the annotations:

$ cd datasets/annotations
$ tar -xzvf pt-sbu.tar.gz

After unzipping, there will be a folder pt-sbu containing several files, the datasets folder will have the following structure:

|-- datasets
    |-- imgfeats
    |   |-- ...
    |-- annotations
        |-- pt-sbu
        |   |-- pt_sbu_annotations_train.tsv
        |   |-- pt_sbu_annotations_train.lineidx
        |-- ...

Conceptual

You can download the formatted annotations here and run the following command to unzip the annotations:

$ cd datasets/annotations
$ tar -xzvf pt-conceptual.tar.gz

After unzipping, there will be a folder pt-conceptual containing several files, the datasets folder will have the following structure:

|-- datasets
    |-- imgfeats
    |   |-- ...
    |-- annotations
        |-- pt-conceptual
        |   |-- pt_conceptual_annotations_train.tsv
        |   |-- pt_conceptual_annotations_train.lineidx
        |   |-- pt_conceptual_annotations_val.tsv
        |   |-- pt_conceptual_annotations_val.lineidx
        |-- ...

The Final Structure

Finally, the datasets folder will have the following structure:

|-- datasets
    |-- imgfeats
    |   |-- mscoco_bua_r101_fix36
    |   |   |-- npz_files
    |   |        |-- ***.npz
    |   |        |-- ...
    |   |-- visualgenome_bua_r101_fix36
    |   |   |-- npz_files
    |   |        |-- ***.npz
    |   |        |-- ...
    |   |-- sbu_bua_r101_fix36
    |   |   |-- npz_files
    |   |        |-- ***.npz
    |   |        |-- ...
    |   |-- conceptual_bua_r101_fix36
    |       |-- npz_files
    |            |-- ***.npz
    |            |-- ...
    |-- annotations
        |-- pt-coco
        |   |-- text_piror_coco_vg_cc_sbu.json
        |   |-- pt_coco_annotations_train.tsv
        |   |-- pt_coco_annotations_train.lineidx
        |   |-- pt_coco_annotations_test.tsv
        |   |-- pt_coco_annotations_test.lineidx
        |   |-- pt_coco_annotations_testall.tsv
        |   |-- pt_coco_annotations_testall.lineidx
        |   |-- pt_coco_annotations_dev.tsv
        |   |-- pt_coco_annotations_dev.lineidx
        |-- pt-vg
        |   |-- pt_vg_annotations_train.tsv
        |   |-- pt_vg_annotations_train.lineidx
        |-- pt-sbu
        |   |-- pt_sbu_annotations_train.tsv
        |   |-- pt_sbu_annotations_train.lineidx
        |-- pt-conceptual
            |-- pt_conceptual_annotations_train.tsv
            |-- pt_conceptual_annotations_train.lineidx
            |-- pt_conceptual_annotations_val.tsv
            |-- pt_conceptual_annotations_val.lineidx

Finetuning

We provide the extracted image region features, and formatted text annotations for each downstream tasks.

Image Features

We use pre-extracted region region features for each image. For the finetuning tasks in this repository, two image datasets COCO and Flickr are used.

The image features are extracted using the commonly-used bottom-up-attention manner, with each image being represented as a fixed number (k=36) of 2048-D features. You can download the extracted features or extract the visual features by yourself.

Download the extracted features

We provide the extracted image features for two datasets in .tsv format, namely mscoco_bua_r101_fix36.tar.gz and flickr_bua_r101_fix36.tar.gz, corresponding to the features for COCO and Flickr, respectively. Using the command below to unzip the downloaded files to the proper places:

$ tar -xzvf *_bua_r101_fix36.tar.gz datasets/imgfeats/

Each zipped file contain three files: imgfeat.tsv, imgfeat.lineidx, and img_feat_offset_map.json

The datasets/ folder will have the following structure:

|-- datasets
    |-- imgfeats
    |   |-- mscoco_bua_r101_fix36
    |   |   |-- imgfeat.tsv
    |   |   |-- imgfeat.lineidx
    |   |   |-- img_feat_offset_map.json
    |   |-- flickr_bua_r101_fix36
    |       |-- imgfeat.tsv
    |       |-- imgfeat.lineidx
    |       |-- img_feat_offset_map.json
    |-- annotations
        |-- ...

Extract the visual features

Alternatively, the above image features can be extracted by using our bottom-up-attention.pytorch repository. To make a fair comparison to other VLP methods, we use the standard Faster R-CNN with R101-fix36 model (ckpt) to extract the features.

The following command below extract features for each image in $IMAGE_DIR and output corresponding features in .npz format to $OUT_DIR:

$ python extract_features.py --mode caffe \
         --num-cpus 32 --gpu '0,1,2,3' \
         --extract-mode roi_feats \
         --min-max-boxes 36, 36 \
         --config-file configs/bua-caffe/extract-bua-caffe-r101-fix36.yaml \
         --image-dir $IMAGE_DIR \
         --out-dir $OUT_DIR \
         --resume

The $IMAGE_DIR refers to a folder contains .jpg format images, and the $OUT_DIR refers to the folder that stores extracted .npz format features.

After obtaining the .npz features for the whole dataset, you can use transfer_npz2tsv.py to convert these .npz features into one .tsv file as follows:

$ python transfer_npz2tsv.py \
         -- npz-dir $NPZ_DIR \
         -- tsv-dir $TSV_DIR

The $NPZ_DIR is the folder contains .npz format features, and the $TSV_DIR folder stores the converted .tsv format features, it will contain three files: imgfeat.tsv, imgfeat.lineidx, and img_feat_offset_map.json.

After preparing the visual features, the datasets folder will have the following structure:

|-- datasets
    |-- imgfeats
    |   |-- $COCO_TSV_DIR
    |   |   |-- imgfeat.tsv
    |   |   |-- imgfeat.lineidx
    |   |   |-- img_feat_offset_map.json
    |   |-- $Flickr_TSV_DIR
    |       |-- imgfeat.tsv
    |       |-- imgfeat.lineidx
    |       |-- img_feat_offset_map.json
    |-- annotations
        |-- ...

The $COCO_TSV_DIR and the $Flickr_TSV_DIR are the $TSV_DIR folders for COCO and Flickr, respectively. Make sure that the paths are consistent with the settings in the config files. For simplicity, we recommend setting $COCO_TSV_DIR and $Flickr_TSV_DIR to mscoco_bua_r101_fix36 and flickr_bua_r101_fix36, respectively.

Text Annotations

For each downstream task (i.e., dataset), we provide the formatted annotation files as follows.

Visual Question Answering (VQA)

For the VQA task, we use VQAv2 datasets. You can download the formatted annotations here and run the following command to unzip the annotations:

$ cd datasets/annotations
$ tar -xzvf vqa-vqav2.tar.gz

After unzipping, there will be a folder vqa-vqav2 containing several files. The vqa_vqav2_annotations.json file, which is the primary annotation file for VQAv2 including the train, val, and test splits. To perform offline validation on the val and minival splits, we additionally provide four files:

v2_OpenEnded_mscoco_val2014_questions.json
v2_mscoco_val2014_annotations.json
v2_OpenEnded_mscoco_minival2014_questions.json
v2_mscoco_minival2014_annotations.json

After that, the datasets folder will have the following structure:

|-- datasets
    |-- imgfeats
    |   |-- ...
    |-- annotations
        |-- vqa-vqav2
        |   |-- vqa_vqav2_annotations.json
        |   |-- v2_OpenEnded_mscoco_val2014_questions.json
        |   |-- v2_mscoco_val2014_annotations.json
        |   |-- v2_OpenEnded_mscoco_minival2014_questions.json
        |   |-- v2_mscoco_minival2014_annotations.json
        |-- ...

Referring Expression Comprehension (REC)

For the REC task, we use RefCOCO, RefCOCOplus, and RefCOCOg datasets. You can download the formatted annotations here (RefCOCO, RefCOCOplus, RefCOCOg) and run the following commands to unzip the annotations:

$ cd datasets/annotations
$ tar -xzvf rec-refcoco.tar.gz
$ tar -xzvf rec-refcocoplus.tar.gz
$ tar -xzvf rec-refcocog.tar.gz

Similarly, the datasets folder will have the following structure:

|-- datasets
    |-- imgfeats
    |   |-- ...
    |-- annotations
        |-- rec-refcoco
        |   |-- rec_refcoco_annotations.json
        |-- rec-refcocoplus
        |   |-- rec_refcocoplus_annotations.json
        |-- rec-refcocog
        |   |-- rec_refcocog_annotations.json
        |-- ...

Image-Text Retrieval (ITR)

For the ITR task, we use ITR-COCO and ITR-Flickr Datasets. You can download the formatted annotations here (ITR-COCO, ITR-Flickr) and run the following commands to unzip the annotations:

$ cd datasets/annotations
$ tar -xzvf itr-coco.tar.gz
$ tar -xzvf itr-flickr.tar.gz

After unzipping, you will obtain a itr-coco folder and a itr-flickr folder. Each folder contains a *_annotations.json file and a img_text_map.json file. The latter file is used to store the different mappings between image-text pairs.

After that, the datasets folder will have the following structure:

|-- datasets
    |-- imgfeats
    |   |-- ...
    |-- annotations
        |-- itr-coco
        |   |-- itr_coco_annotations.json
        |   |-- img_text_map.json
        |-- itr-flickr
        |   |-- itr_flickr_annotations.json
        |   |-- img_text_map.json
        |-- ...

The Final Structure

Finally, the datasets folder will have the following structure:

|-- datasets
    |-- imgfeats
    |   |-- mscoco_bua_r101_fix36
    |   |   |-- imgfeat.tsv
    |   |   |-- imgfeat.lineidx
    |   |   |-- img_feat_offset_map.json
    |   |-- flickr_bua_r101_fix36
    |       |-- imgfeat.tsv
    |       |-- imgfeat.lineidx
    |       |-- img_feat_offset_map.json
    |-- annotations
        |-- vqa-vqav2
        |   |-- vqa_vqav2_annotations.json
        |   |-- v2_mscoco_val2014_annotations.json
        |   |-- v2_OpenEnded_mscoco_val2014_questions.json
        |   |-- v2_mscoco_minival2014_annotations.json
        |   |-- v2_OpenEnded_mscoco_minival2014_questions.json
        |-- rec-refcoco
        |   |-- rec_refcoco_annotations.json
        |-- rec-refcocoplus
        |   |-- rec_refcocoplus_annotations.json
        |-- rec-refcocog
        |   |-- rec_refcocog_annotations.json
        |-- itr-coco
        |   |-- itr_coco_annotations.json
        |   |-- img_text_map.json
        |-- itr-flickr
            |-- itr_flickr_annotations.json
            |-- img_text_map.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!