codebase for EACL 2024 paper The Role of Data Curation in Image Captioning
(BEiT-3) Image as a Foreign Language: BEiT Pretraining for Vision and Vision-Language Tasks
We used the BEiT3-base model from the original unilm repo. We used the base size checkpoint---BEiT3-base
: #layer=12; hidden=768; FFN factor=4x; #head=12; patch=16x16; #parameters: 276M => download checkpoint
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip3 install -r beit3/requirements.txt
or set up accordingly as in https://github.com/microsoft/unilm/tree/master/beit3
Following instructions at get_started_for_image_captioning.md
for downloading images of the datasets and preprocessing.
We provide preprocessed annotations here.
We provide a sample bash file to finetune the BEiT3 model with dynamic data curation in run/finetune_flickr_captioning_curation_tmp.sh
.
Curation methods and ratio:
You can use the curation_method
and curation_ratio
to config the curation processs. --dynamic
is for curating on samples with loss that are 2std away from the mean.
Dataset:
--task
is used to specify which dataset you will use --- flickr30k_captioning
or coco_captioning
updating soon
@misc{li2024role,
title={The Role of Data Curation in Image Captioning},
author={Wenyan Li and Jonas F. Lotz and Chen Qiu and Desmond Elliott},
year={2024},
eprint={2305.03610},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
This repository is built using the BEiT3 repository and the BLIP library.
This project is licensed under the license found in the LICENSE file in the root directory of this source tree.