This repository contains the code, models, and data for the paper MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment.
This paper introduces MM-Instruct, a large-scale dataset of diverse and high-quality visual instruction data designed to enhance the instruction-following capabilities of large multimodal models (LMMs). While existing visual instruction datasets often focus on question-answering, they struggle to generalize to broader application scenarios such as creative writing, summarization, or image analysis. To address these limitations, we propose a novel approach to constructing MM-Instruct that leverages the strong instruction-following capabilities of existing LLMs to generate novel visual instruction data from large-scale but conventional image captioning datasets. MM-Instruct first leverages ChatGPT to automatically generate diverse instructions from a small set of seed instructions through augmenting and summarization. It then matches these instructions with images and uses an open-sourced large language model (LLM) to generate coherent answers to the instruction-image pairs. The LLM is grounded by the detailed text descriptions of images in the whole answer generation process to guarantee the alignment of the instruction data. Moreover, we introduce a benchmark based on the generated instruction data to evaluate the instruction-following capabilities of existing LMMs. We demonstrate the effectiveness of MM-Instruct by training a LLaVA-1.5 model on the generated data, denoted as LLaVA-Instruct, which exhibits significant improvements in instruction-following capabilities compared to LLaVA-1.5 models.
- Diverse Instructions: MM-Instruct goes beyond simple question-answering, covering a wide range of instructions including creative writing, summarization, and image analysis.
- High-Quality: Instructions and answers are carefully generated and filtered for coherence and alignment with image content.
- Automated Pipeline: We provide an automated pipeline for data generation.
- Benchmark for Evaluation: A held-out subset of MM-Instruct serves as a benchmark for evaluating the instruction-following capabilities of LMMs.
The MM-Instruct dataset consists of 234k high-quality visual instruction-answer pairs derived from image captioning datasets. You can download it on huggingface.
We release 7B/13B models trained with LLaVA framework.
Model | Link |
---|---|
MM-Instruct-7B | weights |
MM-Instruct-13B | weights |
If you are not using Linux, do NOT proceed, see instructions for macOS and Windows.
- Clone this repository and navigate to MM-Instruct folder
git clone https://github.com/jihaonew/MM-Instruct.git
cd MM-Instruct
- Install Package
conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip # enable PEP 660 support
pip install -e .
- Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
git pull
pip install -e .
# if you see some import errors when you upgrade,
# please try running the command below (without #)
# pip install flash-attn --no-build-isolation --no-cache-dir
We follow original LLaVA for training and evaluation. On exception is that we use our generated data in the instruction tuning stage. LLaVA training consists of two stages: (1) feature alignment stage: use our 558K subset of the LAION-CC-SBU dataset to connect a frozen pretrained vision encoder to a frozen LLM; (2) visual instruction tuning stage: use 150K GPT-generated multimodal instruction-following data, around 515K VQA data from academic-oriented tasks, and 234K our generated instruction-tuning data to teach the model to follow multimodal instructions.
LLaVA is trained on 8 A100 GPUs with 80GB memory. To train on fewer GPUs, you can reduce the per_device_train_batch_size
and increase the gradient_accumulation_steps
accordingly. Always keep the global batch size the same: per_device_train_batch_size
x gradient_accumulation_steps
x num_gpus
.
We use a similar set of hyperparameters as Vicuna in finetuning. Both hyperparameters used in pretraining and finetuning are provided below.
- Pretraining
Hyperparameter | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
---|---|---|---|---|---|
LLaVA-v1.5-13B | 256 | 1e-3 | 1 | 2048 | 0 |
- Finetuning
Hyperparameter | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
---|---|---|---|---|---|
LLaVA-v1.5-13B | 128 | 2e-5 | 1 | 2048 | 0 |
Our base model Vicuna v1.5, which is an instruction-tuned chatbot, will be downloaded automatically when you run our provided training scripts. No action is needed.
Please download the 558K subset of the LAION-CC-SBU dataset with BLIP captions we use in the paper here.
Pretrain takes around 5.5 hours for LLaVA-v1.5-13B on 8x A100 (80G), due to the increased resolution to 336px. It takes around 3.5 hours for LLaVA-v1.5-7B.
Training script with DeepSpeed ZeRO-2: pretrain.sh
.
--mm_projector_type mlp2x_gelu
: the two-layer MLP vision-language connector.--vision_tower openai/clip-vit-large-patch14-336
: CLIP ViT-L/14 336px.
Pretrain takes around 20 hours for LLaVA-7B on 8x V100 (32G)
We provide training script with DeepSpeed here. Tips:
- If you are using V100 which is not supported by FlashAttention, you can use the memory-efficient attention implemented in xFormers. Install xformers and replace
llava/train/train_mem.py
above with llava/train/train_xformers.py.
- Prepare data
Download the annotation of the final mixture instruction tuning data llava_v1_5_mix665k.json from LLaVA. Download our generated instruction-tuning data mm_instruct_234k and combine them with llava_v1_5_mix665k using the script.
and download the images from constituting datasets:
- COCO: train2017
- GQA: images
- OCR-VQA: download script, we save all files as
.jpg
- TextVQA: train_val_images
- VisualGenome: part1, part2
- datacomp_1b: datacomp_1b
- sa_1b: sa_1b
After downloading all of them, organize the data as follows in ./playground/data
,
├── coco
│ └── train2017
├── gqa
│ └── images
├── ocr_vqa
│ └── images
├── textvqa
│ └── train_images
├── datacomp
│ └── images
├── sa1b
└── vg
├── VG_100K
└── VG_100K_2
- Start training!
You may download our pretrained projectors in Model Zoo. It is not recommended to use legacy projectors, as they may be trained with a different version of the codebase, and if any option is off, the model will not function/train as we expected.
Visual instruction tuning takes around 20 hours for LLaVA-v1.5-13B on 8x A100 (80G), due to the increased resolution to 336px. It takes around 10 hours for LLaVA-v1.5-7B on 8x A100 (40G).
Training script with DeepSpeed ZeRO-3: finetune.sh
.
If you are do not have enough GPU memory:
- Use LoRA:
finetune_lora.sh
. We are able to fit 13B training in 8-A100-40G/8-A6000, and 7B training in 8-RTX3090. Make sureper_device_train_batch_size*gradient_accumulation_steps
is the same as the provided script for best reproducibility. - Replace
zero3.json
withzero3_offload.json
which offloads some parameters to CPU RAM. This slows down the training speed.
If you are interested in finetuning LLaVA model to your own task/data, please check out Finetune_Custom_Data.md
。
New options to note:
--mm_projector_type mlp2x_gelu
: the two-layer MLP vision-language connector.--vision_tower openai/clip-vit-large-patch14-336
: CLIP ViT-L/14 336px.--image_aspect_ratio pad
: this pads the non-square images to square, instead of cropping them; it slightly reduces hallucination.--group_by_modality_length True
: this should only be used when your instruction tuning dataset contains both language (e.g. ShareGPT) and multimodal (e.g. LLaVA-Instruct). It makes the training sampler only sample a single modality (either image or language) during training, which we observe to speed up training by ~25%, and does not affect the final outcome.
We firstly evaluate our MM-Instruct Model on existing 12 benchmarks following LLaVA-1.5. To ensure the reproducibility, we evaluate the models with greedy decoding. We do not evaluate using beam search to make the inference process consistent with the chat demo of real-time outputs. See Evaluation.md.
Besides the evalution in LLaVA-1.5, we provide an extra dataset to evaluate the intruction-following capabilities of LMMs. Our GPT-assisted evaluation of Instruction-Following Capability is provided for thoroughly examining the instruction-following capabilities of LMMs,. Please see our paper for more details.
Please follow the following steps to evaluate your model.
- Data preparation Firstly download the test image
- Generate responses of your own model
python model_vqa.py \
--model-path ./checkpoints/LLaVA-13B-v0 \
--question-file \
./llava/eval/MM-Instruct-eval/MM-Instruct-Eval-question.json \
--image-folder \
/path/to/eval_image \
--answers-file \
/path/to/answer-file-our.json
- Evaluate the generated responses by comparing the your answer with the baseline result(Eg. GPT4-V here). We provide results from
GPT4-V
,Gemini-Pro
,CogVLM
,Instruct-BLIP
,LLaVA-1.5
,Qwen-VL
OPENAI_API_KEY="sk-***********************************"
python llava/eval/mminstruct-eval-gpt4v.py \
--image-folder /path/to/eval_image \
--question llava/eval/MM-Instruct-eval/MM-Instruct-Eval-question.json \
--m1 /path/to/answer-file-our.json \
--m2 ./llava/eval/MM-Instruct-eval/baseline_answers/gpt4v.json
If you find this work helpful, please cite our paper:
@misc{liu2024mminstructgeneratedvisualinstructions,
title={MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment},
author={Jihao Liu and Xin Huang and Jinliang Zheng and Boxiao Liu and Jia Wang and Osamu Yoshie and Yu Liu and Hongsheng Li},
year={2024},
eprint={2406.19736},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2406.19736},
}
Contributions to this repository are welcome. Please feel free to open an issue or submit a pull request.