Official implementation of MC-LLaVA: Multi-Concept Personalized Vision-Language Model
Abstract: Current vision-language models (VLMs) show exceptional abilities across diverse tasks, such as visual question answering. To enhance user experience, recent studies investigate VLM personalization to understand user-provided concepts. However, they mainly focus on single-concept personalization, neglecting the existence and interplay of multiple concepts, which limits real-world applicability. This paper proposes the first multi-concept personalization paradigm, MC-LLaVA. Specifically, MC-LLaVA employs a multi-concept instruction tuning strategy, effectively integrating multiple concepts in a single training step. To reduce the costs related to joint training, we propose a personalized textual prompt that uses visual token information to initialize concept tokens. Additionally, we introduce a personalized visual prompt during inference, aggregating location confidence maps for enhanced recognition and grounding capabilities. To advance multi-concept personalization research, we further contribute a high-quality instruction tuning dataset. We carefully collect images with multiple characters and objects from movies and manually generate question-answer samples for multi-concept scenarios, featuring superior diversity. Comprehensive qualitative and quantitative experiments demonstrate that MC-LLaVA can achieve impressive multi-concept personalized responses, paving the way for VLMs to become better user-specific assistants.
- Release training and testing code. (2024/12/6)
- Open-source full datasets for research and development. (2024/12/15)
- Update codes for testing with personalized visual prompt. (2025/6/4)
- Clone this repository:
git clone https://github.com/arctanxarc/MC-LLaVA.git
cd MC-LLaVA
- Set up your Python environment:
conda create -n mcllava python=3.10 -y
conda activate mcllava
pip install --upgrade pip
pip install -e .
Setup training data (TODO)
Start training (TODO)
Once the trained parameters are ready, there are two versions of the testing code:
- Without personalized visual prompt:
cd eval; python eval.py
- With personalized visual prompt:
- For easier debugging, we first compute and save the images with personalized visual prompts for all concepts in an offline manner:
cd eval; python gen_som.py --dataset_root <your dataset path>
- Then, run the evaluation:
cd eval; python eval_w_som.py
(For details on the script parameters, see the .py file.)
- For easier debugging, we first compute and save the images with personalized visual prompts for all concepts in an offline manner:
(Currently does not include the concept of objects. Will be updated as soon as possible.)
The multi-concept personalized dataset proposed in our paper and a part of trained weight can be downloaded here.
The dataset is a visual-language dataset tailored for multi-concept scenario customization in Vision-Language Models (VLM). It includes: 99 images sourced from the open dataset CC12M, 1,595 screenshots from 40 films and TV shows, and 12,639 Q&A pairs generated by GPT-4o.
Specifically, the dataset comprises 40 scenarios (one film or TV show corresponds to one scenario), each with several (2 to 4) characters. Each character is represented by 15 images and several corresponding Q&A pairs. Additionally, each scenario includes a set of images featuring multiple characters along with corresponding Q&A pairs.
Number of Characters per Scenario | Number of Scenarios | Number of Images | Number of Q&A Pairs |
---|---|---|---|
2 | 30 | 1050 | 7950 |
3 | 7 | 350 | 2940 |
4 | 3 | 195 | 1749 |
.
├── scenarios.json # List of all scenario names
├── 4o_generate_training_data # Visual Q&A training data generated by GPT-4o; each character has a file with 100 Q&A pairs
│ ├── Alex
| ...
│ └── ziqiao
│ └── conversation.json
├── random-images # Random images, 99 selected from the open dataset CC12M, used to form training data
│ ├── 10192.png
│ ...
│ └── 9943.png
├── three_concept # Scenarios with three or more characters; includes ten such scenarios
│ ├── concept
│ │ ├── test # Single-character test set, grouped by character
| | | ...
│ │ │ └── ziqiao
│ │ │ ├── 0.png
│ │ │ ...
│ │ │ ├── 4.png
│ │ │ ├── choice.json # Visual multiple-choice Q&A pairs for testing
│ │ │ ├── qa.json # Visual Q&A pairs for testing
│ │ │ └── vqa.json # Text-only Q&A pairs for testing
│ │ └── train # Training set, grouped by character
│ │ ├── Alex
| | ...
│ │ └── ziqiao
│ │ ├── 0.png
│ │ ├── 0_mask.png # Mask images generated using [GroundedSAM](https://github.com/IDEA-Research/Grounded-Segment-Anything)
│ │ ...
│ │ ├── 9.png
│ │ ├── 9_mask.png
│ │ └── caption_llava.json # Descriptions generated by LLaVA for each training image, used to reproduce the baseline
│ └── multi # Multi-character test set, grouped by scenario
│ ├── Alex_Gloria_Marty
| ...
│ └── zhongtangxi_sanchengmeiqin_jiubuliulang_donghailinxizi
│ ├── 0.png
│ ...
│ ├── 4.png
│ ├── choice.json # Visual multiple-choice Q&A pairs for testing
│ ├── position.json # Positional information specifying the relative positions of characters in each image
│ ├── vqa_position.json # Visual Q&A pairs for visual grounding testing
│ ├── qa.json # Visual Q&A pairs for testing
│ └── vqa.json # Text-only Q&A pairs for testing
└── two_concept # Scenarios with two characters; includes thirty such scenarios. The file structure is the same as three_concept.
├── concept
| ├── test
│ └── train
└── multi
The scenario A_L_Y comes from the TV show “Spy × Family” and includes three characters: A, L, and Y, as shown in the figure.
(TODO)
@article{an2024mc,
title={Mc-llava: Multi-concept personalized vision-language model},
author={An, Ruichuan and Yang, Sihan and Lu, Ming and Zhang, Renrui and Zeng, Kai and Luo, Yulin and Cao, Jiajun and Liang, Hao and Chen, Ying and She, Qi and others},
journal={arXiv preprint arXiv:2411.11706},
year={2024}
}
This code is heavily inspired by LLaVA and Yo’LLaVA. Thank you for your outstanding work!
This dataset is licensed under the MIT License. You are free to use, modify, and distribute this dataset under the terms of the MIT License.