MC-LLaVA: Multi-Concept Personalized Vision-Language Model

Official implementation of MC-LLaVA: Multi-Concept Personalized Vision-Language Model


The vanilla LLaVA fails to understand user-provided concepts. Existing methods like Yo'LLaVA mainly focus on single-concept personalization. Our proposed MC-LLaVA learns multiple concepts and can perform accurately in multi-concept personalization across various tasks such as recognition, VQA, and caption.

Abstract: Current vision-language models (VLMs) show exceptional abilities across diverse tasks, such as visual question answering. To enhance user experience, recent studies investigate VLM personalization to understand user-provided concepts. However, they mainly focus on single-concept personalization, neglecting the existence and interplay of multiple concepts, which limits real-world applicability. This paper proposes the first multi-concept personalization paradigm, MC-LLaVA. Specifically, MC-LLaVA employs a multi-concept instruction tuning strategy, effectively integrating multiple concepts in a single training step. To reduce the costs related to joint training, we propose a personalized textual prompt that uses visual token information to initialize concept tokens. Additionally, we introduce a personalized visual prompt during inference, aggregating location confidence maps for enhanced recognition and grounding capabilities. To advance multi-concept personalization research, we further contribute a high-quality instruction tuning dataset. We carefully collect images with multiple characters and objects from movies and manually generate question-answer samples for multi-concept scenarios, featuring superior diversity. Comprehensive qualitative and quantitative experiments demonstrate that MC-LLaVA can achieve impressive multi-concept personalized responses, paving the way for VLMs to become better user-specific assistants.

TODO 🚀

Release training and testing code. (2024/12/6)
Open-source full datasets for research and development. (2024/12/15)
Update codes for testing with personalized visual prompt. (2025/6/4)

Code

Installation

Clone this repository:

git clone https://github.com/arctanxarc/MC-LLaVA.git
cd MC-LLaVA

Set up your Python environment:

conda create -n mcllava python=3.10 -y
conda activate mcllava
pip install --upgrade pip
pip install -e .

Training

Setup training data (TODO)

Start training (TODO)

Testing

Once the trained parameters are ready, there are two versions of the testing code:

Without personalized visual prompt:
- cd eval; python eval.py
With personalized visual prompt:
- For easier debugging, we first compute and save the images with personalized visual prompts for all concepts in an offline manner: cd eval; python gen_som.py --dataset_root <your dataset path>
- Then, run the evaluation: cd eval; python eval_w_som.py (For details on the script parameters, see the .py file.)

Dataset& Trained Weight

(Currently does not include the concept of objects. Will be updated as soon as possible.)

Download

The multi-concept personalized dataset proposed in our paper and a part of trained weight can be downloaded here.

Introduction

The dataset is a visual-language dataset tailored for multi-concept scenario customization in Vision-Language Models (VLM). It includes: 99 images sourced from the open dataset CC12M, 1,595 screenshots from 40 films and TV shows, and 12,639 Q&A pairs generated by GPT-4o.

Specifically, the dataset comprises 40 scenarios (one film or TV show corresponds to one scenario), each with several (2 to 4) characters. Each character is represented by 15 images and several corresponding Q&A pairs. Additionally, each scenario includes a set of images featuring multiple characters along with corresponding Q&A pairs.

Number of Characters per Scenario	Number of Scenarios	Number of Images	Number of Q&A Pairs
2	30	1050	7950
3	7	350	2940
4	3	195	1749

File Structure

.
├── scenarios.json  # List of all scenario names
├── 4o_generate_training_data   # Visual Q&A training data generated by GPT-4o; each character has a file with 100 Q&A pairs
│   ├── Alex
|   ...
│   └── ziqiao
│       └── conversation.json
├── random-images   # Random images, 99 selected from the open dataset CC12M, used to form training data
│   ├── 10192.png
│   ...
│   └── 9943.png
├── three_concept   # Scenarios with three or more characters; includes ten such scenarios
│   ├── concept
│   │   ├── test    # Single-character test set, grouped by character
|   |   |   ...
│   │   │   └── ziqiao
│   │   │       ├── 0.png
│   │   │       ...
│   │   │       ├── 4.png
│   │   │       ├── choice.json # Visual multiple-choice Q&A pairs for testing
│   │   │       ├── qa.json     # Visual Q&A pairs for testing
│   │   │       └── vqa.json    # Text-only Q&A pairs for testing
│   │   └── train   # Training set, grouped by character
│   │       ├── Alex
|   |       ...
│   │       └── ziqiao
│   │           ├── 0.png
│   │           ├── 0_mask.png  # Mask images generated using [GroundedSAM](https://github.com/IDEA-Research/Grounded-Segment-Anything)
│   │           ...
│   │           ├── 9.png
│   │           ├── 9_mask.png
│   │           └── caption_llava.json  # Descriptions generated by LLaVA for each training image, used to reproduce the baseline
│   └── multi   # Multi-character test set, grouped by scenario
│       ├── Alex_Gloria_Marty
|       ...
│       └── zhongtangxi_sanchengmeiqin_jiubuliulang_donghailinxizi
│           ├── 0.png
│           ...
│           ├── 4.png
│           ├── choice.json # Visual multiple-choice Q&A pairs for testing
│           ├── position.json   # Positional information specifying the relative positions of characters in each image
│           ├── vqa_position.json   # Visual Q&A pairs for visual grounding testing
│           ├── qa.json # Visual Q&A pairs for testing
│           └── vqa.json    # Text-only Q&A pairs for testing
└── two_concept # Scenarios with two characters; includes thirty such scenarios. The file structure is the same as three_concept.
    ├── concept
    |   ├── test
    │   └── train
    └── multi

Example

The scenario A_L_Y comes from the TV show “Spy × Family” and includes three characters: A, L, and Y, as shown in the figure.

Prepare Your Own Data

(TODO)

BibTeX

@article{an2024mc,
  title={Mc-llava: Multi-concept personalized vision-language model},
  author={An, Ruichuan and Yang, Sihan and Lu, Ming and Zhang, Renrui and Zeng, Kai and Luo, Yulin and Cao, Jiajun and Liang, Hao and Chen, Ying and She, Qi and others},
  journal={arXiv preprint arXiv:2411.11706},
  year={2024}
}

Acknowledgement

This code is heavily inspired by LLaVA and Yo’LLaVA. Thank you for your outstanding work!

License

This dataset is licensed under the MIT License. You are free to use, modify, and distribute this dataset under the terms of the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
assets		assets
eval		eval
llava		llava
train		train
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
recognize_negative_template.json		recognize_negative_template.json
recognize_postive_template.json		recognize_postive_template.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MC-LLaVA: Multi-Concept Personalized Vision-Language Model

TODO 🚀

Code

Installation

Training

Testing

Dataset& Trained Weight

Download

Introduction

File Structure

Example

Prepare Your Own Data

BibTeX

Acknowledgement

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

arctanxarc/MC-LLaVA

Folders and files

Latest commit

History

Repository files navigation

MC-LLaVA: Multi-Concept Personalized Vision-Language Model

TODO 🚀

Code

Installation

Training

Testing

Dataset& Trained Weight

Download

Introduction

File Structure

Example

Prepare Your Own Data

BibTeX

Acknowledgement

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages