GitHub - AIDC-AI/Wings: The code repository for "Wings: Learning Multimodal LLMs without Text-only Forgetting" [NeurIPS 2024]

Wings: A Versatile Multimodal LLM without Text-only Forgetting

📝 Paper • 🤗 Hugging Face

🚀 Ask questions or discuss ideas on GitHub

Table of Contents

Why Wings?
How to use
- Quick start
- Citation
License
Disclaimer

Why Wings?

💡 TL;DR

Wings is a brand-new universal Multimodal Large Language Model (MLLM). Its flexible multimodal structure enhances the MLLM as if giving it wings that enhance the performance of multimodal capabilities while minimizing text-only forgetting.
Any architecture of MLLM can adapt the Wings component.

Multimodal large language models (MLLMs), initiated with a trained LLM, first align images with text and then fine-tune on multimodal mixed inputs. However, the MLLM catastrophically forgets the text-only instructions, which do not include images and can be addressed within the initial LLM.

In this work, we present Wings, a novel MLLM that excels in both text-only dialogues and multimodal comprehension. Analyzing MLLM attention in multimodal instructions reveals that text-only forgetting is related to the attention shifts from pre-image to post-image text. From that, we construct extra modules that act as the boosted learner to compensate for the attention shift. The complementary visual and textual learners, like "wings" on either side, are connected in parallel within each layer's attention block. Initially, image and text inputs are aligned with visual learners operating alongside the main attention, balancing focus on visual elements. Textual learners are later collaboratively integrated with attention-based routing to blend the outputs of the visual and textual learners. We design the Low-Rank Residual Attention (LoRRA) to guarantee high efficiency for learners.

Our experimental results demonstrate that Wings outperforms equally-scaled MLLMs in both text-only and visual question-answering tasks. On a newly constructed Interleaved Image-Text (IIT) benchmark, Wings exhibits superior performance from text-only-rich to multimodal-rich question-answering tasks.

How to use

Quick start

Environment Setups:

conda create --name your_env_name python=3.10
conda activate your_env_name
pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt

Training:

bash run/pretrain_base.sh
# Set path for pretrained MLLM (after stage 1)
bash run/finetune_base.sh

Inferring

# Set path containing the trained safetensors
bash run/infer.sh

Citation

If you find Wings useful, please cite the paper:

@article{zhang_wings,
  author       = {Yi{-}Kai Zhang and
                  Shiyin Lu and
                  Yang Li and
                  Yanqing Ma and
                  Qing{-}Guo Chen and
                  Zhao Xu and
                  Weihua Luo and
                  Kaifu Zhang and
                  De{-}Chuan Zhan and
                  Han{-}Jia Ye},
  title        = {Wings: Learning Multimodal LLMs without Text-only Forgetting},
  journal      = {CoRR},
  volume       = {abs/2406.03496},
  year         = {2024}
}

License

This project is licensed under the Apache License, Version 2.0 (SPDX-License-Identifier: Apache-2.0).

Disclaimer

We used compliance-checking algorithms during the training process, to ensure the compliance of the trained model to the best of our ability. Due to the complexity of the data and the diversity of language model usage scenarios, we cannot guarantee that the model is completely free of copyright issues or improper content. If you believe anything infringes on your rights or generates improper content, please contact us, and we will promptly address the matter.

Acknowledgements

We express our gratitude to the LLaVA project and its contributors. Special thanks to Xudong Lu for fixing the bug in modeling_wings_llama.py.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
assets/images		assets/images
data		data
run		run
wings		wings
.gitignore		.gitignore
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
infer.py		infer.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wings: A Versatile Multimodal LLM without Text-only Forgetting

Why Wings?

How to use

Quick start

Citation

License

Disclaimer

Acknowledgements

About

Releases

Packages

Languages

License

AIDC-AI/Wings

Folders and files

Latest commit

History

Repository files navigation

Wings: A Versatile Multimodal LLM without Text-only Forgetting

Why Wings?

How to use

Quick start

Citation

License

Disclaimer

Acknowledgements

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages