Self-Visual-RAG

This repository provides an unofficial implementation of the Col-retriever model, developed based on the methodology presented in the SV-RAG: LoRA-Contextualizing Adaptation of MLLMs for Long Document Understanding paper, adapted from the ColPali repository. This project integrates two powerful base models:

Phi-3-V (Microsoft)
InternVL2. (OpenGVLab)

These models are fine-tuned with LoRA adapters for document retrieval task.

Installation

conda create -n svrag python=3.10 -y
conda activate svrag
pip install -e .
pip install -r requirements.txt

Retrieval Inference

We trained two models using LoRA: Col-Phi-3-V and Col-InternVL2. You can test the retrieval model using the run_test.py script with demo data slidevqa_dev.json:

python test_retrieval --model InternVL2
python test_retrieval --model Phi

Self-Visual-RAG Inference

These script demonstrate the full Self-Visual-RAG pipeline on demo query data.

python test_sv_rag --model InternVL2 --k 5
python test_sv_rag --model Phi

For the InternVL2 backbone, the VLLM supports multiple image inputs. You can use the --k flag to control the top-k retrieved images used for answer generation. However, increasing k will significantly increase memory consumption.

Training

Col-InterVL2-4B model:

torchrun --nproc_per_node=8 --master_port=20001 scripts/train/train_colbert.py train_colInternVL2_4b_model.yaml

Col-Phi-3-V model:

torchrun --nproc_per_node=8 --master_port=20001 scripts/train/train_colbert.py train_colphi_model.yaml

📖 Reference

@inproceedings{chen2025svrag,
  title={{SV}-{RAG}: Lo{RA}-Contextualizing Adaptation of {MLLM}s for Long Document Understanding},
  author={Jian Chen and Ruiyi Zhang and Yufan Zhou and Tong Yu and Franck Dernoncourt and Jiuxiang Gu and Ryan A. Rossi and Changyou Chen and Tong Sun},
  booktitle={The Thirteenth International Conference on Learning Representations},
  year={2025},
  url={https://openreview.net/forum?id=FDaHjwInXO}
}

@article{chen2024lora,
  title={LoRA-Contextualizing Adaptation of Large Multimodal Models for Long Document Understanding},
  author={Chen, Jian and Zhang, Ruiyi and Zhou, Yufan and Yu, Tong and Dernoncourt, Franck and Gu, Jiuxiang and Rossi, Ryan A and Chen, Changyou and Sun, Tong},
  journal={arXiv preprint arXiv:2411.01106},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
colpali_engine.egg-info		colpali_engine.egg-info
colpali_engine		colpali_engine
demo_data		demo_data
scripts		scripts
util		util
LICENSE		LICENSE
README.md		README.md
overview.png		overview.png
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_test.py		run_test.py
test_retrieval.py		test_retrieval.py
test_sv_rag.py		test_sv_rag.py
train_colInternVL2_4b_model.yaml		train_colInternVL2_4b_model.yaml
train_colphi_model.yaml		train_colphi_model.yaml
update.sh		update.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Self-Visual-RAG

Installation

Retrieval Inference

Self-Visual-RAG Inference

Training

📖 Reference

About

Uh oh!

Releases

Packages

Languages

License

puar-playground/Self-Visual-RAG

Folders and files

Latest commit

History

Repository files navigation

Self-Visual-RAG

Installation

Retrieval Inference

Self-Visual-RAG Inference

Training

📖 Reference

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages