Visual AI Lab, The University of Hong Kong & Baidu VIS
* Corresponding author
Overview of our 3DRS framework for 3D-aware representation supervision in MLLMs.
Recent advances in Multimodal Large Language Models (MLLMs) have revolutionized multimodal reasoning, yet scene understanding in complex 3D environments remains a challenge. Existing MLLMs, primarily trained on 2D data, lack explicit 3D-aware representation, limiting their effectiveness in spatially-grounded tasks.
We propose 3DRS, a general framework that introduces explicit 3D-aware representation supervision into MLLMs using powerful 3D foundation models. By aligning the visual features of MLLMs with rich 3D representations, our method enables stronger geometric and spatial reasoning, bridging the gap between 2D pretraining and real-world 3D scene understanding.
- 2025-06-03: We release our paper arXiv, processed data, training code, and evaluation code.
3DRS achieves state-of-the-art results on ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.
3DRS achieves consistent performance improvement on different MLLMs.
- Release the training code.
- Release the evaluation script.
- Release the training data.
- Release the model checkpoint.
- Clone this repository:
git clone https://github.com/Visual-AI/3DRS.git cd 3DRS
- Install dependencies:
conda create -n 3drs python=3.10 conda activate 3drs pip install --upgrade pip pip install -e ".[train]" pip install flash-attn --no-build-isolation # install flash attention pip install -e transformers
The processed training data is accessible at here. You can download it and put as data/
folder.
You need to download the VGGT model from vggt, and put in checkpoints
folder.
Afterwards, you need to run the command:
python extract_vggt_feature
This script will extract the vggt features to data/
folder.
The pre-trained LLaVA-Next-Video can be downloaded from Hugging Face.
Please put it into data/models
as LLaVA-Video-7B-Qwen2
folder.
The final data structure should be organized as follows:
data/
├── balanced/
├── benchmark/
├── embodiedscan/
├── metadata/
├── models/
├──LLaVA-Video-7B-Qwen2/
├── processed/
└── scannet/
├──mask/
├──pcd_with_object_aabbs/
├──posed_images/
|──posed_images_3d_feature_vggt/
sh train_eval.sh
You can modify the MID_RUN_NAME
to change the name of an experiment, which should be consistent with the name in train_eval.sh
file.
If you find this work useful, please cite:
@inproceedings{huang2025,
title={MLLMs Need 3D-Aware Representation Supervision for Scene Understanding},
author={Xiaohu Huang and Jingjing Wu and Qunyi Xie and Kai Han},
booktitle={arXiv},
year={2025}
}