MLLMs Need 3D-Aware Representation Supervision for Scene Understanding

Xiaohu Huang, Jingjing Wu, Qunyi Xie, Kai Han^*
Visual AI Lab, The University of Hong Kong & Baidu VIS
^* Corresponding author

Overview of our 3DRS framework for 3D-aware representation supervision in MLLMs.

Introduction

Recent advances in Multimodal Large Language Models (MLLMs) have revolutionized multimodal reasoning, yet scene understanding in complex 3D environments remains a challenge. Existing MLLMs, primarily trained on 2D data, lack explicit 3D-aware representation, limiting their effectiveness in spatially-grounded tasks.

We propose 3DRS, a general framework that introduces explicit 3D-aware representation supervision into MLLMs using powerful 3D foundation models. By aligning the visual features of MLLMs with rich 3D representations, our method enables stronger geometric and spatial reasoning, bridging the gap between 2D pretraining and real-world 3D scene understanding.

News

2025-06-03: We release our paper arXiv, processed data, training code, and evaluation code.

State-of-the-Art Performance

3DRS achieves state-of-the-art results on ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.

3DRS achieves consistent performance improvement on different MLLMs.

TODO List

Release the training code.
Release the evaluation script.
Release the training data.
Release the model checkpoint.

Installation

Clone this repository:

git clone https://github.com/Visual-AI/3DRS.git
cd 3DRS

Install dependencies:

conda create -n 3drs python=3.10
conda activate 3drs
pip install --upgrade pip
pip install -e ".[train]"
pip install flash-attn --no-build-isolation     # install flash attention
pip install -e transformers

Preparing the training data

The processed training data is accessible at here. You can download it and put as data/ folder.

Extracting VGGT features

You need to download the VGGT model from vggt, and put in checkpoints folder.

Afterwards, you need to run the command:

python extract_vggt_feature

This script will extract the vggt features to data/ folder.

Model Preparation

The pre-trained LLaVA-Next-Video can be downloaded from Hugging Face.

Please put it into data/models as LLaVA-Video-7B-Qwen2 folder.

Data Structure

The final data structure should be organized as follows:

data/
├── balanced/
├── benchmark/
├── embodiedscan/
├── metadata/
├── models/
   ├──LLaVA-Video-7B-Qwen2/
├── processed/
└── scannet/
   ├──mask/
   ├──pcd_with_object_aabbs/
   ├──posed_images/
   |──posed_images_3d_feature_vggt/

Run the training and evaluation

sh train_eval.sh

You can modify the MID_RUN_NAME to change the name of an experiment, which should be consistent with the name in train_eval.sh file.

Citation

If you find this work useful, please cite:

@inproceedings{huang2025,
  title={MLLMs Need 3D-Aware Representation Supervision for Scene Understanding},
  author={Xiaohu Huang and Jingjing Wu and Qunyi Xie and Kai Han},
  booktitle={arXiv},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
assets		assets
llava		llava
scripts		scripts
transformers		transformers
trl		trl
vggt		vggt
LICENSE		LICENSE
README.md		README.md
extract_vggt_feature.py		extract_vggt_feature.py
hfd.sh		hfd.sh
pyproject.toml		pyproject.toml
train_eval.sh		train_eval.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MLLMs Need 3D-Aware Representation Supervision for Scene Understanding

Introduction

News

State-of-the-Art Performance

TODO List

Installation

Preparing the training data

Extracting VGGT features

Model Preparation

Data Structure

Run the training and evaluation

Citation

About

Uh oh!

Releases

Packages

Languages

License

Visual-AI/3DRS

Folders and files

Latest commit

History

Repository files navigation

MLLMs Need 3D-Aware Representation Supervision for Scene Understanding

Introduction

News

State-of-the-Art Performance

TODO List

Installation

Preparing the training data

Extracting VGGT features

Model Preparation

Data Structure

Run the training and evaluation

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages