Skip to content
/ 3DRS Public

The official repository for paper "MLLMs Need 3D-Aware Representation Supervision for Scene Understanding"

License

Notifications You must be signed in to change notification settings

Visual-AI/3DRS

Repository files navigation

MLLMs Need 3D-Aware Representation Supervision for Scene Understanding

Xiaohu Huang, Jingjing Wu, Qunyi Xie, Kai Han*
Visual AI Lab, The University of Hong Kong & Baidu VIS
* Corresponding author


Overview of our 3DRS framework for 3D-aware representation supervision in MLLMs.


Introduction

Recent advances in Multimodal Large Language Models (MLLMs) have revolutionized multimodal reasoning, yet scene understanding in complex 3D environments remains a challenge. Existing MLLMs, primarily trained on 2D data, lack explicit 3D-aware representation, limiting their effectiveness in spatially-grounded tasks.

We propose 3DRS, a general framework that introduces explicit 3D-aware representation supervision into MLLMs using powerful 3D foundation models. By aligning the visual features of MLLMs with rich 3D representations, our method enables stronger geometric and spatial reasoning, bridging the gap between 2D pretraining and real-world 3D scene understanding.


News

  • 2025-06-03: We release our paper arXiv, processed data, training code, and evaluation code.

State-of-the-Art Performance


3DRS achieves state-of-the-art results on ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.


3DRS achieves consistent performance improvement on different MLLMs.


TODO List

  • Release the training code.
  • Release the evaluation script.
  • Release the training data.
  • Release the model checkpoint.

Installation

  1. Clone this repository:
    git clone https://github.com/Visual-AI/3DRS.git
    cd 3DRS
  2. Install dependencies:
    conda create -n 3drs python=3.10
    conda activate 3drs
    pip install --upgrade pip
    pip install -e ".[train]"
    pip install flash-attn --no-build-isolation     # install flash attention
    pip install -e transformers

Preparing the training data

The processed training data is accessible at here. You can download it and put as data/ folder.

Extracting VGGT features

You need to download the VGGT model from vggt, and put in checkpoints folder.

Afterwards, you need to run the command:

python extract_vggt_feature

This script will extract the vggt features to data/ folder.

Model Preparation

The pre-trained LLaVA-Next-Video can be downloaded from Hugging Face.

Please put it into data/models as LLaVA-Video-7B-Qwen2 folder.

Data Structure

The final data structure should be organized as follows:

data/
├── balanced/
├── benchmark/
├── embodiedscan/
├── metadata/
├── models/
   ├──LLaVA-Video-7B-Qwen2/
├── processed/
└── scannet/
   ├──mask/
   ├──pcd_with_object_aabbs/
   ├──posed_images/
   |──posed_images_3d_feature_vggt/

Run the training and evaluation

sh train_eval.sh

You can modify the MID_RUN_NAME to change the name of an experiment, which should be consistent with the name in train_eval.sh file.

Citation

If you find this work useful, please cite:

@inproceedings{huang2025,
  title={MLLMs Need 3D-Aware Representation Supervision for Scene Understanding},
  author={Xiaohu Huang and Jingjing Wu and Qunyi Xie and Kai Han},
  booktitle={arXiv},
  year={2025}
}

About

The official repository for paper "MLLMs Need 3D-Aware Representation Supervision for Scene Understanding"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published