📰 News

CrossOver: 3D Scene Cross-Modal Alignment

Sayan Deb Sarkar¹ . Ondrej Miksik² . Marc Pollefeys^{2, 3} . Dániel Béla Baráth^{3, 4} . Iro Armeni¹

Computer Vision And Pattern Recognition (CVPR) 2025

¹Stanford University · ²Microsoft Spatial AI Lab · ³ETH Zürich · ⁴HUN-REN SZTAKI

📃 Abstract

Multi-modal 3D object understanding has gained significant attention, yet current approaches often rely on rigid object-level modality alignment or assume complete data availability across all modalities. We present CrossOver, a novel framework for cross-modal 3D scene understanding via flexible, scene-level modality alignment. Unlike traditional methods that require paired data for every object instance, CrossOver learns a unified, modality-agnostic embedding space for scenes by aligning modalities - RGB images, point clouds, CAD models, floorplans, and text descriptions - without explicit object semantics. Leveraging dimensionality-specific encoders, a multi-stage training pipeline, and emergent cross-modal behaviors, CrossOver supports robust scene retrieval and object localization, even with missing modalities. Evaluations on ScanNet and 3RScan datasets show its superior performance across diverse metrics, highlighting CrossOver’s adaptability for real-world applications in 3D scene understanding.

🚀 Features

Flexible Scene-Level Alignment 🌐 - Aligns RGB, point clouds, CAD, floorplans, and text at the scene level— no perfect data needed!
Emergent Cross-Modal Behaviors 🤯 - Learns unseen modality pairs (e.g., floorplan ↔ text) without explicit pairwise training.
Real-World Applications 🌍 AR/VR, robotics, construction—handles temporal changes (e.g., object rearrangement) effortlessly.

Table of Contents

Installation
Data
Demo
Training & Inference
Acknowledgements
Citation

📰 News

📡 Stay tuned for stronger checkpoint release trained on many more datasets!

[2025-05] Pretrained checkpoints have been moved to HuggingFace 👉 here.
[2025-03] CrossOver is accepted to CVPR 2025 as Highlight. 🔥
[2025-02] We release CrossOver on arXiv with codebase + pre-trained checkpoints. Checkout our paper and website.

🛠️ Installation

The code has been tested on:

Ubuntu: 22.04 LTS
Python: 3.9.20
CUDA: 12.1
GPU: GeForce RTX 4090/RTX 3090

📦 Setup

Clone the repo and setup as follows:

$ git clone [email protected]:GradientSpaces/CrossOver.git
$ cd CrossOver
$ conda env create -f req.yml
$ conda activate crossover

Further installation for MinkowskiEngine, Pointnet2_PyTorch and GPU kNN (for I2P-MAE setup). Setup as follows:

$ git clone --recursive "https://github.com/EthenJ/MinkowskiEngine"
$ conda install openblas-devel -c anaconda

# Minkowski Engine
$ cd MinkowskiEngine/ && python setup.py install --blas_include_dirs=${CONDA_PREFIX}/include --force_cuda --blas=openblas

# Pointnet2_PyTorch
$ cd .. && git clone https://github.com/erikwijmans/Pointnet2_PyTorch.git
$ pip install pointnet2_ops_lib/.

# GPU kNN
$ cd .. && pip install --upgrade https://github.com/unlimblue/KNN_CUDA/releases/download/0.2/KNN_CUDA-0.2-py3-none-any.whl

Since we use CUDA 12.1, we use the above MinkowskiEngine fork; for other CUDA drivers, please refer to the official repo.

⬇️ Data

See DATA.MD for detailed instructions on data download, preparation and preprocessing. We list the data used in the current version of CrossOver in the table below:

Dataset Name	Object Modality	Scene Modality	Object Temporal Information	Scene Temporal Information
Scannet	`[point, rgb, cad, referral]`	`[point, rgb, floorplan, referral]`	❌	✅
3RScan	`[point, rgb, referral]`	`[point, rgb, referral]`	✅	✅

To run our demo, you only need to download generated embedding data; no need for any data preprocessing.

📽️ Demo

This demo script allows users to process a custom scene and retrieve the closest match from the supported datasets using different modalities. Detailed usage can be found inside the script. Example usage below:

$ python demo/demo_scene_retrieval.py

Various configurable parameters:

--query_path: Path to the query scene file (eg: ./example_data/dining_room/scene_cropped.ply).
--database_path: Path to the precomputed embeddings of the database scenes downloaded before (eg: ./release_data/embed_scannet.pt).
--query_modality: Modality of the query scene, Options: point, rgb, floorplan, referral
--database_modality: Modality used for retrieval. Same options as above.
--ckpt: Path to the pre-trained scene crossover model checkpoint (details here), example_path: ./checkpoints/scene_crossover_scannet+scan3r.pth/).

For embedding and pre-trained model download, refer to generated embedding data and checkpoints sections.

We also provide scripts for inference on a single scan of the supported datasets. Details in Single Inference section in TRAIN.md.

🏋️ Training and Inference

See TRAIN.md for the inventory of available checkpoints and detailed instructions on training and inference/evaluation with pre-trained checkpoints. The checkpoint inventory is listed below:

Checkpoints

We provide all available checkpoints on huggingface 👉 here. Detailed descriptions in the table below:

`instance_baseline`

Description	Checkpoint Link
Instance Baseline trained on 3RScan	3RScan
Instance Baseline trained on ScanNet	ScanNet
Instance Baseline trained on ScanNet + 3RScan	ScanNet+3RScan

`instance_crossover`

Description	Checkpoint Link
Instance CrossOver trained on 3RScan	3RScan
Instance CrossOver trained on ScanNet	ScanNet
Instance CrossOver trained on ScanNet + 3RScan	ScanNet+3RScan

`scene_crossover`

Description	Checkpoint Link
Unified CrossOver trained on ScanNet + 3RScan	ScanNet+3RScan

🚧 TODO List

Release evaluation on temporal instance matching
Release inference on single scan cross-modal object retrieval

📧 Contact

If you have any questions regarding this project, please use the github issue tracker or contact Sayan Deb Sarkar ([email protected]).

🙏 Acknowledgements

We thank the authors from 3D-VisTa, SceneVerse and SceneGraphLoc for open-sourcing their codebases.

📄 Citation

@inproceedings{sarkar2025crossover,
author={Sayan Deb Sarkar and Ondrej Miksik and Marc Pollefeys and Daniel Barath and Iro Armeni},
title={CrossOver: 3D Scene Cross-Modal Alignment}, 
booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
common		common
configs		configs
data		data
demo		demo
evaluator		evaluator
model		model
modules		modules
optim		optim
prepare_data		prepare_data
preprocess		preprocess
retrieval		retrieval
scripts		scripts
single_inference		single_inference
third_party/BLIP		third_party/BLIP
trainer		trainer
util		util
.gitignore		.gitignore
DATA.md		DATA.md
LICENSE		LICENSE
README.md		README.md
TRAIN.md		TRAIN.md
preprocessor.py		preprocessor.py
req.yml		req.yml
run.py		run.py
run_evaluation.py		run_evaluation.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CrossOver: 3D Scene Cross-Modal Alignment

📃 Abstract

🚀 Features

📰 News

🛠️ Installation

📦 Setup

⬇️ Data

📽️ Demo

🏋️ Training and Inference

Checkpoints

`instance_baseline`

`instance_crossover`

`scene_crossover`

🚧 TODO List

📧 Contact

🙏 Acknowledgements

📄 Citation

About

Uh oh!

Releases

Packages

Languages

License

GradientSpaces/CrossOver

Folders and files

Latest commit

History

Repository files navigation

CrossOver: 3D Scene Cross-Modal Alignment

📃 Abstract

🚀 Features

📰 News

🛠️ Installation

📦 Setup

⬇️ Data

📽️ Demo

🏋️ Training and Inference

Checkpoints

instance_baseline

instance_crossover

scene_crossover

🚧 TODO List

📧 Contact

🙏 Acknowledgements

📄 Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`instance_baseline`

`instance_crossover`

`scene_crossover`

Packages