Sayan Deb Sarkar1 . Ondrej Miksik2 . Marc Pollefeys2, 3 . Dániel Béla Baráth3 . Iro Armeni1
1Stanford University · 2Microsoft Spatial AI Lab · 3ETH Zürich
Multi-modal 3D object understanding has gained significant attention, yet current approaches often rely on rigid object-level modality alignment or assume complete data availability across all modalities. We present CrossOver, a novel framework for cross-modal 3D scene understanding via flexible, scene-level modality alignment. Unlike traditional methods that require paired data for every object instance, CrossOver learns a unified, modality-agnostic embedding space for scenes by aligning modalities - RGB images, point clouds, CAD models, floorplans, and text descriptions - without explicit object semantics. Leveraging dimensionality-specific encoders, a multi-stage training pipeline, and emergent cross-modal behaviors, CrossOver supports robust scene retrieval and object localization, even with missing modalities. Evaluations on ScanNet and 3RScan datasets show its superior performance across diverse metrics, highlighting CrossOver’s adaptability for real-world applications in 3D scene understanding.
- Flexible Scene-Level Alignment 🌐 - Aligns RGB, point clouds, CAD, floorplans, and text at the scene level— no perfect data needed!
- Emergent Cross-Modal Behaviors 🤯 - Learns unseen modality pairs (e.g., floorplan ↔ text) without explicit pairwise training.
- Real-World Applications 🌍 AR/VR, robotics, construction—handles temporal changes (e.g., object rearrangement) effortlessly.
Table of Contents
[2025-02] We release CrossOver codebase + pre-trained checkpoints. Paper coming out soon, checkout our website!
The code has been tested on:
Ubuntu: 22.04 LTS
Python: 3.9.20
CUDA: 12.1
GPU: GeForce RTX 4090/RTX 3090
Clone the repo and setup as follows:
$ git clone [email protected]:GradientSpaces/CrossOver.git
$ cd CrossOver
$ conda env create -f req.yml
$ conda activate crossover
Further installation for MinkowskiEngine
, Pointnet2_PyTorch
and GPU kNN
(for I2P-MAE setup). Setup as follows:
$ git clone --recursive "https://github.com/EthenJ/MinkowskiEngine"
$ conda install openblas-devel -c anaconda
# Minkowski Engine
$ cd MinkowskiEngine/ && python setup.py install --blas_include_dirs=${CONDA_PREFIX}/include --force_cuda --blas=openblas
# Pointnet2_PyTorch
$ cd .. && git clone https://github.com/erikwijmans/Pointnet2_PyTorch.git
$ pip install pointnet2_ops_lib/.
# GPU kNN
$ cd .. && pip install --upgrade https://github.com/unlimblue/KNN_CUDA/releases/download/0.2/KNN_CUDA-0.2-py3-none-any.whl
Since we use CUDA 12.1, we use the above
MinkowskiEngine
fork; for other CUDA drivers, please refer to the official repo.
See DATA.MD for detailed instructions on data download, preparation and preprocessing. We list the available data used in the current version of CrossOver in the table below:
Dataset Name | Object Modality | Scene Modality | Object Temporal Information | Scene Temporal Information |
---|---|---|---|---|
Scannet | [point, rgb, cad, referral] |
[point, rgb, floorplan, referral] |
❌ | ✅ |
3RScan | [point, rgb, referral] |
[point, rgb, referral] |
✅ | ✅ |
MultiScan | [point, rgb, referral] |
[point, rgb, referral] |
❌ | ✅ |
To run our demo, you only need to download generated embedding data; no need for downloading preprocessed data.
This demo script allows users to process a custom scene and retrieve the closest match from the supported datasets using different modalities. Detailed usage can be found inside the script. Example usage below:
$ python demo/demo_scene_retrieval.py
Various configurable parameters:
--query_path
: Path to the query scene file (eg:./example_data/dining_room/scene_cropped.ply
).--database_path
: Path to the precomputed embeddings of the database scenes downloaded before (eg:./release_data/embed_scannet.pt
).--query_modality
: Modality of the query scene, Options:point
,rgb
,floorplan
,referral
--database_modality
: Modality used for retrieval. (same options as above)--ckpt
: Path to the pre-trained scene crossover model checkpoint (details here), example_path:./checkpoints/scene_crossover_scannet+scan3r.pth/
.
For pre-trained model download, refer to data download and checkpoints sections.
We also provide scripts for inference on a single scan of the supported datasets. Details in Single Inference section in TRAIN.md.
See TRAIN.md for the inventory of available checkpoints and detailed instructions on training and inference/evaluation with pre-trained checkpoints. The checkpoint inventory is listed below:
We provide all available checkpoints on G-Drive here. Detailed descriptions in the table below:
Description | Checkpoint Link |
---|---|
Instance Baseline trained on 3RScan | 3RScan |
Instance Baseline trained on ScanNet | ScanNet |
Instance Baseline trained on ScanNet + 3RScan | ScanNet+3RScan |
Description | Checkpoint Link |
---|---|
Instance CrossOver trained on 3RScan | 3RScan |
Instance CrossOver trained on ScanNet | ScanNet |
Instance CrossOver trained on ScanNet + 3RScan | ScanNet+3RScan |
Instance CrossOver trained on ScanNet + 3RScan + MultiScan | ScanNet+3RScan+MultiScan |
Description | Checkpoint Link |
---|---|
Unified CrossOver trained on ScanNet + 3RScan | ScanNet+3RScan |
Unified CrossOver trained on ScanNet + 3RScan + MultiScan | ScanNet+3RScan+MultiScan |
- Release evaluation on temporal instance matching
- Release evaluation on single image-based scene retrieval
- Release inference on single scan cross-modal object retrieval
We thank the authors from 3D-VisTa, SceneVerse and SceneGraphLoc for open-sourcing their codebases.
@article{
}