Skip to content

GradientSpaces/CrossOver

Repository files navigation

CrossOver: 3D Scene Cross-Modal Alignment

Sayan Deb Sarkar1 . Ondrej Miksik2 . Marc Pollefeys2, 3 . Dániel Béla Baráth3 . Iro Armeni1

1Stanford University · 2Microsoft Spatial AI Lab · 3ETH Zürich

📃 Abstract

Multi-modal 3D object understanding has gained significant attention, yet current approaches often rely on rigid object-level modality alignment or assume complete data availability across all modalities. We present CrossOver, a novel framework for cross-modal 3D scene understanding via flexible, scene-level modality alignment. Unlike traditional methods that require paired data for every object instance, CrossOver learns a unified, modality-agnostic embedding space for scenes by aligning modalities - RGB images, point clouds, CAD models, floorplans, and text descriptions - without explicit object semantics. Leveraging dimensionality-specific encoders, a multi-stage training pipeline, and emergent cross-modal behaviors, CrossOver supports robust scene retrieval and object localization, even with missing modalities. Evaluations on ScanNet and 3RScan datasets show its superior performance across diverse metrics, highlighting CrossOver’s adaptability for real-world applications in 3D scene understanding.

🚀 Features

  • Flexible Scene-Level Alignment 🌐 - Aligns RGB, point clouds, CAD, floorplans, and text at the scene level— no perfect data needed!
  • Emergent Cross-Modal Behaviors 🤯 - Learns unseen modality pairs (e.g., floorplan ↔ text) without explicit pairwise training.
  • Real-World Applications 🌍 AR/VR, robotics, construction—handles temporal changes (e.g., object rearrangement) effortlessly.
Table of Contents
  1. Installation
  2. Data
  3. Demo
  4. Training & Inference
  5. Acknowledgements
  6. Citation

📰 News

  • [2025-02] We release CrossOver codebase + pre-trained checkpoints. Paper coming out soon, checkout our website!

🛠️ Installation

The code has been tested on:

Ubuntu: 22.04 LTS
Python: 3.9.20
CUDA: 12.1
GPU: GeForce RTX 4090/RTX 3090

📦 Setup

Clone the repo and setup as follows:

$ git clone [email protected]:GradientSpaces/CrossOver.git
$ cd CrossOver
$ conda env create -f req.yml
$ conda activate crossover

Further installation for MinkowskiEngine, Pointnet2_PyTorch and GPU kNN (for I2P-MAE setup). Setup as follows:

$ git clone --recursive "https://github.com/EthenJ/MinkowskiEngine"
$ conda install openblas-devel -c anaconda

# Minkowski Engine
$ cd MinkowskiEngine/ && python setup.py install --blas_include_dirs=${CONDA_PREFIX}/include --force_cuda --blas=openblas

# Pointnet2_PyTorch
$ cd .. && git clone https://github.com/erikwijmans/Pointnet2_PyTorch.git
$ pip install pointnet2_ops_lib/.

# GPU kNN
$ cd .. && pip install --upgrade https://github.com/unlimblue/KNN_CUDA/releases/download/0.2/KNN_CUDA-0.2-py3-none-any.whl

Since we use CUDA 12.1, we use the above MinkowskiEngine fork; for other CUDA drivers, please refer to the official repo.

⬇️ Data

See DATA.MD for detailed instructions on data download, preparation and preprocessing. We list the available data used in the current version of CrossOver in the table below:

Dataset Name Object Modality Scene Modality Object Temporal Information Scene Temporal Information
Scannet [point, rgb, cad, referral] [point, rgb, floorplan, referral]
3RScan [point, rgb, referral] [point, rgb, referral]
MultiScan [point, rgb, referral] [point, rgb, referral]

To run our demo, you only need to download generated embedding data; no need for downloading preprocessed data.

📽️ Demo

This demo script allows users to process a custom scene and retrieve the closest match from the supported datasets using different modalities. Detailed usage can be found inside the script. Example usage below:

$ python demo/demo_scene_retrieval.py

Various configurable parameters:

  • --query_path: Path to the query scene file (eg: ./example_data/dining_room/scene_cropped.ply).
  • --database_path: Path to the precomputed embeddings of the database scenes downloaded before (eg: ./release_data/embed_scannet.pt).
  • --query_modality: Modality of the query scene, Options: point, rgb, floorplan, referral
  • --database_modality: Modality used for retrieval. (same options as above)
  • --ckpt: Path to the pre-trained scene crossover model checkpoint (details here), example_path: ./checkpoints/scene_crossover_scannet+scan3r.pth/.

For pre-trained model download, refer to data download and checkpoints sections.

We also provide scripts for inference on a single scan of the supported datasets. Details in Single Inference section in TRAIN.md.

🏋️ Training and Inference

See TRAIN.md for the inventory of available checkpoints and detailed instructions on training and inference/evaluation with pre-trained checkpoints. The checkpoint inventory is listed below:

Checkpoints

We provide all available checkpoints on G-Drive here. Detailed descriptions in the table below:

instance_baseline

Description Checkpoint Link
Instance Baseline trained on 3RScan 3RScan
Instance Baseline trained on ScanNet ScanNet
Instance Baseline trained on ScanNet + 3RScan ScanNet+3RScan

instance_crossover

Description Checkpoint Link
Instance CrossOver trained on 3RScan 3RScan
Instance CrossOver trained on ScanNet ScanNet
Instance CrossOver trained on ScanNet + 3RScan ScanNet+3RScan
Instance CrossOver trained on ScanNet + 3RScan + MultiScan ScanNet+3RScan+MultiScan

scene_crossover

Description Checkpoint Link
Unified CrossOver trained on ScanNet + 3RScan ScanNet+3RScan
Unified CrossOver trained on ScanNet + 3RScan + MultiScan ScanNet+3RScan+MultiScan

🚧 TODO List

  • Release evaluation on temporal instance matching
  • Release evaluation on single image-based scene retrieval
  • Release inference on single scan cross-modal object retrieval

🙏 Acknowledgements

We thank the authors from 3D-VisTa, SceneVerse and SceneGraphLoc for open-sourcing their codebases.

📄 Citation

@article{

}

About

CrossOver: 3D Scene Cross-Modal Alignment

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published