Official implementation of "Multi-modal CrossViT using 3D spatial information for visual localization" (Multimedia Tools and Applications, 2024) by Junekoo Kang, Mark Mpabulungi, and Hyunki Hong. | Paper | Online |
- Overview
- Architecture
- Requirements
- Project Structure
- Pipeline Steps
- Evaluation
- Performance Highlights
- Citation
- Acknowledgments
- Contact
This research introduces a state-of-the-art hierarchical framework for visual localization leveraging a multi-modal CrossViT architecture. Our approach uniquely combines image features with 3D spatial information, achieving superior performance with significantly reduced computational requirements.
- Multi-modal Architecture: Dual-branch CrossViT integrating visual and 3D spatial information
- RoPE-based Encoding: Advanced 3D spatial information encoding using Rotary Position Embedding
- Spatial Contrastive Learning: Novel training strategy utilizing shared 3D points and IoU-based similarity metrics
- Knowledge Distillation: Optimized inference through teacher-student model transfer
- Computational Efficiency: Achieves 58.9× fewer FLOPs and 21.6× fewer parameters compared to NetVLAD
The training pipeline implements a sophisticated dual-branch architecture:
- Image Branch: Advanced visual feature processing using CrossViT
- Spatial Branch: Specialized 3D information handling with RoPE encoding
Our inference model employs a lightweight student architecture maintaining spatial awareness through knowledge distillation, optimized for real-world deployment.
conda create -n spatial_contrastive_learning python=3.8
conda activate spatial_contrastive_learning
pip install -r requirements.txt
pytorch>=1.7.0
hloc
opencv-python
numpy
pandas
scikit-learn
tensorboard
tqdm
├── datasets/
│ └── aachen/ # Aachen Day-Night dataset
├── models/
│ ├── crossvit_official.py # Base CrossViT
│ └── crossvit_PE_RT_official_MultiModal.py # Multi-modal CrossViT
├── DataBase/ # Preprocessed data
├── pipeline_sfm_visuallocalization.ipynb # SfM pipeline
├── preprocessing.py # Point cloud processing
├── generate_RoPE_embeddings.ipynb # RoPE encoding
├── train_multimodal_crossvit_teacher.py # Teacher model training
├── train_knowledge_distillation_student.py # Student model training
├── generate_localization_pairs.py # Pair generation
└── final_pipeline.ipynb # End-to-end pipeline
# SfM and preprocessing
jupyter notebook pipeline_sfm_visuallocalization.ipynb
python preprocessing.py
# RoPE embeddings generation
jupyter notebook generate_RoPE_embeddings.ipynb
python train_multimodal_crossvit_teacher.py
Implements:
- Dual-branch architecture (Multi-modal CrossViT)
- Spatial contrastive learning
- Hard negative sampling
- Knowledge Distillation
python train_knowledge_distillation_student.py
Features:
- Gaussian kernel-based embedding transfer
- Single-image inference optimization
- Spatial awareness preservation
python generate_localization_pairs.py
jupyter notebook final_pipeline.ipynb
Pipeline components:
- Student model image retrieval
- Local feature matching
- PnP-RANSAC pose estimation
Condition | (0.25m, 2°) | (0.5m, 5°) | (5m, 10°) |
---|---|---|---|
Daytime | 87.3% | 95.0% | 97.6% |
Nighttime | 87.8% | 89.8% | 95.9% |
Models | P@200 | P@150 | P@100 | P@50 | P@20 | P@5 | P@1 |
---|---|---|---|---|---|---|---|
Ours | 0.8209 | 0.7727 | 0.7206 | 0.6889 | 0.7383 | 0.8368 | 0.8976 |
NetVLAD | 0.4611 | 0.4427 | 0.4257 | 0.4529 | 0.5980 | 0.8219 | 0.9425 |
Models | FLOPs (GB) | Parameters (MB) |
---|---|---|
NetVLAD | 94.3 | 148.9 |
AP-GEM | 86.2 | 105.3 |
CRN | 94.3 | 148.9 |
SARE | 94.3 | 148.9 |
HAF | 1791.2 | 158.8 |
Patch-NetVLAD | 94.2 | 148.7 |
Ours | 1.6 | 6.9 |
@article{kang2024multi,
title={Multi-modal {CrossViT} using {3D} spatial information for visual localization},
author={Kang, Junekoo and Mpabulungi, Mark and Hong, Hyunki},
journal={Multimedia Tools and Applications},
year={2024},
publisher={Springer},
doi={10.1007/s11042-024-20382-w}
}
- HLOC for the hierarchical localization framework
- CrossViT for the transformer architecture
- Aachen Day-Night dataset for evaluation
For questions or issues:
- Junekoo Kang ([email protected]) or ([email protected])
- Google Drive: download link