-
Notifications
You must be signed in to change notification settings - Fork 4
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
9cf830d
commit f23c63e
Showing
154 changed files
with
24,083 additions
and
9,086 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
[submodule "projects/SAN"] | ||
path = projects/SAN | ||
url = https://github.com/waveleaf27/SAN.git |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
# [ECCV'24] CVT-Occ: Cost Volume Temporal Fusion for 3D Occupancy Prediction | ||
|
||
Zhangchen Ye<sup>1*</sup>, Tao Jiang<sup>1,2*</sup>, Chenfeng Xu<sup>3</sup>, Yiming Li<sup>4</sup>, and Hang Zhao<sup>1,2,5✉</sup> | ||
|
||
<sup>1</sup>IIIS, Tsinghua University <sup>2</sup>Shanghai AI Lab <sup>3</sup>UC Berkeley <sup>4</sup>New York University <sup>5</sup>Shanghai Qi Zhi Institute | ||
|
||
 | ||
|
||
## News | ||
- [2024/07/29]: Code Released. | ||
|
||
- [2024/07/04]: Our paper has been accepted by [ECCV2024](https://eccv2024.ecva.net/). | ||
|
||
## Abstract | ||
Vision-based 3D occupancy prediction is significantly challenged by the inherent limitations of monocular vision in depth estimation. This paper introduces CVT-Occ, a novel approach that leverages temporal fusion through the geometric correspondence of voxels over time to improve the accuracy of 3D occupancy predictions. By sampling points along the line of sight of each voxel and integrating the features of these points from historical frames, we construct a cost volume feature map that refines current volume features for improved prediction outcomes. Our method takes advantage of parallax cues from historical observations and employs a data-driven approach to learn the cost volume. We validate the effectiveness of CVT-Occ through rigorous experiments on the Occ3D-Waymo dataset, where it outperforms state-of-the-art methods in 3D occupancy prediction with minimal additional computational cost. | ||
|
||
## Get Started | ||
- [install.md](docs/install.md) | ||
- [dataset.md](docs/dataset.md) | ||
- [trainval.md](docs/trainval.md) | ||
|
||
## Model Zoo | ||
|
||
All models can be download from [HERE](https://drive.google.com/drive/folders/1m3r4VrjY1G8N-h7NVyUDc0oOI5g8f7nT?usp=sharing) | ||
|
||
### Occ3D-Waymo | ||
|
||
| Method | mIoU | Go | Vehicle | Pedestrian | Sign | Bicyclist | Traffic Light | Pole | Cons. Cone | Bicycle | Building | Vegetation | Tree Trunk | Road | Walkable | | ||
|---------------------|-------|-------|---------|------------|-------|-----------|---------------|-------|------------|---------|----------|------------|------------|-------|----------| | ||
| [BEVFormer-w/o TSA](projects/configs/cvtocc/bevformer_wotsa_waymo.py) | 23.87 | **7.50** | 34.54 | 21.07 | 9.69 | **20.96** | 11.48 | 11.48 | 14.06 | 14.51 | 23.14 | 21.82 | 8.57 | 78.45 | 56.89 | | ||
| [BEVFormer](projects/configs/cvtocc/bevformer_waymo.py) | 24.58 | 7.18 | 36.06 | 21.00 | 9.76 | 20.23 | 12.61 | 14.52 | 14.70 | 16.06 | 23.98 | 22.50 | 9.39 | 79.11 | 57.04 | | ||
| [SOLOFusion](projects/configs/cvtocc/solofusion_waymo.py) | 24.73 | 4.97 | 32.45 | 18.28 | 10.33 | 17.14 | 8.07 | 17.83 | 16.23 | 19.30 | **31.49**| **28.98** | **16.93** | 70.95 | 53.28 | | ||
| [BEVFormer-WrapConcat](projects/configs/cvtocc/bevformer_wrapconcat_waymo.py) | 25.07 | 6.20 | 36.17 | 20.95 | 9.56 | 20.58 | **12.82** | 16.24 | 14.31 | 16.78 | 25.14 | 23.56 | 12.81 | 79.04 | 56.83 | | ||
| [CVT-Occ (ours)](projects/configs/cvtocc/cvtocc_waymo.py) | **27.37** | 7.44 | **41.00** | **23.93** | **11.92** | 20.81 | 12.07 | **18.03** | **16.88**| **21.37**| 29.40 | 27.42 | 14.67 | **79.12** | **59.09** | | ||
|
||
### Occ3D-NuScenes | ||
|
||
| Method | mIoU | others | barrier | bicycle | bus | car | Cons. vehicle | motorcycle | pedestrian | traffic cone | trailer | truck | Dri. Sur | other flat | sidewalk | terrain | manmade | vegetation | | ||
|---------------------|-------|--------|---------|---------|------|------|---------------|------------|------------|--------------|---------|-------|----------|------------|----------|---------|---------|------------| | ||
| [BEVFormer-w/o TSA](projects/configs/cvtocc/bevformer_wotsa_nuscenes.py) | 38.05 | 9.11 | 45.68 | 22.61 | 46.19 | 52.97 | 20.27 | 26.5 | 26.8 | 26.21 | 32.29 | 37.58 | 80.5 | 40.6 | 49.93 | 52.48 | 41.59 | 35.51 | | ||
| [BEVFormer](projects/configs/cvtocc/bevformer_nuscenes.py) | 39.04 | **9.57** | 47.13 | 22.52 | 47.61 | 54.14| 20.39 | 26.44 | 28.12 | 27.46 | 34.53 | 39.69 | 81.44 | **41.14** | 50.79 | 54.00 | 43.08 | 35.60 | | ||
| [CVT-Occ (ours)](projects/configs/cvtocc/cvtocc_nuscenes.py) | **40.34**| 9.45 | **49.46** | **23.57** | **49.18** | **55.63**| **23.1** | **27.85** | **28.88** | **29.07** | **34.97** | **40.98** | **81.44** | 40.92 | **51.37** | **54.25** | **45.94** | **39.71** | | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,173 @@ | ||
# Preparing Dataset | ||
|
||
## Occ3D-Waymo | ||
|
||
| Type | Info | | ||
| :----: | :----: | | ||
| train | 798 scenes| | ||
| val | 202 scenes| | ||
| Time Span | 20s | | ||
| Frame | 200 per scene| | ||
| Time Interval | 0.1s| | ||
| cameras | 5 | | ||
| voxel size | [0.4m, 0.4m, 0.4m] | | ||
| range | [-40m, -40m, -1m, 40m, 40m, 5.4m] | | ||
| volume size | [200, 200, 16] | | ||
| classes label | 0 - 14, 23 | | ||
|
||
- sensor: | ||
|
||
- 5 cameras. Front(image_0), front left(image_1), front right(image_2), side left(image_3), side right(image_4). I write the coresponding data file folder in the brackets. But the pose info idx dismatch the image data file. I write a code in `get_data_info` function in `waymo_temporal_zlt.py` to fix this bug. | ||
|
||
- the size of img0-2: 1280x1920; img3-4: 886x1920. All of them will be reshaped and padded to 640x960. | ||
|
||
- coordinate: | ||
|
||
- The whole data set coordinate system obeys the `right-hand rule`. | ||
|
||
- The global coordinate system: the Up(z) axis is consistent with the direction of gravity, and upward is positive; East(x) points due east along latitude, and North(y) points to the North Pole. | ||
|
||
- The vehicle coordinate system moves with the car, with the X-axis pointing forward, the Y-axis pointing to the left, and the Z-axis pointing up positive. | ||
|
||
- The sensor coordinates can be obtained from the vehicle coordinates via the rotation matrix, which can be viewed as an external parameter matrix. | ||
|
||
- Voxel semantics for each sample frame is given as `[semantics]` in the `labels.npz`. Please note that there is a slight difference between the Occupancy classes and the classes used in the [Waymo LiDAR segmentation](https://github.com/waymo-research/waymo-open-dataset/blob/bae19fa0a36664da18b691349955b95b29402713/waymo_open_dataset/protos/segmentation.proto#L20). | ||
|
||
- The dataset contains 15 classes. The definition of classes from 0 to 14 is `TYPE_GENERALOBJECT, TYPE_VEHICLE, TYPE_PEDESTRIAN, TYPE_SIGN, TYPE_CYCLIST, TYPE_TRAFFIC_LIGHT, TYPE_POLE, TYPE_CONSTRUCTION_CONE, TYPE_BICYCLE, TYPE_MOTORCYCLE, TYPE_BUILDING, TYPE_VEGETATION, TYPE_TREE_TRUNK, TYPE_ROAD, TYPE_WALKABLE`. | ||
|
||
- The label 15 category represents voxels that are not occupied by anything, which is named as `free`. Indeed `free` label is `23` in ground truth file. It is converted to `15` in dataloader. | ||
|
||
**1. Prepare Waymo dataset** | ||
|
||
Download Waymo v1.3.1 full dataset from [Waymo website](https://waymo.com/open/download/). | ||
|
||
**2. Prepare 3D Occupancy ground truth** | ||
|
||
Download the gts with voxel size 0.4m, annotation file(`waymo_infos_{train, val}.pkl`), and pose file(`cam_infos.pkl` and `cam_infos_vali.pkl`) we provided in [HERE](https://drive.google.com/drive/folders/13WxRl9Zb_AshEwvD96Uwz8cHjRNrtfQk) and organize your folder structure as below: | ||
|
||
``` | ||
└── Occ3D-Waymo | ||
├── waymo_infos_train.pkl | ||
├── waymo_infos_val.pkl | ||
├── cam_infos.pkl | ||
├── cam_infos_vali.pkl | ||
├── training | ||
| ├── 000 | ||
| | ├── 000_04.npz | ||
| | ├── 001_04.npz | ||
| | ├── 002_04.npz | ||
| | └── ... | ||
| | | ||
| ├── 001 | ||
| | ├── 000_04.npz | ||
| | └── ... | ||
| ├── ... | ||
| | | ||
| └── 797 | ||
| ├── 000_04.npz | ||
| └── ... | ||
| | ||
├── validation | ||
| ├── 000 | ||
| | ├── 000_04.npz | ||
| | └── ... | ||
| ├── ... | ||
| | | ||
| └── 201 | ||
| ├── 000_04.npz | ||
| └── ... | ||
``` | ||
|
||
- `training` and `validation`contains data for each scene. Each scene includes corresponding ground truth of each frame. | ||
|
||
- `*.npz` contains `[voxel_label]`, `[origin_voxel_state]`, `[final_voxel_state]` , and `[infov]` for each frame. | ||
|
||
- `[voxel_label]`: semantic ground truth. | ||
|
||
- `[origin_voxel_state]`: lidar mask. | ||
|
||
- `[final_voxel_state]`: camera mask. Since we focus on a vision-centric task, we provide a binary voxel mask `[mask_camera]`, indicating whether the voxels are observed or not in the current camera view. | ||
|
||
- `[infov]`: infov mask. Since Waymo only has 5 cameras and does not provide a 360-degree surround view, we additionally provide `[mask_fov]`. | ||
|
||
- `*_04.npz` represents the data with a voxel size of 0.4m. | ||
|
||
## Occ3D-NuScenes | ||
|
||
| Type | Info | | ||
| :----: | :----: | | ||
| train | 600 scenes| | ||
| val | 150 scenes| | ||
| Time Span | 20s | | ||
| Frame | 40 per scene | | ||
| Time Interval | 0.5s| | ||
| cameras | 6 | | ||
| voxel size | [0.4m, 0.4m, 0.4m] | | ||
| range | [-40m, -40m, -1m, 40m, 40m, 5.4m] | | ||
| volume size | [200, 200, 16]| | ||
| classes | 0 - 17 | | ||
|
||
- sensor: | ||
|
||
- 6 cameras. Front, Front Right, Front Left, Back, Back Right, Back Left. | ||
|
||
- size of image: 1600x900 | ||
|
||
- The dataset contains 18 classes. The definition of classes from 0 to 16 is the same as the [nuScenes-lidarseg](https://github.com/nutonomy/nuscenes-devkit/blob/fcc41628d41060b3c1a86928751e5a571d2fc2fa/python-sdk/nuscenes/eval/lidarseg/README.md) dataset. The label 17 category represents `free`. Voxel semantics for each sample frame is given as `[semantics]` in the labels.npz. | ||
|
||
**1. Prepare NuScenes dataset** | ||
|
||
Download nuScenes V1.0 full dataset and can bus data from [NuScenes website](https://www.nuscenes.org/download). Organize the folder structure: | ||
|
||
``` | ||
cvtocc | ||
├── project code/ | ||
├── data/ | ||
│ ├── can_bus/ | ||
│ ├── occ3d-nus/ | ||
│ │ ├── maps/ | ||
│ │ ├── samples/ | ||
| | | ├── CAM_BACK | ||
| | | | ├── n015-2018-07-18-11-07-57+0800__CAM_BACK__1531883530437525.jpg | ||
| | | | └── ... | ||
| | | ├── CAM_BACK_LEFT | ||
| | | | ├── n015-2018-07-18-11-07-57+0800__CAM_BACK_LEFT__1531883530447423.jpg | ||
| | | | └── ... | ||
| | | └── ... | ||
│ │ ├── v1.0-trainval | ||
``` | ||
|
||
- samples/ contains images captured by various cameras. | ||
|
||
**2. Prepare 3D Occupancy ground truth** | ||
|
||
Download the gts and annotations.json we provided in [HERE](https://drive.google.com/drive/folders/1Xarc91cNCNN3h8Vum-REbI-f0UlSf5Fc) and organize your folder structure as below: | ||
|
||
``` | ||
cvtocc | ||
├── data/ | ||
│ ├── can_bus/ | ||
│ ├── occ3d-nus/ | ||
│ │ ├── maps/ | ||
│ │ ├── samples/ | ||
│ │ ├── v1.0-trainval/ | ||
│ │ ├── gts/ | ||
| | | ├── [scene_name] | ||
| | | | ├── [frame_token] | ||
| | | | | └── labels.npz | ||
| | | | └── ... | ||
| | | └── ... | ||
│ │ └── annotations.json | ||
``` | ||
|
||
- gts/ contains the ground truth of each sample. [scene_name] specifies a sequence of frames, and [frame_token] specifies a single frame in a sequence. `labels.npz` contains [semantics], [mask_lidar], and [mask_camera] for each frame. | ||
|
||
- annotations.json contains meta infos of the dataset. | ||
|
||
**3. Generate the info files for training and validation:** | ||
|
||
```shell | ||
python tools/create_data.py occ --root-path ./data/occ3d-nus --out-dir ./data/occ3d-nus --extra-tag occ --version v1.0 --canbus ./data --occ-path ./data/occ3d-nus | ||
``` | ||
|
||
Using the above code will generate the following files `data/occ3d-nus/occ_infos_temporal_{train, val}.pkl` |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
# Installation instructions | ||
|
||
Following https://mmdetection3d.readthedocs.io/en/v0.17.1/getting_started.html#installation | ||
|
||
**a. Create a conda virtual environment and activate it.** | ||
```shell | ||
conda create -n cvtocc python=3.8 -y | ||
conda activate cvtocc | ||
``` | ||
|
||
**b. Install PyTorch, torchvision and torchaudio following the [official instructions](https://pytorch.org/).** | ||
```shell | ||
pip install torch==1.10.0+cu113 torchvision==0.11.1+cu113 torchaudio==0.10.0+cu113 --extra-index-url https://download.pytorch.org/whl/cu113 | ||
``` | ||
|
||
**c. Install mmcv-full.** | ||
```shell | ||
pip install mmcv-full==1.4.0 | ||
``` | ||
|
||
**d. Install mmdet and mmseg.** | ||
```shell | ||
pip install mmdet==2.22.0 | ||
pip install mmsegmentation==0.14.1 | ||
``` | ||
|
||
**e. Install mmdet3d from source code.** | ||
```shell | ||
git clone https://github.com/open-mmlab/mmdetection3d.git | ||
cd mmdetection3d | ||
git checkout v0.17.1 | ||
python setup.py install | ||
``` | ||
**g. Clone cvtocc.** | ||
``` | ||
git clone [email protected]:Tsinghua-MARS-Lab/CVT-Occ.git | ||
``` | ||
|
||
**h. Prepare pretrained models.** | ||
```shell | ||
cd cvtocc | ||
mkdir ckpts | ||
|
||
cd ckpts & wget https://github.com/zhiqi-li/storage/releases/download/v1.0/r101_dcn_fcos3d_pretrain.pth | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
# Training & Evaluation | ||
|
||
## training | ||
### single GPU | ||
```sh | ||
./tools/dist_train.sh projects/configs/cvtocc/cvtocc_waymo.py 1 | ||
``` | ||
|
||
### single machine | ||
```sh | ||
./tools/dist_train.sh projects/configs/cvtocc/cvtocc_waymo.py 8 | ||
``` | ||
|
||
## evaluation | ||
### single GPU | ||
```sh | ||
./tools/dist_test.sh projects/configs/cvtocc/cvtocc_waymo.py work_dirs/cvtocc_waymo/latest.pth 1 --eval mIoU | ||
``` | ||
### single machine | ||
```sh | ||
./tools/dist_test.sh projects/configs/cvtocc/cvtocc_waymo.py work_dirs/cvtocc_waymo/latest.pth 8 --eval mIoU | ||
``` | ||
|
||
## save results | ||
### single machine | ||
```sh | ||
./tools/dist_test.sh projects/configs/cvtocc/cvtocc_waymo.py work_dirs/cvtocc_waymo/latest.pth 8 --out work_dirs/cvtocc_waymo/results.pkl | ||
``` |
Oops, something went wrong.