Skip to content

Commit

Permalink
release code
Browse files Browse the repository at this point in the history
  • Loading branch information
waveleaf27 committed Jul 29, 2024
1 parent 9cf830d commit f23c63e
Show file tree
Hide file tree
Showing 154 changed files with 24,083 additions and 9,086 deletions.
11 changes: 9 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,9 @@ data
*.pkl.json
*.log.json
work_dirs/
ckpts/
ckpts
work_dirs
work_dirs*/
exps/
*~
Expand All @@ -130,19 +133,23 @@ exps/

# demo
*.jpg
*.png
.png
*.obj
*.ply
*.pt




tmp.py
*tmp*
build/
data/
output/
work_dirs


result.txt
*.csv
projects/configs/vis_weight*
example.txt
important.txt
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[submodule "projects/SAN"]
path = projects/SAN
url = https://github.com/waveleaf27/SAN.git
43 changes: 43 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# [ECCV'24] CVT-Occ: Cost Volume Temporal Fusion for 3D Occupancy Prediction

Zhangchen Ye<sup>1*</sup>, Tao Jiang<sup>1,2*</sup>, Chenfeng Xu<sup>3</sup>, Yiming Li<sup>4</sup>, and Hang Zhao<sup>1,2,5&#x2709;</sup>

<sup>1</sup>IIIS, Tsinghua University <sup>2</sup>Shanghai AI Lab <sup>3</sup>UC Berkeley <sup>4</sup>New York University <sup>5</sup>Shanghai Qi Zhi Institute

![network](./docs/figs/network.png)

## News
- [2024/07/29]: Code Released.

- [2024/07/04]: Our paper has been accepted by [ECCV2024](https://eccv2024.ecva.net/).

## Abstract
Vision-based 3D occupancy prediction is significantly challenged by the inherent limitations of monocular vision in depth estimation. This paper introduces CVT-Occ, a novel approach that leverages temporal fusion through the geometric correspondence of voxels over time to improve the accuracy of 3D occupancy predictions. By sampling points along the line of sight of each voxel and integrating the features of these points from historical frames, we construct a cost volume feature map that refines current volume features for improved prediction outcomes. Our method takes advantage of parallax cues from historical observations and employs a data-driven approach to learn the cost volume. We validate the effectiveness of CVT-Occ through rigorous experiments on the Occ3D-Waymo dataset, where it outperforms state-of-the-art methods in 3D occupancy prediction with minimal additional computational cost.

## Get Started
- [install.md](docs/install.md)
- [dataset.md](docs/dataset.md)
- [trainval.md](docs/trainval.md)

## Model Zoo

All models can be download from [HERE](https://drive.google.com/drive/folders/1m3r4VrjY1G8N-h7NVyUDc0oOI5g8f7nT?usp=sharing)

### Occ3D-Waymo

| Method | mIoU | Go | Vehicle | Pedestrian | Sign | Bicyclist | Traffic Light | Pole | Cons. Cone | Bicycle | Building | Vegetation | Tree Trunk | Road | Walkable |
|---------------------|-------|-------|---------|------------|-------|-----------|---------------|-------|------------|---------|----------|------------|------------|-------|----------|
| [BEVFormer-w/o TSA](projects/configs/cvtocc/bevformer_wotsa_waymo.py) | 23.87 | **7.50** | 34.54 | 21.07 | 9.69 | **20.96** | 11.48 | 11.48 | 14.06 | 14.51 | 23.14 | 21.82 | 8.57 | 78.45 | 56.89 |
| [BEVFormer](projects/configs/cvtocc/bevformer_waymo.py) | 24.58 | 7.18 | 36.06 | 21.00 | 9.76 | 20.23 | 12.61 | 14.52 | 14.70 | 16.06 | 23.98 | 22.50 | 9.39 | 79.11 | 57.04 |
| [SOLOFusion](projects/configs/cvtocc/solofusion_waymo.py) | 24.73 | 4.97 | 32.45 | 18.28 | 10.33 | 17.14 | 8.07 | 17.83 | 16.23 | 19.30 | **31.49**| **28.98** | **16.93** | 70.95 | 53.28 |
| [BEVFormer-WrapConcat](projects/configs/cvtocc/bevformer_wrapconcat_waymo.py) | 25.07 | 6.20 | 36.17 | 20.95 | 9.56 | 20.58 | **12.82** | 16.24 | 14.31 | 16.78 | 25.14 | 23.56 | 12.81 | 79.04 | 56.83 |
| [CVT-Occ (ours)](projects/configs/cvtocc/cvtocc_waymo.py) | **27.37** | 7.44 | **41.00** | **23.93** | **11.92** | 20.81 | 12.07 | **18.03** | **16.88**| **21.37**| 29.40 | 27.42 | 14.67 | **79.12** | **59.09** |

### Occ3D-NuScenes

| Method | mIoU | others | barrier | bicycle | bus | car | Cons. vehicle | motorcycle | pedestrian | traffic cone | trailer | truck | Dri. Sur | other flat | sidewalk | terrain | manmade | vegetation |
|---------------------|-------|--------|---------|---------|------|------|---------------|------------|------------|--------------|---------|-------|----------|------------|----------|---------|---------|------------|
| [BEVFormer-w/o TSA](projects/configs/cvtocc/bevformer_wotsa_nuscenes.py) | 38.05 | 9.11 | 45.68 | 22.61 | 46.19 | 52.97 | 20.27 | 26.5 | 26.8 | 26.21 | 32.29 | 37.58 | 80.5 | 40.6 | 49.93 | 52.48 | 41.59 | 35.51 |
| [BEVFormer](projects/configs/cvtocc/bevformer_nuscenes.py) | 39.04 | **9.57** | 47.13 | 22.52 | 47.61 | 54.14| 20.39 | 26.44 | 28.12 | 27.46 | 34.53 | 39.69 | 81.44 | **41.14** | 50.79 | 54.00 | 43.08 | 35.60 |
| [CVT-Occ (ours)](projects/configs/cvtocc/cvtocc_nuscenes.py) | **40.34**| 9.45 | **49.46** | **23.57** | **49.18** | **55.63**| **23.1** | **27.85** | **28.88** | **29.07** | **34.97** | **40.98** | **81.44** | 40.92 | **51.37** | **54.25** | **45.94** | **39.71** |

173 changes: 173 additions & 0 deletions docs/dataset.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,173 @@
# Preparing Dataset

## Occ3D-Waymo

| Type | Info |
| :----: | :----: |
| train | 798 scenes|
| val | 202 scenes|
| Time Span | 20s |
| Frame | 200 per scene|
| Time Interval | 0.1s|
| cameras | 5 |
| voxel size | [0.4m, 0.4m, 0.4m] |
| range | [-40m, -40m, -1m, 40m, 40m, 5.4m] |
| volume size | [200, 200, 16] |
| classes label | 0 - 14, 23 |

- sensor:

- 5 cameras. Front(image_0), front left(image_1), front right(image_2), side left(image_3), side right(image_4). I write the coresponding data file folder in the brackets. But the pose info idx dismatch the image data file. I write a code in `get_data_info` function in `waymo_temporal_zlt.py` to fix this bug.

- the size of img0-2: 1280x1920; img3-4: 886x1920. All of them will be reshaped and padded to 640x960.

- coordinate:

- The whole data set coordinate system obeys the `right-hand rule`.

- The global coordinate system: the Up(z) axis is consistent with the direction of gravity, and upward is positive; East(x) points due east along latitude, and North(y) points to the North Pole.

- The vehicle coordinate system moves with the car, with the X-axis pointing forward, the Y-axis pointing to the left, and the Z-axis pointing up positive.

- The sensor coordinates can be obtained from the vehicle coordinates via the rotation matrix, which can be viewed as an external parameter matrix.

- Voxel semantics for each sample frame is given as `[semantics]` in the `labels.npz`. Please note that there is a slight difference between the Occupancy classes and the classes used in the [Waymo LiDAR segmentation](https://github.com/waymo-research/waymo-open-dataset/blob/bae19fa0a36664da18b691349955b95b29402713/waymo_open_dataset/protos/segmentation.proto#L20).

- The dataset contains 15 classes. The definition of classes from 0 to 14 is `TYPE_GENERALOBJECT, TYPE_VEHICLE, TYPE_PEDESTRIAN, TYPE_SIGN, TYPE_CYCLIST, TYPE_TRAFFIC_LIGHT, TYPE_POLE, TYPE_CONSTRUCTION_CONE, TYPE_BICYCLE, TYPE_MOTORCYCLE, TYPE_BUILDING, TYPE_VEGETATION, TYPE_TREE_TRUNK, TYPE_ROAD, TYPE_WALKABLE`.

- The label 15 category represents voxels that are not occupied by anything, which is named as `free`. Indeed `free` label is `23` in ground truth file. It is converted to `15` in dataloader.

**1. Prepare Waymo dataset**

Download Waymo v1.3.1 full dataset from [Waymo website](https://waymo.com/open/download/).

**2. Prepare 3D Occupancy ground truth**

Download the gts with voxel size 0.4m, annotation file(`waymo_infos_{train, val}.pkl`), and pose file(`cam_infos.pkl` and `cam_infos_vali.pkl`) we provided in [HERE](https://drive.google.com/drive/folders/13WxRl9Zb_AshEwvD96Uwz8cHjRNrtfQk) and organize your folder structure as below:

```
└── Occ3D-Waymo
├── waymo_infos_train.pkl
├── waymo_infos_val.pkl
├── cam_infos.pkl
├── cam_infos_vali.pkl
├── training
| ├── 000
| | ├── 000_04.npz
| | ├── 001_04.npz
| | ├── 002_04.npz
| | └── ...
| |
| ├── 001
| | ├── 000_04.npz
| | └── ...
| ├── ...
| |
| └── 797
| ├── 000_04.npz
| └── ...
|
├── validation
| ├── 000
| | ├── 000_04.npz
| | └── ...
| ├── ...
| |
| └── 201
| ├── 000_04.npz
| └── ...
```

- `training` and `validation`contains data for each scene. Each scene includes corresponding ground truth of each frame.

- `*.npz` contains `[voxel_label]`, `[origin_voxel_state]`, `[final_voxel_state]` , and `[infov]` for each frame.

- `[voxel_label]`: semantic ground truth.

- `[origin_voxel_state]`: lidar mask.

- `[final_voxel_state]`: camera mask. Since we focus on a vision-centric task, we provide a binary voxel mask `[mask_camera]`, indicating whether the voxels are observed or not in the current camera view.

- `[infov]`: infov mask. Since Waymo only has 5 cameras and does not provide a 360-degree surround view, we additionally provide `[mask_fov]`.

- `*_04.npz` represents the data with a voxel size of 0.4m.

## Occ3D-NuScenes

| Type | Info |
| :----: | :----: |
| train | 600 scenes|
| val | 150 scenes|
| Time Span | 20s |
| Frame | 40 per scene |
| Time Interval | 0.5s|
| cameras | 6 |
| voxel size | [0.4m, 0.4m, 0.4m] |
| range | [-40m, -40m, -1m, 40m, 40m, 5.4m] |
| volume size | [200, 200, 16]|
| classes | 0 - 17 |

- sensor:

- 6 cameras. Front, Front Right, Front Left, Back, Back Right, Back Left.

- size of image: 1600x900

- The dataset contains 18 classes. The definition of classes from 0 to 16 is the same as the [nuScenes-lidarseg](https://github.com/nutonomy/nuscenes-devkit/blob/fcc41628d41060b3c1a86928751e5a571d2fc2fa/python-sdk/nuscenes/eval/lidarseg/README.md) dataset. The label 17 category represents `free`. Voxel semantics for each sample frame is given as `[semantics]` in the labels.npz.

**1. Prepare NuScenes dataset**

Download nuScenes V1.0 full dataset and can bus data from [NuScenes website](https://www.nuscenes.org/download). Organize the folder structure:

```
cvtocc
├── project code/
├── data/
│ ├── can_bus/
│ ├── occ3d-nus/
│ │ ├── maps/
│ │ ├── samples/
| | | ├── CAM_BACK
| | | | ├── n015-2018-07-18-11-07-57+0800__CAM_BACK__1531883530437525.jpg
| | | | └── ...
| | | ├── CAM_BACK_LEFT
| | | | ├── n015-2018-07-18-11-07-57+0800__CAM_BACK_LEFT__1531883530447423.jpg
| | | | └── ...
| | | └── ...
│ │ ├── v1.0-trainval
```

- samples/ contains images captured by various cameras.

**2. Prepare 3D Occupancy ground truth**

Download the gts and annotations.json we provided in [HERE](https://drive.google.com/drive/folders/1Xarc91cNCNN3h8Vum-REbI-f0UlSf5Fc) and organize your folder structure as below:

```
cvtocc
├── data/
│ ├── can_bus/
│ ├── occ3d-nus/
│ │ ├── maps/
│ │ ├── samples/
│ │ ├── v1.0-trainval/
│ │ ├── gts/
| | | ├── [scene_name]
| | | | ├── [frame_token]
| | | | | └── labels.npz
| | | | └── ...
| | | └── ...
│ │ └── annotations.json
```

- gts/ contains the ground truth of each sample. [scene_name] specifies a sequence of frames, and [frame_token] specifies a single frame in a sequence. `labels.npz` contains [semantics], [mask_lidar], and [mask_camera] for each frame.

- annotations.json contains meta infos of the dataset.

**3. Generate the info files for training and validation:**

```shell
python tools/create_data.py occ --root-path ./data/occ3d-nus --out-dir ./data/occ3d-nus --extra-tag occ --version v1.0 --canbus ./data --occ-path ./data/occ3d-nus
```

Using the above code will generate the following files `data/occ3d-nus/occ_infos_temporal_{train, val}.pkl`
Binary file added docs/figs/network.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
45 changes: 45 additions & 0 deletions docs/install.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# Installation instructions

Following https://mmdetection3d.readthedocs.io/en/v0.17.1/getting_started.html#installation

**a. Create a conda virtual environment and activate it.**
```shell
conda create -n cvtocc python=3.8 -y
conda activate cvtocc
```

**b. Install PyTorch, torchvision and torchaudio following the [official instructions](https://pytorch.org/).**
```shell
pip install torch==1.10.0+cu113 torchvision==0.11.1+cu113 torchaudio==0.10.0+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
```

**c. Install mmcv-full.**
```shell
pip install mmcv-full==1.4.0
```

**d. Install mmdet and mmseg.**
```shell
pip install mmdet==2.22.0
pip install mmsegmentation==0.14.1
```

**e. Install mmdet3d from source code.**
```shell
git clone https://github.com/open-mmlab/mmdetection3d.git
cd mmdetection3d
git checkout v0.17.1
python setup.py install
```
**g. Clone cvtocc.**
```
git clone [email protected]:Tsinghua-MARS-Lab/CVT-Occ.git
```

**h. Prepare pretrained models.**
```shell
cd cvtocc
mkdir ckpts

cd ckpts & wget https://github.com/zhiqi-li/storage/releases/download/v1.0/r101_dcn_fcos3d_pretrain.pth
```
28 changes: 28 additions & 0 deletions docs/trainval.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Training & Evaluation

## training
### single GPU
```sh
./tools/dist_train.sh projects/configs/cvtocc/cvtocc_waymo.py 1
```

### single machine
```sh
./tools/dist_train.sh projects/configs/cvtocc/cvtocc_waymo.py 8
```

## evaluation
### single GPU
```sh
./tools/dist_test.sh projects/configs/cvtocc/cvtocc_waymo.py work_dirs/cvtocc_waymo/latest.pth 1 --eval mIoU
```
### single machine
```sh
./tools/dist_test.sh projects/configs/cvtocc/cvtocc_waymo.py work_dirs/cvtocc_waymo/latest.pth 8 --eval mIoU
```

## save results
### single machine
```sh
./tools/dist_test.sh projects/configs/cvtocc/cvtocc_waymo.py work_dirs/cvtocc_waymo/latest.pth 8 --out work_dirs/cvtocc_waymo/results.pkl
```
Loading

0 comments on commit f23c63e

Please sign in to comment.