This repository contains an end-to-end example of deploying BEVFormer with explicit quantization with NVIDIA's ModelOpt Toolkit. At the end, we show TensorRT deployment results in terms of runtime and accuracy.
- TensorRT 10.x
- ONNX-Runtime 1.18.x
- onnx-graphsurgeon
- onnsim
- ModelOpt toolkit 0.15.0
- BEVFormer_tensorrt
Follow the Data Preparation steps for NuScenes and CAN bus. This will prepare the full train / validation dataset.
Build docker image:
$ export TAG=tensorrt_bevformer:24.08
$ docker build -f docker/tensorrt.Dockerfile --no-cache --tag=$TAG .
A. Download model weights from here
and save it in ./models
$ wget -P ./models
B. Run docker container:
$ docker run -it --rm --gpus device=0 --network=host --shm-size 20g -v $(pwd):/mnt -v <path to data>:/workspace/BEVFormer_tensorrt/data $TAG
C. In docker container, patch the BEVFormer_tensorrt
folder and compile plugins:
# 1. Apply patch to BEVFormer_tensorrt with changes necessary for TensorRT 10 support
$ cd /workspace/BEVFormer_tensorrt
$ git apply /mnt/bevformer_trt10.patch
# 2. Compile plugins
$ cd TensorRT/build
$ cmake .. -DCMAKE_TENSORRT_PATH=/usr && make -j$(nproc) && make install
The compiled plugin will be saved in
, which will later be used by both ModelOpt and TensorRT.
D. Export simplified ONNX model from torch:
$ cd /workspace/BEVFormer_tensorrt
$ python tools/ configs/bevformer/plugin/ /mnt/models/bevformer_tiny_epoch_24.pth --opset=13 --cuda --flag=cp2_op13
$ cp checkpoints/onnx/bevformer_tiny_epoch_24_cp2_op13.onnx /mnt/models/
$ export PLUGIN_PATH=/workspace/BEVFormer_tensorrt/TensorRT/lib/
$ python /mnt/tools/ --onnx=/mnt/models/bevformer_tiny_epoch_24_cp2_op13.onnx --trt_plugins=$PLUGIN_PATH
This will generate an ONNX file of same name as the input ONNX file with the suffix
if using CUDA 12.x. No such variable is needed with CUDA 11.8.
This script does the following post-processing actions:
- Automatically detect custom TRT ops in the ONNX model.
- Ensure that the custom ops are supported as a TRT plugin in ONNX-Runtime (
domain). - Update all tensor types and shapes in the ONNX graph with
. - Simplify model with
- Prepare the calibration data:
$ cd /workspace/BEVFormer_tensorrt
$ PYTHONPATH=$(pwd) python /mnt/tools/ configs/bevformer/plugin/ \
--onnx_path=/mnt/models/bevformer_tiny_epoch_24_cp2_op13_post_simp.onnx \
The calibration data will be saved in
. The script uses 600 calibration samples by default. See instructions in the ModelOpt toolkit for more info on generating the calibration data.
- Quantize ONNX model with calibration data:
$ python /mnt/tools/ --onnx_path=/mnt/models/bevformer_tiny_epoch_24_cp2_op13_post_simp.onnx \
--trt_plugins=$PLUGIN_PATH \
--op_types_to_exclude MatMul \
This generates an ONNX model with suffix
with Q/DQ nodes around relevant layers.
- MatMul ops are not being quantized (
--op_types_to_exclude MatMul
). The reasoning for this is that MHA blocks, present in Transformer-based models, are currently recommended to run in FP16. Keep in mind that optimal Q/DQ node placement can vary for different models, so there may be cases where quantizing MatMul ops may be more advantageous. This is up to the user to decide. - If you're running out of memory, you may need to add
to the beginning of that quantization command. This is only valid for CUDA 12.x. No such variable is needed with CUDA 11.8.
$ trtexec --onnx=/mnt/models/bevformer_tiny_epoch_24_cp2_op13_post_simp.quant.onnx \
--saveEngine=/mnt/models/bevformer_tiny_epoch_24_cp2_op13_post_simp.quant.engine \
--staticPlugins=$PLUGIN_PATH \
Note: In order to deploy the quantized ONNX model in another platform or with another TensorRT version, simply re-compile the plugin for the required settings and deploy the engine using the same explicitly-quantized ONNX model.
Run evaluation script:
$ cd /workspace/BEVFormer_tensorrt
$ python tools/bevformer/ \
configs/bevformer/plugin/ \
/mnt/models/bevformer_tiny_epoch_24_cp2_op13_post_simp.quant.engine \
System: NVIDIA A40 GPU, TensorRT
BEVFormer tiny with FP16 plugins with nv_half2
Precision | GPU Compute Time (median, ms) | Accuracy (NDS / mAP) |
FP32 | 18.82 | NDS: 0.354, mAP: 0.252 |
FP16 | 9.36 | NDS: 0.354, mAP: 0.251 |
BEST (TensorRT PTQ - Implicit Quantization) | 6.20 | NDS: 0.353, mAP: 0.250 |
QDQ_BEST (ModelOpt PTQ - Explicit Quantization) | 6.02 | NDS: 0.352, mAP: 0.251 |
BEVFormer tiny with FP16 plugins with nv_half
Precision | GPU Compute Time (median, ms) | Accuracy (NDS / mAP) |
FP32 | 18.80 | NDS: 0.354, mAP: 0.252 |
FP16 | 9.81 | NDS: 0.354, mAP: 0.251 |
BEST (TensorRT PTQ - Implicit Quantization) | 6.73 | NDS: 0.353, mAP: 0.250 |
QDQ_BEST (ModelOpt PTQ - Explicit Quantization) | 6.54 | NDS: 0.353, mAP: 0.251 |
To reproduce the results, run:
to build/save the TensorRT engine and obtain the runtime;./
to evaluate the TensorRT engine's accuracy.