As of v0.2.0 release, traditional post-training quantization (PTQ) produces a degraded performance of YOLOv6-S
from 43.4% to 41.2%. This is however much improved compared with v0.1.0 since the most sensitve layers are removed. Yet it is not ready for deployment. Meanwhile, due to the inconsistency of reparameterization blocks during training and inference, quantization-aware training (QAT) cannot be directly integrated into YOLOv6. As a remedy, we first train a single-branch network called YOLOv6-S-RepOpt
with RepOptimizer. It reaches 43.1% mAP and is very close to YOLOv6-S. We then apply our quantization strategy on YOLOv6-S-RepOpt
.
We apply post-training quantization to YOLOv6-S-RepOpt
, and its mAP slightly drops by 0.5%. Hence it is necessary to use QAT to further improve the accuracy. Besides, we involve channel-wise distillation to accelerate the convergence. We finally reach a quantized model at 43.0% mAP.
To deploy the quantized model on typical NVIDIA GPUs (e.g. T4), we export the model to the ONNX format, then we use TensorRT to build a serialized engine along with the computed scale cache. The performance arrives at 43.3% mAP, only 0.1% left to match the fully float precision of YOLOv6-S
.
It is required to install pytorch_quantization
, on top of which we build our quantization strategy.
pip install --extra-index-url=https://pypi.ngc.nvidia.com --trusted-host pypi.ngc.nvidia.com nvidia-pyindex
pip install --extra-index-url=https://pypi.ngc.nvidia.com --trusted-host pypi.ngc.nvidia.com pytorch_quantization
Firstly, train a YOLOv6-S RepOpt
as follows, or download our realeased checkpoint and scales.
We perform PTQ to get the range of activations and weights.
CUDA_VISIBLE_DEVICES=0 python tools/train.py \
--data ./data/coco.yaml \
--output-dir ./runs/opt_train_v6s_ptq \
--conf configs/repopt/yolov6s_opt_qat.py \
--quant \
--calib \
--batch 32 \
--workers 0
Our proposed QAT strategy comes with channel-wise distillation. It loades calibrated ReOptimizer-trained model and trains for 10 epochs. To reproduce the result,
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8 \
tools/train.py \
--data ./data/coco.yaml \
--output-dir ./runs/opt_train_v6s_qat \
--conf configs/repopt/yolov6s_opt_qat.py \
--quant \
--distill \
--distill_feat \
--batch 128 \
--epochs 10 \
--workers 32 \
--teacher_model_path ./assets/yolov6s_v2_reopt_43.1.pt \
--device 0,1,2,3,4,5,6,7
To export to ONNX,
python3 qat_export.py --weights yolov6s_v2_reopt_43.1.pt --quant-weights yolov6s_v2_reopt_qat_43.0.pt --graph-opt --export-batch-size 1
To build a TRT engine,
trtexec --workspace=1024 --percentile=99 --streams=1 --int8 --fp16 --avgRuns=10 --onnx=yolov6s_v2_reopt_qat_43.0_bs1.sim.onnx --calib=yolov6s_v2_reopt_qat_43.0_remove_qdq_bs1_calibration_addscale.cache --saveEngine=yolov6s_v2_reopt_qat_43.0_bs1.sim.trt
You can directly build engine with yolov6s_v2_quant.onnx and yolov6s_v2_calibration.cache
We release our quantized and graph-optimized YOLOv6-S (v0.2.0) model. The following throughput is tested with TensorRT 8.4 on a NVIDIA Tesla T4 GPU.
Model | Size | Precision | mAPval 0.5:0.95 |
SpeedT4 trt b1 (fps) |
SpeedT4 trt b32 (fps) |
---|---|---|---|---|---|
[YOLOv6-S RepOpt] | 640 | INT8 | 43.3 | 619 | 924 |
[YOLOv6-S] | 640 | FP16 | 43.4 | 377 | 541 |
[YOLOv6-T RepOpt] | 640 | INT8 | 39.8 | 741 | 1167 |
[YOLOv6-T] | 640 | FP16 | 40.3 | 449 | 659 |
[YOLOv6-N RepOpt] | 640 | INT8 | 34.8 | 1114 | 1828 |
[YOLOv6-N] | 640 | FP16 | 35.9 | 802 | 1234 |