- Windows 10 laptop
- CPU i7-11375H
- GPU RTX-3060
- Visual studio 2017
- CUDA 11.1
- TensorRT 8.0.3.4 (unet)
- TensorRT 8.2.0.6 (detr, yolov5s, real-esrgan)
- Opencv 3.4.5
- make Engine directory for engine file
- make Int8_calib_table directory for ptq calibration table
- Layer for input preprocess(NHWC->NCHW, BGR->RGB, [0, 255]->[0, 1] (Normalize))
- plugin_ex1.cpp (plugin sample code)
- preprocess.hpp (plugin define)
- preprocess.cu (preprocessing cuda kernel function)
- Validation_py/Validation_preproc.py (Result validation with pytorch)
- vgg11.cpp
- with preprocess plugin
- resnet18.cpp
- 100 images from COCO val2017 dataset for PTQ calibration
- Match all results with PyTorch
- Comparison of calculation execution time of 100 iteration and GPU memory usage for one 224x224x3 image
Pytorch | TensorRT | TensorRT | TensorRT | |
Precision | FP32 | FP32 | FP16 | Int8(PTQ) |
Avg Duration time [ms] | 4.1 ms | 1.7 ms | 0.7 ms | 0.6 ms |
FPS [frame/sec] | 243 fps | 590 fps | 1385 fps | 1577 fps |
Memory [GB] | 1.551 GB | 1.288 GB | 0.941 GB | 0.917 GB |
- UNet model (unet.cpp)
- use TensorRT 8.0.3.4 version for unet model(For version 8.2.0.6, an error about the unet model occurs)
- unet_carvana_scale0.5_epoch1.pth
- additional preprocess (resize & letterbox padding) with openCV
- postprocess (model output to image)
- Match all results with PyTorch
- Comparison of calculation execution time of 100 iteration and GPU memory usage for one 512x512x3 image
Pytorch | Pytorch | TensorRT | TensorRT | TensorRT | |
Precision | FP32 | FP16 | FP32 | FP16 | Int8(PTQ) |
Avg Duration time [ms] | 66.21 ms | 34.58 ms | 40.81 ms | 13.52 ms | 8.19 ms |
FPS [frame/sec] | 15 fps | 29 fps | 25 fps | 77 fps | 125 fps |
Memory [GB] | 3.863 GB | 2.677 GB | 1.552 GB | 1.367 GB | 1.051 GB |
- DETR model (detr_trt.cpp)
- additional preprocess (mean std normalization function)
- postprocess (show out detection result to the image)
- Match all results with PyTorch
- Comparison of calculation execution time of 100 iteration and GPU memory usage for one 500x500x3 image
Pytorch | Pytorch | TensorRT | TensorRT | TensorRT | |
Precision | FP32 | FP16 | FP32 | FP16 | Int8(PTQ) |
Avg Duration time [ms] | 37.03 ms | 30.71 ms | 16.40 ms | 6.07 ms | 5.30 ms |
FPS [frame/sec] | 27 fps | 33 fps | 61 fps | 165 fps | 189 fps |
Memory [GB] | 1.563 GB | 1.511 GB | 1.212 GB | 1.091 GB | 1.005 GB |
- Yolov5s model (yolov5s.cpp)
- Comparison of calculation execution time of 100 iteration and GPU memory usage for one 640x640x3 image resized & padded
Pytorch | TensorRT | TensorRT | |
Precision | FP32 | FP32 | Int8(PTQ) |
Avg Duration time [ms] | 7.72 ms | 6.16 ms | 2.86 ms |
FPS [frame/sec] | 129 fps | 162 fps | 350 fps |
Memory [GB] | 1.670 GB | 1.359 GB | 0.920 GB |
- Real-ESRGAN model (real-esrgan.cpp)
- RealESRGAN_x4plus.pth
- Scale up 4x (448x640x3 -> 1792x2560x3)
- Comparison of calculation execution time of 100 iteration and GPU memory usage
- [update] RealESRGAN_x2plus model (set OUT_SCALE=2)
Pytorch | Pytorch | TensorRT | TensorRT | |
Precision | FP32 | FP16 | FP32 | FP16 |
Avg Duration time [ms] | 4109 ms | 1936 ms | 2139 ms | 737 ms |
FPS [frame/sec] | 0.24 fps | 0.52 fps | 0.47 fps | 1.35 fps |
Memory [GB] | 5.029 GB | 4.407 GB | 3.807 GB | 3.311 GB |
- Yolov6s model (yolov6.cpp)
- Comparison of calculation execution time of 1000 iteration and GPU memory usage (with preprocess, without nms, 536 x 640 x 3)
Pytorch | TensorRT | TensorRT | TensorRT | |
Precision | FP32 | FP32 | FP16 | Int8(PTQ) |
Avg Duration time [ms] | 20.7 ms | 10.3 ms | 3.54 ms | 2.58 ms |
FPS [frame/sec] | 48.14 fps | 96.21 fps | 282.26 fps | 387.89 fps |
Memory [GB] | 1.582 GB | 1.323 GB | 0.956 GB | 0.913 GB |
- Yolov7 model (yolov7.cpp)
- TRT_DLL_EX : https://github.com/yester31/TRT_DLL_EX
- Prepare the trained model in the training framework (generate the weight file to be used in TensorRT).
- Implement the model using the TensorRT API to match the trained model structure.
- Extract weights from the trained model.
- Make sure to pass the weights appropriately to each layer of the prepared TensorRT model.
- Build and run.
- After the TensorRT model is built, the model stream is serialized and generated as an engine file.
- Inference by loading only the engine file in the subsequent task(if model parameters or layers are modified, re-execute the previous (4) task).
- tensorrtx : https://github.com/wang-xinyu/tensorrtx
- unet : https://github.com/milesial/Pytorch-UNet
- detr : https://github.com/facebookresearch/detr
- yolov5 : https://github.com/ultralytics/yolov5
- real-esrgan : https://github.com/xinntao/Real-ESRGAN
- yolov6 : https://github.com/meituan/YOLOv6