The postprocess of the YOLOv4 ( C++ Version )
YOLO-v4 consists of 5 main steps, namely, data input, preprocessing, inference, post-processing, and drawing. Here we mainly focus on the last two parts, post-processing and drawing.
Since we know that the runtime of Python is somehow relatively slower than other types of programming languages, especially, C and C++, just to name a few. Therefore, I tried to rewrite the yolo-v4 post-process Python version with C++ so as to test whether the total runtime of post-processing and result drawing can be sped up or not.
Here we use the three layer outputs obtained after conducting YOLO-v4 inference incl. conv2d_58_Conv2D_YoloRegion, conv2d_66_Conv2D_YoloRegion, and conv2d_74_Conv2D_YoloRegion to implement YOLO-v4 post-processing with C++.
threshold 0.99
threshold 0.6
threshold 0.1
Reshape | Filter | NMS | Total post-process | Drawing | Total runtime | |
---|---|---|---|---|---|---|
Python | 1.1339 | 2.8736 | 0.0996 | 1.5921 | ||
C++ | 2.529 | 0.002 | 7.195 | 8.128 |
Reshape | Filter | NMS | Total post-process | Drawing | Total runtime | |
---|---|---|---|---|---|---|
Python | 0.4694 | 3.3707 | 1.9178 | 9.7203 | 15.5413 | |
C++ | 2.583 | 0.026 | 7.321 |
Reshape | Filter | NMS | Total post-process | Drawing | Total runtime | |
---|---|---|---|---|---|---|
Python | 1.0774 | 7.1897 | 13.7035 | 22.1300 | 28.6309 | 50.7609 |
C++ | 2.826 | 0.161 |
During the testing of the code, we have discovered that the runtime of the "transpose" function in "reshape" section took up around 1/2 of the total process runtime. Since Python Numpy module uses BLAS and LAPACK to execute matrix, vector, and linear algebra-related operations, we come up with the idea of solving this issue with Xtensor-Blas module.
Improvement 1
Here the transpose function is replaced with the code below:
xt::xarray<float> transpose(xt::xarray<float>& predictions) {
xt::xarray<float>::shape_type shape = {predictions.shape()[0], predictions.shape()[2], predictions.shape()[3], predictions.shape()[1]};
xt::xarray<float> new_predictions(shape);
for (std::size_t n = 0; n < predictions.shape()[0]; n++) {
for (std::size_t h = 0; h < predictions.shape()[2]; h++) {
for (std::size_t w = 0; w < predictions.shape()[3]; w++) {
for (std::size_t c = 0; c < predictions.shape()[1]; c++) {
new_predictions(n, h, w, c) = predictions(n, c, h, w);
}
}
}
}
return new_predictions;
}
0.99 | 0.6 | 0.1 | |||||||||||||||
Reshape | Total runtime | Reshape | Total runtime | Reshape | Total runtime | ||||||||||||
Before | 4.423 | 8.128 | Before | 4.453 | 12.731 | Before | 4.579 | 25.848 | |||||||||
After | After | After |
Install Xtensor
First, install xtl
cd /opt
git clone https://github.com/xtensor-stack/xtl.git
cd xtl
cmake -D CMAKE_INSTALL_PREFIX=/opt/xtl
Install Xtensor
git clone https://github.com/xtensor-stack/xtensor.git
cd xtensor
cmake -DCMAKE_INSTALL_PREFIX=/opt/xtensor
make install
To run code
First, locate to build directory, and
cmake ../project -DCMAKE_INSTALL_PREFIX=/opt/ ..
make
execute post-process.cpp
./pp obj_input.jpg
When the code successfully run, the result will be (shown in microsecond. Please multiply by 0.001 to convert it to milliseconds.):
https://superfastpython.com/what-is-blas-and-lapack-in-numpy/
https://max-c.notion.site/C-Numpy-Python-NPY-efe8a325aacb43ec9827f86185220fdc