The rapid advancement of Artificial Intelligence has enabled remarkable achievements in various fields, such as image/video classification, semantic segmentation, object detection/tracking, and other applications applicable in embedded systems. However, the increasing size and complexity of deep learning models required to tackle large-scale image problems demand dedicated hardware acceleration on embedded devices for real-time processing with low latency.
In this project, we present a detailed analysis of configuring, installing, and running a Convolutional Neural Network (CNN) on the Zynq Ultrascale+ board (ZCU102). This platform combines a multiprocessor system-on-chip (MPSoC) with an FPGA, offering hardware acceleration specifically tailored for efficient execution of neural networks, minimizing the computational load on the main processor.
The main goal of this study is to evaluate the performance degradation when running a CNN model on the Deep Learning Processing Unit (DPU) of ZCU102 compared to running it on a Cloud infrastructure. For this purpose, we have chosen to use Kaggle as the Cloud platform. Through a comparative and qualitative analysis, various relevant aspects will be considered to provide an assessment of the performance on the target board ZCU102.
The Deep Learning Processor Unit (DPU) is a programmable and optimized processing unit specifically designed for implementing deep learning neural networks. Configurable using Xilinx's Vivado platform, the DPU accelerates operations of convolutional neural networks, including convolutions, pooling, and normalization. Its specialized instruction set maximizes neural network efficiency, reducing energy consumption and processing time. The DPU's hardware architecture utilizes a pipelined design, ensuring high parallelism and superior throughput compared to the CPU on the board. On-chip memory (OCM) minimizes external memory access, increasing overall system efficiency. The DPU is highly configurable and flexible, adapting to various applications and supporting widely used neural network architectures in Computer Vision, such as VGG, ResNet, GoogleNet, YOLO, SSD, MobileNet, and more. The following sections will cover the configuration and complete installation of the DPU on the board.
This project aims to demonstrate the applicability of ZCU102 in a real-world context, supporting traffic sign recognition applications through a convolutional neural network. The implemented system is capable of recognizing signs such as turn indications, speed limits, no-overtaking zones, children pedestrian crossings, and other related symbols. The goal is to enable vehicles to take appropriate precautions in response to these signals.
We will develop a convolutional neural network model using the PyTorch library. To assess the performance, the model will be initially executed on the cloud via Kaggle and subsequently deployed on the ZCU102 board. Comparative and qualitative analysis will be conducted to identify any performance differences between execution on the ZCU102 Deep Learning Processing Unit (DPU) and the Cloud.
The dataset used is the Traffic Signs Classification, consisting of over 40 classes of traffic signs and more than 50,000 images. Its breadth and realism provide the model with a comprehensive understanding of traffic signs, enhancing its recognition and classification capabilities. This dataset is acquired from the German Traffic Sign Recognition Benchmark.
You can find the complete implementation of the model on Kaggle in the GitHub repository, which contains the notebook for the project. This serve for save the necessary file "model.pth" that will be utilized for the AI Quantizer and AI Compiler
- PyTorch
- Vivado
- Vitis AI
- Xilinx Deep Learning Processing Unit (DPU)
- PetaLinux
- Kaggle Cloud Platform
To experiment with the application's performance on the board, we build two convolutional networks for the traffic sign recognition problem, which is useful, for example, in the context of autonomous driving technology. We propose the LeNet model and the VGG16 model.
The configuration before running the support flow for model execution on the board is detailed in the complete Italian ๐ฎ๐น document.
Description of the Vitis AI execution flow with PyTorch for ZCU102:
After following the described process, the folder "target_zcu102" will be generated, containing the files to be uploaded to the board, as shown in the figure:
By transferring the folder correctly to ZCU102, the application can be run on the board using the following command:
python3 app_traffic_sign.py -m CNN_zcu102.xmodel --threads (N)
By executing all possible configuration combinations between the number of cores and threads, the results are as follows:
Platform | Throughput (fps) | Accuracy (%) |
---|---|---|
Cloud (GPU P100) | LeNet: 2,685,901.64 | 0.9776 |
VGG16: 2,771,993.92 | 0.9856 | |
Host (Quantization 8-bit) | LeNet: N/A | 0.9739 |
VGG16: N/A | 0.9848 | |
ZCU102 DPU Single Core - 1 Thread | LeNet: 3,593.36 | 0.9772 |
VGG16: 333.46 | 0.9839 | |
ZCU102 DPU Multi Core - 1 Thread | LeNet: 3,559.38 | 0.9772 |
VGG16: 331.46 | 0.9839 | |
ZCU102 DPU Single Core - 4 Threads | LeNet: 5,529.12 | 0.9772 |
VGG16: 352.58 | 0.9839 | |
ZCU102 DPU Multi Core - 4 Threads | LeNet: 5,631.97 | 0.9772 |
VGG16: 486.66 | 0.9839 |
Based on the results and observations made during the project, we can draw some conclusions about the use of Zynq UltraScale+ MPSoC (ZCU102) in Edge AI applications. The accuracy achieved by the model on the DPU is approximately equal to that obtained when the model is executed in the Cloud. However, the throughput (calculated as frame/sec) is significantly lower in the DPU compared to the Cloud solution.
This leads to a trade-off decision since this application is targeted for use on board an autonomous driving vehicle. Both high accuracy and adequate throughput are crucial, as shown in the case of the LeNet model in this study. Indeed, in the case of the multi-core DPU with 4 threads, the system is capable of performing traffic sign recognition in 0.17756 ms, without considering additional constraints such as the distance at which the model can perform recognition and others.
In general, despite the quantization during the Vitis-AI workflow, the model retains a good level of accuracy in its predictions. This is a significant advantage when the goal of the Edge AI application is to be as accurate as possible without major time constraints.
However, the main limitation of this approach lies in the technological constraints imposed by Zynq UltraScale+; even with the integration of a DPU that accelerates convolution operations for neural networks, the throughput is not yet comparable to that of a Cloud solution.
ยฉ 2023 - Project developed for the Embedded Systems exam at the University of Naples, Federico II. Created exclusively for demonstrative and educational purposes. Authors: Antonio Romano - Giuseppe Riccio
- Vitisโข AI User Guide
- Documentation DPU for Zynq UltraScale+ MPSoC
- Vitisโข AI Library User Guide
- PetaLinux Tools Documentation Reference Guide
These documents were used as reference sources for the project, providing crucial information regarding the use and configuration of Xilinx resources and related development tools.