Loop pipeline processing (no back pressure, continuously in every clock-cycle) for stitched IP #1247

ytlai5566 · 2024-12-05T07:54:45Z

ytlai5566
Dec 5, 2024

Dear all,

I am trying to build a NN model using brevitas and FINN. In the hardware implementation, I exported the stitched IP and ran behavior simulation on the IP within Vivado. In the folding config, I made it as fully parallel in each layer (PS and SIMD are all in their maximum value) to largely reduce latency.
By simulation, I can see that there is back pressure from the IP via AXI4 stream I/O, so the next event can be processed only after the output is ready (s_tready will become 1 again).
In the case of hls4ml, a Vivado/Vitis HLS project will be exported, and we can simply change the config of loop pipeline then export RTL IP. The processing will be fully in pipeline. Data can be processed in every clock-cycle.
How can we do the same setup for FINN?

Thank you in advance.

Answered by ytlai5566

Dec 10, 2024

I found the way to do so.
In principle, the default IP generated by FINN does not include this loop pipeline setup, so there must be back pressure while the input data comes too frequently.
After the stitched IP is generated, I found that the temporary HLS projects locate under /tmp/.
Then I simply opened each project, add one line in the source code:
#pragma HLS PIPELINE II=1 style=frp
export RTL, and repeat for all the HLS projects.
After using the newly generated IP, the data flow is in full pipeline with back pressure.

View full answer

fpjentzsch · 2024-12-05T09:22:03Z

fpjentzsch
Dec 5, 2024
Collaborator

Hi,
I'm not sure I understand what you mean. FINN should also process data every clock cycle. You'll only see back pressure reach the input side if it propagates from the output side or if internal layers need to wait for each other.

Maybe your pipeline parallelism is not fully balanced, so that you have a bottleneck in a later stage of the pipeline? Normally you would want the bottleneck to be your first layer and match or exceed the throughput for all following layers.

For full unfolding of CNN layers you need to use the ConvolutionInputGenerator_rtl with parallel_window set to 1. Maybe this is the culprit?

3 replies

ytlai5566 Dec 5, 2024
Author

Thank you for the reply.
What I am expecting is that: User inputs data in every clock-cycle (s_tvalid = 1 always), then user can receive the output data also in every clock-cycle (m_tvalid = 1 always).
In my test bench, I simply keep m_tready = 1 at the output side, so there should be no back pressure from the output.

Here I attach a screenshot for demonstration.

In this case, I input the data periodically, and we can see the output data frequency is smaller.
After a while, s_tready is not kept as 1 anymore (back pressure started happening).
If the entire IP is using the same clock, and each module processes data in every clock-cycle, I think everything can be in pipeline as no one is waiting for others?
In my model design, I am not using Convolution layer, but QuantLinear, BatchNorm1d, QuantReLU and QuantIdentity.
I have set the PE and SIMD of each layer to be maximum, so I am not sure if there is other config can be changed to improve the balance issue?

ytlai5566 Dec 6, 2024
Author

Hi. Here let me provide more information after some more studies.
For my design, there are 3 hidden layers (all dense, no conv):

Input: 4 x 16 bits = 64 bits
1: 8 x 8 bits = 64 bits
2: 8 x 8 bits = 64 bits
3: 16 x 8 bits = 128 bits
Output: 1 x 8 bits = 8 bits

While exporting to stitched IP, I am using this folding config for full parallelization:
"MVAU_hls_0": "PE": 8,"SIMD": 4,
"MVAU_hls_1": "PE": 8,"SIMD": 8,
"MVAU_hls_2": "PE": 16,"SIMD": 8,
"MVAU_hls_3": "PE": 1,"SIMD": 16,

Other config:
mvau_wwidth_max = 1024,
target_fps = 10000000,
Both rtl and hls were tested for the StreamingFIFO.

Here is the RTL simulation result on each layer:

The input data were given periodically in every 11 clock-cycles.
If I looked into the processing time of each layer inside, we can see that some of the MVAU_hls modules are introducing additional processing time (dead time), which is the reason of back pressure.
I would appreciate if anyone can instruct about how to fix this problem of dead time in order to make it in full pipeline.
Is there anything in the config can be changed?

ytlai5566 Dec 10, 2024
Author

I found the way to do so.
In principle, the default IP generated by FINN does not include this loop pipeline setup, so there must be back pressure while the input data comes too frequently.
After the stitched IP is generated, I found that the temporary HLS projects locate under /tmp/.
Then I simply opened each project, add one line in the source code:
#pragma HLS PIPELINE II=1 style=frp
export RTL, and repeat for all the HLS projects.
After using the newly generated IP, the data flow is in full pipeline with back pressure.

Answer selected by ytlai5566

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loop pipeline processing (no back pressure, continuously in every clock-cycle) for stitched IP #1247

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Loop pipeline processing (no back pressure, continuously in every clock-cycle) for stitched IP #1247

ytlai5566 Dec 5, 2024

Replies: 1 comment · 3 replies

fpjentzsch Dec 5, 2024 Collaborator

ytlai5566 Dec 5, 2024 Author

ytlai5566 Dec 6, 2024 Author

ytlai5566 Dec 10, 2024 Author

ytlai5566
Dec 5, 2024

Replies: 1 comment 3 replies

fpjentzsch
Dec 5, 2024
Collaborator

ytlai5566 Dec 5, 2024
Author

ytlai5566 Dec 6, 2024
Author

ytlai5566 Dec 10, 2024
Author