Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cyclic-free graph violated error in TFC for BBN-PYNQ example for 2w2a (weight 2 bits, activation 2bits) #938

Open
pkdeep opened this issue Dec 10, 2023 Discussed in #937 · 16 comments

Comments

@pkdeep
Copy link

pkdeep commented Dec 10, 2023

Discussed in #937

Originally posted by pkdeep December 10, 2023
Hello I am trying to run BBN-PYNQ example notebook with TFC (for MNIST data) network with weight = 2bits and activation also as 2 bit. I am getting cycle-free graph violated error while generating dataflow partition. Any workaround?
(docker image: xilinx/finn v0.9-2-gb3bdff11-dirty.xrt_202210.2.13.466_18.04-amd64-xrt)
Below in the screenshot of the error.
Finn_graph_error

@Ba1tu3han
Copy link

Ba1tu3han commented Dec 10, 2023

Hello,

I have the same error in the same example with a custom network #936

@pkdeep
Copy link
Author

pkdeep commented Dec 10, 2023 via email

@iksnagreb
Copy link
Contributor

Could you please share more details on your model, i.e., the initial onnx graph right after export and the one right before the failing transformation? Might indeed be the same issue as #936, but without seeing the graph it is not possible to tell.

@pkdeep
Copy link
Author

pkdeep commented Dec 13, 2023

Model.txt
FINN_ERROR
Hi @iksnagreb ,
Please find the attached images of the models, 1 just after the export to onnx, and 1 just before the failing transformation (model.transform(CreateDataflowPartition()))
exported_model_at_start onnx
model_just_beforefailing_transformation_dataflow

The first image at the top is of the Model just after export to onnx
The second image is of the model just before failing transformation.
Attached the model details in .txt file.
I used the below code for model export (input to the model is float values between 0-1 with shape of (1x1024)):

import onnx
import brevitas.onnx as bo
t1 = tr.model.eval()
a = torch.rand(1, 1, 1,1024)
bo.export_finn_onnx(t1, a, "qnn_w2_a2_self.onnx")

Let me know if you require anything else.

@iksnagreb
Copy link
Contributor

iksnagreb commented Dec 13, 2023

Hm, for some reason your MatMul layers are not converted to the corresponding HLS layer MatrixVectorActivation, while in turn the related MultiThreshold layers are converted to standalone HLS layers Thresholding_Batch. The CreateDataflowPartition transformation, however, expects a continuous chain of purely HLS layers while your model now has an alternating chain of HLS (the Thresholding_Batch) and standard onnx (i.e., the MatMul) layers.

You are using the bnn-pynq notebooks as is, just loading the w2a2 model at the start? You changed nothing else? Then the problem is very likely that this example notebook is intended for binarized (or bipolar) neural networks. By loading the 2-bit variant, this is not the case any more for you. Thus the ConvertBipolarMatMulToXnorPopcount and consequently the InferBinaryMatrixVectorActivation transformations (in the two cells before the failing one) will not work, leaving the MatMul layers there.

You have two options now: Either stick to the binarized model to follow the example notebook as it is, or adapt the "Conversion to HLS layers" cells such that they work with the 2-bit (or even more bit) models. For the second option, I suggest you have a look at the InferQuantizedMatrixVectorActivation transformation to replace the InferBinaryMatrixVectorActivation. You might have to adapt some parts of the streamlining and pre-/post-processing as well.

@pkdeep
Copy link
Author

pkdeep commented Dec 13, 2023

You are using the bnn-pynq notebooks as is, just loading the w2a2 model at the start? You changed nothing else? :: Yes

I need to use w2a2 or w2a4 (as my intended final network is not giving good accuracy below this). So I have to use either w2a2 or w2a4 configuration.

"InferBinaryMatrixVectorActivation" how to call or use it? any references?

(P.S.:: When I was was using w1a1 configuration , I was able to successfully run the code)

Thanks

@iksnagreb
Copy link
Contributor

Look into the first code cell of the "Conversion to HLS layers" section in the notebook. The third line should be: model = model.transform(to_hls.InferBinaryMatrixVectorActivation("decoupled")). Start by replacing this call by InferQuantizedMatrixVectorActivation and see whether the rest of the notebook works again. It might not be the only change necessary though, but I expect it brings you at least through the dataflow partition.

@pkdeep
Copy link
Author

pkdeep commented Dec 13, 2023

I did the changes suggested by you and could move forward. Thanks a lot for the help. The attached file shows the model. I'll get back to you with a detailed report tomorrow
new
Thanks once again
.

@pkdeep
Copy link
Author

pkdeep commented Dec 19, 2023

@iksnagreb Hi,
I have been able to complete the run and generate the Pynq Driver. I also ran the verification notebook, where my results are matching. Thanks for your support
However, when I
am running on the board, my results are not matching? Any hint as to why that might be happening?
My running_weight folder is empty? is that expected behavior?
Any help will be greatly appreciated.
P.S.

  1. I also tired tfc_w1a1 on board : here is my observation:
    While TopK() (i.e. index of max value) matches, but the actual value coming out of board and in verification are different. In verification notebook we get some float value (both -ve and +ve [-1.4913709, -1.2444434, 1.0602127, -1.3267525, -1.3267525, -1.57368 , -1.4090617, -1.4090617, -1.4090617, -1.7382984]],), however, in the board, I get values on only positive integers [[28., 31., 59., 30., 30., 27., 29., 29., 29., 25.]], dtype=float32)
  2. My network output is 5 node, which is multilabel type, so more than 1 node can be active. I am not sure if I have to do something extra to get correct values.
    Pradeep

Pynq_driver_snapshot

@iksnagreb
Copy link
Contributor

Nice to hear you are making some progress. Yes, empty runtime weights should be expected in this case.

Regarding the mismatch when running on the board: I am just guessing, but likely the dataflow partition moved the input quantization (and maybe the output de-quantization, if you are expecting floating-point outputs) out of the hardware design, such that what is running on the device is purely integer, thus you are getting only integers back. That means, you probably have to quantize your inputs manually (and maybe de-quantize your outputs manually as well, or, alternatively compare against quantized expected outputs for verification).

@pkdeep
Copy link
Author

pkdeep commented Dec 22, 2023

Thanks for the reply. Can you provide any example code or links, which can be useful to debug this issue? I am unaware of how to locate different design parts between HW (PL) and SW(PS). Calls which can be useful for getting insight may be very handy.

That means, you probably have to quantize your inputs manually (and maybe de-quantize your outputs manually as well, or, alternatively compare against quantized expected outputs for verification).-------> How to retrieve the information required to do this manually (like mean and dynamic range)

@iksnagreb
Copy link
Contributor

To see how FINN partitioned your model, you can have a look into the parent model after creating the dataflow partition: The original notebook saves this as build_dir+"/tfc_w1_a1_dataflow_parent.onnx" (you might have changed this). This model graph should look rather simple, containing a StreamingDataflowPartition in the center (that is the part viewed in the next cell of the example notebook). Everything inside this StreamingDataflowPartition will be placed into the hardware design, everything outside of it not. What you have shown above is probably just the part inside of the partition.

Guessing from the last model graph you provided, it seems to be everything up to (and including) the first MultiThreshold which is not included in the hardware - this makes sense as it corresponds to the input quantization. Normally, you would now have to figure out which Quant node (probably just the first one as well) originally corresponds to this MultiThreshold and use the quantization parameters (scale, zero-point, etc.) from there. However, it looks like you have some Mul and Add nodes (we do not care for the Reshape here) preceding the MultiThreshold, which suspiciously look like a conversion to bipolar inputs. Bipolar inputs do not really make sense for your 2w2a model. Is this still the case? You might want to check again, whether your inputs and outputs (as the Mul and Add following the last MatMul look suspiciously like the reverse of the bipolar-conversion) are treated correctly or whether there are still some leftovers form the binary/bipolar example in there.

@pkdeep
Copy link
Author

pkdeep commented Dec 29, 2023

Hi @iksnagreb , finally the code is working fine on the board. I was able to push the input Quant layer inside the dataflow partition and was able to convert the INT output of dataflow partition to FLOAT by using the values. Thanks a lot for your help.

Now moving forward I want to play around with the folding factors and performance enhancement. I have few queries regarding this. If you can answer them or direct me to relevant section, it will be of great help:

  1. When I try to increase the PE and SIMD values, I get an error and the flow does not processed. Generally the errors are not detailed and it is difficult to make something of these errors. Is there any better way by which I can produce or debug these? (Current msg while Synthesis error : ) (log file attached
    runme.log
    )
Finished Part Resource Summary
---------------------------------------------------------------------------------
/opt/Xilinx/Vivado/2022.2/bin/rdiArgs.sh: line 312: 49498 Killed                  "$RDI_PROG" "$@"
Parent process (pid 49498) has died. This helper process will now exit

  1. How to get latency numbers of the implementation?
  2. I sometimes get following error when I try to use "cybersecurity/3-build-accelerator-with-finn.ipynb", which I am not able to understand
ERROR: [HLS 207-2163] 'bitwidth' attribute requires integer constant between 1 and 8191 inclusive (/opt/Xilinx/Vitis_HLS/2022.2/common/technology/autopilot/ap_common.h:520:35)
INFO: [HLS 207-4518] in instantiation of template class 'ssdm_int<16384, false>' requested here (/opt/Xilinx/Vitis_HLS/2022.2/common/technology/autopilot/ap_int_base.h:108:29)
INFO: [HLS 207-4518] in instantiation of template class 'ap_int_base<16384, false>' requested here (/opt/Xilinx/Vitis_HLS/2022.2/common/technology/autopilot/ap_int.h:181:18)
INFO: [HLS 207-4518] in instantiation of template class 'ap_uint<16384>' requested here (/tmp/finn_dev_pradeep/code_gen_ipgen_MatrixVectorActivation_1_ra_wykn9/top_MatrixVectorActivation_1.cpp:38:1)
ERROR: [HLS 207-2163] 'bitwidth' attribute requires integer constant between 1 and 8191 inclusive (/opt/Xilinx/Vitis_HLS/2022.2/common/technology/autopilot/ap_common.h:521:29)
ERROR: [HLS 207-2163] 'bitwidth' attribute requires integer constant between 1 and 8191 inclusive (/opt/Xilinx/Vitis_HLS/2022.2/common/technology/autopilot/ap_common.h:523:104)
ERROR: [HLS 207-2163] 'bitwidth' attribute requires integer constant between 1 and 8191 inclusive (/opt/Xilinx/Vitis_HLS/2022.2/common/technology/autopilot/ap_int.h:212:114)
ERROR: [HLS 207-3337] type 'ap_uint<256U * 32U * ap_int<2>::width>' does not provide a call operator (/home/pradeep/Desktop/finn-brevitas/finn/deps/finn-hlslib/mvau.hpp:268:20)

Thanks

@fpjentzsch
Copy link
Collaborator

Hi,

  1. I'm afraid this error usually points to Vivado running out of RAM during synthesis, most likely due to the larger/more complex design due to the parallelism increase.
  2. You can run this transformation to get the estimated latency per layer. The worst latency will determine the overall throughput, but estimating the actual inference latency is not as easy. For an upper bound, this transformation simply adds all latencies together to give you "critical_path_cycles". RTL simulation is usually preferred to get realistic latency figures.
    If you are using the FINN builder tool, the following step wraps these analysis transformations and dumps the results into .json log files, among some additional information like operator/parameter counts:
    def step_generate_estimate_reports(model: ModelWrapper, cfg: DataflowBuildConfig):
  3. FINN uses AXI-Streams internally to move data around. Parallelism directly impacts the width of these streams and 8192 is the maximum width supported by Vitis HLS. Unfortunately this is a hard limit, so you will have to decrease parallelism and/or model size to make it work. If this only happens in a layer for which an alternative RTL backend exists (such as the ConvolutionInputGenerator), you might be able to avoid this limitation by switching from the HLS to the RTL backend.

@pkdeep
Copy link
Author

pkdeep commented Jan 10, 2024

Hi,
Thanks a lot for the replies and help. I am able to move forward and experiment with different folding options for better performance.
Here are a few doubts which I have:

  1. If only the weights of my model change, do I still need to go through the full flow or there are other methods to update these weights?
  2. If I have sparsity in my model (pruned model), will it help in FINN implementation?
  3. If my model has multiple inputs (which are concatenated before different layers), how to handle that in FINN flow? I am attaching 1 example model, which has 5 inputs of different sizes. Images of both the exported ONNX model and tidy-up model by FINN

are attached herewith. FINN is creating dangling nodes for other inputs. Any idea how to address this?
Thanks once again for your valuable help.
Like this ->>

        self.qid1 = qnn.QuantIdentity(bit_width=4, return_quant_tensor=True)
        self.qlin1 = qnn.QuantLinear(self.input_size, self.hidden2, bias = True, weight_bit_width = weight_bit_width )         
        self.act1 = qnn.QuantReLU(bit_width=act_bit_width)
        self.qid2 = qnn.QuantIdentity(bit_width=4, return_quant_tensor=True)
        self.qlin2 = qnn.QuantLinear(self.hidden1, self.hidden2, bias = True, weight_bit_width = weight_bit_width ) #256+64
def forward(self,x1,x2, x3, x4, x5):

       x = self.qid1(x1)
       x = self.qlin1(x)
       x = self.act1(x)
       x11 = self.qid2(x2)
       x = torch.cat([x, x11], dim=1)
       x = self.qlin2(x)

step_tidy_up onnx
random onnx

Here are my few observations, that might help someone:

  1. Try to avoid using VM. It creates more problems than solves.
  2. Once you understand the basics, it is better to move (compared to notebook) to the fpga_build flow for experimenting with different architectures.
  3. Respect 8192 limit, it saves lot of time :)

@pkdeep
Copy link
Author

pkdeep commented Jan 11, 2024

Hi @fpjentzsch ,
Update to the last comment.
Instead of multiple inputs to the model (which seems to be difficult to get converted to hls), I have taken a single combined input and slicing the input to feed to different layers, which seems to be a better option.
But when I try to streamline the model, I am still left with few

Mul

and

MatMul

nodes, which dont go away. I am attaching the image of the model, which in turn gives cycle-free graph error when I run "Dataflow" transformation.
Can you help me?
partial_converted_parallel_model

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants