-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to reproduce FPGA speed-up #9
Comments
Hi Kisaru, I checked with my notes on the raw experimental data we have obtained on this particular test example, our FPGA kernel execution time for it was 0.30 sec instead of your 1.64 sec. I do not find the raw data for CPU since we experimented with the whole data set in the paper. However, our highly optimized CPU kernel was 4.24 sec on this particular input. This roughly matches the results in our paper. Your software version might run on a different CPU than our experiment environment (E5-2680v4), which might also cause some discrepancy. The huge gap in the FPGA kernel might be caused by the recent updates in the Vitis/Vivado HLS software. We haven't performed experiments on the recents versions. Could you please check if the achieve II in the report is one? Besides, what is the achieved operating frequency? Could you please post the Thanks, |
Hi Jason, Thank you very much for your quick reply! I have used "FPGA Developer AMI v1.6.1" with "AWS EC2 FPGA Development Kit v1.4.15a" so that I could use the same Xilinx 2018.3 toolset used in your work. Here I'm sending you the two files device_chain_kernel_vivado_hls.log and xocc_kernel-hw.log (available under kernel/hls/rpt/2018.3/kernel-hw/kernel-hw/ and kernel/hls/rpt/2018.3/kernel-hw/ respectively) which would help you identify the issue. On a side note, since there was an error in compiling memory_scheduler.cpp with GCC 4.8.5 (related to initializing variable-sized arrays) which was available by default on AWS F1 machine, I had to do a few small modifications in memory_scheduler.cpp to initialize arrays to zero with a loop. For example,
to
However, I don't think the above change would have impacted the performance of the hardware kernel since it's a very minor modification and the time for the memory scheduler/descheduler is not anyway taken into account when evaluating the performance. Also, the 1.64 sec time I mentioned was the kernel's end-to-end time reported after a successful completion of FPGA kernel execution which includes the data transfer time as well (which I think is what you reported in the paper as well). Just to be sure that I'm experimenting with the same 30,000 function calls, it would be great if you could send me the total number of anchors (or the total number of lines) available in the Thank you very much again for all the help you provide me to identify this issue! Best, |
Hi Kisaru, Thanks for the information. The problem looks more complex than I thought -- you indeed achieved II = 1. Unfortunately Amazon is not longer sponsoring us AWS credits, so I might not be able to rerun the experiments right away. I will try to get some credits if possible. At the meantime, there might be some factors you can take into account to understand the performance difference:
Thanks, |
Hi Jason, Thank you very much for looking into this further and trying to reproduce my issue on AWS! In response to the factors you pointed out,
I extracted 1.64 seconds from this expecting that this is the timing considered in the paper. Instead, I think I should take max(transfer in, compute, transfer out) = max(0.66, 0.55, 0.43) = 0.66 secs as the timing for this dataset. However, I believe I cannot achieve this performance straight away from the implementation as double buffering is not implemented in it already? Thank you! Best, |
Hi Kisaru,
void device_chain_kernel(
block_io *anchors_0,
block_io *returns_0,
block_io *anchors_1,
block_io *returns_1, //...
#pragma HLS interface m_axi port=anchors_0 offset=slave bundle=gmem_0
#pragma HLS interface m_axi port=returns_0 offset=slave bundle=gmem_0
#pragma HLS interface m_axi port=anchors_1 offset=slave bundle=gmem_1
#pragma HLS interface m_axi port=returns_1 offset=slave bundle=gmem_1
load_compute_store(anchors_0, returns_0, n, max_dist_x, max_dist_y, bw);
load_compute_store(anchors_1, returns_1, n, max_dist_x, max_dist_y, bw); Remember to assign the DDR banks to the bundle:
Thanks, |
Hello!
I was able to successfully reproduce your FPGA work on an AWS F1 instance and obtain performance results to compare them against pure software performance. To get the pure software performance, I created a 14-threaded software version that executes Minimap2's chaining function -
mm_chain_dp
(taken from the same Minimap2 software version you have used in your testbed) on an Intel Xeon CPU.To take the above performance results, I dumped the input/output of 30,000 function calls using the below command with the testbed you've provided with your work.
./minimap2 -ax map-pb c_elegans40x.fastq c_elegans40x.fastq --chain-dump-in in-c_elegans-30k.txt --chain-dump-out out-c_elegans-30k.txt --chain-dump-limit=30000 > /dev/null
The FPGA kernel time I obtained for the above input dump (in-c_elegans-30k.txt) was 1.6356766 seconds. For the same input data, my 14-threaded pure software version could complete its work in 6.649 seconds.
According to above timing results, the speed-up against the 14-threaded software is nearly 4x. However, I could observe that in your publication it mentions that the speed-up against a 14-threaded software reference is 28x.
Could you please let me know if I have missed anything or is there any specific considerations you took when doing the performance comparisons.
Thank you very much!
Kind regards,
Kisaru
The text was updated successfully, but these errors were encountered: