Skip to content

Latest commit

 

History

History
99 lines (74 loc) · 6.22 KB

README.md

File metadata and controls

99 lines (74 loc) · 6.22 KB

fpga-fft

A highly optimized streaming FFT core based on Bailey's 4-step large FFT algorithm: https://www.nas.nasa.gov/assets/pdf/techreports/1989/rnr-89-004.pdf

Data input/output are continuous with no gaps between frames.

Currently only supporting power-of-two sizes and fixed point data.

Resource usage is on par with Xilinx FFT IP core, and Fmax is up to 30% higher for common sizes.

Zynq-7000

Name Configuration Device LUTs FFs RAMB36 DSP48E1 Fmax
fft1024 24b data, 17b twiddle, rounded XC7Z010-1 1648 4087 2 16 350 MHz
fft1024_wide 32b data, 24b twiddle, rounded XC7Z010-1 2508 6096 3 32 310 MHz
fft1024_spdf_wide 32b data, 24b twiddle, rounded XC7Z010-1 3259 7101 4 32 310 MHz
fft4096 24b data, 17b twiddle, rounded XC7Z010-1 2073 5016 6 20 343 MHz

Kintex-7

Name Configuration Device LUTs FFs RAMB36 DSP48E1 Fmax
fft1024 24b data, 17b twiddle, rounded XC7K160T-1 1653 4087 2 16 458 MHz(1)
fft4096 24b data, 17b twiddle, rounded XC7K160T-1 2098 4942 6 20 458 MHz(1)
fft8192 24b data, 17b twiddle, rounded XC7K160T-1 2279 5760 14 24 458 MHz(1)
fft16k_2 24b data, 17b twiddle, rounded XC7K160T-1 2428 6563 29 28 458 MHz(1)
fft32k_wide 32b data, 24b twiddle, rounded XC7K160T-1 3982 10245 73 56 458 MHz(1)

Kintex Ultrascale

Name Configuration Device LUTs RAMB36 DSP48E1 Fmax
fft4096 24b data, 17b twiddle, rounded XCKU025-1 2071 9 20 525 MHz(1)(2)
fft8192_wide 24b data, 24b twiddle, rounded XCKU025-1 3137 16.5 48 525 MHz(1)(2)
fft32k_wide 32b data, 24b twiddle, rounded XCKU025-1 4222 76 56 501 MHz(2)

(1) Bottlenecked by block ram maximum frequency.

(2) Additional contraints are required on BRAM synthesis. See below.

(3) Fmax numbers are based on Vivado (2019.1) timing analysis with "Performance_Explore" synthesis strategy.

Architecture

The basic architecture is based on subdividing a size N = N1*N2 FFT into N2 FFTs of size N1 followed by reordering and multiplication by twiddle factors, then N1 FFTs of size N2.

block diagram

Usage

Top level VHDL code is generated by the script codegen/gen_fft.py. The VHDL sub-blocks in this repository are referenced by the generated code.

To generate a custom FFT size, edit the FFT layout definitions in codegen/gen_fft_layouts.py.

A layout definition looks like this:

fft4096 = \
	FFT4Step(4096,
		FFT4Step(64,
			FFT4Step(16, 
				fft4_scale_none,
				fft4_scale_none),
			fft4_scale_none),
		FFT4Step(64,
			FFT4Step(16, 
				fft4_scale_div_n,
				fft4_scale_div_n),
			fft4_scale_div_n),
		16); # twiddleBits

fft4_scale_none is a base FFT instance (butterfly), and FFT4Step represents the combination of two sub-FFTs to form a larger FFT of size N1*N2. See user guide for more details.

To use a generated FFT core it is necessary to generate all the twiddle ROM sizes used (twiddle ROM size is equal to N). For N <= 32 use gen_twiddle_rom_simple.py, and gen_twiddle_rom.py otherwise. On Linux the shell script gen_roms.sh will generate all common twiddle bit sizes and depths.

Data input and output order are described as an address bit permutation. The exact permutation varies by layout and can be found in the comments at the top of generated files. The do_gen_fft script will generate reorderers and natural order wrappers in addition to the FFT core.

When composing a FFT layout prefer 4-FFT butterflies over 2-FFT. Several 4-FFT butterfly implementations are provided, each with a different tradeoff between area and speed. A 2-FFT butterfly can be used for FFT sizes that are not perfect square.

Timing constraints

A few multi-cycle timing constraints are required (because some inner butterfly implementations deserialize the data and present data every 2 or 4 cycles to the butterfly implementation):

set_multicycle_path -setup -start -from [get_pins -hierarchical *fftIn_mCycle*/C] -to [get_pins -hierarchical *fftOut_mCycle*/D] 4
set_multicycle_path -hold -start -from [get_pins -hierarchical *fftIn_mCycle*/C] -to [get_pins -hierarchical *fftOut_mCycle*/D] 3

set_multicycle_path -setup -start -from [get_pins -hierarchical *fftIn_2Cycle*/C] -to [get_pins -hierarchical *fftOut_2Cycle*/D] 2
set_multicycle_path -hold -start -from [get_pins -hierarchical *fftIn_2Cycle*/C] -to [get_pins -hierarchical *fftOut_2Cycle*/D] 1

For Ultrascale parts you may also need to force the use of width expansion rather than depth expansion of BRAMs to meet timing. After running implementation look for failed timing paths that pass through a series of BRAMs, and constrain those BRAMs to disable cascading. The constraints will look like this:

set_property CASCADE_ORDER_B NONE [get_cells fft/top_core/transp/ram/ram1_reg_bram_0]
set_property CASCADE_ORDER_B NONE [get_cells fft/top_core/transp/ram/ram1_reg_bram_1]
set_property CASCADE_ORDER_B NONE [get_cells fft/top_core/transp/ram/ram1_reg_bram_2]
set_property CASCADE_ORDER_B NONE [get_cells fft/top_core/transp/ram/ram1_reg_bram_3]
set_property CASCADE_ORDER_B NONE [get_cells fft/top_core/transp/ram/ram1_reg_bram_4]
set_property CASCADE_ORDER_B NONE [get_cells fft/top_core/transp/ram/ram1_reg_bram_5]

The exact instance names will need to be adjusted based on the timing report. See https://forums.xilinx.com/t5/UltraScale-Architecture/Prevent-Block-Ram-Cascade-Chain/td-p/645310