See Vitis™ Development Environment on xilinx.com See Vitis™ AI Development Environment on xilinx.com |
Version: Vitis 2024.1
- Polyphase Channelizer
The polyphase channelizer [1] down-converts simultaneously a set of frequency-division multiplexed (FDM) channels carried in a single data stream using an efficient approach based on digital signal processing. Channelizer use is ubiquitous in many wireless communications systems. Channelizer sampling rates increase steadily as the capabilities of RF-DAC and RF-ADC technology advances, making them challenging to implement in high-speed reconfigurable devices, such as field programmable gate arrays (FPGAs). This tutorial implements a high-speed channelizer design using a combination of AI Engine and programmable logic (PL) resources in AMD Versal™ adaptive SoC devices.
The following table shows the system requirements for the polyphase channelizer. The input sampling rate is 10.5 GSPS. The design supports M=16 channels with each one supporting 10.5G / 16 = 656.25 MHz of bandwidth. The channelizer employs a polyphase technique as outlined in [1] to achieve an oversampled output at a rate of P/Q = 8/7 times the channel bandwidth, or 656.25 * 8/7 = 750 MSPS. The prototype filter used by the channelizer uses K=8 taps per phase, leading to a total of 16 x 8 = 128 taps overall.
Parameter | Value | Units |
---|---|---|
Input Sampling Rate (Fs) | 10.5 | GSPS |
# of Channels (M) | 16 | channels |
Interpolation Factor (P) | 8 | n/a |
Decimation Factor (Q) | 7 | n/a |
Channel Bandwidth | 656.25 | MHz |
Output Sampling Rate | 750 | MSPS |
# of taps per phase (K) | 8 | n/a |
The following figure shows a block diagram of the polyphase channelizer. The following five blocks perform the required signal processing functions:
- The Circular Buffer converts the scalar input data stream into an M-vector output format for the downstream blocks, and introduces state to manage the P/Q output oversampling. Its memory depth spans the full extent of M x K samples. Conceptually, the circular buffer operates on a M x K array, employing a "serpentine shift" to introduce S = M x Q / P samples to each new output block. The remaining M - S samples come from the state history.
- The Polyphase Filter implements a parallel bank of M filters across the columns of the M x K circular buffer. Each filter employs K = 8 coefficients taken from an M-phase decomposition of the channelizer prototype filter. The filter produces a single vector of M output samples.
- The Cyclic Shift Buffer removes frequency-dependent phase shifts from the downstream Inverse Discrete Fourier Transform (IDFT) outputs using a memoryless and periodically time-varying circular shift of its inputs. A finite state machine (FSM) manages the sequence of input permutations across each input block. The number of states depends on the specific oversampling ratio factors P and Q and number of channels M.
- The Inverse Fast Fourier Transform (IFFT) performs an IDFT operation on its input vector of M samples to produce a transformed vector of output samples. In the channelizer context, the IDFT performs a parallel bank of M frequency down-conversion operations. Each IDFT output represents a separate down-converted channel of bandwidth Fs / M sampled at a rate of Fs / M * P / Q samples per second.
- The output buffer prepares the output channel samples for consumption by downstream processing. It is not included in this reference design.
The following figure shows a system model of the polyphase channelizer built in MATLAB and encapsulated in a MATLAB app (GUI). This provides a comprehensive golden model of the channelizer algorithms and illustrates the relationships between the various system parameters. The model was built to support a broader range of parameter settings than the actual Versal adaptive SoC design:
- The model supports two different input sampling rates: Fs = 10.5 GSPS and Fs = 20.5 GSPS.
- The number of channels M can be set to 16, 32, 64, or 128 using a dial.
- The output oversampling ratio P/Q may be set to 1/1, 2/1, 4/3, or 8/7 using the appropriate button.
- The number of active channels can be entered in the bottom left. This value must be less than the chosen value of M.
The model may be run by pressing the "Go" button. When this occurs, the model generates the desired number of active channels and positions them in randomly chosen carrier locations. Each signal is modeled as filtered Gaussian noise for simplicity. The model displays the impulse response of the prototype channelizer filter computed for the given system parameters in the top left plot. The bottom left plot shows this same filter in the frequency domain in red along with the actual signal to be extracted by the channelizer in blue. The top right plot shows the input spectrum to the channelizer along with the active carriers and their index labels. The bottom right plot shows the extracted channels at baseband in the time domain, where the blue signals are the channelizer inputs (delayed by the known group delay of the channelizer), and the red signals are the channelizer outputs.
This section outlines the system partitioning for the polyphase channelizer. This involves analyzing the characteristics of its five functional blocks to identify which should be implemented in AI Engines versus PL to establish a data flow with sufficient bandwidth to support the required computations.
Channelizers today can operate at sampling rates between 10 and 20 GSPS. With typical AI Engine and PL clock rates of 1 GHz and 500 MHz respectively, this implies channelizers require Super Sample Rate (SSR) operation where several I/O samples are produced and consumed on clock every cycle. A feasible clocking strategy is based on the following:
- IFFT processing employs sizes N = 2^m and hardware solutions become overly complex unless SSR = 2^n. Here SSR = 4, 8, or 16 makes sense given M = 16 for this design.
- Hardware design is further simplified when the input sampling rate Fs contains a factor of Q=7 matching its output oversampling factor P/Q = 8/7 because the output sampling rate is then an integral number of clock cycles.
- AI Engine supports clock rates ranging from Fc = 1.0 GHz to 1.3 GHz depending on speed grade. It follows SSR = Fs/Fc ranges from 10/1.3 to 20/1.0.
A suitable clocking strategy can be identified based on these considerations. This tutorial targets a nominal Fs = 10 GSPS with SSR = 8 for an AI Engine nominal clock rate of Fc = 1.25 GHz. This performance may be met with a "-2M" speed grade device, the specific clock rates chosen as appropriate to satisfy the Q=7 divisibility requirement.
The following figure shows a diagram of the M x K Circular Buffer described earlier. Each cell contains one sample "x(n)", where each sample is labelled with its time index "n". Note there are M=16 rows and K=8 columns. The diagram shows the evolution of the buffer contents over three consecutive time epochs of the buffer. The leftmost column represents the current input samples. There are M=16 samples in total. Fourteen of these labelled in red are input to the buffer over two cycles. The two samples labelled in blue represent history samples from the previous epoch.
Notice how the circular or "serpentine" shift operates on the M x K buffer. From the left to the middle, the buffer is shifted down by 14 samples. The bottom of each column is shifted around to the top of the next column to the right. Samples shifted out of the rightmost column are discarded. Notice how the red input samples "x13" and "x12" in the top two rows on the left become the blue state samples "x13" and "x12" in the bottom two rows in the middle. This is how the Circular Buffer introduces state into the filterbank processing.
The filterbank needs to process each row in the M x K array as a normal FIR filter. This is depicted as the green rectangle in the following figure. Notice, however, how the "state history" inside the green rectangle does not contain the normal "time-shifted" samples one usually sees within the state of an FIR filter. The sample ordering is jumbled and is unrelated over time. This cannot be implemented as a normal finite impulse response (FIR) filter in the AI Engine because the state history is not "linear". Not only the input sample, but the entire state history would have to be input to the FIR on every cycle. This is not feasible.
However, the yellow boxes reveal a solution. Note how the time indices of the samples within the yellow boxes do exhibit the desired "time-shifted" characteristic of a normal FIR filter state. On each time sample, the state contents within the yellow boxes are shifted by one sample making room for a new one. But these yellow boxes correspond to different logical filters of the filterbank. Consequently, a workable solution may be achieved by mapping logical filters (i.e., different rows in the M x K matrix) to physical AI Engine tiles performing those filters. This mapping changes over time on a sample-by-sample basis as indicated by the following figure, and acts as a "card dealing" operation where the input samples to the desired logical filters are dealt to different physical AI Engine tiles. Inside those AI Engine tiles, the state history exhibits time-shifted state. The outputs of the physical tiles must then undergo an inverse "card dealing" pattern to assign the output samples to the proper logical filter. This "card dealing" permutation is implemented easily in the PL through routing and multiplexing logic resource.
The AI Engine supports 16 MAC/cycle with "cint16" data and "int16" coefficients. It follows that four samples of a K=8 tap filter requires two cycles of compute. A single I/O stream delivers exactly four samples over four cycles. It follows this design is "I/O bound" rather than "Compute Bound" because the compute is busy only 50% of the time. The system must process M=16 samples every two cycles. It follows eight AI Engine tiles provide sufficient bandwidth with single stream I/O, each tile performing the compute for two filterbank channels. Additional design details are given below.
The cyclic shift performs no computations but simply introduces memoryless permutations in each input M-vector. No buffering occurs between inputs. The block simply performs a "cyclic shift" of each input M-vector. The shift amount varies according to an eight-stage FSM in this design. This block fits poorly to the AI Engine array as its stream routing is more restrictive than PL for introducing permutations, and there is no compute require to warrant it. This function is a natural fit for a "PL Data Mover" and can be implemented easily using Vitis HLS.
The IDFT or IFFT must perform an M=16 point transform at the input sample rate Fs. Given the design adopts SSR = 8, it follows a complete transform must be performed once every M / SSR = 16/8 = 2 cycles. This is a very high throughput rate given the M=16 transform involves either four stages of Radix-2 butterflies (32 total) or two stages of Radix-4 butterflies (eight total). This is challenging to achieve at a sustained rate of two cycles per transform given the overhead of butterfly addressing required for FFT solutions.
In this case, a direct "matrix multiplication" approach to computing the IDFT directly provides a workable solution. For the "cint16" data types adopted in this design, the AI Engine is capable of performing a single [1x2] x [2x4] vector-matrix product "OP" per cycle. The IDFT for M=16 requires a [1x16] x [16x16] vector-matrix product, equivalent to 32 such OPs. It follows that 16 AI engine tiles are required to implement the IDFT matrix product in two cycles.
To support this 100% efficient compute bound, each tile must use two input streams and compute one OP every cycle without stalling. The final output tiles must deliver four samples every two cycles to meet the desired throughput. More design details are given below.
The following figure shows a hardware diagram of the final polyphase channelizer design. It consists of the following elements:
-
The DMA Stream Source block uses a block RAM buffer to store channelizer input samples from DDR memory sampled at Fs. These samples are played out over seven AXI streams into the channelizer design. This block is implemented in PL using HLS at 312.5 MHz.
-
The Input Permute block introduces the "serpentine shift" required by the Circular Buffer plus any "card dealing" permutations as dictated by the periodic logical-to-physical channel pattern to drive the AI Engine filterbank with proper data to establish fixed state history patterns in the array. This block is implemented in PL using HLS at 312.5 MHz.
-
The Filterbank is implemented as an AI Engine sub-graph using the design approach detailed below. The design uses eight tiles and has eight I/O AXI streams. The AI Engine array is clocked at 1.25 GHz.
-
The Output Permute block removes the "card dealing" permutation applied for the filterbank processing so its output ordering has been restored prior to addition of the cyclic shift. This block is implemented in PL using HLS at 312.5 MHz.
-
The IDFT is implemented as an AI Engine sub-graph using the design approach detailed below. The design uses 16 tiles and has eight I/O AXI streams.
-
The DMA Stream Sink block uses a block RAM buffer to capture the channelizer output samples and return them to DDR memory. The block is implemented in PL using HLS at 312.5 MHz.
The following figure shows the physical layout of the AI Engine array for the polyphase channelizer design. The overall design requires 24 tiles. The IDFT uses 4 x 4 = 16 tiles and the Filterbank uses 4 x 2 = 8 tiles. A total of 22 tiles are used for buffering. The design uses 32 PLIO in total, 16 for input and 16 for output.
The following figure shows the VC1902 die layout for the polyphase channelizer and summarizes the AI Engine and PL resources needed to build the full design.
The following figure shows the software scheduling of the polyphase filterbank design. Each tile implements the filtering for two physical channels, in this case "A" and "B". The stream inputs collect four samples over four cycles, alternately for each channel. Similarly, the compute is performed alternately over two cycles for each channel. The output results are then produced alternately on the output stream over another four cycles. This loop is scheduled with II=8 to achieve the desired throughput.
From the compute gaps in the following figure and the fact that each AI Engine tile contains not one but two I/O streams, raises the question as to why do we use eight tiles for this design when perhaps only four are required from a compute bound perspective? Although the AI Engine supports two input and two output streams, a VLIW hardware restriction limits their use to either (i) two inputs and one output or (ii) one input and two outputs, or (iii) one input and one output. It was not feasible to schedule an II=8 loop supporting four filters in a single tile.
The following figure shows a diagram of how the "vector x matrix" multiplication form of the IDFT is vectorized and mapped to the AI Engine array of 4 x 4 = 16 tiles. The figure shows two consecutive IDFT transforms, one above the other. Recall each full transform is performed over two cycles. The operation of the design is outlined as follows:
- The design consists of a four x four array of tiles. Each tile performs two [1x2] x [2x4] operations over two cycles. Each row of tiles passes its computed outputs to the tile below in the same column using the cascade stream.
- Four samples are input on each of two input streams for each tile. The same data is broadcast to each tile in the row. For example, the orange input samples are broadcast to all tiles in the orange row, whereas the purple input samples are broadcast to all tiles in the purple row.
- Notice how the four input samples on a given stream span particular consecutive samples of a pair of transform inputs. For example, the four orange inputs on stream "ss0" contain the first two samples in the top (current) and bottom (next) input vector. Similarly, the four left-most purple samples on (unlabelled) stream "ss4" contain the 9th and 10th samples in the top and bottom input vectors.
- The array combines outputs top-to-bottom (in the diagram) using the cascade streams. The four tiles in the bottom row produce the outputs, writing four samples every four cycles on both streams in each tile. Note in the physical array, the cascade streams run horizontally left to right — the physical layout is rotated 90 degrees from the diagram in the following figure.
- Each full compute takes two cycles, with throughput sustained at that rate with 100% efficient compute in each AI Engine tile.
The polyphase channelizer design can be built easily from the command line.
IMPORTANT: Before beginning the tutorial ensure you have installed Vitis™ 2024.1 software. Ensure you have downloaded the Common Images for Embedded Vitis Platforms from this link.
Set the environment variable COMMON_IMAGE_VERSAL
to the full path where you have downloaded the Common Images. The remaining environment variables are configured in the top level Makefile <path-to-design>/04-Polyphase-Channelizer/Makefile
file. The tutorial will build its own custom platform
The channelizer design can be built for hardware emulation using the Makefile as follows:
[shell]% cd <path-to-design>/04-Polyphase-Channelizer
[shell]% make all TARGET=hw_emu
This will take about 90 minutes to run. The build process will generate a folder 04-Polyphase-Channelizer/package
containing all the files required for hardware emulation. This can be run as shown below. An optional -g
can be applied to the launch_hw_emu.sh
command to launch the Vivado waveform GUI to observe the top-level AXI signal ports in the design.
[shell]% cd <path-to-design>/04-Polyphase-Channelizer/package
[shell]% ./launch_hw_emu.sh -run-app embedded_exec.sh
The channelizer design can be built for the VCK190 board using the Makefile as follows:
[shell]% cd <path-to-design>/04-Polyphase-Channelizer
[shell]% make all TARGET=hw
The build process will generate the SD card image in the 04-Polyphase-Channelizer/package/sd_card
folder.
The Power Design Manager (PDM) is the new, next-generation power estimation platform designed to bring accurate and consistent power estimation capabilities to the largest Versal and AMD Kria™ SOM products. It is the preferred power estimation tool for the Versal product family. More information can be found on the Power Design Manager (PDM) product page and in the Power Design Manager User Guide (UG1556).
The PDM has three modes to estimate power:
- Manual Estimation Flow: All device and design parameters including device part, design resources (AI Engine, PL and PS), clocks, toggle rate, etc. are input manually into the GUI.
- Import Compilation Flow: The file generated from XPE or Vivado Report Power is imported into the PDM after compiling the design.
- Import Simulation Flow: The file generated from XPE or Vivado Report Power is imported into the PDM after simulating the design.
This example uses the Import Compilation Flow mode to perform a Vectorless Power Analysis as defined in the Vivado Design Suite User Guide: Power Analysis and Optimization (UG907). This estimate is refined by running a simulation of the AI Engine portion of the design and updating the initial estimate.
[shell]% make all power TARGET=hw
This performs the following tasks:
- Compiles the design targeting vck190.
- Runs the
vivado_xpe
Makefile target undervitis/final
which opens the compiled design in Vivado and runsreport_power
. The output of this step issystem_power.xpe
which is located in thevitis/final/build_hw/_x/link/vivado/vpl/prj
folder. - Runs the
vitis_xpe
Makefile target underaie/m16_ssr8
which simulates the AI Engine portion of the design and produces a refined power estimate. The output of this step ism16_ssr8_app.xpe
which is located in theaie/m16_ssr8/aiesim_xpe/
folder.
-
Launch the PDM.
-
Select New Project from the Start menu. The New Project dialog box opens.
-
In the New Project dialog box, type a name for your project.
-
In Project location, specify a directory where the project files will be stored.
-
Check the Create project subdirectory checkbox.
-
Select the Import XPE file checkbox and provide the path to
system_power.xpe
. -
Click Next, then click Finish.
The following screen is displayed.
In the Import XPE wizard, provide the path to the .xpe file you want to import and click OK.
The following screen is displayed.
The following table shows a comparison between power estimates in compilation versus simulation flows in the PDM.
Component | Static (W) | Dynamic (W) | Total (W) | Static (W) | Dynamic (W) | Total (W) |
Import Compilation Flow | Import Simulation Flow | |||||
PL | 7.5 | 2.8 | 10.3 | 7.5 | 2.8 | 10.3 |
AI Engine | 4.8 | 4.3 | 9.1 | 4.8 | 4.2 | 9.0 |
PS+PMC | 0.2 | 1.3 | 1.5 | 0.2 | 1.3 | 1.5 |
Everything else (NoC, DDRMC, GTY, etc) | 1.0 | 8.1 | 9.1 | 1.0 | 8.1 | 9.1 |
Total (W) | 13.5 | 16.4 | 29.9 | 13.5 | 16.3 | 29.8 |
[1] F.J. Harris et. al., "Digital Receivers and Transmitter Using Polyphase Filter Banks for Wireless Communications", IEEE Transactions on Microwave Theory and Techniques, Vol. 51, No. 4, April 2003.
GitHub issues will be used for tracking requests and bugs. For questions, go to support.xilinx.com.
Copyright © 2023-2024 Advanced Micro Devices, Inc