Add experimental cuFFT support. #587

cbalint13 · 2020-11-08T18:20:21Z

Enable offload experimental FFT processing on cuda based GPUs.

Description

Proposed is a very simple patch using drop-in swap of actual main FFTW3 routines, see: cuFFTW interface
cuFFT don't support R2R so those are skipped in this PR, however they are not used (yet) in the srsLTE runtime.

Target

To improve the under 4ms computation requirement in the very tight timeframe only with limited threads is very challenging.

Evaluation

Enabling cuFFT with proposed patch works fine just as the cpu target code, no degradation or loss was experienced.
According to a simple benchmark things are slower on CUDA vs CPU having @1024pts target sizes:

Running ./cufftwf-benchmark
---------------------------------------------------------------------------------------
Benchmark                             Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------
cu_fftwf/1024/manual_time         23298 ns        23283 ns        29583 bytes_per_second=335.33M/s items_per_second=43.9524M/s
cu_fftwf/2048/manual_time         27701 ns        27674 ns        25377 bytes_per_second=564.059M/s items_per_second=73.9324M/s
cu_fftwf/524288/manual_time     2002004 ns      1996820 ns          352 bytes_per_second=1.95117G/s items_per_second=261.882M/s
cu_fftwf/1048576/manual_time    4032475 ns      4022351 ns          175 bytes_per_second=1.9374G/s items_per_second=260.033M/s
cu_fftwf/manual_time_BigO          3.85 N          3.84 N    
cu_fftwf/manual_time_RMS              3 %             3 %    

Running ./fftw3f-benchmark
------------------------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------
fftwf/1024/manual_time          1644 ns         1669 ns       425523 bytes_per_second=4.63961G/s items_per_second=622.718M/s
fftwf/2048/manual_time          4070 ns         4094 ns       171156 bytes_per_second=3.74898G/s items_per_second=503.18M/s
fftwf/524288/manual_time    10701510 ns     10675552 ns           64 bytes_per_second=373.779M/s items_per_second=48.992M/s
fftwf/1048576/manual_time   33639549 ns     33554670 ns           20 bytes_per_second=237.815M/s items_per_second=31.1709M/s
fftwf/manual_time_BigO          0.00 N^2        0.00 N^2  
fftwf/manual_time_RMS             16 %            16 %

But being offloaded, the CPU may benefit some more free time slots on some low end SBC.
Beyond FFT, more benefits for baseband may come from Nvidia Aerial's cuPHY , targeting FEC / TurboCodes too.

@andrepuschmann , @suttonpd , @ismagom looking forward your thoughts.

Thank You !

CLAassistant · 2020-11-08T18:20:26Z

All committers have signed the CLA.

andrepuschmann

Hey Balint, thanks a lot for that. We are looking at GPU-based acceleration for a while but didn't see an actual use-case just yet Primarily, as you've pointed out, FFTs and other basic operations are super efficient on x86. That might change for coding and other more complex PHY procedures. We'll see. We also follow Aerial, but didn't try it out or have seen any benchmark results. Definitely looking forward to it. That being said, I think your PR is a good basis for possible GPU offloading using CUDA.

lib/src/phy/dft/dft_fftw.c

xavierarteaga · 2020-11-14T23:35:56Z

Hello, thanks for your experiment. I was curious and benchmarked it. There is an OFDM unit test that can be used for benchmark: lib/src/phy/dft/test/ofdm_test -N 2048 -n 100 -r 10000. The benchmark result is the ratio of the number of samples divided by the time it took to process.

For USE_CUDA=Off (i7 7800X OC 4.5GHz):

Running test for 100 PRB, 16800 RE...  [email protected] [email protected] MSE=0.000005

For USE_CUDA=On (GeForce GTX 970 WindForce 3X OC):

Running test for 100 PRB, 16800 RE...  [email protected] [email protected] MSE=0.000005

Also, the whole DL processing chain (including PDCCH and PDSCH encoding/decoding) can be tested and benchmarked lib/test/phy/phy_dl_test -p 100 -s 1000 -m 28. This test prompts the ratio of the number of bits encoded in a subframe divided by the processing time.

For USE_CUDA=Off:

lib/test/phy/phy_dl_test -p 100 -s 1000 -m 28
Finished! The UE failed decoding 0 of 1000 transport blocks.
75376000 were transmitted, 75376000 bits were received.
[Rates in Mbps] Granted  Processed
           eNb:    75.4      263.4
            UE:    75.4      102.6
BLER:   0.0%
Ok

For USE_CUDA=On:

Finished! The UE failed decoding 0 of 1000 transport blocks.
75376000 were transmitted, 75376000 bits were received.
[Rates in Mbps] Granted  Processed
           eNb:    75.4      142.1
            UE:    75.4       79.5
BLER:   0.0%
Ok

I am curious about how the cuPHY may perform.

andrepuschmann · 2021-04-23T08:03:34Z

Just a quick follow-up on this. We've decided to leave the PR for the upcoming release, simply because the user benefit isn't obvious right now. We are happy to leave the PR open and build on top of it should Nvidia decide to make cuBB available publicly. Thanks again @cbalint13 for your contribution. Those are much appreciated.

cbalint13 · 2021-05-03T20:44:22Z

Just a quick follow-up on this. We've decided to leave the PR for the upcoming release, simply because the user benefit isn't obvious right now. We are happy to leave the PR open and build on top of it should Nvidia decide to make cuBB available publicly. Thanks again @cbalint13 for your contribution. Those are much appreciated.

@andrepuschmann , @xavierarteaga , @ismagom

Actual srsRAN implementation seems to have little benefits with CUDA, however it is a encouragement for future development.

Notes on more various tests and benchmarks:

the fft/ifft buffers/batches are too small for substantial benefits for cuFFT computation (compared to large batches)
the intense GPU<->PCIe<->RAM transfers for the small buffers/batches is also a major drawback (even with DMA)
the actual schema (having small buffers/batches) would suffer same drawbacks for CUDA turbocode/ldpc routines too
on e.g. NVIDIA Jetson (GPU on same CPU+MEM bus) the intense buffer exchange is worse (up to freezing the bus).

Some possible future solutions that may enable more heterogenous computation schema:

To gain benefit the whole computation chain including data should stay on the very GPU instance (until "to-air" release).
PHY code/buffer re-organization would benefit coupling with more advanced (never seen before) mod/demod schemas like e.g. [1] [2] (might help elude current/classical "not-so-parallelizable" FEC computation).

[1] LDPC https://arxiv.org/pdf/2007.07644.pdf
[2] TurboAE https://github.com/yihanjiang/turboae

andrepuschmann reviewed Nov 9, 2020

View reviewed changes

lib/src/phy/dft/dft_fftw.c Outdated Show resolved Hide resolved

Add experimental cuFFT support.

6c24e66

cbalint13 requested a review from andrepuschmann November 11, 2020 20:01

ghost mentioned this pull request Dec 23, 2020

Make Test failed #537

Closed

cbalint13 mentioned this pull request May 3, 2021

Add downlink PAPR measurement to eNB. #659

Open

andrepuschmann force-pushed the master branch from 5662e73 to 1c6dd8c Compare May 14, 2021 18:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add experimental cuFFT support. #587

Add experimental cuFFT support. #587

cbalint13 commented Nov 8, 2020 •

edited

Loading

CLAassistant commented Nov 8, 2020 •

edited

Loading

andrepuschmann left a comment

xavierarteaga commented Nov 14, 2020

andrepuschmann commented Apr 23, 2021

cbalint13 commented May 3, 2021 •

edited

Loading

Add experimental cuFFT support. #587

Are you sure you want to change the base?

Add experimental cuFFT support. #587

Conversation

cbalint13 commented Nov 8, 2020 • edited Loading

CLAassistant commented Nov 8, 2020 • edited Loading

andrepuschmann left a comment

Choose a reason for hiding this comment

xavierarteaga commented Nov 14, 2020

andrepuschmann commented Apr 23, 2021

cbalint13 commented May 3, 2021 • edited Loading

cbalint13 commented Nov 8, 2020 •

edited

Loading

CLAassistant commented Nov 8, 2020 •

edited

Loading

cbalint13 commented May 3, 2021 •

edited

Loading