[BUG] Poor pwelch performance #887

deanljohnson · 2025-02-21T23:06:11Z

Describe the Bug

pwelch reduction stage performs about 10x worse than a similar hand-rolled CUDA implementation.

The relevant parameters in this case are:

fft size: 65536
window size: 500
window overlap: 250
number of input samples: 164750
number of FFTs: 658

From looking at the pwelch_impl source, it looks like it is effectively 3 operations:

FFTs
X_with_overlaps = conj(X_with_overlaps) * X_with_overlaps
Pxx = sum(mag_sq_X_with_overlaps, {0}) * norm_factor

On my system using matx, step 2 takes about 40us and step 3 takes 380us.
A custom CUDA kernel for steps 2&3 is taking 30us.

To Reproduce

Allocate complex<float> input tensor of size 164750
Create a hamming window of size 500
Call matx::pwelch(input, window, 500, 250, 65536)

Expected Behavior
Better performance

Code Snippets
The custom CUDA kernel in this case is:

template<typename ComplexT, typename RealT>
__global__ void ComputePowerSpectrumMultiFFT(const ComplexT* fft_buff, RealT* power_spectrum,
   int num_ffts, int num_bins)
{
   const int id = blockIdx.x * blockDim.x + threadIdx.x;
   const int num_threads = blockDim.x * gridDim.x;

   for (int bin = id; bin < num_bins; bin += num_threads)
   {
      RealT pwr = 0;
      for (int fft = 0; fft < num_ffts; fft++)
      {
         pwr += norm(fft_buff[fft * num_bins + bin]);
      }
      pwr /= num_ffts;

      // our kernel is outputting in dB, but that difference shouldn't matter.
      // Although, it would be nice if MatX allowed the `10 * log10` to be fused
      // with the rest of pwelch somehow.
      if (pwr <= 0)
         power_spectrum[bin] = 0;
      else
         power_spectrum[bin] = 10 * log10(pwr);
   }
}

This kernel is not doing the FFT portion of pwelch, but I am not reporting any issue with the FFT portion of pwelch, only the reduction stages.

System Details (please complete the following information):

OS: Rocky 9
CUDA version:CUDA 12.3
g++ version: 11.4.1

Additional Context
magnitude squared calculation info from nsys profile:

Begins: 14.0713s
Ends: 14.0713s (+43.936 μs)
grid:  <<<256, 31, 1>>>
block: <<<256, 1, 1>>>

Reduction+normalization (using matx) info from nsys profile:

Begins: 14.0713s
Ends: 14.0717s (+383.104 μs)
grid:  <<<65536, 1, 1>>>
block: <<<256, 1, 1>>>

Reduction+normalization (using custom kernel) info from nsys profile:

Begins: 14.0684s
Ends: 14.0684s (+29.344 μs)
grid:  <<<64, 1, 1>>>
block: <<<1024, 1, 1>>>

The text was updated successfully, but these errors were encountered:

cliffburdick · 2025-02-28T01:42:40Z

hi @deanljohnson, just to ping the issue, we haven't forgot about these. there's a bit of prep we are doing for GTC and haven't had time to look at them yet. we should be able to take a look soon.

tmartin-gh · 2025-03-03T23:01:53Z

Please checkout PR #897.

deanljohnson · 2025-03-05T21:19:28Z

#897 looks good to me - I checked out the branch and the performance for the reduction stage was virtually identical to our hand-rolled implementation. I also appreciate the addition of PwelchOutputScaleMode.

cliffburdick · 2025-03-07T17:01:07Z

Closing. Thanks @tmartin-gh !

tmartin-gh mentioned this issue Mar 3, 2025

Add configurable Pwelch scaling and improve performance #897

Merged

cliffburdick closed this as completed Mar 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Poor pwelch performance #887

[BUG] Poor pwelch performance #887

deanljohnson commented Feb 21, 2025 •

edited

Loading

cliffburdick commented Feb 28, 2025

tmartin-gh commented Mar 3, 2025

deanljohnson commented Mar 5, 2025

cliffburdick commented Mar 7, 2025

[BUG] Poor pwelch performance #887

[BUG] Poor pwelch performance #887

Comments

deanljohnson commented Feb 21, 2025 • edited Loading

cliffburdick commented Feb 28, 2025

tmartin-gh commented Mar 3, 2025

deanljohnson commented Mar 5, 2025

cliffburdick commented Mar 7, 2025

deanljohnson commented Feb 21, 2025 •

edited

Loading