[BUG] Small FFTs have large overhead due to cudaMemGetInfo #863

deanljohnson · 2025-02-05T00:36:11Z

Describe the Bug

When executing small batched FFTs that complete quickly, performance is noticeably worse than using cuFFT directly due to calls to cudaMemGetInfo.

To Reproduce
Steps to reproduce the behavior:

Allocate input and output tensors of approximate size {16, 512}
(output = ifft(input)).run(stream);

Expected Behavior
I would expect the performance of this to more closely mirror cuFFT.

System Details (please complete the following information):

OS: Rocky 9
CUDA version: CUDA 12.3
g++ version: 11.4

Additional Context

Running an nsys profile, I can see that matx is calling cudaMemGetInfo on every call to .run:

On my machine, cudaMemGetInfo is taking about 10us. This isn't a long time and the FFT/batch sizes I am working with here are obviously small and therefore less efficient, but 10us on every .run call is still an unfortunate amount of overhead even if the problem size was larger (to a point).

This does not seem to occur when cuFFT is used directly. I also tried editing the matx source to remove the cudaMemGetInfo call and the overhead disappeared (performance matched cuFFT). This makes me wonder how it's approach differs from matx since it has to allocate workspaces behind the scene using the default API (cufftPlan1d/cufftExec*).

The text was updated successfully, but these errors were encountered:

cliffburdick · 2025-02-05T16:09:34Z

Thanks @deanljohnson . We were discussing a similar issue today and will address this soon. The reason this is there is because we try to put a limit on how large of a batch we launch in cuFFT so that it doesn't use excessive workspace memory. We will need to revisit this decision.

cliffburdick · 2025-02-05T16:46:11Z

Fixed in #864

deanljohnson · 2025-02-05T17:19:35Z

Excellent! Thanks for the quick turn around on this change!

cliffburdick self-assigned this Feb 5, 2025

tylera-nvidia linked a pull request Feb 5, 2025 that will close this issue

Remove test for free memory on FFTs #864

Merged

cliffburdick mentioned this issue Feb 5, 2025

Remove test for free memory on FFTs #864

Merged

cliffburdick closed this as completed in #864 Feb 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Small FFTs have large overhead due to cudaMemGetInfo #863

[BUG] Small FFTs have large overhead due to cudaMemGetInfo #863

deanljohnson commented Feb 5, 2025 •

edited

Loading

cliffburdick commented Feb 5, 2025

cliffburdick commented Feb 5, 2025

deanljohnson commented Feb 5, 2025

[BUG] Small FFTs have large overhead due to cudaMemGetInfo #863

[BUG] Small FFTs have large overhead due to cudaMemGetInfo #863

Comments

deanljohnson commented Feb 5, 2025 • edited Loading

cliffburdick commented Feb 5, 2025

cliffburdick commented Feb 5, 2025

deanljohnson commented Feb 5, 2025

deanljohnson commented Feb 5, 2025 •

edited

Loading