Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Small FFTs have large overhead due to cudaMemGetInfo #863

Closed
deanljohnson opened this issue Feb 5, 2025 · 3 comments · Fixed by #864
Closed

[BUG] Small FFTs have large overhead due to cudaMemGetInfo #863

deanljohnson opened this issue Feb 5, 2025 · 3 comments · Fixed by #864
Assignees

Comments

@deanljohnson
Copy link

deanljohnson commented Feb 5, 2025

Describe the Bug

When executing small batched FFTs that complete quickly, performance is noticeably worse than using cuFFT directly due to calls to cudaMemGetInfo.

To Reproduce
Steps to reproduce the behavior:

  1. Allocate input and output tensors of approximate size {16, 512}
  2. (output = ifft(input)).run(stream);

Expected Behavior
I would expect the performance of this to more closely mirror cuFFT.

System Details (please complete the following information):

  • OS: Rocky 9
  • CUDA version: CUDA 12.3
  • g++ version: 11.4

Additional Context

Running an nsys profile, I can see that matx is calling cudaMemGetInfo on every call to .run:

Image

On my machine, cudaMemGetInfo is taking about 10us. This isn't a long time and the FFT/batch sizes I am working with here are obviously small and therefore less efficient, but 10us on every .run call is still an unfortunate amount of overhead even if the problem size was larger (to a point).

This does not seem to occur when cuFFT is used directly. I also tried editing the matx source to remove the cudaMemGetInfo call and the overhead disappeared (performance matched cuFFT). This makes me wonder how it's approach differs from matx since it has to allocate workspaces behind the scene using the default API (cufftPlan1d/cufftExec*).

@cliffburdick
Copy link
Collaborator

Thanks @deanljohnson . We were discussing a similar issue today and will address this soon. The reason this is there is because we try to put a limit on how large of a batch we launch in cuFFT so that it doesn't use excessive workspace memory. We will need to revisit this decision.

@cliffburdick cliffburdick self-assigned this Feb 5, 2025
@cliffburdick
Copy link
Collaborator

Fixed in #864

@deanljohnson
Copy link
Author

Excellent! Thanks for the quick turn around on this change!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants