You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When executing small batched FFTs that complete quickly, performance is noticeably worse than using cuFFT directly due to calls to cudaMemGetInfo.
To Reproduce
Steps to reproduce the behavior:
Allocate input and output tensors of approximate size {16, 512}
(output = ifft(input)).run(stream);
Expected Behavior
I would expect the performance of this to more closely mirror cuFFT.
System Details (please complete the following information):
OS: Rocky 9
CUDA version: CUDA 12.3
g++ version: 11.4
Additional Context
Running an nsys profile, I can see that matx is calling cudaMemGetInfo on every call to .run:
On my machine, cudaMemGetInfo is taking about 10us. This isn't a long time and the FFT/batch sizes I am working with here are obviously small and therefore less efficient, but 10us on every .run call is still an unfortunate amount of overhead even if the problem size was larger (to a point).
This does not seem to occur when cuFFT is used directly. I also tried editing the matx source to remove the cudaMemGetInfo call and the overhead disappeared (performance matched cuFFT). This makes me wonder how it's approach differs from matx since it has to allocate workspaces behind the scene using the default API (cufftPlan1d/cufftExec*).
The text was updated successfully, but these errors were encountered:
Thanks @deanljohnson . We were discussing a similar issue today and will address this soon. The reason this is there is because we try to put a limit on how large of a batch we launch in cuFFT so that it doesn't use excessive workspace memory. We will need to revisit this decision.
Describe the Bug
When executing small batched FFTs that complete quickly, performance is noticeably worse than using
cuFFT
directly due to calls tocudaMemGetInfo
.To Reproduce
Steps to reproduce the behavior:
{16, 512}
(output = ifft(input)).run(stream)
;Expected Behavior
I would expect the performance of this to more closely mirror
cuFFT
.System Details (please complete the following information):
Additional Context
Running an nsys profile, I can see that matx is calling
cudaMemGetInfo
on every call to.run
:On my machine,
cudaMemGetInfo
is taking about 10us. This isn't a long time and the FFT/batch sizes I am working with here are obviously small and therefore less efficient, but 10us on every.run
call is still an unfortunate amount of overhead even if the problem size was larger (to a point).This does not seem to occur when
cuFFT
is used directly. I also tried editing thematx
source to remove thecudaMemGetInfo
call and the overhead disappeared (performance matchedcuFFT
). This makes me wonder how it's approach differs frommatx
since it has to allocate workspaces behind the scene using the default API (cufftPlan1d
/cufftExec*
).The text was updated successfully, but these errors were encountered: