-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Poor pwelch performance #887
Comments
hi @deanljohnson, just to ping the issue, we haven't forgot about these. there's a bit of prep we are doing for GTC and haven't had time to look at them yet. we should be able to take a look soon. |
Please checkout PR #897. |
#897 looks good to me - I checked out the branch and the performance for the reduction stage was virtually identical to our hand-rolled implementation. I also appreciate the addition of |
Closing. Thanks @tmartin-gh ! |
Describe the Bug
pwelch
reduction stage performs about 10x worse than a similar hand-rolled CUDA implementation.The relevant parameters in this case are:
From looking at the
pwelch_impl
source, it looks like it is effectively 3 operations:X_with_overlaps = conj(X_with_overlaps) * X_with_overlaps
Pxx = sum(mag_sq_X_with_overlaps, {0}) * norm_factor
On my system using matx, step 2 takes about 40us and step 3 takes 380us.
A custom CUDA kernel for steps 2&3 is taking 30us.
To Reproduce
complex<float>
input tensor of size 164750matx::pwelch(input, window, 500, 250, 65536)
Expected Behavior
Better performance
Code Snippets
The custom CUDA kernel in this case is:
This kernel is not doing the FFT portion of
pwelch
, but I am not reporting any issue with the FFT portion ofpwelch
, only the reduction stages.System Details (please complete the following information):
Additional Context
magnitude squared calculation info from nsys profile:
Reduction+normalization (using
matx
) info from nsys profile:Reduction+normalization (using custom kernel) info from nsys profile:
The text was updated successfully, but these errors were encountered: