-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
atomic_add slows down attention backwards due to layout conversions #4717
Comments
Had a fix for this issue, got similar perf metrics on my machine, maybe you can try if it works for you on you machine. lijinpei@1e344f7
|
Storing data with mma is still buggy in triton and is something I'll be working on after all pending linear layout PRs have been merged. It's not practical because of the problem you mentioned. One reason that |
I'll take a look to see if we can get memory coalescing to happen and/or use vectorized loads. |
#4971 for generating vectorized atomic_add instructions |
@bertmaher these are the numbers I get after adding vectorized atomics. I'm guessing this means that the vectorized atomics support is sufficient - let me know if you think it's worth investigating further though!
|
@davidberard98 great! |
@Chillee noticed that using
atomic_add
in the backward of attention notably slows down the kernel, and in fact it's slower than "manually" doingatomic_add
using inline assembly. The root cause seems to be that the layout conversion from#mma
layout to#blocked
adds a lot of overhead; interestingly, usingtl.store
(which is incorrect) does the layout conversion but is nevertheless faster than theatomic_add
version (possibly due to async copying from smem to gmem).Repro at https://gist.github.com/bertmaher/e33b874f75cb82451060b88ee20b8203.
Results on my H100:
The text was updated successfully, but these errors were encountered: