You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to apply flash attention to an algorithm that involves additive bias. From what I understand, flash attention only supports additive bias (or Additive Linear Bias, ALiBi) in the triton version, and even there, gradients are not applied to the bias. In my current use case, I need gradients to be applied to the bias so that the related parameters are updated during backpropagation. I have a few questions:
Does flash attention support additive bias?
If not, what challenges in CUDA programming prevent its implementation?
Why is the gradient not applied to the bias in the triton version?
I'm not very experienced with CUDA, so this might not make perfect sense, but would it be possible to simply disable the bias feature, transfer it from HBM to SRAM, and combine operations like softmax, addition, and matrix multiplication (i.e., sm(qk^T+b)v) into a single kernel?
I believe this has been raised in previous discussions, but it doesn't seem fully resolved. I'd appreciate any guidance.
Thanks!
The text was updated successfully, but these errors were encountered:
Hello @tridao !
I'm trying to apply flash attention to an algorithm that involves additive bias. From what I understand, flash attention only supports additive bias (or Additive Linear Bias, ALiBi) in the triton version, and even there, gradients are not applied to the bias. In my current use case, I need gradients to be applied to the bias so that the related parameters are updated during backpropagation. I have a few questions:
Does flash attention support additive bias?
If not, what challenges in CUDA programming prevent its implementation?
Why is the gradient not applied to the bias in the triton version?
I'm not very experienced with CUDA, so this might not make perfect sense, but would it be possible to simply disable the bias feature, transfer it from HBM to SRAM, and combine operations like softmax, addition, and matrix multiplication (i.e., sm(qk^T+b)v) into a single kernel?
I believe this has been raised in previous discussions, but it doesn't seem fully resolved. I'd appreciate any guidance.
Thanks!
The text was updated successfully, but these errors were encountered: