Additive Bias in Flash Attention #1219

kkh517 · 2024-09-11T08:31:57Z

I'm trying to apply flash attention to an algorithm that involves additive bias. From what I understand, flash attention only supports additive bias (or Additive Linear Bias, ALiBi) in the triton version, and even there, gradients are not applied to the bias. In my current use case, I need gradients to be applied to the bias so that the related parameters are updated during backpropagation. I have a few questions:

Does flash attention support additive bias?

If not, what challenges in CUDA programming prevent its implementation?

Why is the gradient not applied to the bias in the triton version?

I'm not very experienced with CUDA, so this might not make perfect sense, but would it be possible to simply disable the bias feature, transfer it from HBM to SRAM, and combine operations like softmax, addition, and matrix multiplication (i.e., sm(qk^T+b)v) into a single kernel?

I believe this has been raised in previous discussions, but it doesn't seem fully resolved. I'd appreciate any guidance.

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additive Bias in Flash Attention #1219

Additive Bias in Flash Attention #1219

kkh517 commented Sep 11, 2024

Additive Bias in Flash Attention #1219

Additive Bias in Flash Attention #1219

Comments

kkh517 commented Sep 11, 2024