Skip to content

Extend FlashAttention Prefill with KV cache #318

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 21 commits into
base: sycl-develop
Choose a base branch
from

Conversation

min-jean-cho
Copy link

@min-jean-cho min-jean-cho commented Apr 19, 2025

Moved to #331.

This extends FlashAttention prefill with cached KV in addition to current KV (blue box in the below figure). Both causal and non-causal are supported.
extend prefill

@pengzhao-intel
Copy link

Great Job Min Jean!

@sunjiweiswift
Copy link

Refer to sglang/test/srt/test_triton_attention_kernels.py,
qo_indptr, kv_indptr, kv_indices, and other parameters are also needed

@pengzhao-intel
Copy link

Refer to sglang/test/srt/test_triton_attention_kernels.py, qo_indptr, kv_indptr, kv_indices, and other parameters are also needed

@sunjiweiswift Thanks for good suggestion.

This is non-contiguous input of cached KV for this feature and the major part of code is same.
I suggest to review and merge this PR first and we can add this feature in next PR (or only leave it in sglang because this is the cutlass example, we want to keep it as simple as possible in general).

@mehdi-goli
Copy link
Collaborator

Refer to sglang/test/srt/test_triton_attention_kernels.py, qo_indptr, kv_indptr, kv_indices, and other parameters are also needed

@sunjiweiswift Thanks for good suggestion.

This is non-contiguous input of cached KV for this feature and the major part of code is same. I suggest to review and merge this PR first and we can add this feature in next PR (or only leave it in sglang because this is the cutlass example, we want to keep it as simple as possible in general).

Yes agree with you @pengzhao-intel . We need to keep the example as simple as possible and leave the rest in sglang. If there is aby performance regression there, we would be able to offer help.

@min-jean-cho min-jean-cho force-pushed the minjean/extend_attention_prefill branch from 770669a to 72edfc0 Compare April 22, 2025 08:06
@mehdi-goli
Copy link
Collaborator

@min-jean-cho I put the prefetch for the cached version, I also fixed some index/strides as well

@min-jean-cho
Copy link
Author

@mehdi-goli, looks good. Thanks for the update!

@min-jean-cho
Copy link
Author

Hi @mehdi-goli, any further comments on this? Thanks.

@pengzhao-intel
Copy link

@mehdi-goli @muhammad-tanvir-1211 please help approve and merge this PR and we have further works based on this :)

Copy link

@pengzhao-intel pengzhao-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the contribution!

Comment on lines +337 to +338
int offset_k_cache = num_heads_kv * head_size_qk * seq_len_kv_cache;
int offset_v_cache = num_heads_kv * head_size_vo * seq_len_kv_cache;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we consider the cached key-value pairs to be the same across all batches? My understanding is that each batch would have it's seq_len for the cached keys and values, which would mean that seq_len_kv_cache would also be of Variable Length type (same as seq_len_qo and seq_len_kv). This code would potentially give out of bound access because it is missing the multiplication with l_coord (if we want to keep seq_len_kv_cache fixed length), or a multiplication with kv_cache_cumulative_length[l_coord] (if we want to change the type to Variable Length)

@min-jean-cho
Copy link
Author

Moved to #331. As discussed offline, separating without KV cache vs. with KV cache into separate pipelines. Created a new PR rather than update in here to keep the difference clear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants