Extend FlashAttention Prefill with KV cache #318

min-jean-cho · 2025-04-19T14:30:45Z

Moved to #331.

This extends FlashAttention prefill with cached KV in addition to current KV (blue box in the below figure). Both causal and non-causal are supported.

pengzhao-intel · 2025-04-19T17:40:06Z

Great Job Min Jean!

sunjiweiswift · 2025-04-21T06:09:52Z

Refer to sglang/test/srt/test_triton_attention_kernels.py,
qo_indptr, kv_indptr, kv_indices, and other parameters are also needed

pengzhao-intel · 2025-04-21T06:45:47Z

Refer to sglang/test/srt/test_triton_attention_kernels.py, qo_indptr, kv_indptr, kv_indices, and other parameters are also needed

@sunjiweiswift Thanks for good suggestion.

This is non-contiguous input of cached KV for this feature and the major part of code is same.
I suggest to review and merge this PR first and we can add this feature in next PR (or only leave it in sglang because this is the cutlass example, we want to keep it as simple as possible in general).

mehdi-goli · 2025-04-21T09:01:42Z

Refer to sglang/test/srt/test_triton_attention_kernels.py, qo_indptr, kv_indptr, kv_indices, and other parameters are also needed

@sunjiweiswift Thanks for good suggestion.

This is non-contiguous input of cached KV for this feature and the major part of code is same. I suggest to review and merge this PR first and we can add this feature in next PR (or only leave it in sglang because this is the cutlass example, we want to keep it as simple as possible in general).

Yes agree with you @pengzhao-intel . We need to keep the example as simple as possible and leave the rest in sglang. If there is aby performance regression there, we would be able to offer help.

applications/flash_attention_v2/kernel/xe_flash_attn_gemm.hpp

Co-authored-by: Mehdi Goli <[email protected]>

…able lengh

mehdi-goli · 2025-04-22T17:01:35Z

@min-jean-cho I put the prefetch for the cached version, I also fixed some index/strides as well

min-jean-cho · 2025-04-22T18:01:01Z

@mehdi-goli, looks good. Thanks for the update!

min-jean-cho · 2025-04-23T23:46:34Z

Hi @mehdi-goli, any further comments on this? Thanks.

pengzhao-intel · 2025-04-24T01:28:08Z

@mehdi-goli @muhammad-tanvir-1211 please help approve and merge this PR and we have further works based on this :)

pengzhao-intel

LGTM! Thanks for the contribution!

applications/flash_attention_v2/collective/xe_flash_attn_mma.hpp

applications/flash_attention_v2/kernel/tile_scheduler.hpp

applications/flash_attention_v2/kernel/xe_flash_attn_gemm.hpp

applications/flash_attention_v2/collective/xe_flash_attn_mma.hpp

Co-authored-by: Tadej Ciglarič <[email protected]>

muhammad-tanvir-1211 · 2025-04-24T13:06:40Z

applications/flash_attention_v2/collective/xe_flash_attn_mma.hpp

+      int offset_k_cache = num_heads_kv * head_size_qk * seq_len_kv_cache;
+      int offset_v_cache = num_heads_kv * head_size_vo * seq_len_kv_cache;


Do we consider the cached key-value pairs to be the same across all batches? My understanding is that each batch would have it's seq_len for the cached keys and values, which would mean that seq_len_kv_cache would also be of Variable Length type (same as seq_len_qo and seq_len_kv). This code would potentially give out of bound access because it is missing the multiplication with l_coord (if we want to keep seq_len_kv_cache fixed length), or a multiplication with kv_cache_cumulative_length[l_coord] (if we want to change the type to Variable Length)

min-jean-cho · 2025-04-25T23:08:46Z

Moved to #331. As discussed offline, separating without KV cache vs. with KV cache into separate pipelines. Created a new PR rather than update in here to keep the difference clear.

Original: #318 This extends FlashAttention prefill with cached KV in addition to current KV (blue box in the below figure). Both causal and non-causal are supported. <img width="611" alt="extend prefill" src="https://github.com/user-attachments/assets/d27e5dc6-5700-447e-b8c8-33e8073d7891" /> --------- Co-authored-by: Mehdi Goli <[email protected]> Co-authored-by: Alejandro Acosta <[email protected]>

min-jean-cho added 5 commits April 18, 2025 05:43

Add seq_len_kv_cache

b8c4928

Add KV cache and update verify kernel

fcde843

Update mmaQK, mmaPV to handle KV cache, new

359dfa5

Correct verify kernel with KV cache

180574e

Fix causal mask when KV cache

cac41c2

mehdi-goli reviewed Apr 21, 2025

View reviewed changes

min-jean-cho and others added 6 commits April 21, 2025 13:59

Update applications/flash_attention_v2/kernel/xe_flash_attn_gemm.hpp

9403e6d

Co-authored-by: Mehdi Goli <[email protected]>

Update applications/flash_attention_v2/kernel/xe_flash_attn_gemm.hpp

f413f9c

Co-authored-by: Mehdi Goli <[email protected]>

Update applications/flash_attention_v2/kernel/xe_flash_attn_gemm.hpp

9c02255

Co-authored-by: Mehdi Goli <[email protected]>

Merge branch 'sycl-develop' into minjean/extend_attention_prefill

cfe8ffd

Minor update

748815c

Update flops, gbps calculation

72edfc0

min-jean-cho force-pushed the minjean/extend_attention_prefill branch from 770669a to 72edfc0 Compare April 22, 2025 08:06

min-jean-cho and others added 3 commits April 22, 2025 01:09

Fix verify when num_heads_kv != num_heads_q

6735cfd

Fixing the index for launch

c9d11ea

Adding prefetch to the extend version. Fixing the stride for the vari…

e7505c8

…able lengh

pengzhao-intel approved these changes Apr 24, 2025

View reviewed changes

t4c1 reviewed Apr 24, 2025

View reviewed changes

min-jean-cho and others added 4 commits April 24, 2025 03:10

Update applications/flash_attention_v2/kernel/tile_scheduler.hpp

e666342

Co-authored-by: Tadej Ciglarič <[email protected]>

Update applications/flash_attention_v2/kernel/xe_flash_attn_gemm.hpp

31e5580

Co-authored-by: Tadej Ciglarič <[email protected]>

Update applications/flash_attention_v2/kernel/xe_flash_attn_gemm.hpp

21056aa

Co-authored-by: Tadej Ciglarič <[email protected]>

Update applications/flash_attention_v2/kernel/xe_flash_attn_gemm.hpp

31b9aa9

Co-authored-by: Tadej Ciglarič <[email protected]>

Update applications/flash_attention_v2/collective/xe_flash_attn_mma.hpp

83bf043

Co-authored-by: Tadej Ciglarič <[email protected]>

muhammad-tanvir-1211 reviewed Apr 24, 2025

View reviewed changes

mehdi-goli added 2 commits April 25, 2025 20:15

Applyig the comments

5620c38

Update xe_flash_attn_gemm.hpp

e5fbdbb

min-jean-cho mentioned this pull request Apr 25, 2025

(Take2) Extend FlashAttention Prefill with KV cache #331

Merged

min-jean-cho closed this Apr 29, 2025

		int offset_k_cache = num_heads_kv * head_size_qk * seq_len_kv_cache;
		int offset_v_cache = num_heads_kv * head_size_vo * seq_len_kv_cache;

Extend FlashAttention Prefill with KV cache #318

Extend FlashAttention Prefill with KV cache #318

Uh oh!

Conversation

min-jean-cho commented Apr 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pengzhao-intel commented Apr 19, 2025

Uh oh!

sunjiweiswift commented Apr 21, 2025

Uh oh!

pengzhao-intel commented Apr 21, 2025

Uh oh!

mehdi-goli commented Apr 21, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mehdi-goli commented Apr 22, 2025

Uh oh!

min-jean-cho commented Apr 22, 2025

Uh oh!

min-jean-cho commented Apr 23, 2025

Uh oh!

pengzhao-intel commented Apr 24, 2025

Uh oh!

pengzhao-intel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

muhammad-tanvir-1211 Apr 24, 2025

Choose a reason for hiding this comment

Uh oh!

min-jean-cho commented Apr 25, 2025

Uh oh!

Uh oh!

min-jean-cho commented Apr 19, 2025 •

edited

Loading