Add `flash_attn_varlen_func_with_kvcache`. #685

garrett4wade · 2023-11-22T08:52:39Z

Implementing a new function called flash_attn_varlen_func_with_kvcache, which behaves similar to flash_attn_with_kvcache but allows variable-length q/k/v inputs.

Aim for Implementing This Feature

This enables inflight batching(#672) , i.e., during generation, when one sequence completes with the EOS token, we put a new prompt at this position such as the PAD token will not accupy computation bandwidth. This technique results in variable-length inputs during each generation step, (e.g., [1, 1, 3, 1] if the new prompt is length 3).

Enabling inflight batching in flash-attn benefits LLM RLHF which requires high generation throughput (no pad in this case) while updating model parameters at the same time (we can't use inference libraries like vLLM because we need to synchronize parameters after each train step, which is expensive).

Usage

q/k/v are packed 1D tensors. q should be passed in together with a cu_seqlens_q and max_seqlen_q similar to flash_attn_varlen_func. k and v are optional arguments, which should be passed in with cu_seqlens_k. max_seqlen_k is determined by kv cache. kv cache will be updated in-place. Use kv cache only if k and v are not passed in.

k_cache and v_cache still have shape (batch_size_cache, seqlen_cache, n_heads, head_dim).

Major Changes:

Adding mha_varlen_fwd_kvcache in csrc/flash_attn/flash_api.cpp. This function is similar to mha_fwd_kvcache but sets forward params appropriately to deal with variable-length inputs. block_info.h is also changed accordingly.
Adding flash_attn_varlen_func_with_kvcache in flash_attn/flash_attn_interface.py.
Adding test_flash_attn_varlen_kvcache in tests/test_flash_attn.py.

Minor Changes:

Add a make clean option to remove compilation cache. I found weird bugs when compile the code again without make clean .
Minor refactor of mha_fwd_kvcache. Set cache length to the k_cache_seqlens attribute of forward params instead of using cu_seqlens_k and is_seqlens_k_cumulative. Add a new attribute called cu_seqlens_knew to distinguish it from cu_seqlens_k. block_info.h and flash.h are changed accordingly.

Limitation

This new function by default sets num_splits=1, which may hurt performance, but currently I don't understand how to fix this.

Edit 2023.11.23: Append test result.

1. add varlen supporting of MHA layer 2. add varlen supporting of ApplyRotaryEmbQKV_ / ApplyRotaryEmbKV_ (with test code) 3. fix wrong spell

shcho1118 · 2023-11-23T06:54:40Z

@garrett4wade
If this PR applies, could it be used in place of vllm's paged attention v2 kernel?
The problem I have right now is that I can't use the splitkv kernel in varlen funcs in the decoding phase.

garrett4wade · 2023-11-24T02:42:07Z

@shcho1118 Hi cho, I don't think this PR can help because I forcely set num_splits=1 for varlen func with kv cache, although this can be potentially fixed in the future.

sgrigory · 2024-01-06T21:59:55Z

@shcho1118 Hi cho, I don't think this PR can help because I forcely set num_splits=1 for varlen func with kv cache, although this can be potentially fixed in the future.

@garrett4wade @shcho1118 FYI, I'm adding split-kv to mha_varlen_fwd for decoding in #754

GGGGGGXY and others added 7 commits September 20, 2023 01:45

[Rotary] more varlen rotary function implement

e2b01b5

1. add varlen supporting of MHA layer 2. add varlen supporting of ApplyRotaryEmbQKV_ / ApplyRotaryEmbKV_ (with test code) 3. fix wrong spell

[format] format code using black

cb5bd90

Merge branch 'main' into feature/varlen_rotray

2bb9449

Merge branch 'main' into feature/varlen_rotray

a3df17f

Merge branch 'Dao-AILab:main' into feature/varlen_rotray

f704613

add varlen fwd w. kv cache

e729659

add doc

1835313

garrett4wade mentioned this pull request Nov 22, 2023

Could we have a flash_attn_varlen_func_with_kv_cache? #672

Open

garrett4wade added 2 commits November 23, 2023 16:34

Merge remote-tracking branch 'rotary/feature/varlen_rotray'

a689f64

fix test rotary

3bb2208

garrett4wade added 3 commits November 24, 2023 11:33

fix rotary and test

1b66415

add llama scaling rotary

a15b0c8

Merge branch 'main' of github.com:Dao-AILab/flash-attention

6346773

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `flash_attn_varlen_func_with_kvcache`. #685

Add `flash_attn_varlen_func_with_kvcache`. #685

garrett4wade commented Nov 22, 2023 •

edited

Loading

shcho1118 commented Nov 23, 2023

garrett4wade commented Nov 24, 2023

sgrigory commented Jan 6, 2024

Add flash_attn_varlen_func_with_kvcache. #685

Are you sure you want to change the base?

Add flash_attn_varlen_func_with_kvcache. #685

Conversation

garrett4wade commented Nov 22, 2023 • edited Loading

Aim for Implementing This Feature

Usage

Major Changes:

Minor Changes:

Limitation

shcho1118 commented Nov 23, 2023

garrett4wade commented Nov 24, 2023

sgrigory commented Jan 6, 2024

Add `flash_attn_varlen_func_with_kvcache`. #685

Add `flash_attn_varlen_func_with_kvcache`. #685

garrett4wade commented Nov 22, 2023 •

edited

Loading