Adding new rocm triton flash attention kernel #4

jpvillam-amd · 2024-03-15T17:23:13Z

Making this PR for a quick review before I open the main PR for upstream

@vgokhale Added you as a co-author since vllm/model_executor/layers/attention/ops/flash_attention_triton.py is basically all ours 😆

Co-authored-by: Vinayak Gokhale <[email protected]>

vgokhale · 2024-03-18T15:04:22Z

vllm/model_executor/layers/attention/ops/flash_attention_triton.py

+
+        encoded_softmax = None
+
+        M = torch.empty((batch, nheads_q, metadata.max_seq_len), device=q.device, dtype=torch.float32)


I think you can remove this, and the associated writeback in the kernel for a minor speedup. This is used during the backward pass and we don't have a separate inference-only forward kernel. We don't need it for inference.

vllm/model_executor/layers/attention/ops/flash_attention_triton.py

hongxiayang · 2024-03-18T15:20:57Z

vllm/model_executor/layers/attention/backends/flash_attn.py

 import torch

 from vllm.model_executor.input_metadata import InputMetadata
 from vllm.model_executor.layers.attention.ops.paged_attn import (
    PagedAttentionImpl)
+from vllm.model_executor.layers.attention.ops.flash_attention_triton import attention


I have some questions: can flash-attn-func and triton fa co-exist? or should we allow user to install them at the same time in docker file?
Also, what is the steps to validate this PR ?

I have some questions: can flash-attn-func and triton fa co-exist? or should we allow user to install them at the same time in docker file?

Some background

flash_attn_func is Tri Dao's "reference" implementation of Flash Attention. It calls the cutlass version in the backend.

If is_hip(), flash_attn_func calls the CK version in the backend.

Cutlass version I believe is a pip package and one can just pip install flash-attention. I think for CK we need to install from our fork but I'm a little less familiar with that.

Triton is completely separate and does not intersect with either the front end flash_attn_func or cutlass / CK. It requires a Triton install. And the kernel, which is available in this PR.

So, yes both can co-exist but only one is needed.

Dockerfile.rocm

jpvillam-amd · 2024-03-19T19:34:38Z

vllm/model_executor/layers/attention/backends/flash_attn.py

-                    window_size=self.sliding_window,
-                    alibi_slopes=self.alibi_slopes,
-                )
+                if is_hip():


Add a flag to skip

Added Flag for controlling triton vs default flow. More small changes to dockerfile

hongxiayang · 2024-03-20T15:50:39Z

Dockerfile.rocm

+    mkdir -p libs \
+    && cd libs \
+    && pip uninstall -y triton \
+    && git clone https://github.com/ROCmSoftwarePlatform/triton.git \


Do we need to use a specific branch for this PR?

As far as I know just the default branch should be fine

shajrawi · 2024-05-03T14:07:11Z

closing as we merge triton upstream

Make run_vllm.sh generic

Adding new rocm triton flash attention kernel

e569133

Co-authored-by: Vinayak Gokhale <[email protected]>

jpvillam-amd self-assigned this Mar 15, 2024

jpvillam-amd requested review from vgokhale, dllehr-amd and hongxiayang March 15, 2024 17:23

vgokhale reviewed Mar 18, 2024

View reviewed changes

vllm/model_executor/layers/attention/ops/flash_attention_triton.py Show resolved Hide resolved

hongxiayang reviewed Mar 18, 2024

View reviewed changes

hongxiayang requested changes Mar 18, 2024

View reviewed changes

Dockerfile.rocm Outdated Show resolved Hide resolved

gshtras reviewed Mar 19, 2024

View reviewed changes

Dockerfile.rocm Outdated Show resolved Hide resolved

jpvillam-amd commented Mar 19, 2024

View reviewed changes

Small fix on dockerfile

0e63661

jpvillam-amd force-pushed the jpvillam/v0.3.3_triton branch from c582158 to 0e63661 Compare March 19, 2024 19:42

jpvillam-amd added 2 commits March 19, 2024 18:24

Merge remote-tracking branch 'origin/main' into jpvillam/v0.3.3_triton

c89c0e3

Rebase updates and PR review changes

d4cb905

Added Flag for controlling triton vs default flow. More small changes to dockerfile

hongxiayang reviewed Mar 20, 2024

View reviewed changes

jpvillam and others added 4 commits March 22, 2024 16:48

Added interleaving for MQA for triton kernel

1fff99a

Only if VLLM_USE_FLASH_ATTN_TRITON is to 1 or true

a3bc6c8

Make triton the default FA

8cd9653

Make workaround only applicable to triton path

52af554

shajrawi closed this May 3, 2024

gshtras pushed a commit that referenced this pull request Sep 27, 2024

Merge pull request #4 from ROCmSoftwarePlatform/lili/generic-run_vllm.sh

7870a9d

Make run_vllm.sh generic

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding new rocm triton flash attention kernel #4

Adding new rocm triton flash attention kernel #4

jpvillam-amd commented Mar 15, 2024

vgokhale Mar 18, 2024

hongxiayang Mar 18, 2024

vgokhale Mar 18, 2024

jpvillam-amd Mar 19, 2024 •

edited

Loading

hongxiayang Mar 20, 2024

jpvillam-amd Mar 20, 2024

shajrawi commented May 3, 2024


		encoded_softmax = None

		M = torch.empty((batch, nheads_q, metadata.max_seq_len), device=q.device, dtype=torch.float32)

Adding new rocm triton flash attention kernel #4

Adding new rocm triton flash attention kernel #4

Conversation

jpvillam-amd commented Mar 15, 2024

vgokhale Mar 18, 2024

Choose a reason for hiding this comment

hongxiayang Mar 18, 2024

Choose a reason for hiding this comment

vgokhale Mar 18, 2024

Choose a reason for hiding this comment

jpvillam-amd Mar 19, 2024 • edited Loading

Choose a reason for hiding this comment

hongxiayang Mar 20, 2024

Choose a reason for hiding this comment

jpvillam-amd Mar 20, 2024

Choose a reason for hiding this comment

shajrawi commented May 3, 2024

jpvillam-amd Mar 19, 2024 •

edited

Loading