[ROCm] Faster Custom Paged Attention kernels #12348

tjtanaa · 2025-01-23T08:02:20Z

Description

This PR implements a faster Custom Paged Attention (CPA) kernel based on mfma16x16x16 instructions.
This feature is from ROCm/vllm (ROCm#372).

End-to-End Performance gain

Model: Llama-3.1-70B-Instruct
Tensor Parallelism: 1
GPU: MI300X

CPA Version	Input length	Output length	KV-cache-dtype	Quantization	Prompt numbers	Req/s	Total Tokens/s	Output Tokens/s
before changes	128	128	fp8_e4m3	fp8	200	13.05	3340.6	1670.3
before changes	128	256	fp8_e4m3	fp8	200	7.56	2901.31	1934.21
before changes	128	2048	fp8_e4m3	fp8	200	0.78	1698.35	1598.45
before changes	512	128	fp8_e4m3	fp8	200	6.44	4122.57	824.51
before changes	512	256	fp8_e4m3	fp8	200	4.48	3443.46	1147.82
before changes	512	2048	fp8_e4m3	fp8	200	0.66	1696.64	1357.31
before changes	ShareGPT		fp8_e4m3	fp8	1000	6.22	2574.19	1234.64
optimized	128	128	fp8_e4m3	fp8	200	15.11	3867.75	1933.87
optimized	128	256	fp8_e4m3	fp8	200	9.01	3459.98	2306.65
optimized	128	2048	fp8_e4m3	fp8	200	1.2	2609.04	2455.57
optimized	512	128	fp8_e4m3	fp8	200	7.33	4694.05	938.81
optimized	512	256	fp8_e4m3	fp8	200	5.5	4223.29	1407.76
optimized	512	2048	fp8_e4m3	fp8	200	1.03	2648.55	2118.84
optimized	ShareGPT		fp8_e4m3	fp8	1000	7.45	3081.14	1477.79

github-actions · 2025-01-23T08:02:31Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Ported ROCm/vllm changes to upstream vLLM This commit manually ports changes from ROCm/vllm (ROCm#372) to upstream vLLM. The original work was done by sanyalington. Co-authored-by: sanyalington <[email protected]> Signed-off-by: vllmellm <[email protected]>

Signed-off-by: tjtanaa <[email protected]>

Signed-off-by: sanyalington <[email protected]>

tjtanaa · 2025-01-23T14:14:03Z

Regarding to the API changes of paged_attention in csrc/rocm/torch_bindings.cpp. This change only affects ROCm code path and does not interfere with code path of other platform.

 rocm_ops.def(
      "paged_attention(Tensor! out, Tensor exp_sums,"
      "                Tensor max_logits, Tensor tmp_out,"
      "                Tensor query, Tensor key_cache,"
      "                Tensor value_cache, int num_kv_heads,"
      "                float scale, Tensor block_tables,"
      "                Tensor context_lens, int block_size,"
      "                int max_context_len,"
      "                Tensor? alibi_slopes,"
      "                str kv_cache_dtype,"
      "                float k_scale, float v_scale,"
      "                Tensor? fp8_out_scale,"
      "                int partition_size) -> ()");

Seeking advice on handling the variables `fp8_out_scale` and `partition_size`.

Situation: Currently these two variables fp8_out_scale and partition_size has been introduced in the Custom Paged Attention ROCm, but they are not in used by higher level abstractions. They are set to fp8_out_scale=None and partition_size=256. The partition_size=256 has been found experimentally to be a good value for MI300.

Option 1:

Remove fp8_out_scale from csrc/rocm/attention.cu
Hard code partition_size to be 256 in csrc/rocm/attention.cu.
This avoid changing the paged_attention API in csrc/rocm/torch_bindings.cpp

~~Option 2:~~

~~Keep the variables as is, and mark TODO: for future feature update to remember introducing fp8 scaling strategy for ROCm.~~
~~Set fp8_out_scale=None and partition_size=256 when calling ops.paged_attention_rocm in vllm/attention/backends/rocm_flash_attn.py~~

We have implemented Option 1.

hongxiayang · 2025-01-23T22:40:36Z

@tjtanaa Please fix the DCO error:
Ensure you have a local copy of your branch by checking out the pull request locally via command line.
In your local branch, run: git rebase HEAD~4 --signoff
Force push your changes to overwrite the branch: git push --force-with-lease origin port-rocm-cpa-credit

…iminate the need for additional argumnets (partition_size and fp8_output_scale) in its api. Signed-off-by: vllmellm <[email protected]>

mergify · 2025-01-24T04:44:26Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @tjtanaa.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: vllmellm <[email protected]>

… and code documentation. updated its unittest to match the correct partition size based on paged attention versions as well as platform type. Signed-off-by: vllmellm <[email protected]>

tjtanaa · 2025-01-27T12:31:21Z

@tjtanaa Please fix the DCO error: Ensure you have a local copy of your branch by checking out the pull request locally via command line. In your local branch, run: git rebase HEAD~4 --signoff Force push your changes to overwrite the branch: git push --force-with-lease origin port-rocm-cpa-credit

@hongxiayang We find that rebasing is hard as we had merged from main. In the process of fixing the DCO, we had to resolve merge conflict twice, and will require us to test everything again. It seems there are ways to override the DCO during merge. Could we get more input from vLLM maintainers about DCO issue.

Signed-off-by: tjtanaa <[email protected]>

hongxiayang

Thanks a lot. LGTM.

hongxiayang · 2025-01-30T22:38:07Z

I also verified the throughput numbers using the image built from @tjtanaa 's branch.

hongxiayang · 2025-01-31T00:26:09Z

@tjtanaa Can you work with @DarkLight1337 to see what else is needed in order to merge this PR?
Thanks for your effort for upstreaming this and fixing the test and clean up other spelling errors as well.

DarkLight1337 · 2025-01-31T02:14:16Z

@tlrmchlsmth @WoosukKwon can either of you take a look at this PR?

tlrmchlsmth · 2025-01-31T03:16:11Z

tests/kernels/test_attention.py

+    global PARTITION_SIZE
+


why make PARTITION_SIZE a global here? Not sure what PARTITION_SIZE does, or why would it be different on RoCM

@tlrmchlsmth This line is defined to tell the compiler that the PARTITION_SIZE within the test scope test_paged_attention function that PARTITION_SIZE should be from the global variable. This is needed after we defined a line https://github.com/vllm-project/vllm/blob/9501972249ca7dca0704bceb4308163f30999a6d/tests/kernels/test_attention.py#L217C36-L219C49
to reassign PARTITION_SIZE to have the value of PARTITION_SIZE_ROCM, which causes the compiler to think that PARTITION_SIZE is a local variable.

PARTITION_SIZE_ROCM is a performance-tuned hyperparameter for ROCm custom paged attention. That's why it is different from the PARTITION_SIZE on other platform.

tlrmchlsmth · 2025-01-31T03:25:08Z

I'll take a look tomorrow

mergify bot added the ci/build label Jan 23, 2025

vllmellm and others added 2 commits January 23, 2025 08:50

format code and enable test_attention.py in AMD CI

841e678

Signed-off-by: tjtanaa <[email protected]>

tjtanaa force-pushed the port-rocm-cpa-credit branch 2 times, most recently from 9be5f70 to f57dcb9 Compare January 23, 2025 08:57

add author; update requirements-rocm.txt

4f71b54

Signed-off-by: sanyalington <[email protected]>

tjtanaa force-pushed the port-rocm-cpa-credit branch from f57dcb9 to 4f71b54 Compare January 23, 2025 09:01

tjtanaa changed the title ~~[AMD] Faster Custom Paged Attention kernels~~ [ROCm] Faster Custom Paged Attention kernels Jan 23, 2025

[Misc]: removed unnecessary condition in atttention test.

c200473

hongxiayang added the rocm label Jan 23, 2025

[Kernel][Hardware][AMD] refactoring rocm custom paged attention to el…

a60ae3f

…iminate the need for additional argumnets (partition_size and fp8_output_scale) in its api. Signed-off-by: vllmellm <[email protected]>

mergify bot added the needs-rebase label Jan 24, 2025

Merge remote-tracking branch 'origin/main' into port-rocm-cpa-credit

c411b72

Signed-off-by: vllmellm <[email protected]>

mergify bot removed the needs-rebase label Jan 24, 2025

vllmellm added 2 commits January 24, 2025 08:21

[kernel] fix the format.

54b0249

Signed-off-by: vllmellm <[email protected]>

[Kernel][Hardware][AMD] improved rocm custom paged attention accuracy…

a1a36f3

… and code documentation. updated its unittest to match the correct partition size based on paged attention versions as well as platform type. Signed-off-by: vllmellm <[email protected]>

tjtanaa marked this pull request as ready for review January 27, 2025 12:27

tjtanaa requested review from tlrmchlsmth and WoosukKwon as code owners January 27, 2025 12:27

tjtanaa requested review from alexm-redhat, comaniac, mgoin, robertgshaw2-redhat, DarkLight1337 and simon-mo as code owners January 27, 2025 12:41

tjtanaa requested review from njhill and ywang96 as code owners January 27, 2025 12:41

mergify bot added documentation Improvements or additions to documentation frontend labels Jan 27, 2025

DarkLight1337 requested review from hongxiayang and removed request for mgoin, comaniac, njhill, simon-mo, alexm-redhat, robertgshaw2-redhat, ywang96 and DarkLight1337 January 27, 2025 12:45

tjtanaa force-pushed the port-rocm-cpa-credit branch from e8e548c to a1a36f3 Compare January 28, 2025 03:08

tjtanaa added 2 commits January 28, 2025 03:21

Merge remote-tracking branch 'origin/main' into port-rocm-cpa-credit

21e7306

Signed-off-by: tjtanaa <[email protected]>

format and lint code

9501972

Signed-off-by: tjtanaa <[email protected]>

hongxiayang approved these changes Jan 30, 2025

View reviewed changes

tlrmchlsmth reviewed Jan 31, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm] Faster Custom Paged Attention kernels #12348

[ROCm] Faster Custom Paged Attention kernels #12348

tjtanaa commented Jan 23, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Jan 23, 2025

tjtanaa commented Jan 23, 2025 •

edited

Loading

hongxiayang commented Jan 23, 2025

mergify bot commented Jan 24, 2025

tjtanaa commented Jan 27, 2025

hongxiayang left a comment

hongxiayang commented Jan 30, 2025

hongxiayang commented Jan 31, 2025

DarkLight1337 commented Jan 31, 2025

tlrmchlsmth Jan 31, 2025

tjtanaa Feb 1, 2025

tlrmchlsmth commented Jan 31, 2025

[ROCm] Faster Custom Paged Attention kernels #12348

Are you sure you want to change the base?

[ROCm] Faster Custom Paged Attention kernels #12348

Conversation

tjtanaa commented Jan 23, 2025 • edited by github-actions bot Loading

Description

End-to-End Performance gain

github-actions bot commented Jan 23, 2025

tjtanaa commented Jan 23, 2025 • edited Loading

Seeking advice on handling the variables fp8_out_scale and partition_size.

hongxiayang commented Jan 23, 2025

mergify bot commented Jan 24, 2025

tjtanaa commented Jan 27, 2025

hongxiayang left a comment

Choose a reason for hiding this comment

hongxiayang commented Jan 30, 2025

hongxiayang commented Jan 31, 2025

DarkLight1337 commented Jan 31, 2025

tlrmchlsmth Jan 31, 2025

Choose a reason for hiding this comment

tjtanaa Feb 1, 2025

Choose a reason for hiding this comment

tlrmchlsmth commented Jan 31, 2025

tjtanaa commented Jan 23, 2025 •

edited by github-actions bot

Loading

tjtanaa commented Jan 23, 2025 •

edited

Loading

Seeking advice on handling the variables `fp8_out_scale` and `partition_size`.