-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ROCm] Faster Custom Paged Attention kernels #12348
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
Ported ROCm/vllm changes to upstream vLLM This commit manually ports changes from ROCm/vllm (ROCm#372) to upstream vLLM. The original work was done by sanyalington. Co-authored-by: sanyalington <[email protected]> Signed-off-by: vllmellm <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
9be5f70
to
f57dcb9
Compare
Signed-off-by: sanyalington <[email protected]>
f57dcb9
to
4f71b54
Compare
Regarding to the API changes of
Seeking advice on handling the variables
|
@tjtanaa Please fix the DCO error: |
…iminate the need for additional argumnets (partition_size and fp8_output_scale) in its api. Signed-off-by: vllmellm <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
… and code documentation. updated its unittest to match the correct partition size based on paged attention versions as well as platform type. Signed-off-by: vllmellm <[email protected]>
@hongxiayang We find that rebasing is hard as we had merged from |
e8e548c
to
a1a36f3
Compare
Signed-off-by: tjtanaa <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot. LGTM.
I also verified the throughput numbers using the image built from @tjtanaa 's branch. |
@tjtanaa Can you work with @DarkLight1337 to see what else is needed in order to merge this PR? |
@tlrmchlsmth @WoosukKwon can either of you take a look at this PR? |
global PARTITION_SIZE | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why make PARTITION_SIZE a global here? Not sure what PARTITION_SIZE does, or why would it be different on RoCM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tlrmchlsmth This line is defined to tell the compiler that the PARTITION_SIZE
within the test scope test_paged_attention
function that PARTITION_SIZE
should be from the global variable. This is needed after we defined a line https://github.com/vllm-project/vllm/blob/9501972249ca7dca0704bceb4308163f30999a6d/tests/kernels/test_attention.py#L217C36-L219C49
to reassign PARTITION_SIZE
to have the value of PARTITION_SIZE_ROCM
, which causes the compiler to think that PARTITION_SIZE
is a local variable.
PARTITION_SIZE_ROCM
is a performance-tuned hyperparameter for ROCm custom paged attention. That's why it is different from the PARTITION_SIZE
on other platform.
I'll take a look tomorrow |
Description
This PR implements a faster Custom Paged Attention (CPA) kernel based on
mfma16x16x16
instructions.This feature is from ROCm/vllm (ROCm#372).
End-to-End Performance gain
Model: Llama-3.1-70B-Instruct
Tensor Parallelism: 1
GPU: MI300X