Custom PA perf improvements #222

sanyalington · 2024-10-04T15:41:35Z

Enable 128K context length in custom PA.
Enable custom PA to write fp8 output with scaling, enabled this perf optimization for LLama. This optimization is only enabled on rocm custom PA when chunked prefill is disabled and environment variable VLLM_USE_ROCM_CUSTOM_PAGED_ATTN_FP8_OUT=1

shajrawi

ship it

sanyalington added 6 commits September 29, 2024 06:32

enable custom PA with max seqlen 128k

a8bd8c1

custom PA support to write out scaled fp8 value

0500c09

use regular divide for scaling

7532596

enable custom PA to write out fp8 with scaling factor in llama

9a18fe9

linter fixes

d17db1e

clang-format fixes

b550d71

sanyalington requested a review from gshtras October 4, 2024 15:41

sanyalington added 3 commits October 4, 2024 16:05

update abstract attn impl with fp8_out_scale

ae0f7b0

add optional fp8_out_scale arg to all attn backend classes

1d5df64

clang format fix

68399f8

sanyalington requested a review from dllehr-amd October 4, 2024 16:40

sanyalington added 4 commits October 8, 2024 22:15

Merge branch 'main' into shsanyal_dev_vllm61_PA

05d7f69

add env var to enable cpa fp8 write out

aef4c68

isort fix

f1e28dd

Merge branch 'main' into shsanyal_dev_vllm61_PA

40486d7

shajrawi approved these changes Oct 8, 2024

View reviewed changes

shajrawi merged commit b51fe69 into main Oct 8, 2024
16 of 17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom PA perf improvements #222

Custom PA perf improvements #222

sanyalington commented Oct 4, 2024 •

edited

Loading

shajrawi left a comment

Custom PA perf improvements #222

Custom PA perf improvements #222

Conversation

sanyalington commented Oct 4, 2024 • edited Loading

shajrawi left a comment

Choose a reason for hiding this comment

sanyalington commented Oct 4, 2024 •

edited

Loading