Replies: 1 comment
-
If you're only running 1 request, your KV Cache is unlikely to be filled up. If you want to improve generation speed you can consider using FP8 or speculation |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Your current environment
How would you like to use vllm
1.Model size: 14B
2.Performance: Average prompt throughput: 120.1 tokens/s, average generation throughput: 41.9 tokens/s, running: 1 request, swapped: 0 requests, pending: 0 requests, GPU KV cache usage: 1.0%, CPU KV cache usage: 0.0%.
The generation throughput speed is too slow. I think the reason might be the low GPU KV cache usage. How can I increase the GPU KV cache usage and improve the generation throughput speed?
Beta Was this translation helpful? Give feedback.
All reactions