From eb6fd2b73573a8e53be90f0326a5a59f10518ce9 Mon Sep 17 00:00:00 2001 From: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com> Date: Thu, 27 Feb 2025 11:52:18 -0800 Subject: [PATCH] update --- docs/source/getting_started/v1_user_guide.md | 46 ++++++++++++-------- 1 file changed, 27 insertions(+), 19 deletions(-) diff --git a/docs/source/getting_started/v1_user_guide.md b/docs/source/getting_started/v1_user_guide.md index fc598687c0092..a3c08cb553a6d 100644 --- a/docs/source/getting_started/v1_user_guide.md +++ b/docs/source/getting_started/v1_user_guide.md @@ -1,44 +1,52 @@ # vLLM V1 User Guide -## Why vLLM v1? -Previous blog post [vLLM V1: A Major Upgrade to vLLM's Core Architecture](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html) +## Why vLLM V1? +Previous blog post [vLLM V1: A Major Upgrade to vLLM's Core Architecture](https://blog.vllm.ai/2025/01/27/V1-alpha-release.html) ## Semantic changes and deprecated features ### Logprobs -- vLLM v1 now supports both sample logprobs and prompt logprobs, as introduced in this [PR](https://github.com/vllm-project/vllm/pull/9880). +- vLLM V1 now supports both sample logprobs and prompt logprobs, as introduced in this [PR](https://github.com/vllm-project/vllm/pull/9880). - **Current Limitations**: - - v1 prompt logprobs do not support prefix caching. - - v1 logprobs are computed before logits post-processing, so penalty + - V1 prompt logprobs do not support prefix caching. + - V1 logprobs are computed before logits post-processing, so penalty adjustments and temperature scaling are not applied. - The team is actively working on implementing logprobs that include post-sampling adjustments. -### Encoder-Decoder -- vLLM v1 is currently limited to decoder-only Transformers. Please check out our - [documentation](https://docs.vllm.ai/en/latest/models/supported_models.html) for a - more detailed list of the supported models. Encoder-decoder models support is not - happending soon. +### The following features has been deprecated in V1: -### List of features that are deprecated in v1 +#### Deprecated sampling features - best_of - logits_processors - beam_search +#### Deprecated KV Cache +- KV Cache swapping +- KV Cache offloading +- FP8 KV Cache + ## Unsupported features -### LoRA -- LoRA works for V1 on the main branch, but its performance is inferior to that of V0. +- **LoRA**: LoRA works for V1 on the main branch, but its performance is inferior to that + of V0. The team is actively working on improving the performance [PR](https://github.com/vllm-project/vllm/pull/13096). -### Spec Decode other than ngram -- Currently, only ngram spec decode is supported in V1 after this [PR](https://github.com/vllm-project/vllm/pull/12193). +- **Spec Decode other than ngram**: currently, only ngram spec decode is supported in V1 + after this [PR](https://github.com/vllm-project/vllm/pull/12193). -### KV Cache Swapping & Offloading & FP8 KV Cache -- vLLM v1 does not support KV Cache swapping, offloading, and FP8 KV Cache yet. The - team is working actively on it. +- **Quantization**: For V1, when the CUDA graph is enabled, it defaults to the + piecewise CUDA graph introduced in this [PR](https://github.com/vllm-project/vllm/pull/10058) ; consequently, FP8 and other quantizations are not supported. +## Unsupported models -## Unsupported Models +All model with `SupportsV0Only` tag in the model definition is not supported by V1. + +- **Pooling Models**: Pooling models are not supported in V1 yet. +- **Encoder-Decoder**: vLLM V1 is currently limited to decoder-only Transformers. + Please check out our + [documentation](https://docs.vllm.ai/en/latest/models/supported_models.html) for a + more detailed list of the supported models. Encoder-decoder models support is not + happending soon. ## FAQ