From 904690eb16be69fc5239d82747cdf69c98e71124 Mon Sep 17 00:00:00 2001 From: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com> Date: Tue, 25 Feb 2025 14:49:47 -0800 Subject: [PATCH 01/22] add vLLM V1 User Guide Template Signed-off-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com> --- docs/source/getting_started/v1_user_guide.md | 15 +++++++++++++++ docs/source/index.md | 2 ++ 2 files changed, 17 insertions(+) create mode 100644 docs/source/getting_started/v1_user_guide.md diff --git a/docs/source/getting_started/v1_user_guide.md b/docs/source/getting_started/v1_user_guide.md new file mode 100644 index 0000000000000..6d38dca6ba8f1 --- /dev/null +++ b/docs/source/getting_started/v1_user_guide.md @@ -0,0 +1,15 @@ +# vLLM V1 User Guide + +## Why vLLM v1? +Previous blog post [vLLM V1: A Major Upgrade to vLLM's Core Architecture](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html) + +## Semantic changes and deprecated features + + +## Unsupported features + + +## Unsupported models + + +## FAQ diff --git a/docs/source/index.md b/docs/source/index.md index d17155647f9fe..a9e29f28af55b 100644 --- a/docs/source/index.md +++ b/docs/source/index.md @@ -67,6 +67,8 @@ getting_started/quickstart getting_started/examples/examples_index getting_started/troubleshooting getting_started/faq +getting_started/v1_user_guide + ::: % What does vLLM support? From c294f75cca82bb7e4bcad943f1b2f3d8f673593d Mon Sep 17 00:00:00 2001 From: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com> Date: Wed, 26 Feb 2025 17:52:32 -0800 Subject: [PATCH 02/22] wip update v1 user guide Signed-off-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com> --- docs/source/getting_started/v1_user_guide.md | 23 ++++++++++++++++++++ 1 file changed, 23 insertions(+) diff --git a/docs/source/getting_started/v1_user_guide.md b/docs/source/getting_started/v1_user_guide.md index 6d38dca6ba8f1..5a15c30c90b49 100644 --- a/docs/source/getting_started/v1_user_guide.md +++ b/docs/source/getting_started/v1_user_guide.md @@ -6,8 +6,31 @@ Previous blog post [vLLM V1: A Major Upgrade to vLLM's Core Architecture](https: ## Semantic changes and deprecated features +### Logprobs +- vLLM v1 now supports sample logprobs and prompt logprobs starting from this [pr](https://github.com/vllm-project/vllm/pull/9880); + however, its logprobs support still underperforms compared to v0. We are working on improving this. + +### Encoder-Decoder +- vLLM v1 is currently limited to decoder-only Transformers. Please check out our [documentation](https://docs.vllm.ai/en/latest/models/supported_models.html) for a more detailed list of the supported models. + ## Unsupported features +vLLM v1 does not support the following features yet. + + + + + +### FP8 KV Cache +- This feature is available in vLLM v0 ant not in v1. With v0, you can enable + FP8 KV + Cache by specifying: + ```--kv-cache-dtype fp8``` +### CPU Offload +- vLLM v1 does not supports CPU offload. vLLM v0 has the CPU offload + implementation in `vllm/worker/model_runner.py` that you can specify with + `--cpu-offload-gb 1` (1gb) + ## Unsupported models From 228337126c2d50f452a561275491d5e6a0b27555 Mon Sep 17 00:00:00 2001 From: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com> Date: Wed, 26 Feb 2025 21:57:35 -0800 Subject: [PATCH 03/22] update logprob desc Signed-off-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com> --- docs/source/getting_started/v1_user_guide.md | 13 +++++-------- 1 file changed, 5 insertions(+), 8 deletions(-) diff --git a/docs/source/getting_started/v1_user_guide.md b/docs/source/getting_started/v1_user_guide.md index 5a15c30c90b49..6e19eb54221dc 100644 --- a/docs/source/getting_started/v1_user_guide.md +++ b/docs/source/getting_started/v1_user_guide.md @@ -3,12 +3,12 @@ ## Why vLLM v1? Previous blog post [vLLM V1: A Major Upgrade to vLLM's Core Architecture](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html) -## Semantic changes and deprecated features - +## Semantic changes and deprecated features ### Logprobs -- vLLM v1 now supports sample logprobs and prompt logprobs starting from this [pr](https://github.com/vllm-project/vllm/pull/9880); - however, its logprobs support still underperforms compared to v0. We are working on improving this. +- vLLM v1 now supports both sample logprobs and prompt logprobs, as introduced in this [PR](https://github.com/vllm-project/vllm/pull/9880). +- In vLLM v1, logprobs are computed before logits post-processing, so penalty adjustments and temperature scaling are not applied. +- The team is actively working on implementing logprobs that include post-sampling adjustments, incorporating both penalties and temperature scaling. ### Encoder-Decoder - vLLM v1 is currently limited to decoder-only Transformers. Please check out our [documentation](https://docs.vllm.ai/en/latest/models/supported_models.html) for a more detailed list of the supported models. @@ -17,15 +17,12 @@ Previous blog post [vLLM V1: A Major Upgrade to vLLM's Core Architecture](https: vLLM v1 does not support the following features yet. - - - - ### FP8 KV Cache - This feature is available in vLLM v0 ant not in v1. With v0, you can enable FP8 KV Cache by specifying: ```--kv-cache-dtype fp8``` + ### CPU Offload - vLLM v1 does not supports CPU offload. vLLM v0 has the CPU offload implementation in `vllm/worker/model_runner.py` that you can specify with From 1486cb8fb355f14805b3559b69bcde3df8a9afc6 Mon Sep 17 00:00:00 2001 From: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com> Date: Wed, 26 Feb 2025 22:09:04 -0800 Subject: [PATCH 04/22] update logprobs desc Signed-off-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com> --- docs/source/getting_started/v1_user_guide.md | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/docs/source/getting_started/v1_user_guide.md b/docs/source/getting_started/v1_user_guide.md index 6e19eb54221dc..5dc615a25a012 100644 --- a/docs/source/getting_started/v1_user_guide.md +++ b/docs/source/getting_started/v1_user_guide.md @@ -7,8 +7,11 @@ Previous blog post [vLLM V1: A Major Upgrade to vLLM's Core Architecture](https: ### Logprobs - vLLM v1 now supports both sample logprobs and prompt logprobs, as introduced in this [PR](https://github.com/vllm-project/vllm/pull/9880). -- In vLLM v1, logprobs are computed before logits post-processing, so penalty adjustments and temperature scaling are not applied. -- The team is actively working on implementing logprobs that include post-sampling adjustments, incorporating both penalties and temperature scaling. +- **Current Limitations**: + - v1 prompt logprobs do not support prefix caching. + - v1 logprobs are computed before logits post-processing, so penalty + adjustments and temperature scaling are not applied. +- The team is actively working on implementing logprobs that include post-sampling adjustments. ### Encoder-Decoder - vLLM v1 is currently limited to decoder-only Transformers. Please check out our [documentation](https://docs.vllm.ai/en/latest/models/supported_models.html) for a more detailed list of the supported models. From f9bb56341f944db7c90d041ae39281a4eb4cf1d8 Mon Sep 17 00:00:00 2001 From: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com> Date: Wed, 26 Feb 2025 22:21:03 -0800 Subject: [PATCH 05/22] update Signed-off-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com> --- docs/source/getting_started/v1_user_guide.md | 24 ++++++++++---------- 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/docs/source/getting_started/v1_user_guide.md b/docs/source/getting_started/v1_user_guide.md index 5dc615a25a012..91d166495e979 100644 --- a/docs/source/getting_started/v1_user_guide.md +++ b/docs/source/getting_started/v1_user_guide.md @@ -18,18 +18,18 @@ Previous blog post [vLLM V1: A Major Upgrade to vLLM's Core Architecture](https: ## Unsupported features -vLLM v1 does not support the following features yet. - -### FP8 KV Cache -- This feature is available in vLLM v0 ant not in v1. With v0, you can enable - FP8 KV - Cache by specifying: - ```--kv-cache-dtype fp8``` - -### CPU Offload -- vLLM v1 does not supports CPU offload. vLLM v0 has the CPU offload - implementation in `vllm/worker/model_runner.py` that you can specify with - `--cpu-offload-gb 1` (1gb) + +### LoRA +- LoRA works for V1 on the main branch, but its performance is inferior to that of V0. + The team is actively working on improving the performance on going [PR](https://github.com/vllm-project/vllm/pull/13096). + +### Spec decode other than ngram +- Currently, only ngram spec decode is supported in V1 after this [PR] + (https://github.com/vllm-project/vllm/pull/12193). + +### KV Cache Swapping & Offloading & FP8 +- vLLM v1 does not support KV Cache swapping, offloading, and FP8 KV Cache yet. The + team is working actively on it. ## Unsupported models From 50025508fc12c22f7d1cf6ff8d57fbeef186700d Mon Sep 17 00:00:00 2001 From: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com> Date: Wed, 26 Feb 2025 22:37:16 -0800 Subject: [PATCH 06/22] fix Signed-off-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com> --- docs/source/getting_started/v1_user_guide.md | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/docs/source/getting_started/v1_user_guide.md b/docs/source/getting_started/v1_user_guide.md index 91d166495e979..e647a5861d57e 100644 --- a/docs/source/getting_started/v1_user_guide.md +++ b/docs/source/getting_started/v1_user_guide.md @@ -14,20 +14,22 @@ Previous blog post [vLLM V1: A Major Upgrade to vLLM's Core Architecture](https: - The team is actively working on implementing logprobs that include post-sampling adjustments. ### Encoder-Decoder -- vLLM v1 is currently limited to decoder-only Transformers. Please check out our [documentation](https://docs.vllm.ai/en/latest/models/supported_models.html) for a more detailed list of the supported models. +- vLLM v1 is currently limited to decoder-only Transformers. Please check out our + [documentation](https://docs.vllm.ai/en/latest/models/supported_models.html) for a + more detailed list of the supported models. Encoder-decoder models support is not + happending soon. ## Unsupported features ### LoRA - LoRA works for V1 on the main branch, but its performance is inferior to that of V0. - The team is actively working on improving the performance on going [PR](https://github.com/vllm-project/vllm/pull/13096). + The team is actively working on improving the performance [PR](https://github.com/vllm-project/vllm/pull/13096). ### Spec decode other than ngram -- Currently, only ngram spec decode is supported in V1 after this [PR] - (https://github.com/vllm-project/vllm/pull/12193). +- Currently, only ngram spec decode is supported in V1 after this [PR](https://github.com/vllm-project/vllm/pull/12193). -### KV Cache Swapping & Offloading & FP8 +### KV Cache Swapping & Offloading & FP8 KV Cache - vLLM v1 does not support KV Cache swapping, offloading, and FP8 KV Cache yet. The team is working actively on it. From b395576778c8f98a4b0315119f2d0b78249f88ba Mon Sep 17 00:00:00 2001 From: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com> Date: Wed, 26 Feb 2025 22:47:57 -0800 Subject: [PATCH 07/22] update Signed-off-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com> --- docs/source/getting_started/v1_user_guide.md | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/docs/source/getting_started/v1_user_guide.md b/docs/source/getting_started/v1_user_guide.md index e647a5861d57e..fc598687c0092 100644 --- a/docs/source/getting_started/v1_user_guide.md +++ b/docs/source/getting_started/v1_user_guide.md @@ -19,14 +19,18 @@ Previous blog post [vLLM V1: A Major Upgrade to vLLM's Core Architecture](https: more detailed list of the supported models. Encoder-decoder models support is not happending soon. -## Unsupported features +### List of features that are deprecated in v1 +- best_of +- logits_processors +- beam_search +## Unsupported features ### LoRA - LoRA works for V1 on the main branch, but its performance is inferior to that of V0. The team is actively working on improving the performance [PR](https://github.com/vllm-project/vllm/pull/13096). -### Spec decode other than ngram +### Spec Decode other than ngram - Currently, only ngram spec decode is supported in V1 after this [PR](https://github.com/vllm-project/vllm/pull/12193). ### KV Cache Swapping & Offloading & FP8 KV Cache @@ -34,7 +38,7 @@ Previous blog post [vLLM V1: A Major Upgrade to vLLM's Core Architecture](https: team is working actively on it. -## Unsupported models +## Unsupported Models ## FAQ From 658957f8118edf4d8ae9f9bd51968a558669e8a3 Mon Sep 17 00:00:00 2001 From: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com> Date: Thu, 27 Feb 2025 11:52:18 -0800 Subject: [PATCH 08/22] update Signed-off-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com> --- docs/source/getting_started/v1_user_guide.md | 46 ++++++++++++-------- 1 file changed, 27 insertions(+), 19 deletions(-) diff --git a/docs/source/getting_started/v1_user_guide.md b/docs/source/getting_started/v1_user_guide.md index fc598687c0092..a3c08cb553a6d 100644 --- a/docs/source/getting_started/v1_user_guide.md +++ b/docs/source/getting_started/v1_user_guide.md @@ -1,44 +1,52 @@ # vLLM V1 User Guide -## Why vLLM v1? -Previous blog post [vLLM V1: A Major Upgrade to vLLM's Core Architecture](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html) +## Why vLLM V1? +Previous blog post [vLLM V1: A Major Upgrade to vLLM's Core Architecture](https://blog.vllm.ai/2025/01/27/V1-alpha-release.html) ## Semantic changes and deprecated features ### Logprobs -- vLLM v1 now supports both sample logprobs and prompt logprobs, as introduced in this [PR](https://github.com/vllm-project/vllm/pull/9880). +- vLLM V1 now supports both sample logprobs and prompt logprobs, as introduced in this [PR](https://github.com/vllm-project/vllm/pull/9880). - **Current Limitations**: - - v1 prompt logprobs do not support prefix caching. - - v1 logprobs are computed before logits post-processing, so penalty + - V1 prompt logprobs do not support prefix caching. + - V1 logprobs are computed before logits post-processing, so penalty adjustments and temperature scaling are not applied. - The team is actively working on implementing logprobs that include post-sampling adjustments. -### Encoder-Decoder -- vLLM v1 is currently limited to decoder-only Transformers. Please check out our - [documentation](https://docs.vllm.ai/en/latest/models/supported_models.html) for a - more detailed list of the supported models. Encoder-decoder models support is not - happending soon. +### The following features has been deprecated in V1: -### List of features that are deprecated in v1 +#### Deprecated sampling features - best_of - logits_processors - beam_search +#### Deprecated KV Cache +- KV Cache swapping +- KV Cache offloading +- FP8 KV Cache + ## Unsupported features -### LoRA -- LoRA works for V1 on the main branch, but its performance is inferior to that of V0. +- **LoRA**: LoRA works for V1 on the main branch, but its performance is inferior to that + of V0. The team is actively working on improving the performance [PR](https://github.com/vllm-project/vllm/pull/13096). -### Spec Decode other than ngram -- Currently, only ngram spec decode is supported in V1 after this [PR](https://github.com/vllm-project/vllm/pull/12193). +- **Spec Decode other than ngram**: currently, only ngram spec decode is supported in V1 + after this [PR](https://github.com/vllm-project/vllm/pull/12193). -### KV Cache Swapping & Offloading & FP8 KV Cache -- vLLM v1 does not support KV Cache swapping, offloading, and FP8 KV Cache yet. The - team is working actively on it. +- **Quantization**: For V1, when the CUDA graph is enabled, it defaults to the + piecewise CUDA graph introduced in this [PR](https://github.com/vllm-project/vllm/pull/10058) ; consequently, FP8 and other quantizations are not supported. +## Unsupported models -## Unsupported Models +All model with `SupportsV0Only` tag in the model definition is not supported by V1. + +- **Pooling Models**: Pooling models are not supported in V1 yet. +- **Encoder-Decoder**: vLLM V1 is currently limited to decoder-only Transformers. + Please check out our + [documentation](https://docs.vllm.ai/en/latest/models/supported_models.html) for a + more detailed list of the supported models. Encoder-decoder models support is not + happending soon. ## FAQ From eacf90a5254c33d3fe22054009f694de323c667f Mon Sep 17 00:00:00 2001 From: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com> Date: Thu, 27 Feb 2025 11:55:13 -0800 Subject: [PATCH 09/22] Update v1_user_guide.md Signed-off-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com> --- docs/source/getting_started/v1_user_guide.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/docs/source/getting_started/v1_user_guide.md b/docs/source/getting_started/v1_user_guide.md index a3c08cb553a6d..f2f9ca7ebdbd3 100644 --- a/docs/source/getting_started/v1_user_guide.md +++ b/docs/source/getting_started/v1_user_guide.md @@ -20,10 +20,9 @@ Previous blog post [vLLM V1: A Major Upgrade to vLLM's Core Architecture](https: - logits_processors - beam_search -#### Deprecated KV Cache +#### Deprecated KV Cache features - KV Cache swapping - KV Cache offloading -- FP8 KV Cache ## Unsupported features @@ -37,6 +36,8 @@ Previous blog post [vLLM V1: A Major Upgrade to vLLM's Core Architecture](https: - **Quantization**: For V1, when the CUDA graph is enabled, it defaults to the piecewise CUDA graph introduced in this [PR](https://github.com/vllm-project/vllm/pull/10058) ; consequently, FP8 and other quantizations are not supported. +- **FP8 KV Cache**: FP8 KV Cache is not yet supported in V1. + ## Unsupported models All model with `SupportsV0Only` tag in the model definition is not supported by V1. From 560b2677c71dc9f389af9b1bb9607877bbed743d Mon Sep 17 00:00:00 2001 From: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com> Date: Sat, 1 Mar 2025 17:59:28 -0800 Subject: [PATCH 10/22] update unsupported Signed-off-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com> --- docs/source/getting_started/v1_user_guide.md | 45 ++++++++++++++------ 1 file changed, 31 insertions(+), 14 deletions(-) diff --git a/docs/source/getting_started/v1_user_guide.md b/docs/source/getting_started/v1_user_guide.md index f2f9ca7ebdbd3..845c759c5f862 100644 --- a/docs/source/getting_started/v1_user_guide.md +++ b/docs/source/getting_started/v1_user_guide.md @@ -7,9 +7,9 @@ Previous blog post [vLLM V1: A Major Upgrade to vLLM's Core Architecture](https: ### Logprobs - vLLM V1 now supports both sample logprobs and prompt logprobs, as introduced in this [PR](https://github.com/vllm-project/vllm/pull/9880). -- **Current Limitations**: +- **Current Limitations**: - V1 prompt logprobs do not support prefix caching. - - V1 logprobs are computed before logits post-processing, so penalty + - V1 logprobs are computed before logits post-processing, so penalty adjustments and temperature scaling are not applied. - The team is actively working on implementing logprobs that include post-sampling adjustments. @@ -26,28 +26,45 @@ Previous blog post [vLLM V1: A Major Upgrade to vLLM's Core Architecture](https: ## Unsupported features -- **LoRA**: LoRA works for V1 on the main branch, but its performance is inferior to that +- **LoRA**: LoRA works for V1 on the main branch, but its performance is inferior to that of V0. The team is actively working on improving the performance [PR](https://github.com/vllm-project/vllm/pull/13096). -- **Spec Decode other than ngram**: currently, only ngram spec decode is supported in V1 +- **Spec Decode other than ngram**: currently, only ngram spec decode is supported in V1 after this [PR](https://github.com/vllm-project/vllm/pull/12193). -- **Quantization**: For V1, when the CUDA graph is enabled, it defaults to the - piecewise CUDA graph introduced in this [PR](https://github.com/vllm-project/vllm/pull/10058) ; consequently, FP8 and other quantizations are not supported. +- **Quantization**: For V1, when the CUDA graph is enabled, it defaults to the + piecewiseCUDA graphintroduced in this[PR](https://github.com/vllm-project/vllm/pull/10058); consequently,FP8 and other quantizations are not supported. - **FP8 KV Cache**: FP8 KV Cache is not yet supported in V1. -## Unsupported models -All model with `SupportsV0Only` tag in the model definition is not supported by V1. -- **Pooling Models**: Pooling models are not supported in V1 yet. -- **Encoder-Decoder**: vLLM V1 is currently limited to decoder-only Transformers. - Please check out our - [documentation](https://docs.vllm.ai/en/latest/models/supported_models.html) for a - more detailed list of the supported models. Encoder-decoder models support is not - happending soon. +## Unsupported Models +## Unsupported Models + +vLLM V1 excludes models tagged with `SupportsV0Only` to focus on high-throughput, +decoder-only inference. The unsupported categories are: + +- **Embedding Models** + vLLM V1 does not yet include a `PoolingModelRunner`. + *Example*: `BAAI/bge-m3` + +- **Mamba Models** + Models using selective state space mechanisms (instead of standard transformer + attention) are not supported. + *Examples*: + - Pure Mamba models (e.g. `BAAI/mamba-large`) + - Hybrid Mamba-Transformer models (e.g. `ibm-ai-platform/Bamba-9B`) + +- **Encoder-Decoder Models** + vLLM V1 is optimized for decoder-only transformers. Models that require + cross-attention between separate encoder and decoder components are not supported. + *Example*: `facebook/bart-large-cnn` + +For a complete list of supported models and additional details, please refer to our +[documentation](https://docs.vllm.ai/en/latest/models/supported_models.html). +Support for encoder-decoder architectures is not planned in the near future. ## FAQ From 33d759f4a8815c2a0310a7baf73aef006a76316a Mon Sep 17 00:00:00 2001 From: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com> Date: Sat, 1 Mar 2025 18:21:45 -0800 Subject: [PATCH 11/22] address comments Signed-off-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com> --- docs/source/getting_started/v1_user_guide.md | 46 +++++++++++--------- 1 file changed, 25 insertions(+), 21 deletions(-) diff --git a/docs/source/getting_started/v1_user_guide.md b/docs/source/getting_started/v1_user_guide.md index 845c759c5f862..216b09b9e6fd4 100644 --- a/docs/source/getting_started/v1_user_guide.md +++ b/docs/source/getting_started/v1_user_guide.md @@ -1,7 +1,8 @@ # vLLM V1 User Guide ## Why vLLM V1? -Previous blog post [vLLM V1: A Major Upgrade to vLLM's Core Architecture](https://blog.vllm.ai/2025/01/27/V1-alpha-release.html) +Previous blog post [vLLM V1: A Major Upgrade to vLLM's Core Architecture](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html) + ## Semantic changes and deprecated features @@ -15,10 +16,11 @@ Previous blog post [vLLM V1: A Major Upgrade to vLLM's Core Architecture](https: ### The following features has been deprecated in V1: -#### Deprecated sampling features +#### Current deprecated sampling features +The team is working on supporting these features globally in the server. + - best_of -- logits_processors -- beam_search +- per request logits processors #### Deprecated KV Cache features - KV Cache swapping @@ -38,33 +40,35 @@ Previous blog post [vLLM V1: A Major Upgrade to vLLM's Core Architecture](https: - **FP8 KV Cache**: FP8 KV Cache is not yet supported in V1. +- **Structured Generation Fallback**: Only `xgrammar:no_fallback` is supported. + Details about the structured generation can be found [here](https://docs.vllm.ai/en/latest/features/structured_outputs.html). -## Unsupported Models ## Unsupported Models -vLLM V1 excludes models tagged with `SupportsV0Only` to focus on high-throughput, -decoder-only inference. The unsupported categories are: +vLLM V1 excludes models tagged with `SupportsV0Only` while we develop support for +other types. For a complete list of supported models, see our [documentation](https://docs.vllm.ai/en/latest/models/supported_models.html). The +following categories are currently unsupported, but we plan to +support them eventually. -- **Embedding Models** - vLLM V1 does not yet include a `PoolingModelRunner`. +**Embedding Models** +- vLLM V1 does not yet include a `PoolingModelRunner` to support embedding/pooling + models. *Example*: `BAAI/bge-m3` -- **Mamba Models** - Models using selective state space mechanisms (instead of standard transformer - attention) are not supported. +**Mamba Models** +- Models using selective state-space mechanisms (instead of standard transformer attention) are not yet supported. *Examples*: - - Pure Mamba models (e.g. `BAAI/mamba-large`) - - Hybrid Mamba-Transformer models (e.g. `ibm-ai-platform/Bamba-9B`) + - Pure Mamba models (e.g., `BAAI/mamba-large`) + - Hybrid Mamba-Transformer models (e.g., `ibm-ai-platform/Bamba-9B`) -- **Encoder-Decoder Models** - vLLM V1 is optimized for decoder-only transformers. Models that require - cross-attention between separate encoder and decoder components are not supported. - *Example*: `facebook/bart-large-cnn` +**Encoder-Decoder Models** +- vLLM V1 is currently optimized for decoder-only transformers. Models requiring + cross-attention between separate encoder and decoder (e.g., + `facebook/bart-large-cnn`) are unsupported. -For a complete list of supported models and additional details, please refer to our -[documentation](https://docs.vllm.ai/en/latest/models/supported_models.html). -Support for encoder-decoder architectures is not planned in the near future. +For a complete list of supported models, see +[our documentation](https://docs.vllm.ai/en/latest/models/supported_models.html). ## FAQ From 2c2898961e387504a982107b2905c9ec2fc3669e Mon Sep 17 00:00:00 2001 From: Jennifer Zhao Date: Sat, 1 Mar 2025 21:39:10 -0800 Subject: [PATCH 12/22] Apply suggestions from code review Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> --- docs/source/getting_started/v1_user_guide.md | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/docs/source/getting_started/v1_user_guide.md b/docs/source/getting_started/v1_user_guide.md index 216b09b9e6fd4..69fa605a6db8e 100644 --- a/docs/source/getting_started/v1_user_guide.md +++ b/docs/source/getting_started/v1_user_guide.md @@ -4,17 +4,15 @@ Previous blog post [vLLM V1: A Major Upgrade to vLLM's Core Architecture](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html) -## Semantic changes and deprecated features - +## Semantic Changes and Deprecated Features ### Logprobs -- vLLM V1 now supports both sample logprobs and prompt logprobs, as introduced in this [PR](https://github.com/vllm-project/vllm/pull/9880). -- **Current Limitations**: +- vLLM V1 now supports both sample logprobs and prompt logprobs. Currently, the [implementation](https://github.com/vllm-project/vllm/pull/9880) has the following **limitations and semantic changes**: - V1 prompt logprobs do not support prefix caching. - V1 logprobs are computed before logits post-processing, so penalty adjustments and temperature scaling are not applied. - The team is actively working on implementing logprobs that include post-sampling adjustments. -### The following features has been deprecated in V1: +### Deprecated Features: #### Current deprecated sampling features The team is working on supporting these features globally in the server. @@ -66,7 +64,7 @@ support them eventually. **Encoder-Decoder Models** - vLLM V1 is currently optimized for decoder-only transformers. Models requiring cross-attention between separate encoder and decoder (e.g., - `facebook/bart-large-cnn`) are unsupported. + `facebook/bart-large-cnn`) are not yet supported. For a complete list of supported models, see [our documentation](https://docs.vllm.ai/en/latest/models/supported_models.html). From 4e978c2d34904ded918061998657d860f484c4cd Mon Sep 17 00:00:00 2001 From: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com> Date: Sun, 2 Mar 2025 02:24:42 +0000 Subject: [PATCH 13/22] address comments Signed-off-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com> --- docs/source/getting_started/v1_user_guide.md | 101 ++++++++++++++----- 1 file changed, 74 insertions(+), 27 deletions(-) diff --git a/docs/source/getting_started/v1_user_guide.md b/docs/source/getting_started/v1_user_guide.md index 69fa605a6db8e..0d4d0d8a35eb1 100644 --- a/docs/source/getting_started/v1_user_guide.md +++ b/docs/source/getting_started/v1_user_guide.md @@ -1,53 +1,102 @@ # vLLM V1 User Guide ## Why vLLM V1? -Previous blog post [vLLM V1: A Major Upgrade to vLLM's Core Architecture](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html) +vLLM V0 successfully supported a wide range of models and hardware, but as new features were developed independently, the system grew increasingly complex. This complexity made it harder to integrate new capabilities and introduced technical debt, revealing the need for a more streamlined and unified design. + +Building on V0’s success, vLLM V1 retains the stable and proven components from V0 +(such as the models, GPU kernels, and utilities). At the same time, it significantly +re-architects the core systems—covering the scheduler, KV cache manager, worker, +sampler, and API server—to provide a cohesive, maintainable framework that better +accommodates continued growth and innovation. + +Specifically, V1 aims to: + +- Provide a **simple, modular, and easy-to-hack codebase**. +- Ensure **high performance** with near-zero CPU overhead. +- **Combine key optimizations** into a unified architecture. +- Require **zero configs** by enabling features/optimizations by default. + +For more detailed please refer to the vLLM V1 blog post “[vLLM V1: A Major +Upgrade to vLLM’s Core Architecture](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html)” (published Jan 27, 2025) ## Semantic Changes and Deprecated Features + ### Logprobs -- vLLM V1 now supports both sample logprobs and prompt logprobs. Currently, the [implementation](https://github.com/vllm-project/vllm/pull/9880) has the following **limitations and semantic changes**: - - V1 prompt logprobs do not support prefix caching. - - V1 logprobs are computed before logits post-processing, so penalty - adjustments and temperature scaling are not applied. -- The team is actively working on implementing logprobs that include post-sampling adjustments. -### Deprecated Features: +[vLLM V1 introduces support for returning log probabilities (logprobs) for both +sampled tokens and the prompt.](https://github.com/vllm-project/vllm/pull/9880) +However, there are some important semantic +differences compared to V0: + +**Prompt Logprobs Without Prefix Caching** + +In vLLM V1, if you request prompt logprobs (using the `prompt_logprobs=true` flag), +prefix caching is not available. This means that if you want logprobs for the prompt, +you must disable prefix caching (e.g. by starting the server with `--no-enable-prefix-caching`). +The team is working to support prompt logprobs with caching. + +**Pre-Post-Processing Calculation** + +Logprobs in V1 are now computed immediately +after the model’s raw output (i.e. +before applying any logits post-processing such as temperature scaling or penalty +adjustments). As a result, the returned logprobs do not reflect the final adjusted +probabilities that might be used during sampling. + +In other words, if your sampling pipeline applies penalties or scaling, those +adjustments will affect token selection but not be visible in the logprobs output. -#### Current deprecated sampling features -The team is working on supporting these features globally in the server. +The team is working in progress to include these post-sampling +adjustments in future updates. -- best_of -- per request logits processors +### Deprecated Features -#### Deprecated KV Cache features -- KV Cache swapping -- KV Cache offloading +As part of the major architectural rework in vLLM V1, several legacy features have been removed to simplify the codebase and improve efficiency. -## Unsupported features +**Deprecated sampling features** + +- **best_of**: The sampling parameter best_of—which in V0 enabled + generating multiple candidate outputs per request and then selecting the best + one—has been deprecated in V1. +- **Per-Request Logits Processors**: In V0, users could pass custom + processing functions to adjust logits on a per-request basis. In vLLM V1 this + mechanism is deprecated. Instead, the design is moving toward supporting global + logits processors—a feature the team is actively working on for future releases. + +**Deprecated KV Cache features** + +- KV Cache Swapping +- KV Cache Offloading + +## Unsupported or Unoptimized Features in vLLM V1 + +vLLM V1 is a major rewrite designed for improved throughput, architectural +simplicity, and enhanced distributed inference. Although many features have been +re‐implemented or optimized compared to earlier versions, some functionalities +remain either unsupported or not yet fully optimized: + +### Unoptimized Features - **LoRA**: LoRA works for V1 on the main branch, but its performance is inferior to that of V0. The team is actively working on improving the performance [PR](https://github.com/vllm-project/vllm/pull/13096). -- **Spec Decode other than ngram**: currently, only ngram spec decode is supported in V1 - after this [PR](https://github.com/vllm-project/vllm/pull/12193). +- **Spec Decode**: Currently, only ngram-based spec decode is supported in V1. There + will be follow-up work to support other types of spec decode. -- **Quantization**: For V1, when the CUDA graph is enabled, it defaults to the - piecewiseCUDA graphintroduced in this[PR](https://github.com/vllm-project/vllm/pull/10058); consequently,FP8 and other quantizations are not supported. +### Unsupported Features -- **FP8 KV Cache**: FP8 KV Cache is not yet supported in V1. +- **FP8 KV Cache**: While vLLM V1 introduces new FP8 kernels for model weight quantization, support for an FP8 key–value cache is not yet available. Users must continue using FP16 (or other supported precisions) for the KV cache. -- **Structured Generation Fallback**: Only `xgrammar:no_fallback` is supported. +- **Structured Generation Fallback**: For structured output tasks, V1 currently + supports only the `xgrammar:no_fallback` mode. Details about the structured generation can be found [here](https://docs.vllm.ai/en/latest/features/structured_outputs.html). - - ## Unsupported Models vLLM V1 excludes models tagged with `SupportsV0Only` while we develop support for -other types. For a complete list of supported models, see our [documentation](https://docs.vllm.ai/en/latest/models/supported_models.html). The -following categories are currently unsupported, but we plan to +other types. The following categories are currently unsupported, but we plan to support them eventually. **Embedding Models** @@ -68,5 +117,3 @@ support them eventually. For a complete list of supported models, see [our documentation](https://docs.vllm.ai/en/latest/models/supported_models.html). - -## FAQ From 32887cd1094676090919b4c7d5dbfe76cbd81b50 Mon Sep 17 00:00:00 2001 From: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com> Date: Sun, 2 Mar 2025 06:53:27 +0000 Subject: [PATCH 14/22] remove merged pr Signed-off-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com> --- docs/source/getting_started/v1_user_guide.md | 69 ++++++++++---------- 1 file changed, 35 insertions(+), 34 deletions(-) diff --git a/docs/source/getting_started/v1_user_guide.md b/docs/source/getting_started/v1_user_guide.md index 0d4d0d8a35eb1..81c18798f9651 100644 --- a/docs/source/getting_started/v1_user_guide.md +++ b/docs/source/getting_started/v1_user_guide.md @@ -4,10 +4,10 @@ vLLM V0 successfully supported a wide range of models and hardware, but as new features were developed independently, the system grew increasingly complex. This complexity made it harder to integrate new capabilities and introduced technical debt, revealing the need for a more streamlined and unified design. -Building on V0’s success, vLLM V1 retains the stable and proven components from V0 -(such as the models, GPU kernels, and utilities). At the same time, it significantly -re-architects the core systems—covering the scheduler, KV cache manager, worker, -sampler, and API server—to provide a cohesive, maintainable framework that better +Building on V0’s success, vLLM V1 retains the stable and proven components from V0 +(such as the models, GPU kernels, and utilities). At the same time, it significantly +re-architects the core systems—covering the scheduler, KV cache manager, worker, +sampler, and API server—to provide a cohesive, maintainable framework that better accommodates continued growth and innovation. Specifically, V1 aims to: @@ -17,37 +17,37 @@ Specifically, V1 aims to: - **Combine key optimizations** into a unified architecture. - Require **zero configs** by enabling features/optimizations by default. -For more detailed please refer to the vLLM V1 blog post “[vLLM V1: A Major -Upgrade to vLLM’s Core Architecture](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html)” (published Jan 27, 2025) +For more detailed please refer to the vLLM V1 blog post [vLLM V1: A Major +Upgrade to vLLM’s Core Architecture](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html) (published Jan 27, 2025) ## Semantic Changes and Deprecated Features ### Logprobs -[vLLM V1 introduces support for returning log probabilities (logprobs) for both -sampled tokens and the prompt.](https://github.com/vllm-project/vllm/pull/9880) -However, there are some important semantic +vLLM V1 introduces support for returning log probabilities (logprobs) for both +sampled tokens and the prompt. +However, there are some important semantic differences compared to V0: **Prompt Logprobs Without Prefix Caching** -In vLLM V1, if you request prompt logprobs (using the `prompt_logprobs=true` flag), +In vLLM V1, if you request prompt logprobs (using the `prompt_logprobs=true` flag), prefix caching is not available. This means that if you want logprobs for the prompt, you must disable prefix caching (e.g. by starting the server with `--no-enable-prefix-caching`). The team is working to support prompt logprobs with caching. **Pre-Post-Processing Calculation** -Logprobs in V1 are now computed immediately -after the model’s raw output (i.e. -before applying any logits post-processing such as temperature scaling or penalty -adjustments). As a result, the returned logprobs do not reflect the final adjusted +Logprobs in V1 are now computed immediately +after the model’s raw output (i.e. +before applying any logits post-processing such as temperature scaling or penalty +adjustments). As a result, the returned logprobs do not reflect the final adjusted probabilities that might be used during sampling. -In other words, if your sampling pipeline applies penalties or scaling, those +In other words, if your sampling pipeline applies penalties or scaling, those adjustments will affect token selection but not be visible in the logprobs output. -The team is working in progress to include these post-sampling +The team is working in progress to include these post-sampling adjustments in future updates. ### Deprecated Features @@ -56,12 +56,12 @@ As part of the major architectural rework in vLLM V1, several legacy features ha **Deprecated sampling features** -- **best_of**: The sampling parameter best_of—which in V0 enabled - generating multiple candidate outputs per request and then selecting the best +- **best_of**: The sampling parameter best_of—which in V0 enabled + generating multiple candidate outputs per request and then selecting the best one—has been deprecated in V1. -- **Per-Request Logits Processors**: In V0, users could pass custom - processing functions to adjust logits on a per-request basis. In vLLM V1 this - mechanism is deprecated. Instead, the design is moving toward supporting global +- **Per-Request Logits Processors**: In V0, users could pass custom + processing functions to adjust logits on a per-request basis. In vLLM V1 this + mechanism is deprecated. Instead, the design is moving toward supporting global logits processors—a feature the team is actively working on for future releases. **Deprecated KV Cache features** @@ -71,27 +71,29 @@ As part of the major architectural rework in vLLM V1, several legacy features ha ## Unsupported or Unoptimized Features in vLLM V1 -vLLM V1 is a major rewrite designed for improved throughput, architectural -simplicity, and enhanced distributed inference. Although many features have been -re‐implemented or optimized compared to earlier versions, some functionalities +vLLM V1 is a major rewrite designed for improved throughput, architectural +simplicity, and enhanced distributed inference. Although many features have been +re‐implemented or optimized compared to earlier versions, some functionalities remain either unsupported or not yet fully optimized: ### Unoptimized Features -- **LoRA**: LoRA works for V1 on the main branch, but its performance is inferior to that - of V0. - The team is actively working on improving the performance [PR](https://github.com/vllm-project/vllm/pull/13096). +- **LoRA**: LoRA works for V1 on the main branch, but its performance is + inferior to that of V0. The team is actively working on improving its + performance +(e.g., see [PR #13096](https://github.com/vllm-project/vllm/pull/13096)). -- **Spec Decode**: Currently, only ngram-based spec decode is supported in V1. There +- **Spec Decode**: Currently, only ngram-based spec decode is supported in V1. There will be follow-up work to support other types of spec decode. ### Unsupported Features - **FP8 KV Cache**: While vLLM V1 introduces new FP8 kernels for model weight quantization, support for an FP8 key–value cache is not yet available. Users must continue using FP16 (or other supported precisions) for the KV cache. -- **Structured Generation Fallback**: For structured output tasks, V1 currently +- **Structured Generation Fallback**: For structured output tasks, V1 currently supports only the `xgrammar:no_fallback` mode. - Details about the structured generation can be found [here](https://docs.vllm.ai/en/latest/features/structured_outputs.html). + Details about the structured generation can be found + [here](https://docs.vllm.ai/en/latest/features/structured_outputs.html). ## Unsupported Models @@ -107,13 +109,12 @@ support them eventually. **Mamba Models** - Models using selective state-space mechanisms (instead of standard transformer attention) are not yet supported. *Examples*: - - Pure Mamba models (e.g., `BAAI/mamba-large`) - - Hybrid Mamba-Transformer models (e.g., `ibm-ai-platform/Bamba-9B`) + - Pure Mamba models (e.g., `BAAI/mamba-large`) + - Hybrid Mamba-Transformer models (e.g., `ibm-ai-platform/Bamba-9B`) **Encoder-Decoder Models** - vLLM V1 is currently optimized for decoder-only transformers. Models requiring cross-attention between separate encoder and decoder (e.g., `facebook/bart-large-cnn`) are not yet supported. -For a complete list of supported models, see -[our documentation](https://docs.vllm.ai/en/latest/models/supported_models.html). +For a complete list of supported models, see the [list of supported models](https://docs.vllm.ai/en/latest/models/supported_models.html). From fcffa3cb7e5e28399731ef246ee03c913a707f43 Mon Sep 17 00:00:00 2001 From: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com> Date: Sun, 2 Mar 2025 07:05:16 +0000 Subject: [PATCH 15/22] fix Signed-off-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com> --- docs/source/getting_started/v1_user_guide.md | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/docs/source/getting_started/v1_user_guide.md b/docs/source/getting_started/v1_user_guide.md index 81c18798f9651..88b048454b424 100644 --- a/docs/source/getting_started/v1_user_guide.md +++ b/docs/source/getting_started/v1_user_guide.md @@ -24,17 +24,16 @@ Upgrade to vLLM’s Core Architecture](https://blog.vllm.ai/2025/01/27/v1-alpha- ### Logprobs -vLLM V1 introduces support for returning log probabilities (logprobs) for both -sampled tokens and the prompt. +vLLM V1 introduces support for returning logprobs and prompt logprobs. However, there are some important semantic differences compared to V0: **Prompt Logprobs Without Prefix Caching** -In vLLM V1, if you request prompt logprobs (using the `prompt_logprobs=true` flag), -prefix caching is not available. This means that if you want logprobs for the prompt, -you must disable prefix caching (e.g. by starting the server with `--no-enable-prefix-caching`). -The team is working to support prompt logprobs with caching. +In vLLM V1, if you request prompt logprobs, +prefix caching is not available. This means that if you want prompt logprobs, +you must disable prefix caching (e.g. with `--no-enable-prefix-caching`). +The team is working to support prompt logprobs with prefix caching. **Pre-Post-Processing Calculation** From 047f46d50ec1e9f645bd638f1c6cf52e3ae9c58e Mon Sep 17 00:00:00 2001 From: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com> Date: Sun, 2 Mar 2025 07:11:39 +0000 Subject: [PATCH 16/22] link pr #13361 Signed-off-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com> --- docs/source/getting_started/v1_user_guide.md | 11 ++++------- 1 file changed, 4 insertions(+), 7 deletions(-) diff --git a/docs/source/getting_started/v1_user_guide.md b/docs/source/getting_started/v1_user_guide.md index 88b048454b424..8722e38990d4f 100644 --- a/docs/source/getting_started/v1_user_guide.md +++ b/docs/source/getting_started/v1_user_guide.md @@ -24,8 +24,7 @@ Upgrade to vLLM’s Core Architecture](https://blog.vllm.ai/2025/01/27/v1-alpha- ### Logprobs -vLLM V1 introduces support for returning logprobs and prompt logprobs. -However, there are some important semantic +vLLM V1 supports logprobs and prompt logprobs. However, there are some important semantic differences compared to V0: **Prompt Logprobs Without Prefix Caching** @@ -55,13 +54,11 @@ As part of the major architectural rework in vLLM V1, several legacy features ha **Deprecated sampling features** -- **best_of**: The sampling parameter best_of—which in V0 enabled - generating multiple candidate outputs per request and then selecting the best - one—has been deprecated in V1. +- **best_of**: See details in this [PR #13361](https://github.com/vllm-project/vllm/issues/13361) - **Per-Request Logits Processors**: In V0, users could pass custom processing functions to adjust logits on a per-request basis. In vLLM V1 this - mechanism is deprecated. Instead, the design is moving toward supporting global - logits processors—a feature the team is actively working on for future releases. + is deprecated. Instead, the design is moving toward supporting global logits + processors—a feature the team is actively working on for future releases. **Deprecated KV Cache features** From d858c6a7db683e7ddbbc27fba8984bd5b1294fe4 Mon Sep 17 00:00:00 2001 From: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com> Date: Sun, 2 Mar 2025 07:16:19 +0000 Subject: [PATCH 17/22] link pr #13997 Signed-off-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com> --- docs/source/getting_started/v1_user_guide.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/source/getting_started/v1_user_guide.md b/docs/source/getting_started/v1_user_guide.md index 8722e38990d4f..217aa3ef1436b 100644 --- a/docs/source/getting_started/v1_user_guide.md +++ b/docs/source/getting_started/v1_user_guide.md @@ -54,7 +54,8 @@ As part of the major architectural rework in vLLM V1, several legacy features ha **Deprecated sampling features** -- **best_of**: See details in this [PR #13361](https://github.com/vllm-project/vllm/issues/13361) +- **best_of**: See details here [PR #13361](https://github.com/vllm-project/vllm/issues/13361), +[PR #13997](https://github.com/vllm-project/vllm/issues/13997). - **Per-Request Logits Processors**: In V0, users could pass custom processing functions to adjust logits on a per-request basis. In vLLM V1 this is deprecated. Instead, the design is moving toward supporting global logits From 54d51cf1e5f4885fa905e815f3ab7e244e375ce2 Mon Sep 17 00:00:00 2001 From: Roger Wang Date: Sun, 2 Mar 2025 01:05:54 -0800 Subject: [PATCH 18/22] update Signed-off-by: Roger Wang --- docs/source/getting_started/v1_user_guide.md | 97 +++++++++----------- 1 file changed, 42 insertions(+), 55 deletions(-) diff --git a/docs/source/getting_started/v1_user_guide.md b/docs/source/getting_started/v1_user_guide.md index 217aa3ef1436b..35462fd053818 100644 --- a/docs/source/getting_started/v1_user_guide.md +++ b/docs/source/getting_started/v1_user_guide.md @@ -6,8 +6,8 @@ vLLM V0 successfully supported a wide range of models and hardware, but as new f Building on V0’s success, vLLM V1 retains the stable and proven components from V0 (such as the models, GPU kernels, and utilities). At the same time, it significantly -re-architects the core systems—covering the scheduler, KV cache manager, worker, -sampler, and API server—to provide a cohesive, maintainable framework that better +re-architects the core systems, covering the scheduler, KV cache manager, worker, +sampler, and API server, to provide a cohesive, maintainable framework that better accommodates continued growth and innovation. Specifically, V1 aims to: @@ -17,65 +17,54 @@ Specifically, V1 aims to: - **Combine key optimizations** into a unified architecture. - Require **zero configs** by enabling features/optimizations by default. -For more detailed please refer to the vLLM V1 blog post [vLLM V1: A Major -Upgrade to vLLM’s Core Architecture](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html) (published Jan 27, 2025) +For more details, check out the vLLM V1 blog post [vLLM V1: A Major +Upgrade to vLLM’s Core Architecture](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html) (published Jan 27, 2025). -## Semantic Changes and Deprecated Features +This living user guide outlines a few known **important changes and limitations** introduced by vLLM V1. The team has been working actively to bring V1 as the default engine version of vLLM, therefore this guide will be updated constantly as more feature get supported on vLLM V1. -### Logprobs +### Semantic Changes and Deprecated Features + +#### Logprobs vLLM V1 supports logprobs and prompt logprobs. However, there are some important semantic differences compared to V0: -**Prompt Logprobs Without Prefix Caching** - -In vLLM V1, if you request prompt logprobs, -prefix caching is not available. This means that if you want prompt logprobs, -you must disable prefix caching (e.g. with `--no-enable-prefix-caching`). -The team is working to support prompt logprobs with prefix caching. - -**Pre-Post-Processing Calculation** +**Logprobs Calculation** -Logprobs in V1 are now computed immediately -after the model’s raw output (i.e. +Logprobs in V1 are now returned immediately once computed from the model’s raw output (i.e. before applying any logits post-processing such as temperature scaling or penalty adjustments). As a result, the returned logprobs do not reflect the final adjusted -probabilities that might be used during sampling. +probabilities used during sampling. -In other words, if your sampling pipeline applies penalties or scaling, those -adjustments will affect token selection but not be visible in the logprobs output. +Support for logprobs with post-sampling adjustments is work in progress and will be added in future updates. -The team is working in progress to include these post-sampling -adjustments in future updates. +**Prompt Logprobs with Prefix Caching** -### Deprecated Features +Currently prompt logprobs are only supported when prefix caching is turned off via `--no-enable-prefix-caching`. In a future release, prompt logprobs will be compatible with prefix caching, but a recomputation will be triggered to recover the full prompt logprobs even upon a prefix cache hit. See details in [RFC #13414](https://github.com/vllm-project/vllm/issues/13414). -As part of the major architectural rework in vLLM V1, several legacy features have been removed to simplify the codebase and improve efficiency. +#### Deprecated Features -**Deprecated sampling features** +As part of the major architectural rework in vLLM V1, several legacy features have been deprecated. -- **best_of**: See details here [PR #13361](https://github.com/vllm-project/vllm/issues/13361), -[PR #13997](https://github.com/vllm-project/vllm/issues/13997). +**Sampling features** + +- **best_of**: This feature has been deprecated due to limited usage. See details at [RFC #13361](https://github.com/vllm-project/vllm/issues/13361). - **Per-Request Logits Processors**: In V0, users could pass custom - processing functions to adjust logits on a per-request basis. In vLLM V1 this - is deprecated. Instead, the design is moving toward supporting global logits - processors—a feature the team is actively working on for future releases. + processing functions to adjust logits on a per-request basis. In vLLM V1 this feature has been deprecated. Instead, the design is moving toward supporting **global logits + processors**, a feature the team is actively working on for future releases. See details at [RFC #13360](https://github.com/vllm-project/vllm/pull/13360). -**Deprecated KV Cache features** +**KV Cache features** -- KV Cache Swapping -- KV Cache Offloading +- **GPU <> CPU KV Cache Swapping**: with the new simplified core architecture, vLLM V1 no longer requires KV cache swapping +to handle request preemptions. -## Unsupported or Unoptimized Features in vLLM V1 +### Feature & Model Support in Progress -vLLM V1 is a major rewrite designed for improved throughput, architectural -simplicity, and enhanced distributed inference. Although many features have been -re‐implemented or optimized compared to earlier versions, some functionalities -remain either unsupported or not yet fully optimized: +Although the support for many features & models have been re‐implemented or optimized compared to V0, some remain either unsupported or not yet fully optimized on vLLM V1. -### Unoptimized Features +#### Unoptimized Features -- **LoRA**: LoRA works for V1 on the main branch, but its performance is +- **LoRA**: LoRA is functionally working on vLLM V1 but its performance is inferior to that of V0. The team is actively working on improving its performance (e.g., see [PR #13096](https://github.com/vllm-project/vllm/pull/13096)). @@ -83,7 +72,7 @@ remain either unsupported or not yet fully optimized: - **Spec Decode**: Currently, only ngram-based spec decode is supported in V1. There will be follow-up work to support other types of spec decode. -### Unsupported Features +#### Unsupported Features - **FP8 KV Cache**: While vLLM V1 introduces new FP8 kernels for model weight quantization, support for an FP8 key–value cache is not yet available. Users must continue using FP16 (or other supported precisions) for the KV cache. @@ -92,26 +81,24 @@ remain either unsupported or not yet fully optimized: Details about the structured generation can be found [here](https://docs.vllm.ai/en/latest/features/structured_outputs.html). -## Unsupported Models +#### Unsupported Models -vLLM V1 excludes models tagged with `SupportsV0Only` while we develop support for -other types. The following categories are currently unsupported, but we plan to -support them eventually. +vLLM V1 excludes model architectures with the `SupportsV0Only` protocol, and the majority fall into the following categories. V1 support for these models will be added eventually. -**Embedding Models** -- vLLM V1 does not yet include a `PoolingModelRunner` to support embedding/pooling - models. - *Example*: `BAAI/bge-m3` +**Embedding Models** +vLLM V1 does not yet include a `PoolingModelRunner` to support embedding/pooling + models (e.g, `XLMRobertaModel`). **Mamba Models** -- Models using selective state-space mechanisms (instead of standard transformer attention) are not yet supported. - *Examples*: - - Pure Mamba models (e.g., `BAAI/mamba-large`) - - Hybrid Mamba-Transformer models (e.g., `ibm-ai-platform/Bamba-9B`) +Models using selective state-space mechanisms (instead of standard transformer attention) +are not yet supported (e.g., `MambaForCausalLM`, `JambaForCausalLM`). **Encoder-Decoder Models** -- vLLM V1 is currently optimized for decoder-only transformers. Models requiring - cross-attention between separate encoder and decoder (e.g., - `facebook/bart-large-cnn`) are not yet supported. +vLLM V1 is currently optimized for decoder-only transformers. Models requiring + cross-attention between separate encoder and decoder are not yet supported (e.g., `BartForConditionalGeneration`, `MllamaForConditionalGeneration`). For a complete list of supported models, see the [list of supported models](https://docs.vllm.ai/en/latest/models/supported_models.html). + +## FAQ + +TODO From b934056322ea54ef4a067b56f948c8b1fc8653e5 Mon Sep 17 00:00:00 2001 From: Roger Wang Date: Sun, 2 Mar 2025 01:17:25 -0800 Subject: [PATCH 19/22] pre-commit Signed-off-by: Roger Wang --- docs/source/getting_started/v1_user_guide.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/getting_started/v1_user_guide.md b/docs/source/getting_started/v1_user_guide.md index 35462fd053818..78e8daa24e6bc 100644 --- a/docs/source/getting_started/v1_user_guide.md +++ b/docs/source/getting_started/v1_user_guide.md @@ -90,7 +90,7 @@ vLLM V1 does not yet include a `PoolingModelRunner` to support embedding/pooling models (e.g, `XLMRobertaModel`). **Mamba Models** -Models using selective state-space mechanisms (instead of standard transformer attention) +Models using selective state-space mechanisms (instead of standard transformer attention) are not yet supported (e.g., `MambaForCausalLM`, `JambaForCausalLM`). **Encoder-Decoder Models** From 79549c9859e6f2726029915ca1cf9cc35cbaa06a Mon Sep 17 00:00:00 2001 From: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com> Date: Tue, 4 Mar 2025 20:47:09 +0000 Subject: [PATCH 20/22] update Signed-off-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com> --- docs/source/getting_started/v1_user_guide.md | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/docs/source/getting_started/v1_user_guide.md b/docs/source/getting_started/v1_user_guide.md index 78e8daa24e6bc..907f592209868 100644 --- a/docs/source/getting_started/v1_user_guide.md +++ b/docs/source/getting_started/v1_user_guide.md @@ -60,7 +60,7 @@ to handle request preemptions. ### Feature & Model Support in Progress -Although the support for many features & models have been re‐implemented or optimized compared to V0, some remain either unsupported or not yet fully optimized on vLLM V1. +Although we have re-implemented and partially optimized many features and models from V0 in vLLM V1, optimization work is still ongoing for some, and others remain unsupported. #### Unoptimized Features @@ -70,20 +70,21 @@ Although the support for many features & models have been re‐implemented or op (e.g., see [PR #13096](https://github.com/vllm-project/vllm/pull/13096)). - **Spec Decode**: Currently, only ngram-based spec decode is supported in V1. There - will be follow-up work to support other types of spec decode. + will be follow-up work to support other types of spec decode (e.g., see [PR #13933](https://github.com/vllm-project/vllm/pull/13933)). #### Unsupported Features - **FP8 KV Cache**: While vLLM V1 introduces new FP8 kernels for model weight quantization, support for an FP8 key–value cache is not yet available. Users must continue using FP16 (or other supported precisions) for the KV cache. - **Structured Generation Fallback**: For structured output tasks, V1 currently - supports only the `xgrammar:no_fallback` mode. + supports only the `xgrammar:no_fallback` mode, meaning that it will error out if the output schema is unsupported by xgrammar. Details about the structured generation can be found [here](https://docs.vllm.ai/en/latest/features/structured_outputs.html). #### Unsupported Models -vLLM V1 excludes model architectures with the `SupportsV0Only` protocol, and the majority fall into the following categories. V1 support for these models will be added eventually. +vLLM V1 currently excludes model architectures with the `SupportsV0Only` protocol, +and the majority fall into the following categories. V1 support for these models will be added eventually. **Embedding Models** vLLM V1 does not yet include a `PoolingModelRunner` to support embedding/pooling From aa7bed4d7aa7d7bb140f6d00cd6cf3e1be82daa9 Mon Sep 17 00:00:00 2001 From: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com> Date: Tue, 4 Mar 2025 20:54:17 +0000 Subject: [PATCH 21/22] update Signed-off-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com> --- docs/source/getting_started/v1_user_guide.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/docs/source/getting_started/v1_user_guide.md b/docs/source/getting_started/v1_user_guide.md index 907f592209868..56e2b244383c3 100644 --- a/docs/source/getting_started/v1_user_guide.md +++ b/docs/source/getting_started/v1_user_guide.md @@ -64,6 +64,9 @@ Although we have re-implemented and partially optimized many features and models #### Unoptimized Features +These features are already supported in vLLM V1, but their optimization is still +in progress. + - **LoRA**: LoRA is functionally working on vLLM V1 but its performance is inferior to that of V0. The team is actively working on improving its performance @@ -83,7 +86,7 @@ Although we have re-implemented and partially optimized many features and models #### Unsupported Models -vLLM V1 currently excludes model architectures with the `SupportsV0Only` protocol, +vLLM V1 currently excludes model architectures with the `SupportsV0Only` protocol, and the majority fall into the following categories. V1 support for these models will be added eventually. **Embedding Models** From f80562bd756df19338fc9ab054bee7ccc432615c Mon Sep 17 00:00:00 2001 From: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com> Date: Tue, 4 Mar 2025 21:03:15 +0000 Subject: [PATCH 22/22] update Signed-off-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com> --- docs/source/getting_started/v1_user_guide.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/getting_started/v1_user_guide.md b/docs/source/getting_started/v1_user_guide.md index 56e2b244383c3..5d8da2c765084 100644 --- a/docs/source/getting_started/v1_user_guide.md +++ b/docs/source/getting_started/v1_user_guide.md @@ -62,7 +62,7 @@ to handle request preemptions. Although we have re-implemented and partially optimized many features and models from V0 in vLLM V1, optimization work is still ongoing for some, and others remain unsupported. -#### Unoptimized Features +#### Features to be Optimized These features are already supported in vLLM V1, but their optimization is still in progress.