[Doc] V1 user guide #13991

JenZhao · 2025-02-27T20:42:02Z

This PR adds the v1 User Guide as a living document, to be updated and expanded over time.

github-actions · 2025-02-27T20:42:13Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: Jennifer Zhao <[email protected]>

robertgshaw2-redhat · 2025-03-02T01:54:36Z

docs/source/getting_started/v1_user_guide.md

+#### Deprecated sampling features
+- best_of
+- logits_processors
+- beam_search


I believe beam search is supported

robertgshaw2-redhat · 2025-03-02T01:55:11Z

docs/source/getting_started/v1_user_guide.md

+
+#### Deprecated sampling features
+- best_of
+- logits_processors


"Per request logits processors are not supported"

Nick is working on supporting these globally in the server

addressed in the updates from @ywang96

thank you Roger

robertgshaw2-redhat · 2025-03-02T01:55:55Z

docs/source/getting_started/v1_user_guide.md

+
+- **FP8 KV Cache**: FP8 KV Cache is not yet supported in V1.
+
+## Unsupported models


Mamba not supported yet

We plan to support these eventually

docs/source/getting_started/v1_user_guide.md

robertgshaw2-redhat

Looking good! can you finish this off tomorrow? I want to include a link to this page in the logs

Signed-off-by: Jennifer Zhao <[email protected]>

shen-shanshan · 2025-03-02T02:10:19Z

docs/source/getting_started/v1_user_guide.md

+# vLLM V1 User Guide
+
+## Why vLLM V1?
+Previous blog post [vLLM V1: A Major Upgrade to vLLM's Core Architecture](https://blog.vllm.ai/2025/01/27/V1-alpha-release.html)


This link is invalid, showing "404", please check~

Did you mean: https://blog.vllm.ai/2025/01/27/v1-alpha-release.html ? please modify V1 to v1 in the link.

Signed-off-by: Jennifer Zhao <[email protected]>

ywang96

Left some comments - PTAL

ywang96 · 2025-03-02T04:19:26Z

docs/source/getting_started/v1_user_guide.md

+# vLLM V1 User Guide
+
+## Why vLLM V1?
+Previous blog post [vLLM V1: A Major Upgrade to vLLM's Core Architecture](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html)


Worth expanding here to give a bit more of the context and background - I will help you add stuff here.

docs/source/getting_started/v1_user_guide.md

ywang96 · 2025-03-02T04:23:14Z

docs/source/getting_started/v1_user_guide.md

+#### Current deprecated sampling features
+The team is working on supporting these features globally in the server.
+


This is not accurate since best_of and per request logits processors will be deprecated completely on V1.

I saw @robertgshaw2-redhat 's comments on supporting them globally

thank you @ywang96 I just realized I misunderstood Rob's comment.

ywang96 · 2025-03-02T04:28:20Z

docs/source/getting_started/v1_user_guide.md

+- **Spec Decode other than ngram**: currently, only ngram spec decode is supported in V1
+  after this [PR](https://github.com/vllm-project/vllm/pull/12193).


I don't think we really need to call out PR that's already merged.

ywang96 · 2025-03-02T04:30:07Z

docs/source/getting_started/v1_user_guide.md

+- **Quantization**: For V1, when the CUDA graph is enabled, it defaults to the
+  piecewiseCUDA graphintroduced in this[PR](https://github.com/vllm-project/vllm/pull/10058); consequently,FP8 and other quantizations are not supported.


I don't think this is completely accurate? Also there are some issues in the formatting here.

ywang96 · 2025-03-02T04:32:02Z

docs/source/getting_started/v1_user_guide.md

+## Unsupported Models
+
+vLLM V1 excludes models tagged with `SupportsV0Only` while we develop support for
+other types. For a complete list of supported models, see our [documentation](https://docs.vllm.ai/en/latest/models/supported_models.html). The


Let's remove For a complete list of supported models, see our [documentation](https://docs.vllm.ai/en/latest/models/supported_models.html). since you also have that at the end of the section.

docs/source/getting_started/v1_user_guide.md

ywang96 · 2025-03-02T04:39:17Z

docs/source/getting_started/v1_user_guide.md

+## Unsupported features
+


Similar to my comment above for the first section: adding some context here will be helpful.

Co-authored-by: Roger Wang <[email protected]>

Signed-off-by: Jennifer Zhao <[email protected]>

Signed-off-by: Roger Wang <[email protected]>

comaniac · 2025-03-03T04:43:47Z

docs/source/getting_started/v1_user_guide.md

+- **GPU <> CPU KV Cache Swapping**: with the new simplified core architecture, vLLM V1 no longer requires KV cache swapping
+to handle request preemptions.


To confirm: Have we decided to deprecate (i.e. no longer support) this feature? The reason I'm asking is there are some RFCs about adding CPU kv cache offloading to v1. cc @simon-mo @WoosukKwon

Swapping will be deprecated, but offloading will not.
cc @ywang96

Oh I see I messed them up. nvm then

comaniac · 2025-03-03T04:45:16Z

docs/source/getting_started/v1_user_guide.md

+
+Although the support for many features & models have been re‐implemented or optimized compared to V0, some remain either unsupported or not yet fully optimized on vLLM V1.
+
+#### Unoptimized Features


Better to overview that optimized features are the features that have been supported in v1 but we are still working in progress to optimize it.

comaniac · 2025-03-03T04:46:31Z

docs/source/getting_started/v1_user_guide.md

+- **Spec Decode**: Currently, only ngram-based spec decode is supported in V1. There
+  will be follow-up work to support other types of spec decode.


Could link to the GitHub Project of v1 spec decode

comaniac · 2025-03-03T04:51:59Z

docs/source/getting_started/v1_user_guide.md

+- **GPU <> CPU KV Cache Swapping**: with the new simplified core architecture, vLLM V1 no longer requires KV cache swapping
+to handle request preemptions.
+
+### Feature & Model Support in Progress


One suggestion to this section is that it might be more clear to have an overview supported matrix: features vs status, where status could be one of the following:

Deprecated: No plan to support in v1 unless there's a strong motivation.

Planned: Plan to support but haven't started workings

WIP: Being supported.

Functional (unoptimized): It's working but is being optimized.

Optimized: It's almost optimized and no other planned work atm.

So that in the rest of this section we could be feature centric to describe the current status of each feature and point to the corresponding GitHub issue/PR/Project.

comaniac · 2025-03-03T04:53:37Z

docs/source/getting_started/v1_user_guide.md

+- **Structured Generation Fallback**: For structured output tasks, V1 currently
+  supports only the `xgrammar:no_fallback` mode.
+  Details about the structured generation can be found
+  [here](https://docs.vllm.ai/en/latest/features/structured_outputs.html).


Suggested change

- **Structured Generation Fallback**: For structured output tasks, V1 currently

supports only the `xgrammar:no_fallback` mode.

Details about the structured generation can be found

[here](https://docs.vllm.ai/en/latest/features/structured_outputs.html).

- **Structured Output**: For structured output tasks, V1 currently

supports only the `xgrammar:no_fallback` mode, meaning that it will error out if the output schema is unsupported by xgrammar.

Details about the structured generation can be found

[here](https://docs.vllm.ai/en/latest/features/structured_outputs.html).

comaniac · 2025-03-03T04:54:11Z

docs/source/getting_started/v1_user_guide.md

+
+#### Unsupported Models
+
+vLLM V1 excludes model architectures with the `SupportsV0Only` protocol, and the majority fall into the following categories. V1 support for these models will be added eventually.


Suggested change

vLLM V1 excludes model architectures with the `SupportsV0Only` protocol, and the majority fall into the following categories. V1 support for these models will be added eventually.

vLLM V1 currently excludes model architectures with the `SupportsV0Only` protocol, and the majority fall into the following categories. V1 support for these models will be added eventually.

WoosukKwon · 2025-03-04T10:31:00Z

Thanks for the PR! I will take a look tmr (Tue).

Signed-off-by: Jennifer Zhao <[email protected]>

LiuXiaoxuanPKU

Thanks for the guide. It's great with lots of information.

Just some concerns:

On the high level, I feel it's a bit too negative? We talk a lot about features we don't support or need to be optimized. I feel we should highlight some optimized features, such as chunked prefill and prefix caching.
I feel we need to mention the scheduler (scheduling policy) somewhere, currently we batch prefill tokens and decoding tokens in the same batch, we don't prioritize prefill/decode.

LiuXiaoxuanPKU · 2025-03-05T23:02:07Z

docs/source/getting_started/v1_user_guide.md

+For more details, check out the vLLM V1 blog post [vLLM V1: A Major
+Upgrade to vLLM’s Core Architecture](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html) (published Jan 27, 2025).
+
+This living user guide outlines a few known **important changes and limitations** introduced by vLLM V1. The team has been working actively to bring V1 as the default engine version of vLLM, therefore this guide will be updated constantly as more feature get supported on vLLM V1.


Suggested change

This living user guide outlines a few known **important changes and limitations** introduced by vLLM V1. The team has been working actively to bring V1 as the default engine version of vLLM, therefore this guide will be updated constantly as more feature get supported on vLLM V1.

This living user guide outlines a few known **important changes and limitations** introduced by vLLM V1. The team has been working actively to bring V1 as the default engine, therefore this guide will be updated constantly as more features get supported on vLLM V1.

LiuXiaoxuanPKU · 2025-03-05T23:02:59Z

docs/source/getting_started/v1_user_guide.md

+adjustments). As a result, the returned logprobs do not reflect the final adjusted
+probabilities used during sampling.
+
+Support for logprobs with post-sampling adjustments is work in progress and will be added in future updates.


Suggested change

Support for logprobs with post-sampling adjustments is work in progress and will be added in future updates.

Support for logprobs with post-sampling adjustments is in progress and will be added in future updates.

LiuXiaoxuanPKU · 2025-03-05T23:06:33Z

docs/source/getting_started/v1_user_guide.md

+
+- **best_of**: This feature has been deprecated due to limited usage. See details at [RFC #13361](https://github.com/vllm-project/vllm/issues/13361).
+- **Per-Request Logits Processors**: In V0, users could pass custom
+  processing functions to adjust logits on a per-request basis. In vLLM V1 this feature has been deprecated. Instead, the design is moving toward supporting **global logits


Suggested change

processing functions to adjust logits on a per-request basis. In vLLM V1 this feature has been deprecated. Instead, the design is moving toward supporting **global logits

processing functions to adjust logits on a per-request basis. In vLLM V1, this feature has been deprecated. Instead, the design is moving toward supporting **global logits

LiuXiaoxuanPKU · 2025-03-05T23:08:53Z

docs/source/getting_started/v1_user_guide.md

+(e.g., see [PR #13096](https://github.com/vllm-project/vllm/pull/13096)).
+
+- **Spec Decode**: Currently, only ngram-based spec decode is supported in V1. There
+  will be follow-up work to support other types of spec decode (e.g., see [PR #13933](https://github.com/vllm-project/vllm/pull/13933)).


Suggested change

will be follow-up work to support other types of spec decode (e.g., see [PR #13933](https://github.com/vllm-project/vllm/pull/13933)).

will be follow-up work to support other types of spec decode (e.g., see [PR #13933](https://github.com/vllm-project/vllm/pull/13933)). We will prioritize the support for Eagle, MTP compared to draft model based spec decode.

mergify bot added the documentation Improvements or additions to documentation label Feb 27, 2025

JenZhao added 9 commits February 27, 2025 13:33

add vLLM V1 User Guide Template

904690e

Signed-off-by: Jennifer Zhao <[email protected]>

wip update v1 user guide

c294f75

Signed-off-by: Jennifer Zhao <[email protected]>

update logprob desc

2283371

Signed-off-by: Jennifer Zhao <[email protected]>

update logprobs desc

1486cb8

Signed-off-by: Jennifer Zhao <[email protected]>

update

f9bb563

Signed-off-by: Jennifer Zhao <[email protected]>

fix

5002550

Signed-off-by: Jennifer Zhao <[email protected]>

update

b395576

Signed-off-by: Jennifer Zhao <[email protected]>

update

658957f

Signed-off-by: Jennifer Zhao <[email protected]>

Update v1_user_guide.md

eacf90a

Signed-off-by: Jennifer Zhao <[email protected]>

JenZhao force-pushed the v1_guide branch from ff71ca6 to eacf90a Compare February 27, 2025 21:45

Merge branch 'vllm-project:main' into v1_guide

bf852c6

robertgshaw2-redhat reviewed Mar 2, 2025

View reviewed changes

docs/source/getting_started/v1_user_guide.md Show resolved Hide resolved

robertgshaw2-redhat requested changes Mar 2, 2025

View reviewed changes

update unsupported

560b267

Signed-off-by: Jennifer Zhao <[email protected]>

shen-shanshan reviewed Mar 2, 2025

View reviewed changes

address comments

33d759f

Signed-off-by: Jennifer Zhao <[email protected]>

ywang96 reviewed Mar 2, 2025

View reviewed changes

Apply suggestions from code review

2c28989

Co-authored-by: Roger Wang <[email protected]>

JenZhao force-pushed the v1_guide branch from 2528038 to 56e014f Compare March 2, 2025 06:43

address comments

4e978c2

Signed-off-by: Jennifer Zhao <[email protected]>

JenZhao force-pushed the v1_guide branch from 56e014f to 4e978c2 Compare March 2, 2025 06:44

JenZhao added 4 commits March 2, 2025 06:53

remove merged pr

32887cd

Signed-off-by: Jennifer Zhao <[email protected]>

fix

fcffa3c

Signed-off-by: Jennifer Zhao <[email protected]>

link pr vllm-project#13361

047f46d

Signed-off-by: Jennifer Zhao <[email protected]>

link pr vllm-project#13997

d858c6a

Signed-off-by: Jennifer Zhao <[email protected]>

ywang96 added 2 commits March 2, 2025 01:05

update

54d51cf

Signed-off-by: Roger Wang <[email protected]>

pre-commit

b934056

Signed-off-by: Roger Wang <[email protected]>

JenZhao marked this pull request as ready for review March 2, 2025 20:50

JenZhao requested a review from robertgshaw2-redhat March 3, 2025 04:14

comaniac reviewed Mar 3, 2025

View reviewed changes

JenZhao added 3 commits March 4, 2025 20:47

update

79549c9

Signed-off-by: Jennifer Zhao <[email protected]>

update

aa7bed4

Signed-off-by: Jennifer Zhao <[email protected]>

update

f80562b

Signed-off-by: Jennifer Zhao <[email protected]>

LiuXiaoxuanPKU reviewed Mar 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Doc] V1 user guide #13991

[Doc] V1 user guide #13991

JenZhao commented Feb 27, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Feb 27, 2025

robertgshaw2-redhat Mar 2, 2025

robertgshaw2-redhat Mar 2, 2025

JenZhao Mar 2, 2025 •

edited

Loading

robertgshaw2-redhat Mar 2, 2025

robertgshaw2-redhat left a comment

shen-shanshan Mar 2, 2025

shen-shanshan Mar 2, 2025

ywang96 left a comment

ywang96 Mar 2, 2025

ywang96 Mar 2, 2025

JenZhao Mar 2, 2025

JenZhao Mar 2, 2025

ywang96 Mar 2, 2025

ywang96 Mar 2, 2025

ywang96 Mar 2, 2025

ywang96 Mar 2, 2025

comaniac Mar 3, 2025

JenZhao Mar 4, 2025

comaniac Mar 4, 2025

comaniac Mar 3, 2025

comaniac Mar 3, 2025

comaniac Mar 3, 2025

JenZhao Mar 3, 2025

comaniac Mar 3, 2025

comaniac Mar 3, 2025

WoosukKwon commented Mar 4, 2025

LiuXiaoxuanPKU left a comment

LiuXiaoxuanPKU Mar 5, 2025

LiuXiaoxuanPKU Mar 5, 2025

LiuXiaoxuanPKU Mar 5, 2025

LiuXiaoxuanPKU Mar 5, 2025


		- FP8 KV Cache: FP8 KV Cache is not yet supported in V1.

		## Unsupported models

		#### Current deprecated sampling features
		The team is working on supporting these features globally in the server.

		- Spec Decode other than ngram: currently, only ngram spec decode is supported in V1
		after this [PR](https://github.com/vllm-project/vllm/pull/12193).

		- Quantization: For V1, when the CUDA graph is enabled, it defaults to the
		piecewiseCUDA graphintroduced in this[PR](https://github.com/vllm-project/vllm/pull/10058); consequently,FP8 and other quantizations are not supported.

		- GPU <> CPU KV Cache Swapping: with the new simplified core architecture, vLLM V1 no longer requires KV cache swapping
		to handle request preemptions.


		Although the support for many features & models have been re‐implemented or optimized compared to V0, some remain either unsupported or not yet fully optimized on vLLM V1.

		#### Unoptimized Features

		- Spec Decode: Currently, only ngram-based spec decode is supported in V1. There
		will be follow-up work to support other types of spec decode.


		#### Unsupported Models

		vLLM V1 excludes model architectures with the `SupportsV0Only` protocol, and the majority fall into the following categories. V1 support for these models will be added eventually.

	vLLM V1 excludes model architectures with the `SupportsV0Only` protocol, and the majority fall into the following categories. V1 support for these models will be added eventually.
	vLLM V1 currently excludes model architectures with the `SupportsV0Only` protocol, and the majority fall into the following categories. V1 support for these models will be added eventually.

	This living user guide outlines a few known important changes and limitations introduced by vLLM V1. The team has been working actively to bring V1 as the default engine version of vLLM, therefore this guide will be updated constantly as more feature get supported on vLLM V1.
	This living user guide outlines a few known important changes and limitations introduced by vLLM V1. The team has been working actively to bring V1 as the default engine, therefore this guide will be updated constantly as more features get supported on vLLM V1.

	Support for logprobs with post-sampling adjustments is work in progress and will be added in future updates.
	Support for logprobs with post-sampling adjustments is in progress and will be added in future updates.

	processing functions to adjust logits on a per-request basis. In vLLM V1 this feature has been deprecated. Instead, the design is moving toward supporting **global logits
	processing functions to adjust logits on a per-request basis. In vLLM V1, this feature has been deprecated. Instead, the design is moving toward supporting **global logits

[Doc] V1 user guide #13991

Are you sure you want to change the base?

[Doc] V1 user guide #13991

Conversation

JenZhao commented Feb 27, 2025 • edited by github-actions bot Loading

github-actions bot commented Feb 27, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JenZhao Mar 2, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robertgshaw2-redhat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ywang96 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WoosukKwon commented Mar 4, 2025

LiuXiaoxuanPKU left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JenZhao commented Feb 27, 2025 •

edited by github-actions bot

Loading

JenZhao Mar 2, 2025 •

edited

Loading