Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Doc] V1 user guide #13991

Open
wants to merge 23 commits into
base: main
Choose a base branch
from
Open

[Doc] V1 user guide #13991

wants to merge 23 commits into from

Conversation

JenZhao
Copy link
Contributor

@JenZhao JenZhao commented Feb 27, 2025

This PR adds the v1 User Guide as a living document, to be updated and expanded over time.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the documentation Improvements or additions to documentation label Feb 27, 2025
Signed-off-by: Jennifer Zhao <[email protected]>
Signed-off-by: Jennifer Zhao <[email protected]>
Signed-off-by: Jennifer Zhao <[email protected]>
Signed-off-by: Jennifer Zhao <[email protected]>
Signed-off-by: Jennifer Zhao <[email protected]>
Signed-off-by: Jennifer Zhao <[email protected]>
Signed-off-by: Jennifer Zhao <[email protected]>
Signed-off-by: Jennifer Zhao <[email protected]>
Signed-off-by: Jennifer Zhao <[email protected]>
#### Deprecated sampling features
- best_of
- logits_processors
- beam_search
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe beam search is supported


#### Deprecated sampling features
- best_of
- logits_processors
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • "Per request logits processors are not supported"
  • Nick is working on supporting these globally in the server

Copy link
Contributor Author

@JenZhao JenZhao Mar 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in the updates from @ywang96

thank you Roger


- **FP8 KV Cache**: FP8 KV Cache is not yet supported in V1.

## Unsupported models
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Mamba not supported yet
  • We plan to support these eventually

Copy link
Collaborator

@robertgshaw2-redhat robertgshaw2-redhat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! can you finish this off tomorrow? I want to include a link to this page in the logs

Signed-off-by: Jennifer Zhao <[email protected]>
# vLLM V1 User Guide

## Why vLLM V1?
Previous blog post [vLLM V1: A Major Upgrade to vLLM's Core Architecture](https://blog.vllm.ai/2025/01/27/V1-alpha-release.html)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This link is invalid, showing "404", please check~

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean: https://blog.vllm.ai/2025/01/27/v1-alpha-release.html ? please modify V1 to v1 in the link.

Signed-off-by: Jennifer Zhao <[email protected]>
Copy link
Member

@ywang96 ywang96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments - PTAL

Comment on lines 1 to 4
# vLLM V1 User Guide

## Why vLLM V1?
Previous blog post [vLLM V1: A Major Upgrade to vLLM's Core Architecture](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth expanding here to give a bit more of the context and background - I will help you add stuff here.

Comment on lines 19 to 21
#### Current deprecated sampling features
The team is working on supporting these features globally in the server.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not accurate since best_of and per request logits processors will be deprecated completely on V1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw @robertgshaw2-redhat 's comments on supporting them globally

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you @ywang96 I just realized I misunderstood Rob's comment.

Comment on lines 35 to 36
- **Spec Decode other than ngram**: currently, only ngram spec decode is supported in V1
after this [PR](https://github.com/vllm-project/vllm/pull/12193).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we really need to call out PR that's already merged.

Comment on lines 38 to 39
- **Quantization**: For V1, when the CUDA graph is enabled, it defaults to the
piecewiseCUDA graphintroduced in this[PR](https://github.com/vllm-project/vllm/pull/10058); consequently,FP8 and other quantizations are not supported.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is completely accurate? Also there are some issues in the formatting here.

## Unsupported Models

vLLM V1 excludes models tagged with `SupportsV0Only` while we develop support for
other types. For a complete list of supported models, see our [documentation](https://docs.vllm.ai/en/latest/models/supported_models.html). The
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove For a complete list of supported models, see our [documentation](https://docs.vllm.ai/en/latest/models/supported_models.html). since you also have that at the end of the section.

Comment on lines 29 to 30
## Unsupported features

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to my comment above for the first section: adding some context here will be helpful.

Signed-off-by: Jennifer Zhao <[email protected]>
JenZhao added 4 commits March 2, 2025 06:53
Signed-off-by: Jennifer Zhao <[email protected]>
Signed-off-by: Jennifer Zhao <[email protected]>
Signed-off-by: Jennifer Zhao <[email protected]>
Signed-off-by: Jennifer Zhao <[email protected]>
ywang96 added 2 commits March 2, 2025 01:05
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
@JenZhao JenZhao marked this pull request as ready for review March 2, 2025 20:50
Comment on lines +58 to +59
- **GPU <> CPU KV Cache Swapping**: with the new simplified core architecture, vLLM V1 no longer requires KV cache swapping
to handle request preemptions.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To confirm: Have we decided to deprecate (i.e. no longer support) this feature? The reason I'm asking is there are some RFCs about adding CPU kv cache offloading to v1. cc @simon-mo @WoosukKwon

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Swapping will be deprecated, but offloading will not.
cc @ywang96

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see I messed them up. nvm then


Although the support for many features & models have been re‐implemented or optimized compared to V0, some remain either unsupported or not yet fully optimized on vLLM V1.

#### Unoptimized Features
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to overview that optimized features are the features that have been supported in v1 but we are still working in progress to optimize it.

Comment on lines 72 to 73
- **Spec Decode**: Currently, only ngram-based spec decode is supported in V1. There
will be follow-up work to support other types of spec decode.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could link to the GitHub Project of v1 spec decode

- **GPU <> CPU KV Cache Swapping**: with the new simplified core architecture, vLLM V1 no longer requires KV cache swapping
to handle request preemptions.

### Feature & Model Support in Progress
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One suggestion to this section is that it might be more clear to have an overview supported matrix: features vs status, where status could be one of the following:

  • Deprecated: No plan to support in v1 unless there's a strong motivation.
  • Planned: Plan to support but haven't started workings
  • WIP: Being supported.
  • Functional (unoptimized): It's working but is being optimized.
  • Optimized: It's almost optimized and no other planned work atm.

So that in the rest of this section we could be feature centric to describe the current status of each feature and point to the corresponding GitHub issue/PR/Project.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

Comment on lines 79 to 82
- **Structured Generation Fallback**: For structured output tasks, V1 currently
supports only the `xgrammar:no_fallback` mode.
Details about the structured generation can be found
[here](https://docs.vllm.ai/en/latest/features/structured_outputs.html).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Structured Generation Fallback**: For structured output tasks, V1 currently
supports only the `xgrammar:no_fallback` mode.
Details about the structured generation can be found
[here](https://docs.vllm.ai/en/latest/features/structured_outputs.html).
- **Structured Output**: For structured output tasks, V1 currently
supports only the `xgrammar:no_fallback` mode, meaning that it will error out if the output schema is unsupported by xgrammar.
Details about the structured generation can be found
[here](https://docs.vllm.ai/en/latest/features/structured_outputs.html).


#### Unsupported Models

vLLM V1 excludes model architectures with the `SupportsV0Only` protocol, and the majority fall into the following categories. V1 support for these models will be added eventually.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
vLLM V1 excludes model architectures with the `SupportsV0Only` protocol, and the majority fall into the following categories. V1 support for these models will be added eventually.
vLLM V1 currently excludes model architectures with the `SupportsV0Only` protocol, and the majority fall into the following categories. V1 support for these models will be added eventually.

@WoosukKwon
Copy link
Collaborator

Thanks for the PR! I will take a look tmr (Tue).

JenZhao added 3 commits March 4, 2025 20:47
Signed-off-by: Jennifer Zhao <[email protected]>
Signed-off-by: Jennifer Zhao <[email protected]>
Signed-off-by: Jennifer Zhao <[email protected]>
Copy link
Collaborator

@LiuXiaoxuanPKU LiuXiaoxuanPKU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the guide. It's great with lots of information.

Just some concerns:

  1. On the high level, I feel it's a bit too negative? We talk a lot about features we don't support or need to be optimized. I feel we should highlight some optimized features, such as chunked prefill and prefix caching.
  2. I feel we need to mention the scheduler (scheduling policy) somewhere, currently we batch prefill tokens and decoding tokens in the same batch, we don't prioritize prefill/decode.

For more details, check out the vLLM V1 blog post [vLLM V1: A Major
Upgrade to vLLM’s Core Architecture](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html) (published Jan 27, 2025).

This living user guide outlines a few known **important changes and limitations** introduced by vLLM V1. The team has been working actively to bring V1 as the default engine version of vLLM, therefore this guide will be updated constantly as more feature get supported on vLLM V1.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This living user guide outlines a few known **important changes and limitations** introduced by vLLM V1. The team has been working actively to bring V1 as the default engine version of vLLM, therefore this guide will be updated constantly as more feature get supported on vLLM V1.
This living user guide outlines a few known **important changes and limitations** introduced by vLLM V1. The team has been working actively to bring V1 as the default engine, therefore this guide will be updated constantly as more features get supported on vLLM V1.

adjustments). As a result, the returned logprobs do not reflect the final adjusted
probabilities used during sampling.

Support for logprobs with post-sampling adjustments is work in progress and will be added in future updates.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Support for logprobs with post-sampling adjustments is work in progress and will be added in future updates.
Support for logprobs with post-sampling adjustments is in progress and will be added in future updates.


- **best_of**: This feature has been deprecated due to limited usage. See details at [RFC #13361](https://github.com/vllm-project/vllm/issues/13361).
- **Per-Request Logits Processors**: In V0, users could pass custom
processing functions to adjust logits on a per-request basis. In vLLM V1 this feature has been deprecated. Instead, the design is moving toward supporting **global logits
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
processing functions to adjust logits on a per-request basis. In vLLM V1 this feature has been deprecated. Instead, the design is moving toward supporting **global logits
processing functions to adjust logits on a per-request basis. In vLLM V1, this feature has been deprecated. Instead, the design is moving toward supporting **global logits

(e.g., see [PR #13096](https://github.com/vllm-project/vllm/pull/13096)).

- **Spec Decode**: Currently, only ngram-based spec decode is supported in V1. There
will be follow-up work to support other types of spec decode (e.g., see [PR #13933](https://github.com/vllm-project/vllm/pull/13933)).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
will be follow-up work to support other types of spec decode (e.g., see [PR #13933](https://github.com/vllm-project/vllm/pull/13933)).
will be follow-up work to support other types of spec decode (e.g., see [PR #13933](https://github.com/vllm-project/vllm/pull/13933)). We will prioritize the support for Eagle, MTP compared to draft model based spec decode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants