-
-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Doc] V1 user guide #13991
base: main
Are you sure you want to change the base?
[Doc] V1 user guide #13991
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: Jennifer Zhao <[email protected]>
Signed-off-by: Jennifer Zhao <[email protected]>
Signed-off-by: Jennifer Zhao <[email protected]>
Signed-off-by: Jennifer Zhao <[email protected]>
Signed-off-by: Jennifer Zhao <[email protected]>
Signed-off-by: Jennifer Zhao <[email protected]>
Signed-off-by: Jennifer Zhao <[email protected]>
Signed-off-by: Jennifer Zhao <[email protected]>
Signed-off-by: Jennifer Zhao <[email protected]>
#### Deprecated sampling features | ||
- best_of | ||
- logits_processors | ||
- beam_search |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe beam search is supported
|
||
#### Deprecated sampling features | ||
- best_of | ||
- logits_processors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- "Per request logits processors are not supported"
- Nick is working on supporting these globally in the server
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed in the updates from @ywang96
thank you Roger
|
||
- **FP8 KV Cache**: FP8 KV Cache is not yet supported in V1. | ||
|
||
## Unsupported models |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Mamba not supported yet
- We plan to support these eventually
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! can you finish this off tomorrow? I want to include a link to this page in the logs
Signed-off-by: Jennifer Zhao <[email protected]>
# vLLM V1 User Guide | ||
|
||
## Why vLLM V1? | ||
Previous blog post [vLLM V1: A Major Upgrade to vLLM's Core Architecture](https://blog.vllm.ai/2025/01/27/V1-alpha-release.html) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This link is invalid, showing "404", please check~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean: https://blog.vllm.ai/2025/01/27/v1-alpha-release.html
? please modify V1
to v1
in the link.
Signed-off-by: Jennifer Zhao <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments - PTAL
# vLLM V1 User Guide | ||
|
||
## Why vLLM V1? | ||
Previous blog post [vLLM V1: A Major Upgrade to vLLM's Core Architecture](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth expanding here to give a bit more of the context and background - I will help you add stuff here.
#### Current deprecated sampling features | ||
The team is working on supporting these features globally in the server. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not accurate since best_of
and per request logits processors
will be deprecated completely on V1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw @robertgshaw2-redhat 's comments on supporting them globally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you @ywang96 I just realized I misunderstood Rob's comment.
- **Spec Decode other than ngram**: currently, only ngram spec decode is supported in V1 | ||
after this [PR](https://github.com/vllm-project/vllm/pull/12193). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we really need to call out PR that's already merged.
- **Quantization**: For V1, when the CUDA graph is enabled, it defaults to the | ||
piecewiseCUDA graphintroduced in this[PR](https://github.com/vllm-project/vllm/pull/10058); consequently,FP8 and other quantizations are not supported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is completely accurate? Also there are some issues in the formatting here.
## Unsupported Models | ||
|
||
vLLM V1 excludes models tagged with `SupportsV0Only` while we develop support for | ||
other types. For a complete list of supported models, see our [documentation](https://docs.vllm.ai/en/latest/models/supported_models.html). The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove For a complete list of supported models, see our [documentation](https://docs.vllm.ai/en/latest/models/supported_models.html).
since you also have that at the end of the section.
## Unsupported features | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to my comment above for the first section: adding some context here will be helpful.
Co-authored-by: Roger Wang <[email protected]>
Signed-off-by: Jennifer Zhao <[email protected]>
Signed-off-by: Jennifer Zhao <[email protected]>
Signed-off-by: Jennifer Zhao <[email protected]>
Signed-off-by: Jennifer Zhao <[email protected]>
Signed-off-by: Jennifer Zhao <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
- **GPU <> CPU KV Cache Swapping**: with the new simplified core architecture, vLLM V1 no longer requires KV cache swapping | ||
to handle request preemptions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To confirm: Have we decided to deprecate (i.e. no longer support) this feature? The reason I'm asking is there are some RFCs about adding CPU kv cache offloading to v1. cc @simon-mo @WoosukKwon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Swapping will be deprecated, but offloading will not.
cc @ywang96
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see I messed them up. nvm then
|
||
Although the support for many features & models have been re‐implemented or optimized compared to V0, some remain either unsupported or not yet fully optimized on vLLM V1. | ||
|
||
#### Unoptimized Features |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to overview that optimized features are the features that have been supported in v1 but we are still working in progress to optimize it.
- **Spec Decode**: Currently, only ngram-based spec decode is supported in V1. There | ||
will be follow-up work to support other types of spec decode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could link to the GitHub Project of v1 spec decode
- **GPU <> CPU KV Cache Swapping**: with the new simplified core architecture, vLLM V1 no longer requires KV cache swapping | ||
to handle request preemptions. | ||
|
||
### Feature & Model Support in Progress |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One suggestion to this section is that it might be more clear to have an overview supported matrix: features vs status, where status could be one of the following:
- Deprecated: No plan to support in v1 unless there's a strong motivation.
- Planned: Plan to support but haven't started workings
- WIP: Being supported.
- Functional (unoptimized): It's working but is being optimized.
- Optimized: It's almost optimized and no other planned work atm.
So that in the rest of this section we could be feature centric to describe the current status of each feature and point to the corresponding GitHub issue/PR/Project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do
- **Structured Generation Fallback**: For structured output tasks, V1 currently | ||
supports only the `xgrammar:no_fallback` mode. | ||
Details about the structured generation can be found | ||
[here](https://docs.vllm.ai/en/latest/features/structured_outputs.html). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- **Structured Generation Fallback**: For structured output tasks, V1 currently | |
supports only the `xgrammar:no_fallback` mode. | |
Details about the structured generation can be found | |
[here](https://docs.vllm.ai/en/latest/features/structured_outputs.html). | |
- **Structured Output**: For structured output tasks, V1 currently | |
supports only the `xgrammar:no_fallback` mode, meaning that it will error out if the output schema is unsupported by xgrammar. | |
Details about the structured generation can be found | |
[here](https://docs.vllm.ai/en/latest/features/structured_outputs.html). |
|
||
#### Unsupported Models | ||
|
||
vLLM V1 excludes model architectures with the `SupportsV0Only` protocol, and the majority fall into the following categories. V1 support for these models will be added eventually. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vLLM V1 excludes model architectures with the `SupportsV0Only` protocol, and the majority fall into the following categories. V1 support for these models will be added eventually. | |
vLLM V1 currently excludes model architectures with the `SupportsV0Only` protocol, and the majority fall into the following categories. V1 support for these models will be added eventually. |
Thanks for the PR! I will take a look tmr (Tue). |
Signed-off-by: Jennifer Zhao <[email protected]>
Signed-off-by: Jennifer Zhao <[email protected]>
Signed-off-by: Jennifer Zhao <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the guide. It's great with lots of information.
Just some concerns:
- On the high level, I feel it's a bit too negative? We talk a lot about features we don't support or need to be optimized. I feel we should highlight some optimized features, such as chunked prefill and prefix caching.
- I feel we need to mention the scheduler (scheduling policy) somewhere, currently we batch prefill tokens and decoding tokens in the same batch, we don't prioritize prefill/decode.
For more details, check out the vLLM V1 blog post [vLLM V1: A Major | ||
Upgrade to vLLM’s Core Architecture](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html) (published Jan 27, 2025). | ||
|
||
This living user guide outlines a few known **important changes and limitations** introduced by vLLM V1. The team has been working actively to bring V1 as the default engine version of vLLM, therefore this guide will be updated constantly as more feature get supported on vLLM V1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This living user guide outlines a few known **important changes and limitations** introduced by vLLM V1. The team has been working actively to bring V1 as the default engine version of vLLM, therefore this guide will be updated constantly as more feature get supported on vLLM V1. | |
This living user guide outlines a few known **important changes and limitations** introduced by vLLM V1. The team has been working actively to bring V1 as the default engine, therefore this guide will be updated constantly as more features get supported on vLLM V1. |
adjustments). As a result, the returned logprobs do not reflect the final adjusted | ||
probabilities used during sampling. | ||
|
||
Support for logprobs with post-sampling adjustments is work in progress and will be added in future updates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Support for logprobs with post-sampling adjustments is work in progress and will be added in future updates. | |
Support for logprobs with post-sampling adjustments is in progress and will be added in future updates. |
|
||
- **best_of**: This feature has been deprecated due to limited usage. See details at [RFC #13361](https://github.com/vllm-project/vllm/issues/13361). | ||
- **Per-Request Logits Processors**: In V0, users could pass custom | ||
processing functions to adjust logits on a per-request basis. In vLLM V1 this feature has been deprecated. Instead, the design is moving toward supporting **global logits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
processing functions to adjust logits on a per-request basis. In vLLM V1 this feature has been deprecated. Instead, the design is moving toward supporting **global logits | |
processing functions to adjust logits on a per-request basis. In vLLM V1, this feature has been deprecated. Instead, the design is moving toward supporting **global logits |
(e.g., see [PR #13096](https://github.com/vllm-project/vllm/pull/13096)). | ||
|
||
- **Spec Decode**: Currently, only ngram-based spec decode is supported in V1. There | ||
will be follow-up work to support other types of spec decode (e.g., see [PR #13933](https://github.com/vllm-project/vllm/pull/13933)). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will be follow-up work to support other types of spec decode (e.g., see [PR #13933](https://github.com/vllm-project/vllm/pull/13933)). | |
will be follow-up work to support other types of spec decode (e.g., see [PR #13933](https://github.com/vllm-project/vllm/pull/13933)). We will prioritize the support for Eagle, MTP compared to draft model based spec decode. |
This PR adds the v1 User Guide as a living document, to be updated and expanded over time.