Skip to content

Commit

Permalink
update
Browse files Browse the repository at this point in the history
  • Loading branch information
JenZhao committed Feb 27, 2025
1 parent eb21d44 commit eb6fd2b
Showing 1 changed file with 27 additions and 19 deletions.
46 changes: 27 additions & 19 deletions docs/source/getting_started/v1_user_guide.md
Original file line number Diff line number Diff line change
@@ -1,44 +1,52 @@
# vLLM V1 User Guide

## Why vLLM v1?
Previous blog post [vLLM V1: A Major Upgrade to vLLM's Core Architecture](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html)
## Why vLLM V1?
Previous blog post [vLLM V1: A Major Upgrade to vLLM's Core Architecture](https://blog.vllm.ai/2025/01/27/V1-alpha-release.html)

## Semantic changes and deprecated features

### Logprobs
- vLLM v1 now supports both sample logprobs and prompt logprobs, as introduced in this [PR](https://github.com/vllm-project/vllm/pull/9880).
- vLLM V1 now supports both sample logprobs and prompt logprobs, as introduced in this [PR](https://github.com/vllm-project/vllm/pull/9880).
- **Current Limitations**:
- v1 prompt logprobs do not support prefix caching.
- v1 logprobs are computed before logits post-processing, so penalty
- V1 prompt logprobs do not support prefix caching.
- V1 logprobs are computed before logits post-processing, so penalty
adjustments and temperature scaling are not applied.
- The team is actively working on implementing logprobs that include post-sampling adjustments.

### Encoder-Decoder
- vLLM v1 is currently limited to decoder-only Transformers. Please check out our
[documentation](https://docs.vllm.ai/en/latest/models/supported_models.html) for a
more detailed list of the supported models. Encoder-decoder models support is not
happending soon.
### The following features has been deprecated in V1:

### List of features that are deprecated in v1
#### Deprecated sampling features
- best_of
- logits_processors
- beam_search

#### Deprecated KV Cache
- KV Cache swapping
- KV Cache offloading
- FP8 KV Cache

## Unsupported features

### LoRA
- LoRA works for V1 on the main branch, but its performance is inferior to that of V0.
- **LoRA**: LoRA works for V1 on the main branch, but its performance is inferior to that
of V0.
The team is actively working on improving the performance [PR](https://github.com/vllm-project/vllm/pull/13096).

### Spec Decode other than ngram
- Currently, only ngram spec decode is supported in V1 after this [PR](https://github.com/vllm-project/vllm/pull/12193).
- **Spec Decode other than ngram**: currently, only ngram spec decode is supported in V1
after this [PR](https://github.com/vllm-project/vllm/pull/12193).

### KV Cache Swapping & Offloading & FP8 KV Cache
- vLLM v1 does not support KV Cache swapping, offloading, and FP8 KV Cache yet. The
team is working actively on it.
- **Quantization**: For V1, when the CUDA graph is enabled, it defaults to the
piecewise CUDA graph introduced in this [PR](https://github.com/vllm-project/vllm/pull/10058) ; consequently, FP8 and other quantizations are not supported.

## Unsupported models

## Unsupported Models
All model with `SupportsV0Only` tag in the model definition is not supported by V1.

- **Pooling Models**: Pooling models are not supported in V1 yet.
- **Encoder-Decoder**: vLLM V1 is currently limited to decoder-only Transformers.
Please check out our
[documentation](https://docs.vllm.ai/en/latest/models/supported_models.html) for a
more detailed list of the supported models. Encoder-decoder models support is not
happending soon.


## FAQ

0 comments on commit eb6fd2b

Please sign in to comment.