Skip to content

Commit

Permalink
Squashed commit of the following:
Browse files Browse the repository at this point in the history
commit 5d5071c
Author: Lucas Wilkinson <[email protected]>
Date:   Sat Feb 1 01:13:23 2025 +0000

    reduce split kv amount

    Signed-off-by: Lucas Wilkinson <[email protected]>

commit 5fe1d1d
Author: Lucas Wilkinson <[email protected]>
Date:   Sat Feb 1 00:56:45 2025 +0000

    format

    Signed-off-by: Lucas Wilkinson <[email protected]>

commit 0d66687
Author: Simon Mo <[email protected]>
Date:   Fri Jan 31 16:39:19 2025 -0800

    Update loader.py

    Co-authored-by: Michael Goin <[email protected]>
    Signed-off-by: Lucas Wilkinson <[email protected]>

commit 5002734
Author: Lucas Wilkinson <[email protected]>
Date:   Sat Feb 1 00:14:14 2025 +0000

    simplification

    Signed-off-by: Lucas Wilkinson <[email protected]>

commit fac827f
Merge: db2c583 44bbca7
Author: Lucas Wilkinson <[email protected]>
Date:   Sat Feb 1 00:09:36 2025 +0000

    Merge remote-tracking branch 'origin/main' into mla-fp8

commit db2c583
Author: Lucas Wilkinson <[email protected]>
Date:   Sat Feb 1 00:06:10 2025 +0000

    filter compressed tensor models better

    Signed-off-by: Lucas Wilkinson <[email protected]>

commit e144da8
Author: Lucas Wilkinson <[email protected]>
Date:   Fri Jan 31 18:41:35 2025 -0500

    Update vllm/model_executor/model_loader/loader.py

    Co-authored-by: Simon Mo <[email protected]>
    Signed-off-by: Lucas Wilkinson <[email protected]>

commit 1621381
Author: Lucas Wilkinson <[email protected]>
Date:   Fri Jan 31 18:41:22 2025 -0500

    Update vllm/model_executor/model_loader/loader.py

    Co-authored-by: Simon Mo <[email protected]>
    Signed-off-by: Lucas Wilkinson <[email protected]>

commit 9829fae
Author: Lucas Wilkinson <[email protected]>
Date:   Fri Jan 31 23:40:12 2025 +0000

    misc

    Signed-off-by: Lucas Wilkinson <[email protected]>

commit 44bbca7
Author: Brian Dellabetta <[email protected]>
Date:   Fri Jan 31 17:38:48 2025 -0600

    [Doc] int4 w4a16 example (vllm-project#12585)

    Based on a request by @mgoin , with @kylesayrs we have added an example
    doc for int4 w4a16 quantization, following the pre-existing int8 w8a8
    quantization example and the example available in
    [`llm-compressor`](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w4a16/llama3_example.py)

    FIX #n/a (no issue created)

    @kylesayrs and I have discussed a couple additional improvements for the
    quantization docs. We will revisit at a later date, possibly including:
    - A section for "choosing the correct quantization scheme/ compression
    technique"
    - Additional vision or audio calibration datasets

    ---------

    Signed-off-by: Brian Dellabetta <[email protected]>
    Co-authored-by: Michael Goin <[email protected]>

commit 60808bd
Author: Harry Mellor <[email protected]>
Date:   Fri Jan 31 23:38:35 2025 +0000

    [Doc] Improve installation signposting (vllm-project#12575)

    - Make device tab names more explicit
    - Add comprehensive list of devices to
    https://docs.vllm.ai/en/latest/getting_started/installation/index.html
    - Add `attention` blocks to the intro of all devices that don't have
    pre-built wheels/images

    ---------

    Signed-off-by: Harry Mellor <[email protected]>

commit fc54214
Author: Ryan Nguyen <[email protected]>
Date:   Fri Jan 31 18:37:30 2025 -0500

    [Feature] Fix guided decoding blocking bitmask memcpy (vllm-project#12563)

    **[Guided decoding performance optimization]** Sending the guided
    decoding bitmask in xgrammar to the GPU
    (`self.token_bitmask.to(scores.device)`) is a blocking operation that
    prevents the CPU from pre-launching the sampler kernels. The CPU waits
    until decode is complete, then copies the bitmask over. This PR changes
    the operation to async via setting `non-blocking=True`.

    (Current) The CPU is blocked on a `cudaStreamSynchronize` and only
    pre-empts the sampling kernels after bitmask application. Below is the
    Nsys profile for one decode phase from Llama 3.1 8B.

    ![image](https://github.com/user-attachments/assets/8997eae1-b822-4f52-beb8-ef19a7c6b824)

    With the optimization, this is no longer the case:

    ![image](https://github.com/user-attachments/assets/6d5ea83f-f169-4f98-a8c1-41c719b3e1e7)

    ---------

    Signed-off-by: Ryan N <[email protected]>

commit eb5741a
Author: Tyler Michael Smith <[email protected]>
Date:   Fri Jan 31 18:29:11 2025 -0500

    [Kernel][Quantization] Integrate block-quantized CUTLASS kernels for DeepSeekV3 (vllm-project#12587)

    Integrates the block-quantized kernels introduced in
    vllm-project#11868 for use in linear
    layers.

    Signed-off-by: Tyler Michael Smith <[email protected]>

commit 145c2ff
Author: Robert Shaw <[email protected]>
Date:   Fri Jan 31 18:28:47 2025 -0500

    [Bugfix] Revert MoE Triton Config Default (vllm-project#12629)

    SUMMARY:
    * previous PR for pulling in block configs also changed defaults
    (https://github.com/vllm-project/vllm/pull/11589/files) for FP8
    * this broke L4 MoE since there was not enough SHM for the default
    configuration
    * this reverts the non-block example to the default

    Signed-off-by: [email protected] <[email protected]>

commit 415f194
Author: Kevin H. Luu <[email protected]>
Date:   Fri Jan 31 13:39:36 2025 -0800

    [release] Add input step to ask for Release version (vllm-project#12631)

    Instead of having to create a new build with release version put in as
    env var.

commit 4251506
Author: Lucas Wilkinson <[email protected]>
Date:   Fri Jan 31 21:26:13 2025 +0000

    fixes

    Signed-off-by: Lucas Wilkinson <[email protected]>

commit c9d72cb
Author: Lucas Wilkinson <[email protected]>
Date:   Fri Jan 31 21:17:23 2025 +0000

    more cleanup

    Signed-off-by: Lucas Wilkinson <[email protected]>

commit 3cdd2ce
Author: Lucas Wilkinson <[email protected]>
Date:   Fri Jan 31 21:16:42 2025 +0000

    cleanup

    Signed-off-by: Lucas Wilkinson <[email protected]>

commit 89003c4
Author: Chen Zhang <[email protected]>
Date:   Sat Feb 1 05:13:04 2025 +0800

    [v1][Bugfix] Add extra_keys to block_hash for prefix caching (vllm-project#12603)

    This pr adds extra key to block hash, to generate different hash value
    for two blocks with the same token string but different extra_keys in
    their parent blocks. For example, it can generate different hash value
    for the second block of the following two requests:
    ```python
    request1 = make_request(
            request_id=0,
            prompt_token_ids=[_ for _ in range(6)],
            mm_positions=[{
                "offset": 0,
                "length": 3
            }, {
                "offset": 3,
                "length": 3
            }],
            mm_hashes=["hash1", "hash2"],
        )
        request2 = make_request(
            request_id=1,
            prompt_token_ids=[_ for _ in range(6)],
            mm_positions=[{
                "offset": 0,
                "length": 3
            }, {
                "offset": 3,
                "length": 3
            }],
            mm_hashes=["hash3", "hash2"],
        )
    ```

    ---------

    Signed-off-by: Chen Zhang <[email protected]>

commit f51cbe0
Author: Lucas Wilkinson <[email protected]>
Date:   Fri Jan 31 21:04:22 2025 +0000

    review comments

    Signed-off-by: Lucas Wilkinson <[email protected]>

commit 3d12a04
Author: Lucas Wilkinson <[email protected]>
Date:   Fri Jan 31 20:45:14 2025 +0000

    working but messy

    Signed-off-by: Lucas Wilkinson <[email protected]>

commit 60bcef0
Author: Cody Yu <[email protected]>
Date:   Fri Jan 31 12:30:46 2025 -0800

    [Docs][V1] Prefix caching design (vllm-project#12598)

    - Create v1 design document section in docs.
    - Add prefix caching design doc.

    @WoosukKwon @ywang96

    ---------

    Signed-off-by: Cody Yu <[email protected]>

commit 847f883
Author: Cody Yu <[email protected]>
Date:   Fri Jan 31 12:30:33 2025 -0800

    [Git] Automatically sign-off commits (vllm-project#12595)

    It's very annoying when I forgot to add `-s` in `git commit` to
    sign-off, because I then need to `git rebase HEAD~1 --signoff` and `git
    push -f` to fix the DCO. This PR adds a hook to sign off commits
    automatically when `-s` is missing to solve this problem. The only
    change from the user side is now users have to install 2 hooks, so
    instead of just

    ```
    pre-commit install
    ```

    Now we need to

    ```
    pre-commit install --hook-type pre-commit --hook-type commit-msg
    ```

    Note that even if users still only install the pre-commit hook, they
    won't get any error in `git commit`. Just the sign-off hook won't run.

    cc @hmellor @youkaichao

    ---------

    Signed-off-by: Cody Yu <[email protected]>

commit 325f679
Author: Robert Shaw <[email protected]>
Date:   Fri Jan 31 15:06:39 2025 -0500

    [BugFix] Fix Torch.Compile For DeepSeek (vllm-project#12594)

    Co-authored-by: simon-mo <[email protected]>

commit 548ec44
Author: Lucas Wilkinson <[email protected]>
Date:   Fri Jan 31 19:13:22 2025 +0000

    simon changes

    Signed-off-by: Lucas Wilkinson <[email protected]>

commit a57cd3d
Merge: 076cbe5 cabaf4e
Author: simon-mo <[email protected]>
Date:   Fri Jan 31 07:52:26 2025 +0000

    Merge branch 'main' of github.com:vllm-project/vllm into mla-fp8

commit 076cbe5
Merge: 0ccbcce a1fc18c
Author: Lucas Wilkinson <[email protected]>
Date:   Thu Jan 30 23:31:41 2025 -0500

    Merge branch 'main' into mla-fp8

commit 0ccbcce
Author: Lucas Wilkinson <[email protected]>
Date:   Fri Jan 31 04:29:17 2025 +0000

    deepseek v3 support

    Signed-off-by: Lucas Wilkinson <[email protected]>

commit 645622c
Author: Lucas Wilkinson <[email protected]>
Date:   Fri Jan 31 03:08:36 2025 +0000

    cleanup

    Signed-off-by: Lucas Wilkinson <[email protected]>

commit 2d61054
Author: Lucas Wilkinson <[email protected]>
Date:   Fri Jan 31 03:03:07 2025 +0000

    cleanup

    Co-authored-by: Alexander Matveev <[email protected]>
    Signed-off-by: Lucas Wilkinson <[email protected]>

commit f2b2500
Author: Lucas Wilkinson <[email protected]>
Date:   Fri Jan 31 02:47:05 2025 +0000

    Fix TP > 1 cuda graphs

    Co-authored-by: Alexander Matveev <[email protected]>
    Signed-off-by: Lucas Wilkinson <[email protected]>

commit 433322b
Author: Lucas Wilkinson <[email protected]>
Date:   Fri Jan 31 02:26:11 2025 +0000

    Revert "add cuda graph support"

    Signed-off-by: Lucas Wilkinson <[email protected]>

commit 31c34bf
Author: Lucas Wilkinson <[email protected]>
Date:   Thu Jan 30 23:06:09 2025 +0000

    ci fix

    Signed-off-by: Lucas Wilkinson <[email protected]>

commit 54ba87d
Author: Lucas Wilkinson <[email protected]>
Date:   Thu Jan 30 21:23:09 2025 +0000

    add cuda graph support

    Signed-off-by: Lucas Wilkinson <[email protected]>

commit 5afc1bf
Author: Lucas Wilkinson <[email protected]>
Date:   Thu Jan 30 20:58:53 2025 +0000

    fix mypy

    Signed-off-by: Lucas Wilkinson <[email protected]>

commit cfb2d26
Author: Lucas Wilkinson <[email protected]>
Date:   Thu Jan 30 19:42:36 2025 +0000

    fix mypy

    Signed-off-by: Lucas Wilkinson <[email protected]>

commit 37e39f4
Author: Lucas Wilkinson <[email protected]>
Date:   Thu Jan 30 18:04:58 2025 +0000

    fix failing test

    Signed-off-by: Lucas Wilkinson <[email protected]>

commit 0881475
Author: Lucas Wilkinson <[email protected]>
Date:   Thu Jan 30 17:18:55 2025 +0000

    disable MLA for v3 for now

    Signed-off-by: Lucas Wilkinson <[email protected]>

commit 4a46014
Author: Lucas Wilkinson <[email protected]>
Date:   Thu Jan 30 11:12:48 2025 -0500

    Update vllm/attention/backends/mla/utils.py

    Co-authored-by: Tyler Michael Smith <[email protected]>
    Signed-off-by: Lucas Wilkinson <[email protected]>

commit 09d814c
Author: Lucas Wilkinson <[email protected]>
Date:   Thu Jan 30 15:11:58 2025 +0000

    review comments

    Signed-off-by: Lucas Wilkinson <[email protected]>

commit 8bdc14a
Author: Lucas Wilkinson <[email protected]>
Date:   Thu Jan 30 14:09:46 2025 +0000

    review comments

    Signed-off-by: Lucas Wilkinson <[email protected]>

commit d27826d
Author: Lucas Wilkinson <[email protected]>
Date:   Thu Jan 30 08:51:42 2025 -0500

    Update vllm/config.py

    Co-authored-by: Zhuohan Li <[email protected]>
    Signed-off-by: Lucas Wilkinson <[email protected]>

commit 7487429
Author: Lucas Wilkinson <[email protected]>
Date:   Thu Jan 30 04:00:26 2025 +0000

    renaming for consistency

    Signed-off-by: Lucas Wilkinson <[email protected]>

commit 634eee6
Author: Lucas Wilkinson <[email protected]>
Date:   Thu Jan 30 03:52:59 2025 +0000

    review comments

    Signed-off-by: Lucas Wilkinson <[email protected]>

commit 31b802c
Author: Lucas Wilkinson <[email protected]>
Date:   Wed Jan 29 22:51:37 2025 -0500

    Update vllm/attention/backends/mla/utils.py

    Co-authored-by: Michael Goin <[email protected]>
    Signed-off-by: Lucas Wilkinson <[email protected]>

commit 068e672
Author: Lucas Wilkinson <[email protected]>
Date:   Wed Jan 29 22:46:43 2025 -0500

    Update utils.py

    Co-authored-by: Michael Goin <[email protected]>
    Signed-off-by: Lucas Wilkinson <[email protected]>

commit f2cac91
Author: Lucas Wilkinson <[email protected]>
Date:   Thu Jan 30 03:11:43 2025 +0000

    more cleanups

    Signed-off-by: Lucas Wilkinson <[email protected]>

commit c34e5ca
Author: Lucas Wilkinson <[email protected]>
Date:   Thu Jan 30 03:02:58 2025 +0000

    fix VLLM_MLA_PERFORM_MATRIX_ABSORPTION=0

    Signed-off-by: Lucas Wilkinson <[email protected]>

commit 27ad92c
Author: Lucas Wilkinson <[email protected]>
Date:   Thu Jan 30 02:29:40 2025 +0000

    squashed commits

    Co-authored-by: Woosuk Kwon <[email protected]>
    Co-authored-by: simon-mo <[email protected]>
    Signed-off-by: Lucas Wilkinson <[email protected]>
  • Loading branch information
mawong-amd committed Feb 1, 2025
1 parent d47b834 commit 1d8af93
Show file tree
Hide file tree
Showing 51 changed files with 1,354 additions and 245 deletions.
9 changes: 7 additions & 2 deletions .buildkite/release-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,11 @@ steps:
env:
DOCKER_BUILDKIT: "1"

- input: "Provide Release version here"
fields:
- text: "What is the release version?"
key: "release-version"

- block: "Build CPU release image"
key: block-cpu-release-image-build
depends_on: ~
Expand All @@ -66,7 +71,7 @@ steps:
queue: cpu_queue_postmerge
commands:
- "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg GIT_REPO_CHECK=1 --tag public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:$RELEASE_VERSION --progress plain -f Dockerfile.cpu ."
- "docker push public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:$RELEASE_VERSION"
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg GIT_REPO_CHECK=1 --tag public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:$(buildkite-agent meta-data get release-version) --progress plain -f Dockerfile.cpu ."
- "docker push public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:$(buildkite-agent meta-data get release-version)"
env:
DOCKER_BUILDKIT: "1"
13 changes: 13 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -85,9 +85,22 @@ repos:
entry: tools/png-lint.sh
language: script
types: [png]
- id: signoff-commit
name: Sign-off Commit
entry: bash
args:
- -c
- |
if ! grep -q "^Signed-off-by: $(git config user.name) <$(git config user.email)>" .git/COMMIT_EDITMSG; then
printf "\nSigned-off-by: $(git config user.name) <$(git config user.email)>\n" >> .git/COMMIT_EDITMSG
fi
language: system
verbose: true
stages: [commit-msg]
- id: suggestion
name: Suggestion
entry: bash -c 'echo "To bypass pre-commit hooks, add --no-verify to git commit."'
language: system
verbose: true
pass_filenames: false

1 change: 1 addition & 0 deletions csrc/ops.h
Original file line number Diff line number Diff line change
Expand Up @@ -156,6 +156,7 @@ torch::Tensor ggml_mul_mat_a8(torch::Tensor W, torch::Tensor X, int64_t type,

#ifndef USE_ROCM
bool cutlass_scaled_mm_supports_fp8(int64_t cuda_device_capability);
bool cutlass_scaled_mm_supports_block_fp8(int64_t cuda_device_capability);

void cutlass_scaled_mm(torch::Tensor& out, torch::Tensor const& a,
torch::Tensor const& b, torch::Tensor const& a_scales,
Expand Down
8 changes: 7 additions & 1 deletion csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,13 @@ void cutlass_scaled_mm_sm90(torch::Tensor& c, torch::Tensor const& a,

vllm::cutlass_scaled_mm_blockwise_sm90_fp8(c, a, b, a_scales, b_scales);
} else {
TORCH_CHECK(false, "Unsupported scale group shapes for CUTLASS 3.x GEMM");
TORCH_CHECK(false,
"Unsupported scale group shapes for CUTLASS 3.x GEMM.\n "
"a_scale_group_shape must be [1, 128], got: [",
a_scale_group_shape[0], ", ", a_scale_group_shape[1],
"]\n"
"b_scale_group_shape must be [128, 128], got: [",
b_scale_group_shape[0], ", ", b_scale_group_shape[1], "]");
}
}

Expand Down
15 changes: 14 additions & 1 deletion csrc/quantization/cutlass_w8a8/scaled_mm_entry.cu
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,19 @@ bool cutlass_scaled_mm_supports_fp8(int64_t cuda_device_capability) {
return false;
}

bool cutlass_scaled_mm_supports_block_fp8(int64_t cuda_device_capability) {
// CUTLASS block-quantized FP8 kernels need at least CUDA 12.0
// and at least SM90 (Hopper)

#if defined CUDA_VERSION
if (cuda_device_capability >= 90) {
return CUDA_VERSION >= 12000;
}
#endif

return false;
}

void cutlass_scaled_mm(torch::Tensor& c, torch::Tensor const& a,
torch::Tensor const& b, torch::Tensor const& a_scales,
torch::Tensor const& b_scales,
Expand Down Expand Up @@ -212,4 +225,4 @@ void cutlass_scaled_mm_azp(torch::Tensor& c, torch::Tensor const& a,
"No compiled cutlass_scaled_mm_azp for a compute capability less than "
"CUDA device capability: ",
version_num);
}
}
7 changes: 7 additions & 0 deletions csrc/torch_bindings.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -330,6 +330,13 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) {
ops.def("cutlass_scaled_mm_supports_fp8(int cuda_device_capability) -> bool");
ops.impl("cutlass_scaled_mm_supports_fp8", &cutlass_scaled_mm_supports_fp8);

// Check if cutlass scaled_mm supports block quantization (used by DeepSeekV3)
ops.def(
"cutlass_scaled_mm_supports_block_fp8(int cuda_device_capability) -> "
"bool");
ops.impl("cutlass_scaled_mm_supports_block_fp8",
&cutlass_scaled_mm_supports_fp8);

// Check if cutlass sparse scaled_mm is supported for CUDA devices of the
// given capability
ops.def(
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion docs/source/contributing/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ Check out the [building from source](#build-from-source) documentation for detai
pip install -r requirements-dev.txt

# Linting, formatting and static type checking
pre-commit install
pre-commit install --hook-type pre-commit --hook-type commit-msg

# You can manually run pre-commit with
pre-commit run --all-files
Expand Down
228 changes: 228 additions & 0 deletions docs/source/design/v1/prefix_caching.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,228 @@
# Automatic Prefix Caching

Prefix caching kv-cache blocks is a popular optimization in LLM inference to avoid redundant prompt computations. The core idea is simple – we cache the kv-cache blocks of processed requests, and reuse these blocks when a new request comes in with the same prefix as previous requests. Since prefix caching is almost a free lunch and won’t change model outputs, it has been widely used by many public endpoints (e.g., OpenAI, Anthropic, etc) and most open source LLM inference frameworks (e.g., SGLang).

While there are many ways to implement prefix caching, vLLM chooses a hash-based approach. Specifically, we hash each kv-cache block by the tokens in the block and the tokens in the prefix before the block:

```text
Block 1 Block 2 Block 3
[A gentle breeze stirred] [the leaves as children] [laughed in the distance]
Block 1: |<--- block tokens ---->|
Block 2: |<------- prefix ------>| |<--- block tokens --->|
Block 3: |<------------------ prefix -------------------->| |<--- block tokens ---->|
```

In the example above, the KV cache in the first block can be uniquely identified with the token “A gentle breeze stirred”. The third block can be uniquely identified with the tokens in the block “laughed in the distance”, along with the prefix tokens “A gentle breeze stirred the leaves as children”. Therefore, we can build the block hash of `hash(tuple[components])`, where components are:

* Parent hash value: The hash value of the parent hash block.
* Block tokens: A tuple of tokens in this block. The reason to include the exact tokens is to reduce potential hash value collision.
* Extra hashes: Other values required to make this block unique, such as LoRA IDs and multi-modality input hashes (see the example below).

Note 1: We only cache full blocks.

Note 2: The above hash key structure is not 100% collision free. Theoretically it’s still possible for the different prefix tokens to have the same hash value, but this should be nearly impossible to happen. Of course, contributions are welcome if you have an awesome idea to eliminate collusion entirely.

**A hashing example with multi-modality inputs**
In this example, we illustrate how prefix caching works with multi-modality inputs (e.g., images). Assuming we have a request with the following messages:

```text
messages = [
{"role": "user",
"content": [
{"type": "text",
"text": "What's in this image?"
},
{"type": "image_url",
"image_url": {"url": image_url},
},
]},
]
```

It will become the following prompt:

```text
Prompt:
<s>[INST]What's in this image?\n[IMG][/INST]
Tokenized prompt:
[1, 3, 7493, 1681, 1294, 1593, 3937, 9551, 10, 4]
Prompt with placeholders (<P>):
[1, 3, 7493, 1681, 1294, 1593, 3937, 9551, <P>, <P>, ..., <P>, 4]
```

As we can see, after the tokenization, the `[IMG]` will be replaced by a sequence of placeholder tokens, and these placeholders will be replaced by image embeddings during prefill. The challenge for prefix caching to support this case is we need to differentiate images from the placeholders. To address this problem, we encode the image hash generated by the frontend image processor. For example, the hash of the blocks in the above prompt would be (assuming block size 16, and we have 41 placeholder tokens):

```text
Block 0
Parent hash: None
Token IDs: 1, 3, 7493, 1681, 1294, 1593, 3937, 9551, <p>, ..., <p>
Extra hash: <image hash>
Block 1
Parent hash: Block 0 hash
Token IDs: <p>, ..., <p>
Extra hash: <image hash>
Block 2
Parent hash: Block 1 hash
Token IDs: <p>, ..., <p>
Extra hash: <image hash>
Block 3
Parent hash: Block 2 hash
Token IDs: <p>, ..., <p>, 4
Extra hash: <image hash>
```

In the rest of this document, we first introduce the data structure used for prefix caching in vLLM v1, followed by the prefix caching workflow of major KV cache operators (e.g., allocate, append, free, eviction). Finally, we use an example to illustrate the end to end prefix caching workflow.

## Data Structure

The prefix caching in vLLM v1 is implemented in the KV cache manager. The basic building block is the “Block” data class (simplified):

```python
class KVCacheBlock:
# The block ID (immutable)
block_id: int
# The block hash (will be assigned when the block is full,
# and will be reset when the block is evicted).
block_hash: BlockHashType
# The number of requests using this block now.
ref_cnt: int

# The pointers to form a doubly linked list for the free queue.
prev_free_block: Optional["KVCacheBlock"] = None
next_free_block: Optional["KVCacheBlock"] = None
```

There are two design points to highlight:

1. We allocate all KVCacheBlock when initializing the KV cache manager to be a block pool. This avoids Python object creation overheads and can easily track all blocks all the time.
2. We introduce doubly linked list pointers directly in the KVCacheBlock, so that we could construct a free queue directly. This gives us two benefits:
1. We could have O(1) complexity moving elements in the middle to the tail.
2. We could avoid introducing another Python queue (e.g., `deque`) which has a wrapper to the elements.

As a result, we will have the following components when the KV cache manager is initialized:

:::{image} /assets/design/v1/prefix_caching/overview.png
:alt: Component Overview
:::

* Block Pool: A list of KVCacheBlock.
* Free Block Queue: Only store the pointers of head and tail blocks for manipulations.
* Cache blocks: Mapping from hash key to block IDs.
* Request blocks: Mapping from request ID to allocated block IDs.

## Operations

### Block Allocation

**New request:** Workflow for the scheduler to schedule a new request with KV cache block allocation:

1. The scheduler calls `kv_cache_manager.get_computed_blocks()` to get a sequence of blocks that have already been computed. This is done by hashing the prompt tokens in the request and looking up Cache Blocks.
2. The scheduler calls `kv_cache_manager.allocate_slots()`. It does the following steps:
1. Compute the number of new required blocks, and return if there are no sufficient blocks to allocate.
2. “Touch” the computed blocks. It increases the reference count of the computed block by one, and removes the block from the free queue if the block wasn’t used by other requests. This is to avoid these computed blocks being evicted. See the example in the next section for illustration.
3. Allocate new blocks by popping the heads of the free queue. If the head block is a cached block, this also “evicts” the block so that no other requests can reuse it anymore from now on.
4. If an allocated block is already full of tokens, we immediately add it to the Cache Block, so that the block can be reused by other requests in the same batch.

**Running request:** Workflow for the scheduler to schedule a running request with KV cache block allocation:

1. The scheduler calls `kv_cache_manager.append_slots()`. It does the following steps:
1. Compute the number of new required blocks, and return if there are no sufficient blocks to allocate.
2. Allocate new blocks by popping the heads of the free queue. If the head block is a cached block, this also “evicts” the block so that no other requests can reuse it anymore from now on.
3. Append token IDs to the slots in existing blocks as well as the new blocks. If a block is full, we add it to the Cache Block to cache it.

**Duplicated blocks**
Assuming block size is 4 and you send a request (Request 1\) with prompt ABCDEF and decoding length 3:

```text
Prompt: [A, B, C, D, E, F]
Output: [G, H, I]
Time 0:
Tokens: [A, B, C, D, E, F, G]
Block Table: [0 (ABCD), 1 (EFG)]
Cache Blocks: 0
Time 1:
Tokens: [A, B, C, D, E, F, G, H]
Block Table: [0 (ABCD), 1 (EFGH)]
Cache Blocks: 0, 1
Time 2:
Tokens: [A, B, C, D, E, F, G, H, I]
Block Table: [0 (ABCD), 1 (EFGH), 2 (I)]
Cache Blocks: 0, 1
```

Now block 0 and block 1 are cached, and we send the same request again (Request 2\) with greedy sampling, so that it will produce exactly the same outputs as the Request 1:

```text
Prompt: [A, B, C, D, E, F]
Output: [G, H, I]
Time 0:
Tokens: [A, B, C, D, E, F, G]
Block Table: [0 (ABCD), 3 (EFG)]
Cache Blocks: 0, 1
Time 1:
Tokens: [A, B, C, D, E, F, G, H]
Block Table: [0 (ABCD), 3 (EFGH)]
Cache Blocks: 0, 1, 3
```

As can be seen, block 3 is a new full block and is cached. However, it is redundant as block 1, meaning that we cached the same block twice. In v0, when detecting block 3 is duplicated, we free block 3 and let Request 2 use block 1 instead, so its block table becomes `[0, 1]` in Time 1. However, the block table in vLLM v1 is append-only, meaning that changing the block table from `[0, 3]` to `[0, 1]` is not allowed. As a result, we will have duplicated blocks for the hash key E-H. This duplication will be eliminated when the request is freed.

### Free

When a request is finished, we free all its blocks if no other requests are using them (reference count = 0). In this example, we free request 1 and block 2, 3, 4, 8 associated with it. We can see that the freed blocks are added to the tail of the free queue in the *reverse* order. This is because the last block of a request must hash more tokens and is less likely to be reused by other requests. As a result, it should be evicted first.

:::{image} /assets/design/v1/prefix_caching/free.png
:alt: Free Queue after Free a Request
:::

### Eviction (LRU)

When the head block (least recently used block) of the free queue is cached, we have to evict the block to prevent it from being used by other requests. Specifically, eviction involves the following steps:

1. Pop the block from the head of the free queue. This is the LRU black to be evicted.
2. Remove the block ID from the Cache Block.
3. Remove the block hash.

## Example

In this example, we assume the block size is 4 (each block can cache 4 tokens), and we have 10 blocks in the KV-cache manager in total.

**Time 1: The cache is empty and a new request comes in.** We allocate 4 blocks. 3 of them are already full and cached. The fourth block is partially full with 2 of 4 tokens.

:::{image} /assets/design/v1/prefix_caching/example-time-1.png
:alt: Example Time 1
:::

**Time 3: Request 0 makes the block 3 full and asks for a new block to keep decoding.** We cache block 3 and allocate block 4.

:::{image} /assets/design/v1/prefix_caching/example-time-3.png
:alt: Example Time 3
:::

**Time 4: Request 1 comes in with the 14 prompt tokens, where the first 11 tokens are the same as request 0.** We can see that only 2 blocks (11 tokens) hit the cache, because the 3rd block only matches 3 of 4 tokens.

:::{image} /assets/design/v1/prefix_caching/example-time-4.png
:alt: Example Time 4
:::

**Time 5: Request 0 is finished and free.** Blocks 2, 3 and 4 are added to the free queue in the reverse order (but block 2 and 3 are still cached). Block 0 and 1 are not added to the free queue because they are being used by Request 1.

:::{image} /assets/design/v1/prefix_caching/example-time-5.png
:alt: Example Time 5
:::

**Time 6: Request 1 is finished and free.**

:::{image} /assets/design/v1/prefix_caching/example-time-6.png
:alt: Example Time 6
:::

**Time 7: Request 2 comes in with the 33 prompt tokens, where the first 16 tokens are the same as request 0\.** Note that even the block order in the free queue was `7 - 8 - 9 - 4 - 3 - 2 - 6 - 5 - 1 - 0`, the cache hit blocks (i.e., 0, 1, 2) are touched and removed from the queue before allocation, so the free queue becomes `7 - 8 - 9 - 4 - 3 - 6 - 5`. As a result, the allocated blocks are 0 (cached), 1 (cached), 2 (cached), 7, 8, 9, 4, 3 (evicted).

:::{image} /assets/design/v1/prefix_caching/example-time-7.png
:alt: Example Time 7
:::
1 change: 1 addition & 0 deletions docs/source/features/quantization/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ supported_hardware
auto_awq
bnb
gguf
int4
int8
fp8
quantized_kvcache
Expand Down
Loading

0 comments on commit 1d8af93

Please sign in to comment.