Squashed commit of the following:

commit 5d5071c Author: Lucas Wilkinson <[email protected]> Date: Sat Feb 1 01:13:23 2025 +0000 reduce split kv amount Signed-off-by: Lucas Wilkinson <[email protected]> commit 5fe1d1d Author: Lucas Wilkinson <[email protected]> Date: Sat Feb 1 00:56:45 2025 +0000 format Signed-off-by: Lucas Wilkinson <[email protected]> commit 0d66687 Author: Simon Mo <[email protected]> Date: Fri Jan 31 16:39:19 2025 -0800 Update loader.py Co-authored-by: Michael Goin <[email protected]> Signed-off-by: Lucas Wilkinson <[email protected]> commit 5002734 Author: Lucas Wilkinson <[email protected]> Date: Sat Feb 1 00:14:14 2025 +0000 simplification Signed-off-by: Lucas Wilkinson <[email protected]> commit fac827f Merge: db2c583 44bbca7 Author: Lucas Wilkinson <[email protected]> Date: Sat Feb 1 00:09:36 2025 +0000 Merge remote-tracking branch 'origin/main' into mla-fp8 commit db2c583 Author: Lucas Wilkinson <[email protected]> Date: Sat Feb 1 00:06:10 2025 +0000 filter compressed tensor models better Signed-off-by: Lucas Wilkinson <[email protected]> commit e144da8 Author: Lucas Wilkinson <[email protected]> Date: Fri Jan 31 18:41:35 2025 -0500 Update vllm/model_executor/model_loader/loader.py Co-authored-by: Simon Mo <[email protected]> Signed-off-by: Lucas Wilkinson <[email protected]> commit 1621381 Author: Lucas Wilkinson <[email protected]> Date: Fri Jan 31 18:41:22 2025 -0500 Update vllm/model_executor/model_loader/loader.py Co-authored-by: Simon Mo <[email protected]> Signed-off-by: Lucas Wilkinson <[email protected]> commit 9829fae Author: Lucas Wilkinson <[email protected]> Date: Fri Jan 31 23:40:12 2025 +0000 misc Signed-off-by: Lucas Wilkinson <[email protected]> commit 44bbca7 Author: Brian Dellabetta <[email protected]> Date: Fri Jan 31 17:38:48 2025 -0600 [Doc] int4 w4a16 example (vllm-project#12585) Based on a request by @mgoin , with @kylesayrs we have added an example doc for int4 w4a16 quantization, following the pre-existing int8 w8a8 quantization example and the example available in [`llm-compressor`](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w4a16/llama3_example.py) FIX #n/a (no issue created) @kylesayrs and I have discussed a couple additional improvements for the quantization docs. We will revisit at a later date, possibly including: - A section for "choosing the correct quantization scheme/ compression technique" - Additional vision or audio calibration datasets --------- Signed-off-by: Brian Dellabetta <[email protected]> Co-authored-by: Michael Goin <[email protected]> commit 60808bd Author: Harry Mellor <[email protected]> Date: Fri Jan 31 23:38:35 2025 +0000 [Doc] Improve installation signposting (vllm-project#12575) - Make device tab names more explicit - Add comprehensive list of devices to https://docs.vllm.ai/en/latest/getting_started/installation/index.html - Add `attention` blocks to the intro of all devices that don't have pre-built wheels/images --------- Signed-off-by: Harry Mellor <[email protected]> commit fc54214 Author: Ryan Nguyen <[email protected]> Date: Fri Jan 31 18:37:30 2025 -0500 [Feature] Fix guided decoding blocking bitmask memcpy (vllm-project#12563) **[Guided decoding performance optimization]** Sending the guided decoding bitmask in xgrammar to the GPU (`self.token_bitmask.to(scores.device)`) is a blocking operation that prevents the CPU from pre-launching the sampler kernels. The CPU waits until decode is complete, then copies the bitmask over. This PR changes the operation to async via setting `non-blocking=True`. (Current) The CPU is blocked on a `cudaStreamSynchronize` and only pre-empts the sampling kernels after bitmask application. Below is the Nsys profile for one decode phase from Llama 3.1 8B. ![image](https://github.com/user-attachments/assets/8997eae1-b822-4f52-beb8-ef19a7c6b824) With the optimization, this is no longer the case: ![image](https://github.com/user-attachments/assets/6d5ea83f-f169-4f98-a8c1-41c719b3e1e7) --------- Signed-off-by: Ryan N <[email protected]> commit eb5741a Author: Tyler Michael Smith <[email protected]> Date: Fri Jan 31 18:29:11 2025 -0500 [Kernel][Quantization] Integrate block-quantized CUTLASS kernels for DeepSeekV3 (vllm-project#12587) Integrates the block-quantized kernels introduced in vllm-project#11868 for use in linear layers. Signed-off-by: Tyler Michael Smith <[email protected]> commit 145c2ff Author: Robert Shaw <[email protected]> Date: Fri Jan 31 18:28:47 2025 -0500 [Bugfix] Revert MoE Triton Config Default (vllm-project#12629) SUMMARY: * previous PR for pulling in block configs also changed defaults (https://github.com/vllm-project/vllm/pull/11589/files) for FP8 * this broke L4 MoE since there was not enough SHM for the default configuration * this reverts the non-block example to the default Signed-off-by: [email protected] <[email protected]> commit 415f194 Author: Kevin H. Luu <[email protected]> Date: Fri Jan 31 13:39:36 2025 -0800 [release] Add input step to ask for Release version (vllm-project#12631) Instead of having to create a new build with release version put in as env var. commit 4251506 Author: Lucas Wilkinson <[email protected]> Date: Fri Jan 31 21:26:13 2025 +0000 fixes Signed-off-by: Lucas Wilkinson <[email protected]> commit c9d72cb Author: Lucas Wilkinson <[email protected]> Date: Fri Jan 31 21:17:23 2025 +0000 more cleanup Signed-off-by: Lucas Wilkinson <[email protected]> commit 3cdd2ce Author: Lucas Wilkinson <[email protected]> Date: Fri Jan 31 21:16:42 2025 +0000 cleanup Signed-off-by: Lucas Wilkinson <[email protected]> commit 89003c4 Author: Chen Zhang <[email protected]> Date: Sat Feb 1 05:13:04 2025 +0800 [v1][Bugfix] Add extra_keys to block_hash for prefix caching (vllm-project#12603) This pr adds extra key to block hash, to generate different hash value for two blocks with the same token string but different extra_keys in their parent blocks. For example, it can generate different hash value for the second block of the following two requests: ```python request1 = make_request( request_id=0, prompt_token_ids=[_ for _ in range(6)], mm_positions=[{ "offset": 0, "length": 3 }, { "offset": 3, "length": 3 }], mm_hashes=["hash1", "hash2"], ) request2 = make_request( request_id=1, prompt_token_ids=[_ for _ in range(6)], mm_positions=[{ "offset": 0, "length": 3 }, { "offset": 3, "length": 3 }], mm_hashes=["hash3", "hash2"], ) ``` --------- Signed-off-by: Chen Zhang <[email protected]> commit f51cbe0 Author: Lucas Wilkinson <[email protected]> Date: Fri Jan 31 21:04:22 2025 +0000 review comments Signed-off-by: Lucas Wilkinson <[email protected]> commit 3d12a04 Author: Lucas Wilkinson <[email protected]> Date: Fri Jan 31 20:45:14 2025 +0000 working but messy Signed-off-by: Lucas Wilkinson <[email protected]> commit 60bcef0 Author: Cody Yu <[email protected]> Date: Fri Jan 31 12:30:46 2025 -0800 [Docs][V1] Prefix caching design (vllm-project#12598) - Create v1 design document section in docs. - Add prefix caching design doc. @WoosukKwon @ywang96 --------- Signed-off-by: Cody Yu <[email protected]> commit 847f883 Author: Cody Yu <[email protected]> Date: Fri Jan 31 12:30:33 2025 -0800 [Git] Automatically sign-off commits (vllm-project#12595) It's very annoying when I forgot to add `-s` in `git commit` to sign-off, because I then need to `git rebase HEAD~1 --signoff` and `git push -f` to fix the DCO. This PR adds a hook to sign off commits automatically when `-s` is missing to solve this problem. The only change from the user side is now users have to install 2 hooks, so instead of just ``` pre-commit install ``` Now we need to ``` pre-commit install --hook-type pre-commit --hook-type commit-msg ``` Note that even if users still only install the pre-commit hook, they won't get any error in `git commit`. Just the sign-off hook won't run. cc @hmellor @youkaichao --------- Signed-off-by: Cody Yu <[email protected]> commit 325f679 Author: Robert Shaw <[email protected]> Date: Fri Jan 31 15:06:39 2025 -0500 [BugFix] Fix Torch.Compile For DeepSeek (vllm-project#12594) Co-authored-by: simon-mo <[email protected]> commit 548ec44 Author: Lucas Wilkinson <[email protected]> Date: Fri Jan 31 19:13:22 2025 +0000 simon changes Signed-off-by: Lucas Wilkinson <[email protected]> commit a57cd3d Merge: 076cbe5 cabaf4e Author: simon-mo <[email protected]> Date: Fri Jan 31 07:52:26 2025 +0000 Merge branch 'main' of github.com:vllm-project/vllm into mla-fp8 commit 076cbe5 Merge: 0ccbcce a1fc18c Author: Lucas Wilkinson <[email protected]> Date: Thu Jan 30 23:31:41 2025 -0500 Merge branch 'main' into mla-fp8 commit 0ccbcce Author: Lucas Wilkinson <[email protected]> Date: Fri Jan 31 04:29:17 2025 +0000 deepseek v3 support Signed-off-by: Lucas Wilkinson <[email protected]> commit 645622c Author: Lucas Wilkinson <[email protected]> Date: Fri Jan 31 03:08:36 2025 +0000 cleanup Signed-off-by: Lucas Wilkinson <[email protected]> commit 2d61054 Author: Lucas Wilkinson <[email protected]> Date: Fri Jan 31 03:03:07 2025 +0000 cleanup Co-authored-by: Alexander Matveev <[email protected]> Signed-off-by: Lucas Wilkinson <[email protected]> commit f2b2500 Author: Lucas Wilkinson <[email protected]> Date: Fri Jan 31 02:47:05 2025 +0000 Fix TP > 1 cuda graphs Co-authored-by: Alexander Matveev <[email protected]> Signed-off-by: Lucas Wilkinson <[email protected]> commit 433322b Author: Lucas Wilkinson <[email protected]> Date: Fri Jan 31 02:26:11 2025 +0000 Revert "add cuda graph support" Signed-off-by: Lucas Wilkinson <[email protected]> commit 31c34bf Author: Lucas Wilkinson <[email protected]> Date: Thu Jan 30 23:06:09 2025 +0000 ci fix Signed-off-by: Lucas Wilkinson <[email protected]> commit 54ba87d Author: Lucas Wilkinson <[email protected]> Date: Thu Jan 30 21:23:09 2025 +0000 add cuda graph support Signed-off-by: Lucas Wilkinson <[email protected]> commit 5afc1bf Author: Lucas Wilkinson <[email protected]> Date: Thu Jan 30 20:58:53 2025 +0000 fix mypy Signed-off-by: Lucas Wilkinson <[email protected]> commit cfb2d26 Author: Lucas Wilkinson <[email protected]> Date: Thu Jan 30 19:42:36 2025 +0000 fix mypy Signed-off-by: Lucas Wilkinson <[email protected]> commit 37e39f4 Author: Lucas Wilkinson <[email protected]> Date: Thu Jan 30 18:04:58 2025 +0000 fix failing test Signed-off-by: Lucas Wilkinson <[email protected]> commit 0881475 Author: Lucas Wilkinson <[email protected]> Date: Thu Jan 30 17:18:55 2025 +0000 disable MLA for v3 for now Signed-off-by: Lucas Wilkinson <[email protected]> commit 4a46014 Author: Lucas Wilkinson <[email protected]> Date: Thu Jan 30 11:12:48 2025 -0500 Update vllm/attention/backends/mla/utils.py Co-authored-by: Tyler Michael Smith <[email protected]> Signed-off-by: Lucas Wilkinson <[email protected]> commit 09d814c Author: Lucas Wilkinson <[email protected]> Date: Thu Jan 30 15:11:58 2025 +0000 review comments Signed-off-by: Lucas Wilkinson <[email protected]> commit 8bdc14a Author: Lucas Wilkinson <[email protected]> Date: Thu Jan 30 14:09:46 2025 +0000 review comments Signed-off-by: Lucas Wilkinson <[email protected]> commit d27826d Author: Lucas Wilkinson <[email protected]> Date: Thu Jan 30 08:51:42 2025 -0500 Update vllm/config.py Co-authored-by: Zhuohan Li <[email protected]> Signed-off-by: Lucas Wilkinson <[email protected]> commit 7487429 Author: Lucas Wilkinson <[email protected]> Date: Thu Jan 30 04:00:26 2025 +0000 renaming for consistency Signed-off-by: Lucas Wilkinson <[email protected]> commit 634eee6 Author: Lucas Wilkinson <[email protected]> Date: Thu Jan 30 03:52:59 2025 +0000 review comments Signed-off-by: Lucas Wilkinson <[email protected]> commit 31b802c Author: Lucas Wilkinson <[email protected]> Date: Wed Jan 29 22:51:37 2025 -0500 Update vllm/attention/backends/mla/utils.py Co-authored-by: Michael Goin <[email protected]> Signed-off-by: Lucas Wilkinson <[email protected]> commit 068e672 Author: Lucas Wilkinson <[email protected]> Date: Wed Jan 29 22:46:43 2025 -0500 Update utils.py Co-authored-by: Michael Goin <[email protected]> Signed-off-by: Lucas Wilkinson <[email protected]> commit f2cac91 Author: Lucas Wilkinson <[email protected]> Date: Thu Jan 30 03:11:43 2025 +0000 more cleanups Signed-off-by: Lucas Wilkinson <[email protected]> commit c34e5ca Author: Lucas Wilkinson <[email protected]> Date: Thu Jan 30 03:02:58 2025 +0000 fix VLLM_MLA_PERFORM_MATRIX_ABSORPTION=0 Signed-off-by: Lucas Wilkinson <[email protected]> commit 27ad92c Author: Lucas Wilkinson <[email protected]> Date: Thu Jan 30 02:29:40 2025 +0000 squashed commits Co-authored-by: Woosuk Kwon <[email protected]> Co-authored-by: simon-mo <[email protected]> Signed-off-by: Lucas Wilkinson <[email protected]>
ROCm · Feb 1, 2025 · 1d8af93 · 1d8af93
1 parent d47b834
commit 1d8af93
Show file tree

Hide file tree

Showing 51 changed files with 1,354 additions and 245 deletions.
diff --git a/.buildkite/release-pipeline.yaml b/.buildkite/release-pipeline.yaml
@@ -56,6 +56,11 @@ steps:
     env:
       DOCKER_BUILDKIT: "1"
 
+  - input: "Provide Release version here"
+    fields:
+      - text: "What is the release version?"
+        key: "release-version"
+
   - block: "Build CPU release image"
     key: block-cpu-release-image-build
     depends_on: ~
@@ -66,7 +71,7 @@ steps:
       queue: cpu_queue_postmerge
     commands:
       - "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
-      - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg GIT_REPO_CHECK=1 --tag public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:$RELEASE_VERSION --progress plain -f Dockerfile.cpu ."
-      - "docker push public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:$RELEASE_VERSION"
+      - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg GIT_REPO_CHECK=1 --tag public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:$(buildkite-agent meta-data get release-version) --progress plain -f Dockerfile.cpu ."
+      - "docker push public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:$(buildkite-agent meta-data get release-version)"
     env:
       DOCKER_BUILDKIT: "1"
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -85,9 +85,22 @@ repos:
     entry: tools/png-lint.sh
     language: script
     types: [png]
+  - id: signoff-commit
+    name: Sign-off Commit
+    entry: bash
+    args:
+      - -c
+      - |
+        if ! grep -q "^Signed-off-by: $(git config user.name) <$(git config user.email)>" .git/COMMIT_EDITMSG; then
+          printf "\nSigned-off-by: $(git config user.name) <$(git config user.email)>\n" >> .git/COMMIT_EDITMSG
+        fi
+    language: system
+    verbose: true
+    stages: [commit-msg]
   - id: suggestion
     name: Suggestion
     entry: bash -c 'echo "To bypass pre-commit hooks, add --no-verify to git commit."'
     language: system
     verbose: true
     pass_filenames: false
+
diff --git a/csrc/ops.h b/csrc/ops.h
@@ -156,6 +156,7 @@ torch::Tensor ggml_mul_mat_a8(torch::Tensor W, torch::Tensor X, int64_t type,
 
 #ifndef USE_ROCM
 bool cutlass_scaled_mm_supports_fp8(int64_t cuda_device_capability);
+bool cutlass_scaled_mm_supports_block_fp8(int64_t cuda_device_capability);
 
 void cutlass_scaled_mm(torch::Tensor& out, torch::Tensor const& a,
                        torch::Tensor const& b, torch::Tensor const& a_scales,

diff --git a/csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu b/csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu
@@ -58,7 +58,13 @@ void cutlass_scaled_mm_sm90(torch::Tensor& c, torch::Tensor const& a,
 
     vllm::cutlass_scaled_mm_blockwise_sm90_fp8(c, a, b, a_scales, b_scales);
   } else {
-    TORCH_CHECK(false, "Unsupported scale group shapes for CUTLASS 3.x GEMM");
+    TORCH_CHECK(false,
+                "Unsupported scale group shapes for CUTLASS 3.x GEMM.\n "
+                "a_scale_group_shape must be [1, 128], got: [",
+                a_scale_group_shape[0], ", ", a_scale_group_shape[1],
+                "]\n"
+                "b_scale_group_shape must be [128, 128], got: [",
+                b_scale_group_shape[0], ", ", b_scale_group_shape[1], "]");
   }
 }
 

diff --git a/csrc/quantization/cutlass_w8a8/scaled_mm_entry.cu b/csrc/quantization/cutlass_w8a8/scaled_mm_entry.cu
@@ -81,6 +81,19 @@ bool cutlass_scaled_mm_supports_fp8(int64_t cuda_device_capability) {
   return false;
 }
 
+bool cutlass_scaled_mm_supports_block_fp8(int64_t cuda_device_capability) {
+  // CUTLASS block-quantized FP8 kernels need at least CUDA 12.0
+  // and at least SM90 (Hopper)
+
+#if defined CUDA_VERSION
+  if (cuda_device_capability >= 90) {
+    return CUDA_VERSION >= 12000;
+  }
+#endif
+
+  return false;
+}
+
 void cutlass_scaled_mm(torch::Tensor& c, torch::Tensor const& a,
                        torch::Tensor const& b, torch::Tensor const& a_scales,
                        torch::Tensor const& b_scales,
@@ -212,4 +225,4 @@ void cutlass_scaled_mm_azp(torch::Tensor& c, torch::Tensor const& a,
       "No compiled cutlass_scaled_mm_azp for a compute capability less than "
       "CUDA device capability: ",
       version_num);
-}
+}
diff --git a/csrc/torch_bindings.cpp b/csrc/torch_bindings.cpp
@@ -330,6 +330,13 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) {
   ops.def("cutlass_scaled_mm_supports_fp8(int cuda_device_capability) -> bool");
   ops.impl("cutlass_scaled_mm_supports_fp8", &cutlass_scaled_mm_supports_fp8);
 
+  // Check if cutlass scaled_mm supports block quantization (used by DeepSeekV3)
+  ops.def(
+      "cutlass_scaled_mm_supports_block_fp8(int cuda_device_capability) -> "
+      "bool");
+  ops.impl("cutlass_scaled_mm_supports_block_fp8",
+           &cutlass_scaled_mm_supports_fp8);
+
   // Check if cutlass sparse scaled_mm is supported for CUDA devices of the
   // given capability
   ops.def(

diff --git a/docs/source/assets/design/v1/prefix_caching/example-time-1.png b/docs/source/assets/design/v1/prefix_caching/example-time-1.png
diff --git a/docs/source/assets/design/v1/prefix_caching/example-time-3.png b/docs/source/assets/design/v1/prefix_caching/example-time-3.png
diff --git a/docs/source/assets/design/v1/prefix_caching/example-time-4.png b/docs/source/assets/design/v1/prefix_caching/example-time-4.png
diff --git a/docs/source/assets/design/v1/prefix_caching/example-time-5.png b/docs/source/assets/design/v1/prefix_caching/example-time-5.png
diff --git a/docs/source/assets/design/v1/prefix_caching/example-time-6.png b/docs/source/assets/design/v1/prefix_caching/example-time-6.png
diff --git a/docs/source/assets/design/v1/prefix_caching/example-time-7.png b/docs/source/assets/design/v1/prefix_caching/example-time-7.png
diff --git a/docs/source/assets/design/v1/prefix_caching/free.png b/docs/source/assets/design/v1/prefix_caching/free.png
diff --git a/docs/source/assets/design/v1/prefix_caching/overview.png b/docs/source/assets/design/v1/prefix_caching/overview.png
diff --git a/docs/source/contributing/overview.md b/docs/source/contributing/overview.md
@@ -26,7 +26,7 @@ Check out the [building from source](#build-from-source) documentation for detai
 pip install -r requirements-dev.txt
 
 # Linting, formatting and static type checking
-pre-commit install
+pre-commit install --hook-type pre-commit --hook-type commit-msg
 
 # You can manually run pre-commit with
 pre-commit run --all-files

diff --git a/docs/source/design/v1/prefix_caching.md b/docs/source/design/v1/prefix_caching.md
@@ -0,0 +1,228 @@
+# Automatic Prefix Caching
+
+Prefix caching kv-cache blocks is a popular optimization in LLM inference to avoid redundant prompt computations. The core idea is simple – we cache the kv-cache blocks of processed requests, and reuse these blocks when a new request comes in with the same prefix as previous requests. Since prefix caching is almost a free lunch and won’t change model outputs, it has been widely used by many public endpoints (e.g., OpenAI, Anthropic, etc) and most open source LLM inference frameworks (e.g., SGLang).
+
+While there are many ways to implement prefix caching, vLLM chooses a hash-based approach. Specifically, we hash each kv-cache block by the tokens in the block and the tokens in the prefix before the block:
+
+```text
+                    Block 1                  Block 2                  Block 3
+         [A gentle breeze stirred] [the leaves as children] [laughed in the distance]
+Block 1: |<--- block tokens ---->|
+Block 2: |<------- prefix ------>| |<--- block tokens --->|
+Block 3: |<------------------ prefix -------------------->| |<--- block tokens ---->|
+```
+
+In the example above, the KV cache in the first block can be uniquely identified with the token “A gentle breeze stirred”. The third block can be uniquely identified with the tokens in the block “laughed in the distance”, along with the prefix tokens “A gentle breeze stirred the leaves as children”. Therefore, we can build the block hash of `hash(tuple[components])`, where components are:
+
+* Parent hash value: The hash value of the parent hash block.
+* Block tokens: A tuple of tokens in this block. The reason to include the exact tokens is to reduce potential hash value collision.  
+* Extra hashes: Other values required to make this block unique, such as LoRA IDs and multi-modality input hashes (see the example below).
+
+Note 1: We only cache full blocks.
+
+Note 2: The above hash key structure is not 100% collision free. Theoretically it’s still possible for the different prefix tokens to have the same hash value, but this should be nearly impossible to happen. Of course, contributions are welcome if you have an awesome idea to eliminate collusion entirely.
+
+**A hashing example with multi-modality inputs**  
+In this example, we illustrate how prefix caching works with multi-modality inputs (e.g., images). Assuming we have a request with the following messages:
+
+```text
+messages = [
+    {"role": "user",
+     "content": [
+         {"type": "text",
+          "text": "What's in this image?"
+         },
+         {"type": "image_url",
+          "image_url": {"url": image_url},
+         },
+    ]},
+]
+```
+
+It will become the following prompt:
+
+```text
+Prompt:
+    <s>[INST]What's in this image?\n[IMG][/INST]
+
+Tokenized prompt:
+    [1, 3, 7493, 1681, 1294, 1593, 3937, 9551, 10, 4]
+
+Prompt with placeholders (<P>):
+    [1, 3, 7493, 1681, 1294, 1593, 3937, 9551, <P>, <P>, ..., <P>, 4]
+```
+
+As we can see, after the tokenization, the `[IMG]` will be replaced by a sequence of placeholder tokens, and these placeholders will be replaced by image embeddings during prefill. The challenge for prefix caching to support this case is we need to differentiate images from the placeholders. To address this problem, we encode the image hash generated by the frontend image processor. For example, the hash of the blocks in the above prompt would be (assuming block size 16, and we have 41 placeholder tokens):
+
+```text
+Block 0
+    Parent hash: None
+    Token IDs: 1, 3, 7493, 1681, 1294, 1593, 3937, 9551, <p>, ..., <p>
+    Extra hash: <image hash>
+Block 1
+    Parent hash: Block 0 hash
+    Token IDs: <p>, ..., <p>
+    Extra hash: <image hash>
+Block 2
+    Parent hash: Block 1 hash
+    Token IDs: <p>, ..., <p>
+    Extra hash: <image hash>
+Block 3
+    Parent hash: Block 2 hash
+    Token IDs: <p>, ..., <p>, 4
+    Extra hash: <image hash>
+```
+
+In the rest of this document, we first introduce the data structure used for prefix caching in vLLM v1, followed by the prefix caching workflow of major KV cache operators (e.g., allocate, append, free, eviction). Finally, we use an example to illustrate the end to end prefix caching workflow.
+
+## Data Structure
+
+The prefix caching in vLLM v1 is implemented in the KV cache manager. The basic building block is the “Block” data class (simplified):
+
+```python
+class KVCacheBlock:
+    # The block ID (immutable)
+    block_id: int
+    # The block hash (will be assigned when the block is full,
+    # and will be reset when the block is evicted).
+    block_hash: BlockHashType
+    # The number of requests using this block now.
+    ref_cnt: int
+
+    # The pointers to form a doubly linked list for the free queue.
+    prev_free_block: Optional["KVCacheBlock"] = None
+    next_free_block: Optional["KVCacheBlock"] = None
+```
+
+There are two design points to highlight:
+
+1. We allocate all KVCacheBlock when initializing the KV cache manager to be a block pool. This avoids Python object creation overheads and can easily track all blocks all the time.  
+2. We introduce doubly linked list pointers directly in the KVCacheBlock, so that we could construct a free queue directly. This gives us two benefits:  
+   1. We could have O(1) complexity moving elements in the middle to the tail.  
+   2. We could avoid introducing another Python queue (e.g., `deque`) which has a wrapper to the elements.
+
+As a result, we will have the following components when the KV cache manager is initialized:
+
+:::{image} /assets/design/v1/prefix_caching/overview.png
+:alt: Component Overview
+:::
+
+* Block Pool: A list of KVCacheBlock.  
+* Free Block Queue: Only store the pointers of head and tail blocks for manipulations.  
+* Cache blocks: Mapping from hash key to block IDs.  
+* Request blocks: Mapping from request ID to allocated block IDs.
+
+## Operations
+
+### Block Allocation
+
+**New request:** Workflow for the scheduler to schedule a new request with KV cache block allocation:
+
+1. The scheduler calls `kv_cache_manager.get_computed_blocks()` to get a sequence of blocks that have already been computed. This is done by hashing the prompt tokens in the request and looking up Cache Blocks.  
+2. The scheduler calls `kv_cache_manager.allocate_slots()`. It does the following steps:  
+   1. Compute the number of new required blocks, and return if there are no sufficient blocks to allocate.  
+   2. “Touch” the computed blocks. It increases the reference count of the computed block by one, and removes the block from the free queue if the block wasn’t used by other requests. This is to avoid these computed blocks being evicted. See the example in the next section for illustration.  
+   3. Allocate new blocks by popping the heads of the free queue. If the head block is a cached block, this also “evicts” the block so that no other requests can reuse it anymore from now on.  
+   4. If an allocated block is already full of tokens, we immediately add it to the Cache Block, so that the block can be reused by other requests in the same batch.
+
+**Running request:** Workflow for the scheduler to schedule a running request with KV cache block allocation:
+
+1. The scheduler calls `kv_cache_manager.append_slots()`. It does the following steps:  
+   1. Compute the number of new required blocks, and return if there are no sufficient blocks to allocate.  
+   2. Allocate new blocks by popping the heads of the free queue. If the head block is a cached block, this also “evicts” the block so that no other requests can reuse it anymore from now on.  
+   3. Append token IDs to the slots in existing blocks as well as the new blocks. If a block is full, we add it to the Cache Block to cache it.
+
+**Duplicated blocks**  
+Assuming block size is 4 and you send a request (Request 1\) with prompt ABCDEF and decoding length 3:
+
+```text
+Prompt: [A, B, C, D, E, F]
+Output: [G, H, I]
+
+Time 0:
+  Tokens: [A, B, C, D, E, F, G]
+  Block Table: [0 (ABCD), 1 (EFG)]
+  Cache Blocks: 0
+Time 1:
+  Tokens: [A, B, C, D, E, F, G, H]
+  Block Table: [0 (ABCD), 1 (EFGH)]
+  Cache Blocks: 0, 1
+Time 2:
+  Tokens: [A, B, C, D, E, F, G, H, I]
+  Block Table: [0 (ABCD), 1 (EFGH), 2 (I)]
+  Cache Blocks: 0, 1
+```
+
+Now block 0 and block 1 are cached, and we send the same request again (Request 2\) with greedy sampling, so that it will produce exactly the same outputs as the Request 1:
+
+```text
+Prompt: [A, B, C, D, E, F]
+Output: [G, H, I]
+
+Time 0:
+  Tokens: [A, B, C, D, E, F, G]
+  Block Table: [0 (ABCD), 3 (EFG)]
+  Cache Blocks: 0, 1
+Time 1:
+  Tokens: [A, B, C, D, E, F, G, H]
+  Block Table: [0 (ABCD), 3 (EFGH)]
+  Cache Blocks: 0, 1, 3
+```
+
+As can be seen, block 3 is a new full block and is cached. However, it is redundant as block 1, meaning that we cached the same block twice. In v0, when detecting block 3 is duplicated, we free block 3 and let Request 2 use block 1 instead, so its block table becomes `[0, 1]` in Time 1. However, the block table in vLLM v1 is append-only, meaning that changing the block table from `[0, 3]` to `[0, 1]` is not allowed. As a result, we will have duplicated blocks for the hash key E-H. This duplication will be eliminated when the request is freed.
+
+### Free
+
+When a request is finished, we free all its blocks if no other requests are using them (reference count = 0). In this example, we free request 1 and block 2, 3, 4, 8 associated with it. We can see that the freed blocks are added to the tail of the free queue in the *reverse* order. This is because the last block of a request must hash more tokens and is less likely to be reused by other requests. As a result, it should be evicted first.
+
+:::{image} /assets/design/v1/prefix_caching/free.png
+:alt: Free Queue after Free a Request
+:::
+
+### Eviction (LRU)
+
+When the head block (least recently used block) of the free queue is cached, we have to evict the block to prevent it from being used by other requests. Specifically, eviction involves the following steps:
+
+1. Pop the block from the head of the free queue. This is the LRU black to be evicted.  
+2. Remove the block ID from the Cache Block.  
+3. Remove the block hash.
+
+## Example
+
+In this example, we assume the block size is 4 (each block can cache 4 tokens), and we have 10 blocks in the KV-cache manager in total.
+
+**Time 1: The cache is empty and a new request comes in.** We allocate 4 blocks. 3 of them are already full and cached. The fourth block is partially full with 2 of 4 tokens.
+
+:::{image} /assets/design/v1/prefix_caching/example-time-1.png
+:alt: Example Time 1
+:::
+
+**Time 3: Request 0 makes the block 3 full and asks for a new block to keep decoding.** We cache block 3 and allocate block 4.
+
+:::{image} /assets/design/v1/prefix_caching/example-time-3.png
+:alt: Example Time 3
+:::
+
+**Time 4: Request 1 comes in with the 14 prompt tokens, where the first 11 tokens are the same as request 0.** We can see that only 2 blocks (11 tokens) hit the cache, because the 3rd block only matches 3 of 4 tokens.
+
+:::{image} /assets/design/v1/prefix_caching/example-time-4.png
+:alt: Example Time 4
+:::
+
+**Time 5: Request 0 is finished and free.** Blocks 2, 3 and 4 are added to the free queue in the reverse order (but block 2 and 3 are still cached). Block 0 and 1 are not added to the free queue because they are being used by Request 1.
+
+:::{image} /assets/design/v1/prefix_caching/example-time-5.png
+:alt: Example Time 5
+:::
+
+**Time 6: Request 1 is finished and free.**
+
+:::{image} /assets/design/v1/prefix_caching/example-time-6.png
+:alt: Example Time 6
+:::
+
+**Time 7: Request 2 comes in with the 33 prompt tokens, where the first 16 tokens are the same as request 0\.** Note that even the block order in the free queue was `7 - 8 - 9 - 4 - 3 - 2 - 6 - 5 - 1 - 0`, the cache hit blocks (i.e., 0, 1, 2) are touched and removed from the queue before allocation, so the free queue becomes `7 - 8 - 9 - 4 - 3 - 6 - 5`. As a result, the allocated blocks are 0 (cached), 1 (cached), 2 (cached), 7, 8, 9, 4, 3 (evicted).
+
+:::{image} /assets/design/v1/prefix_caching/example-time-7.png
+:alt: Example Time 7
+:::
diff --git a/docs/source/features/quantization/index.md b/docs/source/features/quantization/index.md
@@ -12,6 +12,7 @@ supported_hardware
 auto_awq
 bnb
 gguf
+int4
 int8
 fp8
 quantized_kvcache
-Original file line number
+Diff line change
@@ Expand Up / @@ -12,6 +12,7 @@ supported_hardware @@
     auto_awq
     bnb
     gguf
+    int4
     int8
     fp8
     quantized_kvcache
@@ Expand Down @@