[Feature][Frontend] Add KVTransferParams for disaggregated prefill feature #12957

ShangmingCai · 2025-02-08T10:42:16Z

This is the initial PR to implement XpYd as described in the design doc, which has gone through several rounds of discussions in the #feat-prefill-disaggregation slack channel and reached a preliminary consensus.

This PR mainly includes the changes related to KVTransferParams. In the implementation of XpYd, we will use this parameter to coordinate metadata among all vllm instances (prefill nodes and decode nodes), the KVStore Server, and the global proxy. With this param, the global proxy can attach the disaggregated prefill metadata info in the CompletionRequest or ChatCompletionRequest, and all kv-transfer-related modules can access it through the model_input in function model_runner.execute_model and use these params to interact with the KVStore Server at the right place under current 1P1D implementation.

Based on this initial PR, the Mooncake team will implement the KVStoreConnector in the next PR, which utilizes KVTransferParams and supports several third-party key-value stores to act as the KVLookupBuffer such as MooncakeStore, Valkey, LMCache, etc. With KVStoreConnector, we can decouple the connections between the Prefill nodes and the Decode nodes, making the entire XpYd implementation simpler and more stable.

The metadata info required in KVTransferParams will be provided by a global proxy (or we can call it the XpYd orchestration mechanism). Therefore, a global proxy interface will be added in the next PR too. Mooncake team will implement a simple version based on round-robin. In the production stage, service providers and users can also implement corresponding global proxy strategies according to their needs.

CC list: @KuntaiDu, @youkaichao, @james0zan, @alogfans, @stmatengss, @zeroorhero, @pizhenwei

Signed-off-by: Shangming Cai <[email protected]>

github-actions · 2025-02-08T10:42:25Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: Shangming Cai <[email protected]>

ShangmingCai · 2025-02-11T08:10:51Z

@KuntaiDu Hello, this is the initial PR for the XpYd design as we discussed before the Chinese New Year. I think it is ready for review :)

Signed-off-by: Shangming Cai <[email protected]>

ShangmingCai · 2025-02-12T14:43:07Z

@youkaichao I have verified that the failure of distributed-tests-4-gpus ci can be solved by #12959. However, the image created by docker-build-image does not include the latest fix. So re-triggering ci without rebase does not seem to solve the ci failure that happens to exist when this PR is created but fixed later. And ever since DCO was introduced, I found that rebase has become difficult. I wonder if there is a better solution to rebase or merge without triggering DCO, so that we can fix the failed CI that has been introduced by the stale code of the main branch that happens to exist when this PR is created.

KuntaiDu · 2025-02-12T14:15:41Z

vllm/entrypoints/openai/protocol.py

@@ -240,6 +240,11 @@ class ChatCompletionRequest(OpenAIBaseModel):
    parallel_tool_calls: Optional[bool] = False
    user: Optional[str] = None

+    # kv transfer params
+    prefix_prompt_ids: Optional[List[int]] = None


Since it is List[int], It seems to require the orchestrator to run per-request tokenization. Will that limits the scalability of the orchestrator?

In practice, if we were only implementing the XpYd scheme, the orchestrator would only need to handle request routing between p-nodes and d-nodes, along with managing the keys for KV cache transmission. However, to ensure compatibility with prefix caching features in the future and enable the proxy to maintain a prefix tree constructed from token ids for more content-aware disaggregated prefill scheduling, this parameter is introduced to support broader KVCache transfer options and scheduling scenarios. In fact, it could be absent in the first XpYd implementation since we assume prefix_prompt_ids should essentially match the prompt_token_ids in disaggregated prefill scenarios.

KuntaiDu · 2025-02-12T14:50:08Z

vllm/entrypoints/openai/serving_chat.py

-                        priority=request.priority,
-                    )
+                    # Note(shangming): v1 does not support KV transfer yet.
+                    if isinstance(self.engine_client, AsyncLLM):


Let me discuss with other people and see how to deal with KVTransferParams in v1.

KuntaiDu · 2025-02-12T14:52:46Z

vllm/kv_transfer_params.py

@@ -0,0 +1,71 @@
+# SPDX-License-Identifier: Apache-2.0


Just want to learn more on the rationales behind isolating the KVTransferParams in an isolated file instead of stuffing it into one existing file.

Good question. Initially, I planned to add it in sampling_params.py, but since the kv transfer process is unrelated to sampling, combining them might make the code harder to understand. I referenced the pooling_params.py implementation and used a separate file for this class to improve readability. Regarding the location, while placing it in the root directory (like other params) works, moving it to the vllm/distributed/kv_transfer directory is also an option. Let me know which approach you prefer!

mergify · 2025-02-12T17:04:01Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ShangmingCai.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

ShangmingCai added 4 commits February 7, 2025 15:27

Add KVTransferParams, modify all generate interfaces.

dbb2ab9

Signed-off-by: Shangming Cai <[email protected]>

Pass kv_transfer_params to model_input in model_runner.

d031995

Signed-off-by: Shangming Cai <[email protected]>

Add test for KVTransferParams.

3c306b4

Signed-off-by: Shangming Cai <[email protected]>

Merge branch 'main' into add_kv_transfer_params

ff4b89e

ShangmingCai requested review from zhuohan123, youkaichao, alexm-redhat, comaniac and njhill as code owners February 8, 2025 10:42

mergify bot added the frontend label Feb 8, 2025

ShangmingCai added 5 commits February 8, 2025 18:56

fix ruff

56d80c9

Signed-off-by: Shangming Cai <[email protected]>

fix mypy

4d9cc92

Signed-off-by: Shangming Cai <[email protected]>

minor

bcab623

Signed-off-by: Shangming Cai <[email protected]>

fix msgspec type error

10c20c2

Signed-off-by: Shangming Cai <[email protected]>

fix v1 error

a45736e

Signed-off-by: Shangming Cai <[email protected]>

ShangmingCai force-pushed the add_kv_transfer_params branch from 6cdcc91 to a45736e Compare February 11, 2025 04:03

add args

88f81d8

Signed-off-by: Shangming Cai <[email protected]>

ShangmingCai changed the title ~~[WIP] Add KVTransferParams for disaggregated prefill feature~~ [Feature][Frontend] Add KVTransferParams for disaggregated prefill feature Feb 11, 2025

ShangmingCai requested review from WoosukKwon, robertgshaw2-redhat and ywang96 as code owners February 12, 2025 09:22

mergify bot added the v1 label Feb 12, 2025

verified that vllm-project#12959 can fix distributed-tests-4-gpus ci

1f194db

Signed-off-by: Shangming Cai <[email protected]>

ShangmingCai force-pushed the add_kv_transfer_params branch from 70db613 to 1f194db Compare February 12, 2025 11:15

ShangmingCai added 2 commits February 12, 2025 20:14

retrigger ci

0395307

Signed-off-by: Shangming Cai <[email protected]>

retrigger ci

0d8ef2c

Signed-off-by: Shangming Cai <[email protected]>

KuntaiDu reviewed Feb 12, 2025

View reviewed changes

KuntaiDu self-assigned this Feb 12, 2025

mergify bot added the needs-rebase label Feb 12, 2025

Merge branch 'main' into add_kv_transfer_params

d33b0dc

mergify bot removed the needs-rebase label Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature][Frontend] Add KVTransferParams for disaggregated prefill feature #12957

[Feature][Frontend] Add KVTransferParams for disaggregated prefill feature #12957

ShangmingCai commented Feb 8, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Feb 8, 2025

ShangmingCai commented Feb 11, 2025

ShangmingCai commented Feb 12, 2025

KuntaiDu Feb 12, 2025

ShangmingCai Feb 12, 2025

KuntaiDu Feb 12, 2025

KuntaiDu Feb 12, 2025

ShangmingCai Feb 12, 2025

mergify bot commented Feb 12, 2025

[Feature][Frontend] Add KVTransferParams for disaggregated prefill feature #12957

Are you sure you want to change the base?

[Feature][Frontend] Add KVTransferParams for disaggregated prefill feature #12957

Conversation

ShangmingCai commented Feb 8, 2025 • edited by github-actions bot Loading

github-actions bot commented Feb 8, 2025

ShangmingCai commented Feb 11, 2025

ShangmingCai commented Feb 12, 2025

KuntaiDu Feb 12, 2025

Choose a reason for hiding this comment

ShangmingCai Feb 12, 2025

Choose a reason for hiding this comment

KuntaiDu Feb 12, 2025

Choose a reason for hiding this comment

KuntaiDu Feb 12, 2025

Choose a reason for hiding this comment

ShangmingCai Feb 12, 2025

Choose a reason for hiding this comment

mergify bot commented Feb 12, 2025

ShangmingCai commented Feb 8, 2025 •

edited by github-actions bot

Loading