-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature][Frontend] Add KVTransferParams for disaggregated prefill feature #12957
base: main
Are you sure you want to change the base?
[Feature][Frontend] Add KVTransferParams for disaggregated prefill feature #12957
Conversation
Signed-off-by: Shangming Cai <[email protected]>
Signed-off-by: Shangming Cai <[email protected]>
Signed-off-by: Shangming Cai <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: Shangming Cai <[email protected]>
Signed-off-by: Shangming Cai <[email protected]>
Signed-off-by: Shangming Cai <[email protected]>
Signed-off-by: Shangming Cai <[email protected]>
Signed-off-by: Shangming Cai <[email protected]>
6cdcc91
to
a45736e
Compare
Signed-off-by: Shangming Cai <[email protected]>
@KuntaiDu Hello, this is the initial PR for the XpYd design as we discussed before the Chinese New Year. I think it is ready for review :) |
Signed-off-by: Shangming Cai <[email protected]>
70db613
to
1f194db
Compare
Signed-off-by: Shangming Cai <[email protected]>
Signed-off-by: Shangming Cai <[email protected]>
@youkaichao I have verified that the failure of distributed-tests-4-gpus ci can be solved by #12959. However, the image created by docker-build-image does not include the latest fix. So re-triggering ci without rebase does not seem to solve the ci failure that happens to exist when this PR is created but fixed later. And ever since DCO was introduced, I found that rebase has become difficult. I wonder if there is a better solution to rebase or merge without triggering DCO, so that we can fix the failed CI that has been introduced by the stale code of the main branch that happens to exist when this PR is created. |
@@ -240,6 +240,11 @@ class ChatCompletionRequest(OpenAIBaseModel): | |||
parallel_tool_calls: Optional[bool] = False | |||
user: Optional[str] = None | |||
|
|||
# kv transfer params | |||
prefix_prompt_ids: Optional[List[int]] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it is List[int]
, It seems to require the orchestrator to run per-request tokenization. Will that limits the scalability of the orchestrator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In practice, if we were only implementing the XpYd scheme, the orchestrator would only need to handle request routing between p-nodes and d-nodes, along with managing the keys for KV cache transmission. However, to ensure compatibility with prefix caching features in the future and enable the proxy to maintain a prefix tree constructed from token ids for more content-aware disaggregated prefill scheduling, this parameter is introduced to support broader KVCache transfer options and scheduling scenarios. In fact, it could be absent in the first XpYd implementation since we assume prefix_prompt_ids
should essentially match the prompt_token_ids
in disaggregated prefill scenarios.
priority=request.priority, | ||
) | ||
# Note(shangming): v1 does not support KV transfer yet. | ||
if isinstance(self.engine_client, AsyncLLM): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me discuss with other people and see how to deal with KVTransferParams
in v1.
@@ -0,0 +1,71 @@ | |||
# SPDX-License-Identifier: Apache-2.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to learn more on the rationales behind isolating the KVTransferParams
in an isolated file instead of stuffing it into one existing file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. Initially, I planned to add it in sampling_params.py
, but since the kv transfer process is unrelated to sampling, combining them might make the code harder to understand. I referenced the pooling_params.py
implementation and used a separate file for this class to improve readability. Regarding the location, while placing it in the root directory (like other params) works, moving it to the vllm/distributed/kv_transfer
directory is also an option. Let me know which approach you prefer!
This pull request has merge conflicts that must be resolved before it can be |
This is the initial PR to implement XpYd as described in the design doc, which has gone through several rounds of discussions in the #feat-prefill-disaggregation slack channel and reached a preliminary consensus.
This PR mainly includes the changes related to
KVTransferParams
. In the implementation of XpYd, we will use this parameter to coordinate metadata among all vllm instances (prefill nodes and decode nodes), the KVStore Server, and the global proxy. With this param, the global proxy can attach the disaggregated prefill metadata info in theCompletionRequest
orChatCompletionRequest
, and all kv-transfer-related modules can access it through themodel_input
in functionmodel_runner.execute_model
and use these params to interact with the KVStore Server at the right place under current 1P1D implementation.Based on this initial PR, the Mooncake team will implement the
KVStoreConnector
in the next PR, which utilizesKVTransferParams
and supports several third-party key-value stores to act as theKVLookupBuffer
such as MooncakeStore, Valkey, LMCache, etc. WithKVStoreConnector
, we can decouple the connections between the Prefill nodes and the Decode nodes, making the entire XpYd implementation simpler and more stable.The metadata info required in
KVTransferParams
will be provided by a global proxy (or we can call it the XpYd orchestration mechanism). Therefore, a global proxy interface will be added in the next PR too. Mooncake team will implement a simple version based on round-robin. In the production stage, service providers and users can also implement corresponding global proxy strategies according to their needs.CC list: @KuntaiDu, @youkaichao, @james0zan, @alogfans, @stmatengss, @zeroorhero, @pizhenwei