Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Are there plans to implement a prefill-decode split inference architecture? #1080

Closed
CSEEduanyu opened this issue Aug 13, 2024 · 2 comments

Comments

@CSEEduanyu
Copy link

CSEEduanyu commented Aug 13, 2024

Motivation

Related work includes:

1.https://github.com/LLMServe/DistServe/tree/main
2.vllm-project/vllm#2809
3.mooncake has proven that separating prefill and decode can lead to throughput improvements and significant cost savings for online services. Are there any plans to do this?

Related resources

No response

@zhyncs
Copy link
Member

zhyncs commented Aug 13, 2024

The feature you mentioned is highly dependent on business requirements and real-world scenarios. It's a promising area for improvement, such as https://github.com/kvcache-ai/Mooncake, and @Jeffwan might be also interested in it. While it's on our roadmap #634, its priority and implementation complexity mean the short-term ROI is not very high. And we'll keep tracking its progress. Thanks for your interest!

@zhyncs
Copy link
Member

zhyncs commented Sep 22, 2024

We plan to implement this feature in Q4, please stay tuned. ref #1487

@zhyncs zhyncs closed this as completed Sep 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants