[Feature] Are there plans to implement a prefill-decode split inference architecture? #1080

CSEEduanyu · 2024-08-13T14:18:46Z

Motivation

Related work includes:

1.https://github.com/LLMServe/DistServe/tree/main
2.vllm-project/vllm#2809
3.mooncake has proven that separating prefill and decode can lead to throughput improvements and significant cost savings for online services. Are there any plans to do this?

Related resources

No response

zhyncs · 2024-08-13T17:25:22Z

The feature you mentioned is highly dependent on business requirements and real-world scenarios. It's a promising area for improvement, such as https://github.com/kvcache-ai/Mooncake, and @Jeffwan might be also interested in it. While it's on our roadmap #634, its priority and implementation complexity mean the short-term ROI is not very high. And we'll keep tracking its progress. Thanks for your interest!

zhyncs · 2024-09-22T14:23:45Z

We plan to implement this feature in Q4, please stay tuned. ref #1487

zhyncs added the await-response label Aug 17, 2024

zhyncs closed this as completed Sep 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Are there plans to implement a prefill-decode split inference architecture? #1080

[Feature] Are there plans to implement a prefill-decode split inference architecture? #1080

CSEEduanyu commented Aug 13, 2024 •

edited

Loading

zhyncs commented Aug 13, 2024

zhyncs commented Sep 22, 2024

[Feature] Are there plans to implement a prefill-decode split inference architecture? #1080

[Feature] Are there plans to implement a prefill-decode split inference architecture? #1080

Comments

CSEEduanyu commented Aug 13, 2024 • edited Loading

Motivation

Related resources

zhyncs commented Aug 13, 2024

zhyncs commented Sep 22, 2024

CSEEduanyu commented Aug 13, 2024 •

edited

Loading