Can vllm serving clients by using multiple model instances? #239
-
Based on the examples, vllm can launch a server with a single model instances. Can vllm serving clients by using multiple model instances? With multiple model instances, the sever will dispatch the requests to different instances to reduce the overhead. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 3 replies
-
Right now vLLM is a serving engine for a single model. You can start multiple vLLM server replicas and use a custom load balancer (e.g., nginx load balancer). Also feel free to checkout FastChat and other multi-model frontends (e.g., aviary). vLLM can be a model worker of these libraries to support multi-replica serving. |
Beta Was this translation helpful? Give feedback.
-
Is this still the case? If so, why does the API support a model parameter if the intents is not to host multiple models? |
Beta Was this translation helpful? Give feedback.
Right now vLLM is a serving engine for a single model. You can start multiple vLLM server replicas and use a custom load balancer (e.g., nginx load balancer). Also feel free to checkout FastChat and other multi-model frontends (e.g., aviary). vLLM can be a model worker of these libraries to support multi-replica serving.