-
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
Let's take a look at the first part for i in range(times):
response = pipe((prompt, image), gen_config=gen_config) the For the second part, response = pipe([(query, image)] * times, gen_config=gen_config) A list of requests is passed to the API. Despite being synchronous, internally the engine processes these requests concurrently, which contributes significantly to its speed compared to the initial scenario. More precisely, the vision model inference and the LLM model inference operate asynchronously. The vision engine utilizes the upstream model repository to handle images without applying additional optimizations. On the other hand, the LLM engine employs continuous batch to manage requests simultaneously. Furthermore, extensive optimization efforts have been invested in the LLM engine, enabling it to deliver exceptional performance under heavy traffic conditions. |
Beta Was this translation helpful? Give feedback.
Let's take a look at the first part
the
pipe()
isVLAsyncEngine.__call__()
, which is synchronous. This API is called 200 times in your example, meaning the engine processes the 200 requests one by one sequentially. It is very time consuming.For the second part,
A list of requests is passed to the API. Despite being synchronous, internally the engine processes these requests concurrently, which contributes significantly to its speed compared to the initial scenario.
More precisely, the vision model inference and the LLM model inference operate…