Skip to content

How multimodal batch inference works? #2034

Answered by lvhan028
bonre asked this question in Q&A
Discussion options

You must be logged in to vote

Let's take a look at the first part

for i in range(times):
    response = pipe((prompt, image), gen_config=gen_config)

the pipe() is VLAsyncEngine.__call__(), which is synchronous. This API is called 200 times in your example, meaning the engine processes the 200 requests one by one sequentially. It is very time consuming.

For the second part,

response = pipe([(query, image)] * times, gen_config=gen_config)

A list of requests is passed to the API. Despite being synchronous, internally the engine processes these requests concurrently, which contributes significantly to its speed compared to the initial scenario.

More precisely, the vision model inference and the LLM model inference operate…

Replies: 1 comment 3 replies

Comment options

You must be logged in to vote
3 replies
@bonre
Comment options

@lvhan028
Comment options

@bonre
Comment options

Answer selected by bonre
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants