How multimodal batch inference works? #2034

bonre · 2024-07-16T03:14:52Z

bonre
Jul 16, 2024

Thanks for your amazing work!
here are my test code:

After running i get the result:

I want to know how multimodal batch inference works, why does using it or not make such a big difference?
I tried to search but got no reliable answer.Ths if anyone can help me figure out :)

Answered by lvhan028

Jul 16, 2024

Let's take a look at the first part

for i in range(times):
    response = pipe((prompt, image), gen_config=gen_config)

the pipe() is VLAsyncEngine.__call__(), which is synchronous. This API is called 200 times in your example, meaning the engine processes the 200 requests one by one sequentially. It is very time consuming.

For the second part,

response = pipe([(query, image)] * times, gen_config=gen_config)

A list of requests is passed to the API. Despite being synchronous, internally the engine processes these requests concurrently, which contributes significantly to its speed compared to the initial scenario.

More precisely, the vision model inference and the LLM model inference operate…

View full answer

lvhan028 · 2024-07-16T04:00:09Z

lvhan028
Jul 16, 2024
Maintainer

Let's take a look at the first part

for i in range(times):
    response = pipe((prompt, image), gen_config=gen_config)

the pipe() is VLAsyncEngine.__call__(), which is synchronous. This API is called 200 times in your example, meaning the engine processes the 200 requests one by one sequentially. It is very time consuming.

For the second part,

response = pipe([(query, image)] * times, gen_config=gen_config)

A list of requests is passed to the API. Despite being synchronous, internally the engine processes these requests concurrently, which contributes significantly to its speed compared to the initial scenario.

More precisely, the vision model inference and the LLM model inference operate asynchronously. The vision engine utilizes the upstream model repository to handle images without applying additional optimizations. On the other hand, the LLM engine employs continuous batch to manage requests simultaneously. Furthermore, extensive optimization efforts have been invested in the LLM engine, enabling it to deliver exceptional performance under heavy traffic conditions.

3 replies

bonre Jul 16, 2024
Author

Thank you very much for your detailed response; it has answered my question thoroughly.

It`s really amazing for the extensive optimization work done on lmdeploy. The efforts put into enhancing the LLM engine's performance, especially under heavy traffic conditions, are truly commendable.

Could you please guide me on which parts of the source code I should read to gain a deeper understanding of the optimizations related to continuous batch processing?

lvhan028 Jul 16, 2024
Maintainer

Turbomind engine is implemented by C++ and CUDA. Its source code lies in https://github.com/InternLM/lmdeploy/tree/main/src. The continuous batching is implemented in LlamaBatch.cc
The turbomind.py provides the Python API of this engine.
I suggest you trace the source code using your second case.

bonre Jul 16, 2024
Author

Thanks for your reply！

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How multimodal batch inference works? #2034

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How multimodal batch inference works? #2034

bonre Jul 16, 2024

Replies: 1 comment · 3 replies

lvhan028 Jul 16, 2024 Maintainer

bonre Jul 16, 2024 Author

lvhan028 Jul 16, 2024 Maintainer

bonre Jul 16, 2024 Author

bonre
Jul 16, 2024

Replies: 1 comment 3 replies

lvhan028
Jul 16, 2024
Maintainer

bonre Jul 16, 2024
Author

lvhan028 Jul 16, 2024
Maintainer

bonre Jul 16, 2024
Author