Can TensorRT-LLM Handle High Levels of Concurrent Requests? #2514
Labels
bug
Something isn't working
Investigating
triaged
Issue has been triaged by maintainers
Triton Backend
System Info
NVIDIA A100 80GB
Running on nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3 image
Who can help?
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
When I send 1 request, it takes 3 seconds to inference, but when I send 10 requests, it takes 30 seconds.
I've worked on parallel processing, but I don't know where the issue lies. Could you help me?
Expected behavior
I've worked on parallel processing, but I don't know where the issue lies.
actual behavior
When I send 1 request, it takes 3 seconds to inference, but when I send 10 requests, it takes 30 seconds.
additional notes
.
The text was updated successfully, but these errors were encountered: