You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am currently developing a Retrieval-Augmented Generation (RAG)-based GPT model using the Mistral-Nemo-Instruct 12B model, which I have loaded on Triton Inference Server and am serving the engine using TensorRT-LLM. However, I am facing an issue where the TensorRT-LLM API sometimes takes an unusually long time to respond. During this period, it prints the following message repeatedly please find them below and and it causes the GPU utilization to spike to 98%, resulting in memory leaks.
I am currently developing a Retrieval-Augmented Generation (RAG)-based GPT model using the Mistral-Nemo-Instruct 12B model, which I have loaded on Triton Inference Server and am serving the engine using TensorRT-LLM. However, I am facing an issue where the TensorRT-LLM API sometimes takes an unusually long time to respond. During this period, it prints the following message repeatedly please find them below and and it causes the GPU utilization to spike to 98%, resulting in memory leaks.
GPU: H100
Tensorrt-llm : 0.15.0
Tensorrt : 10.6.0
No. of GPUs : 2
Max_token_limit: 100000
I0123 04:45:44.967579 120 utils.cc:316] "ModelInstanceState::getRequestBooleanInputTensor: user did not not provide stop input for the request"
I0123 04:45:44.967651 120 utils.cc:316] "ModelInstanceState::getRequestBooleanInputTensor: user did not not provide streaming input for the request"
I0123 04:45:45.034982 120 model_instance_state.cc:1239] "{"Active Request Count":1,"Iteration Counter":458,"Max Request Count":8,"Runtime CPU Memory Usage":17252,"Runtime GPU Memory Usage":16178570883,"Runtime Pinned Memory Usage":549454004,"Timestamp":"01-23-2025 04:45:44.972719","Context Requests":1,"Generation Requests":0,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":5392,"Free KV cache blocks":2865,"Max KV cache blocks":2950,"Tokens per KV cache block":64,"Used KV cache blocks":85,"Reused KV cache blocks":0,"KV cache transfer time":0.000000,"Request count":0}"
I0123 04:45:45.135107 120 model_instance_state.cc:1239] "{"Active Request Count":1,"Iteration Counter":459,"Max Request Count":8,"Runtime CPU Memory Usage":17252,"Runtime GPU Memory Usage":16178570883,"Runtime Pinned Memory Usage":549454004,"Timestamp":"01-23-2025 04:45:45.114354","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":2865,"Max KV cache blocks":2950,"Tokens per KV cache block":64,"Used KV cache blocks":85,"Reused KV cache blocks":0,"KV cache transfer time":0.000000,"Request count":0}"
I0123 04:45:45.135153 120 model_instance_state.cc:1239] "{"Active Request Count":1,"Iteration Counter":460,"Max Request Count":8,"Runtime CPU Memory Usage":17252,"Runtime GPU Memory Usage":16178570883,"Runtime Pinned Memory Usage":549454004,"Timestamp":"01-23-2025 04:45:45.119673","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":2865,"Max KV cache blocks":2950,"Tokens per KV cache block":64,"Used KV cache blocks":85,"Reused KV cache blocks":0,"KV cache transfer time":0.000000,"Request count":0}"
I0123 04:45:45.135173 120 model_instance_state.cc:1239] "{"Active Request Count":1,"Iteration Counter":461,"Max Request Count":8,"Runtime CPU Memory Usage":17252,"Runtime GPU Memory Usage":16178570883,"Runtime Pinned Memory Usage":549454004,"Timestamp":"01-23-2025 04:45:45.125423","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":2865,"Max KV cache blocks":2950,"Tokens per KV cache block":64,"Used KV cache blocks":85,"Reused KV cache blocks":0,"KV cache transfer time":0.000000,"Request count":0}"
I0123 04:45:45.135189 120 model_instance_state.cc:1239] "{"Active Request Count":1,"Iteration Counter":462,"Max Request Count":8,"Runtime CPU Memory Usage":17252,"Runtime GPU Memory Usage":16178570883,"Runtime Pinned Memory Usage":549454004,"Timestamp":"01-23-2025 04:45:45.131276","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":2865,"Max KV cache blocks":2950,"Tokens per KV cache block":64,"Used KV cache blocks":85,"Reused KV cache blocks":0,"KV cache transfer time":0.000000,"Request count":0}"
The text was updated successfully, but these errors were encountered: