Memory Leak: Tensorrt-llm process #2716

payingguest · 2025-01-24T07:34:16Z

I am currently developing a Retrieval-Augmented Generation (RAG)-based GPT model using the Mistral-Nemo-Instruct 12B model, which I have loaded on Triton Inference Server and am serving the engine using TensorRT-LLM. However, I am facing an issue where the TensorRT-LLM API sometimes takes an unusually long time to respond. During this period, it prints the following message repeatedly please find them below and and it causes the GPU utilization to spike to 98%, resulting in memory leaks.

GPU: H100
Tensorrt-llm : 0.15.0
Tensorrt : 10.6.0
No. of GPUs : 2
Max_token_limit: 100000

I0123 04:45:44.967579 120 utils.cc:316] "ModelInstanceState::getRequestBooleanInputTensor: user did not not provide stop input for the request"
I0123 04:45:44.967651 120 utils.cc:316] "ModelInstanceState::getRequestBooleanInputTensor: user did not not provide streaming input for the request"
I0123 04:45:45.034982 120 model_instance_state.cc:1239] "{"Active Request Count":1,"Iteration Counter":458,"Max Request Count":8,"Runtime CPU Memory Usage":17252,"Runtime GPU Memory Usage":16178570883,"Runtime Pinned Memory Usage":549454004,"Timestamp":"01-23-2025 04:45:44.972719","Context Requests":1,"Generation Requests":0,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":5392,"Free KV cache blocks":2865,"Max KV cache blocks":2950,"Tokens per KV cache block":64,"Used KV cache blocks":85,"Reused KV cache blocks":0,"KV cache transfer time":0.000000,"Request count":0}"
I0123 04:45:45.135107 120 model_instance_state.cc:1239] "{"Active Request Count":1,"Iteration Counter":459,"Max Request Count":8,"Runtime CPU Memory Usage":17252,"Runtime GPU Memory Usage":16178570883,"Runtime Pinned Memory Usage":549454004,"Timestamp":"01-23-2025 04:45:45.114354","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":2865,"Max KV cache blocks":2950,"Tokens per KV cache block":64,"Used KV cache blocks":85,"Reused KV cache blocks":0,"KV cache transfer time":0.000000,"Request count":0}"
I0123 04:45:45.135153 120 model_instance_state.cc:1239] "{"Active Request Count":1,"Iteration Counter":460,"Max Request Count":8,"Runtime CPU Memory Usage":17252,"Runtime GPU Memory Usage":16178570883,"Runtime Pinned Memory Usage":549454004,"Timestamp":"01-23-2025 04:45:45.119673","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":2865,"Max KV cache blocks":2950,"Tokens per KV cache block":64,"Used KV cache blocks":85,"Reused KV cache blocks":0,"KV cache transfer time":0.000000,"Request count":0}"
I0123 04:45:45.135173 120 model_instance_state.cc:1239] "{"Active Request Count":1,"Iteration Counter":461,"Max Request Count":8,"Runtime CPU Memory Usage":17252,"Runtime GPU Memory Usage":16178570883,"Runtime Pinned Memory Usage":549454004,"Timestamp":"01-23-2025 04:45:45.125423","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":2865,"Max KV cache blocks":2950,"Tokens per KV cache block":64,"Used KV cache blocks":85,"Reused KV cache blocks":0,"KV cache transfer time":0.000000,"Request count":0}"
I0123 04:45:45.135189 120 model_instance_state.cc:1239] "{"Active Request Count":1,"Iteration Counter":462,"Max Request Count":8,"Runtime CPU Memory Usage":17252,"Runtime GPU Memory Usage":16178570883,"Runtime Pinned Memory Usage":549454004,"Timestamp":"01-23-2025 04:45:45.131276","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":2865,"Max KV cache blocks":2950,"Tokens per KV cache block":64,"Used KV cache blocks":85,"Reused KV cache blocks":0,"KV cache transfer time":0.000000,"Request count":0}"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory Leak: Tensorrt-llm process #2716

Memory Leak: Tensorrt-llm process #2716

payingguest commented Jan 24, 2025

Memory Leak: Tensorrt-llm process #2716

Memory Leak: Tensorrt-llm process #2716

Comments

payingguest commented Jan 24, 2025