Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Leak: Tensorrt-llm process #2716

Open
payingguest opened this issue Jan 24, 2025 · 0 comments
Open

Memory Leak: Tensorrt-llm process #2716

payingguest opened this issue Jan 24, 2025 · 0 comments

Comments

@payingguest
Copy link

I am currently developing a Retrieval-Augmented Generation (RAG)-based GPT model using the Mistral-Nemo-Instruct 12B model, which I have loaded on Triton Inference Server and am serving the engine using TensorRT-LLM. However, I am facing an issue where the TensorRT-LLM API sometimes takes an unusually long time to respond. During this period, it prints the following message repeatedly please find them below and and it causes the GPU utilization to spike to 98%, resulting in memory leaks.

GPU: H100
Tensorrt-llm : 0.15.0
Tensorrt : 10.6.0
No. of GPUs : 2
Max_token_limit: 100000

I0123 04:45:44.967579 120 utils.cc:316] "ModelInstanceState::getRequestBooleanInputTensor: user did not not provide stop input for the request"
I0123 04:45:44.967651 120 utils.cc:316] "ModelInstanceState::getRequestBooleanInputTensor: user did not not provide streaming input for the request"
I0123 04:45:45.034982 120 model_instance_state.cc:1239] "{"Active Request Count":1,"Iteration Counter":458,"Max Request Count":8,"Runtime CPU Memory Usage":17252,"Runtime GPU Memory Usage":16178570883,"Runtime Pinned Memory Usage":549454004,"Timestamp":"01-23-2025 04:45:44.972719","Context Requests":1,"Generation Requests":0,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":5392,"Free KV cache blocks":2865,"Max KV cache blocks":2950,"Tokens per KV cache block":64,"Used KV cache blocks":85,"Reused KV cache blocks":0,"KV cache transfer time":0.000000,"Request count":0}"
I0123 04:45:45.135107 120 model_instance_state.cc:1239] "{"Active Request Count":1,"Iteration Counter":459,"Max Request Count":8,"Runtime CPU Memory Usage":17252,"Runtime GPU Memory Usage":16178570883,"Runtime Pinned Memory Usage":549454004,"Timestamp":"01-23-2025 04:45:45.114354","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":2865,"Max KV cache blocks":2950,"Tokens per KV cache block":64,"Used KV cache blocks":85,"Reused KV cache blocks":0,"KV cache transfer time":0.000000,"Request count":0}"
I0123 04:45:45.135153 120 model_instance_state.cc:1239] "{"Active Request Count":1,"Iteration Counter":460,"Max Request Count":8,"Runtime CPU Memory Usage":17252,"Runtime GPU Memory Usage":16178570883,"Runtime Pinned Memory Usage":549454004,"Timestamp":"01-23-2025 04:45:45.119673","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":2865,"Max KV cache blocks":2950,"Tokens per KV cache block":64,"Used KV cache blocks":85,"Reused KV cache blocks":0,"KV cache transfer time":0.000000,"Request count":0}"
I0123 04:45:45.135173 120 model_instance_state.cc:1239] "{"Active Request Count":1,"Iteration Counter":461,"Max Request Count":8,"Runtime CPU Memory Usage":17252,"Runtime GPU Memory Usage":16178570883,"Runtime Pinned Memory Usage":549454004,"Timestamp":"01-23-2025 04:45:45.125423","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":2865,"Max KV cache blocks":2950,"Tokens per KV cache block":64,"Used KV cache blocks":85,"Reused KV cache blocks":0,"KV cache transfer time":0.000000,"Request count":0}"
I0123 04:45:45.135189 120 model_instance_state.cc:1239] "{"Active Request Count":1,"Iteration Counter":462,"Max Request Count":8,"Runtime CPU Memory Usage":17252,"Runtime GPU Memory Usage":16178570883,"Runtime Pinned Memory Usage":549454004,"Timestamp":"01-23-2025 04:45:45.131276","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":2865,"Max KV cache blocks":2950,"Tokens per KV cache block":64,"Used KV cache blocks":85,"Reused KV cache blocks":0,"KV cache transfer time":0.000000,"Request count":0}"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant