-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Input length limitation (8192) despite model supporting 32k context window #2717
Comments
TensorRT-LLM has a max amount of tokens that can be processed at a time. You have two ways of overcoming this limitation:
NB: Paged kv cache also enables you to use enable_kv_cache_reuse, which is a very nice feature to have. |
Thank you so much for your detailed explanation! However, I'm currently initializing the LLM directly with the HF model (as I found this approach more convenient since it eliminates two additional steps) rather than explicitly doing the model weight conversion and engine building steps. Is there a solution available for this scenario? |
I am not familiar with using TensorRT-LLM with this API. But I presume that it does a trtllm-build step under the hood. You can see the relevant options to this API here: https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html It looks like you can probably implement both the approaches you outlined. |
I'm running into similar issues and after enabling How would I enable chunked context with trtllm-serve?
|
You know it is correct when the start-up logs contain the following line:
When using TensorRT-LLM backend: Make sure your config.pbtxt contains something like:
It can be found in |
In the LLM python api it probably corresponds to the option |
Description
I'm experiencing an input length limitation issue when running Qwen2.5-7B-instruct model with TensorRT-LLM on a single H100 GPU. While the model inherently supports a 32k context window, the system throws an error when the input length exceeds 8192 tokens.
Environment
Code about initialize the model
Error Log
I noticed that in the logs, [TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192 is set to 8192, which appears to be a configuration parameter limiting the maximum number of tokens. I'm wondering why this value is set to 8192 when the model should support 32k tokens.
The text was updated successfully, but these errors were encountered: