Open
Description
Description
I'm experiencing an input length limitation issue when running Qwen2.5-7B-instruct model with TensorRT-LLM on a single H100 GPU. While the model inherently supports a 32k context window, the system throws an error when the input length exceeds 8192 tokens.
Environment
- Model: Qwen2.5-7B-instruct
- Framework: TensorRT-LLM (version 0.16.0)
- Hardware: Single H100 GPU
- Maximum supported model context length: 32k tokens
- Current limitation: 8192 tokens
Code about initialize the model
llm = LLM(
model=model_name_or_path, # HF model local path
tensor_parallel_size=len(available_gpus),
trust_remote_code=True,
)
Error Log
Loading Model: [1/2] Loading HF model to memory
230it [00:09, 25.32it/s]
Time: 9.662s
Loading Model: [2/2] Building TRT-LLM engine
Time: 112.563s
Loading model done.
Total latency: 122.225s
[TensorRT-LLM] TensorRT-LLM version: 0.16.0
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 32768
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (32768) * 28
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 0
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8192 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 14553 MiB
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1272.01 MiB for execution context memory.
[TensorRT-LLM][INFO] [MS] Running engine with multi stream info
[TensorRT-LLM][INFO] [MS] Number of aux streams is 1
[TensorRT-LLM][INFO] [MS] Number of total worker streams is 2
[TensorRT-LLM][INFO] [MS] The main stream provided by execute/enqueue calls is the first worker stream
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 14541 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.66 GB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 5.77 GB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.11 GiB, available: 53.90 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 14192
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 512
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 48.51 GiB for max tokens in paged KV cache (908288).
Evaluating responses...
[TensorRT-LLM][ERROR] Encountered an error when fetching new request: Prompt length (12438) exceeds maximum input length (8192). Set log level to info and check TRTGptModel logs for how maximum input length is set (/home/jenkins/agent/workspace/LLM/helpers/Build-x86_64/llm/cpp/include/tensorrt_llm/batch_manager/llmRequest.h:482)
1 0x7f2312258d0d /inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/public/env/trt_llm/lib/python3.10/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x868d0d) [0x7f2312258d0d]
2 0x7f23145b722d tensorrt_llm::executor::Executor::Impl::executionLoop() + 1021
3 0x7f271ffee5c0 /inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/public/env/trt_llm/lib/python3.10/site-packages/torch/lib/libtorch.so(+0x145c0) [0x7f271ffee5c0]
4 0x7f272e00bac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f272e00bac3]
5 0x7f272e09da40 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126a40) [0x7f272e09da40]
I noticed that in the logs, [TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192 is set to 8192, which appears to be a configuration parameter limiting the maximum number of tokens. I'm wondering why this value is set to 8192 when the model should support 32k tokens.
Metadata
Metadata
Assignees
Labels
No labels