Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Input length limitation (8192) despite model supporting 32k context window #2717

Open
HuangZhen02 opened this issue Jan 24, 2025 · 6 comments

Comments

@HuangZhen02
Copy link

Description

I'm experiencing an input length limitation issue when running Qwen2.5-7B-instruct model with TensorRT-LLM on a single H100 GPU. While the model inherently supports a 32k context window, the system throws an error when the input length exceeds 8192 tokens.

Environment

  • Model: Qwen2.5-7B-instruct
  • Framework: TensorRT-LLM (version 0.16.0)
  • Hardware: Single H100 GPU
  • Maximum supported model context length: 32k tokens
  • Current limitation: 8192 tokens

Code about initialize the model

    llm = LLM(
        model=model_name_or_path,  # HF model local path
        tensor_parallel_size=len(available_gpus), 
        trust_remote_code=True,
    )

Error Log

Loading Model: [1/2]    Loading HF model to memory
230it [00:09, 25.32it/s]
Time: 9.662s
Loading Model: [2/2]    Building TRT-LLM engine
Time: 112.563s
Loading model done.
Total latency: 122.225s
[TensorRT-LLM] TensorRT-LLM version: 0.16.0
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 32768
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (32768) * 28
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 0
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8192 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 14553 MiB
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1272.01 MiB for execution context memory.
[TensorRT-LLM][INFO] [MS] Running engine with multi stream info
[TensorRT-LLM][INFO] [MS] Number of aux streams is 1
[TensorRT-LLM][INFO] [MS] Number of total worker streams is 2
[TensorRT-LLM][INFO] [MS] The main stream provided by execute/enqueue calls is the first worker stream
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 14541 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.66 GB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 5.77 GB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.11 GiB, available: 53.90 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 14192
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 512
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 48.51 GiB for max tokens in paged KV cache (908288).
Evaluating responses...
[TensorRT-LLM][ERROR] Encountered an error when fetching new request: Prompt length (12438) exceeds maximum input length (8192). Set log level to info and check TRTGptModel logs for how maximum input length is set (/home/jenkins/agent/workspace/LLM/helpers/Build-x86_64/llm/cpp/include/tensorrt_llm/batch_manager/llmRequest.h:482)
1       0x7f2312258d0d /inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/public/env/trt_llm/lib/python3.10/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x868d0d) [0x7f2312258d0d]
2       0x7f23145b722d tensorrt_llm::executor::Executor::Impl::executionLoop() + 1021
3       0x7f271ffee5c0 /inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/public/env/trt_llm/lib/python3.10/site-packages/torch/lib/libtorch.so(+0x145c0) [0x7f271ffee5c0]
4       0x7f272e00bac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f272e00bac3]
5       0x7f272e09da40 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126a40) [0x7f272e09da40]

I noticed that in the logs, [TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192 is set to 8192, which appears to be a configuration parameter limiting the maximum number of tokens. I'm wondering why this value is set to 8192 when the model should support 32k tokens.

@mathijshenquet
Copy link

TensorRT-LLM has a max amount of tokens that can be processed at a time. You have two ways of overcoming this limitation:

  1. Use the parameter --max_num_tokens of trtllm-build. This will however lead a larger memory consumption in intermediate buffers.
  2. Enable enable_chunked_context in the config of the model, this allow the engine to process your input in chunks. This requirs you to build the engine with --kv_cache_type paged (I also recommend --use_paged_context_fmha enable) and unfortunetly you can't use this with kv cache quantization.

NB: Paged kv cache also enables you to use enable_kv_cache_reuse, which is a very nice feature to have.

@HuangZhen02
Copy link
Author

Thank you so much for your detailed explanation! However, I'm currently initializing the LLM directly with the HF model (as I found this approach more convenient since it eliminates two additional steps) rather than explicitly doing the model weight conversion and engine building steps. Is there a solution available for this scenario?

@mathijshenquet
Copy link

I am not familiar with using TensorRT-LLM with this API. But I presume that it does a trtllm-build step under the hood.

You can see the relevant options to this API here: https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html

It looks like you can probably implement both the approaches you outlined.

@trystan-s
Copy link

I'm running into similar issues and after enabling --kv_cache_type paged and --use_paged_context_fmha enable I'm still seeing the same issue with being locked @ 8192 with trtllm-serve

How would I enable chunked context with trtllm-serve?

[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 131072
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (131072) * 40
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 0
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8192 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled

@mathijshenquet
Copy link

mathijshenquet commented Jan 27, 2025

You know it is correct when the start-up logs contain the following line:

[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 131071  = maxSequenceLen - 1 since chunked context is enabled

When using TensorRT-LLM backend:

Make sure your config.pbtxt contains something like:

parameters: {
  key: "enable_chunked_context"
  value: {
    string_value: "true"
  }
}

It can be found in triton_model_repo/tensorrt_llm/config.pbtxt. In the examples these files are generated using a bash script and ./tools/fill_template.py.

@mathijshenquet
Copy link

In the LLM python api it probably corresponds to the option enable_chunked_prefill

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants