0.16.0 Qwen2-72B-Struct SQ error

### System Info

Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 37/37 [02:24<00:00,  3.90s/it]

 Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)
calibrating model: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 512/512 [1:51:15<00:00, 13.04s/it]
Weights loaded. Total time: 00:04:54
Weights loaded. Total time: 00:04:48
Weights loaded. Total time: 00:04:46
Weights loaded. Total time: 00:04:41
Total time of converting checkpoints: 02:16:55

[TensorRT-LLM] TensorRT-LLM version: 0.16.0
[01/15/2025-05:42:02] [TRT-LLM] [I] Set bert_attention_plugin to auto.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set gpt_attention_plugin to auto.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set gemm_plugin to float16.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set nccl_plugin to auto.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set lora_plugin to None.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set moe_plugin to auto.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set low_latency_gemm_plugin to None.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set low_latency_gemm_swiglu_plugin to None.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set context_fmha to True.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set remove_input_padding to True.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set reduce_fusion to False.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set user_buffer to False.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set tokens_per_block to 64.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set multiple_profiles to False.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set paged_state to True.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set streamingllm to False.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set use_fused_mlp to True.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set pp_reduce_scatter to False.
[01/15/2025-05:42:02] [TRT-LLM] [W] Implicitly setting QWenConfig.seq_length = 8192
[01/15/2025-05:42:02] [TRT-LLM] [W] Implicitly setting QWenConfig.qwen_type = qwen2
[01/15/2025-05:42:02] [TRT-LLM] [W] Implicitly setting QWenConfig.moe_intermediate_size = 0
[01/15/2025-05:42:02] [TRT-LLM] [W] Implicitly setting QWenConfig.moe_shared_expert_intermediate_size = 0
[01/15/2025-05:42:02] [TRT-LLM] [W] Implicitly setting QWenConfig.tie_word_embeddings = False
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [I] Set dtype to float16.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set paged_kv_cache to True.
[01/15/2025-05:42:02] [TRT-LLM] [W] Overriding paged_state to False
[01/15/2025-05:42:02] [TRT-LLM] [I] Set paged_state to False.
[01/15/2025-05:42:02] [TRT-LLM] [I] max_seq_len is not specified, using deduced value 32768
[01/15/2025-05:42:02] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.

[01/15/2025-05:42:02] [TRT-LLM] [W] padding removal and fMHA are both enabled, max_input_len is not required and will be ignored
[01/15/2025-05:42:03] [TRT] [I] [MemUsageChange] Init CUDA: CPU -14, GPU +0, now: CPU 231, GPU 423 (MiB)
[01/15/2025-05:42:18] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +2038, GPU +374, now: CPU 2404, GPU 797 (MiB)
[01/15/2025-05:42:19] [TRT-LLM] [I] Set smooth_quant_gemm_plugin to float16.
[01/15/2025-05:42:19] [TRT-LLM] [I] Set rmsnorm_quantization_plugin to float16.
[01/15/2025-05:42:19] [TRT-LLM] [I] Set layernorm_quantization_plugin to float16.
[01/15/2025-05:42:19] [TRT-LLM] [I] Set quantize_per_token_plugin to True.
[01/15/2025-05:42:19] [TRT-LLM] [I] Set quantize_tensor_plugin to True.
[01/15/2025-05:42:19] [TRT-LLM] [I] Set nccl_plugin to float16.
[01/15/2025-05:42:20] [TRT-LLM] [I] Total time of constructing network from module object 17.962120056152344 seconds
[01/15/2025-05:42:20] [TRT-LLM] [I] Total optimization profiles added: 1
[01/15/2025-05:42:20] [TRT-LLM] [I] Total time to initialize the weights in network Unnamed Network 0: 00:00:00
[01/15/2025-05:42:20] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0
[01/15/2025-05:42:20] [TRT] [W] Unused Input: position_ids
[01/15/2025-05:42:25] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[01/15/2025-05:42:25] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[01/15/2025-05:42:26] [TRT] [I] Compiler backend is used during engine build.
[01/15/2025-05:42:34] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[01/15/2025-05:42:34] [TRT] [I] Detected 18 inputs and 1 output network tensors.
[01/15/2025-05:44:06] [TRT] [I] Total Host Persistent Memory: 547120 bytes
[01/15/2025-05:44:06] [TRT] [I] Total Device Persistent Memory: 0 bytes
[01/15/2025-05:44:06] [TRT] [I] Max Scratch Memory: 268468224 bytes
[01/15/2025-05:44:06] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 2181 steps to complete.
[01/15/2025-05:44:07] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 433.139ms to assign 21 blocks to 2181 nodes requiring 1501994496 bytes.
[01/15/2025-05:44:07] [TRT] [I] Total Activation Memory: 1501992960 bytes
[01/15/2025-05:44:08] [TRT] [I] Total Weights Memory: 3251073536 bytes
[01/15/2025-05:44:08] [TRT] [I] Compiler backend is used during engine execution.
[01/15/2025-05:44:51] [TRT] [I] Engine generation completed in 146.584 seconds.
[01/15/2025-05:44:51] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 1 MiB, GPU 3101 MiB
[01/15/2025-05:44:54] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:02:33
[01/15/2025-05:44:54] [TRT] [I] Serialized 27 bytes of code generator cache.
[01/15/2025-05:44:54] [TRT] [I] Serialized 125277 bytes of compilation cache.
[01/15/2025-05:44:54] [TRT] [I] Serialized 22 timing cache entries
[01/15/2025-05:44:54] [TRT-LLM] [I] Timing cache serialized to model.cache
[01/15/2025-05:44:54] [TRT-LLM] [I] Build phase peak memory: 26320.83 MB, children: 13.00 MB
[01/15/2025-05:44:54] [TRT-LLM] [I] Serializing engine to ./int8_sq_bin_0113_batch8_20000_tp4/rank0.engine...
[01/15/2025-05:45:00] [TRT-LLM] [I] Engine serialized. Total time: 00:00:06
[01/15/2025-05:45:00] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
。。。。。
[01/15/2025-05:45:00] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:45:00] [TRT-LLM] [I] Set dtype to float16.
[01/15/2025-05:45:00] [TRT-LLM] [I] Set paged_kv_cache to True.
[01/15/2025-05:45:00] [TRT-LLM] [W] Overriding paged_state to False
[01/15/2025-05:45:00] [TRT-LLM] [I] Set paged_state to False.
[01/15/2025-05:45:00] [TRT-LLM] [I] max_seq_len is not specified, using deduced value 32768
[01/15/2025-05:45:00] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.

[01/15/2025-05:45:00] [TRT-LLM] [W] padding removal and fMHA are both enabled, max_input_len is not required and will be ignored
[01/15/2025-05:45:00] [TRT-LLM] [I] Set smooth_quant_gemm_plugin to float16.
[01/15/2025-05:45:00] [TRT-LLM] [I] Set rmsnorm_quantization_plugin to float16.
[01/15/2025-05:45:00] [TRT-LLM] [I] Set layernorm_quantization_plugin to float16.
[01/15/2025-05:45:00] [TRT-LLM] [I] Set quantize_per_token_plugin to True.
[01/15/2025-05:45:00] [TRT-LLM] [I] Set quantize_tensor_plugin to True.
[01/15/2025-05:45:00] [TRT-LLM] [I] Set nccl_plugin to float16.
[01/15/2025-05:45:01] [TRT-LLM] [I] Total time of constructing network from module object 0.8860244750976562 seconds
[01/15/2025-05:45:01] [TRT-LLM] [I] Total optimization profiles added: 1
[01/15/2025-05:45:01] [TRT-LLM] [I] Total time to initialize the weights in network Unnamed Network 0: 00:00:00
[01/15/2025-05:45:01] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0
[01/15/2025-05:45:01] [TRT] [W] Unused Input: position_ids
[01/15/2025-05:45:05] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[01/15/2025-05:45:05] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[01/15/2025-05:45:06] [TRT] [I] Compiler backend is used during engine build.
[01/15/2025-05:45:14] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[01/15/2025-05:45:14] [TRT] [I] Detected 18 inputs and 1 output network tensors.
[01/15/2025-05:46:42] [TRT] [I] Total Host Persistent Memory: 547120 bytes
[01/15/2025-05:46:42] [TRT] [I] Total Device Persistent Memory: 0 bytes
[01/15/2025-05:46:42] [TRT] [I] Max Scratch Memory: 268468224 bytes
[01/15/2025-05:46:42] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 2181 steps to complete.
[01/15/2025-05:46:43] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 423.463ms to assign 21 blocks to 2181 nodes requiring 1501994496 bytes.
[01/15/2025-05:46:43] [TRT] [I] Total Activation Memory: 1501992960 bytes
[01/15/2025-05:46:44] [TRT] [I] Total Weights Memory: 3251043840 bytes
[01/15/2025-05:46:44] [TRT] [I] Compiler backend is used during engine execution.
[01/15/2025-05:46:44] [TRT] [I] Engine generation completed in 98.7165 seconds.
[01/15/2025-05:46:44] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 1 MiB, GPU 3101 MiB
[01/15/2025-05:46:46] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:01:45
[01/15/2025-05:46:46] [TRT-LLM] [I] Build phase peak memory: 46100.56 MB, children: 13.00 MB
[01/15/2025-05:46:47] [TRT-LLM] [I] Serializing engine to ./int8_sq_bin_0113_batch8_20000_tp4/rank1.engine...
[01/15/2025-05:46:53] [TRT-LLM] [I] Engine serialized. Total time: 00:00:06
[01/15/2025-05:46:53] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
。。。。。
[01/15/2025-05:46:53] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:46:53] [TRT-LLM] [I] Set dtype to float16.
[01/15/2025-05:46:53] [TRT-LLM] [I] Set paged_kv_cache to True.
[01/15/2025-05:46:53] [TRT-LLM] [W] Overriding paged_state to False
[01/15/2025-05:46:53] [TRT-LLM] [I] Set paged_state to False.
[01/15/2025-05:46:53] [TRT-LLM] [I] max_seq_len is not specified, using deduced value 32768
[01/15/2025-05:46:53] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.

[01/15/2025-05:46:53] [TRT-LLM] [W] padding removal and fMHA are both enabled, max_input_len is not required and will be ignored
[01/15/2025-05:46:54] [TRT-LLM] [I] Set smooth_quant_gemm_plugin to float16.
[01/15/2025-05:46:54] [TRT-LLM] [I] Set rmsnorm_quantization_plugin to float16.
[01/15/2025-05:46:54] [TRT-LLM] [I] Set layernorm_quantization_plugin to float16.
[01/15/2025-05:46:54] [TRT-LLM] [I] Set quantize_per_token_plugin to True.
[01/15/2025-05:46:54] [TRT-LLM] [I] Set quantize_tensor_plugin to True.
[01/15/2025-05:46:54] [TRT-LLM] [I] Set nccl_plugin to float16.
[01/15/2025-05:46:54] [TRT-LLM] [I] Total time of constructing network from module object 0.9638781547546387 seconds
[01/15/2025-05:46:54] [TRT-LLM] [I] Total optimization profiles added: 1
[01/15/2025-05:46:54] [TRT-LLM] [I] Total time to initialize the weights in network Unnamed Network 0: 00:00:00
[01/15/2025-05:46:54] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0
[01/15/2025-05:46:54] [TRT] [W] Unused Input: position_ids
[01/15/2025-05:46:58] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[01/15/2025-05:46:58] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[01/15/2025-05:46:59] [TRT] [I] Compiler backend is used during engine build.
[01/15/2025-05:47:07] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[01/15/2025-05:47:08] [TRT] [I] Detected 18 inputs and 1 output network tensors.
[01/15/2025-05:48:36] [TRT] [I] Total Host Persistent Memory: 547120 bytes
[01/15/2025-05:48:36] [TRT] [I] Total Device Persistent Memory: 0 bytes
[01/15/2025-05:48:36] [TRT] [I] Max Scratch Memory: 268468224 bytes
[01/15/2025-05:48:36] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 2181 steps to complete.
[01/15/2025-05:48:36] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 432.537ms to assign 21 blocks to 2181 nodes requiring 1501994496 bytes.
[01/15/2025-05:48:36] [TRT] [I] Total Activation Memory: 1501992960 bytes
[01/15/2025-05:48:37] [TRT] [I] Total Weights Memory: 3251043840 bytes
[01/15/2025-05:48:37] [TRT] [I] Compiler backend is used during engine execution.
[01/15/2025-05:48:37] [TRT] [I] Engine generation completed in 98.9362 seconds.
[01/15/2025-05:48:37] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 1 MiB, GPU 3101 MiB
[01/15/2025-05:48:40] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:01:45
[01/15/2025-05:48:40] [TRT-LLM] [I] Build phase peak memory: 46105.67 MB, children: 13.00 MB
[01/15/2025-05:48:41] [TRT-LLM] [I] Serializing engine to ./int8_sq_bin_0113_batch8_20000_tp4/rank2.engine...
[01/15/2025-05:48:47] [TRT-LLM] [I] Engine serialized. Total time: 00:00:06
[01/15/2025-05:48:47] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
。。。。。
[01/15/2025-05:48:47] [TRT-LLM] [I] Set dtype to float16.
[01/15/2025-05:48:47] [TRT-LLM] [I] Set paged_kv_cache to True.
[01/15/2025-05:48:47] [TRT-LLM] [W] Overriding paged_state to False
[01/15/2025-05:48:47] [TRT-LLM] [I] Set paged_state to False.
[01/15/2025-05:48:47] [TRT-LLM] [I] max_seq_len is not specified, using deduced value 32768
[01/15/2025-05:48:47] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.

[01/15/2025-05:48:47] [TRT-LLM] [W] padding removal and fMHA are both enabled, max_input_len is not required and will be ignored
[01/15/2025-05:48:47] [TRT-LLM] [I] Set smooth_quant_gemm_plugin to float16.
[01/15/2025-05:48:47] [TRT-LLM] [I] Set rmsnorm_quantization_plugin to float16.
[01/15/2025-05:48:47] [TRT-LLM] [I] Set layernorm_quantization_plugin to float16.
[01/15/2025-05:48:47] [TRT-LLM] [I] Set quantize_per_token_plugin to True.
[01/15/2025-05:48:47] [TRT-LLM] [I] Set quantize_tensor_plugin to True.
[01/15/2025-05:48:47] [TRT-LLM] [I] Set nccl_plugin to float16.
[01/15/2025-05:48:48] [TRT-LLM] [I] Total time of constructing network from module object 0.9600269794464111 seconds
[01/15/2025-05:48:48] [TRT-LLM] [I] Total optimization profiles added: 1
[01/15/2025-05:48:48] [TRT-LLM] [I] Total time to initialize the weights in network Unnamed Network 0: 00:00:00
[01/15/2025-05:48:48] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0
[01/15/2025-05:48:48] [TRT] [W] Unused Input: position_ids
[01/15/2025-05:48:52] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[01/15/2025-05:48:52] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[01/15/2025-05:48:52] [TRT] [I] Compiler backend is used during engine build.
[01/15/2025-05:49:00] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[01/15/2025-05:49:00] [TRT] [I] Detected 18 inputs and 1 output network tensors.
[01/15/2025-05:50:29] [TRT] [I] Total Host Persistent Memory: 547120 bytes
[01/15/2025-05:50:29] [TRT] [I] Total Device Persistent Memory: 0 bytes
[01/15/2025-05:50:29] [TRT] [I] Max Scratch Memory: 268468224 bytes
[01/15/2025-05:50:29] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 2181 steps to complete.
[01/15/2025-05:50:29] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 434.772ms to assign 21 blocks to 2181 nodes requiring 1501994496 bytes.
[01/15/2025-05:50:29] [TRT] [I] Total Activation Memory: 1501992960 bytes
[01/15/2025-05:50:30] [TRT] [I] Total Weights Memory: 3251043840 bytes
[01/15/2025-05:50:30] [TRT] [I] Compiler backend is used during engine execution.
[01/15/2025-05:50:30] [TRT] [I] Engine generation completed in 98.2933 seconds.
[01/15/2025-05:50:30] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 1 MiB, GPU 3101 MiB
[01/15/2025-05:50:33] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:01:44
[01/15/2025-05:50:33] [TRT-LLM] [I] Build phase peak memory: 46105.16 MB, children: 13.00 MB
[01/15/2025-05:50:33] [TRT-LLM] [I] Serializing engine to ./int8_sq_bin_0113_batch8_20000_tp4/rank3.engine...
[01/15/2025-05:50:39] [TRT-LLM] [I] Engine serialized. Total time: 00:00:05
[01/15/2025-05:50:40] [TRT-LLM] [I] Total time of building all engines: 00:08:38


[TensorRT-LLM] TensorRT-LLM version: 0.16.0
[TensorRT-LLM] TensorRT-LLM version: 0.16.0
[TensorRT-LLM] TensorRT-LLM version: 0.16.0
[TensorRT-LLM] TensorRT-LLM version: 0.16.0
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[01/15/2025-05:52:18] [TRT-LLM] [I] Using C++ session
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 0
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[01/15/2025-05:52:19] [TRT-LLM] [I] Using C++ session
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 2
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[01/15/2025-05:52:19] [TRT-LLM] [I] Using C++ session
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[01/15/2025-05:52:19] [TRT-LLM] [I] Using C++ session
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 3
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 1
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 2
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 1
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 3
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] Rank 1 is using GPU 1
[TensorRT-LLM][INFO] Rank 2 is using GPU 2
[TensorRT-LLM][INFO] Rank 3 is using GPU 3
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 32768
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (32768) * 80
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8192 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 32768
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (32768) * 80
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8192 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 32768
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (32768) * 80
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8192 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 32768
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (32768) * 80
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8192 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 3122 MiB
[TensorRT-LLM][INFO] Loaded engine size: 3122 MiB
[TensorRT-LLM][INFO] Loaded engine size: 3122 MiB
[TensorRT-LLM][INFO] Loaded engine size: 3122 MiB
[TensorRT-LLM][INFO] Detecting local TP group for rank 2
[TensorRT-LLM][INFO] Detecting local TP group for rank 3
[TensorRT-LLM][INFO] Detecting local TP group for rank 1
[TensorRT-LLM][INFO] Detecting local TP group for rank 0
[TensorRT-LLM][INFO] TP group is intra-node for rank 3
[TensorRT-LLM][INFO] TP group is intra-node for rank 2
[TensorRT-LLM][INFO] TP group is intra-node for rank 1
[TensorRT-LLM][INFO] TP group is intra-node for rank 0
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1432.41 MiB for execution context memory.
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1432.41 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1432.41 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1432.41 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 3100 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 3100 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 3100 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 3100 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.36 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.36 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.36 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.36 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.88 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.25 GiB, available: 74.12 GiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.88 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.25 GiB, available: 74.12 GiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.88 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.25 GiB, available: 74.12 GiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.88 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.25 GiB, available: 74.12 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 13663
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 13663
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 13663
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 13663
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 512
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 512
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 512
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 512
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 66.71 GiB for max tokens in paged KV cache (874432).
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 66.71 GiB for max tokens in paged KV cache (874432).
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 66.71 GiB for max tokens in paged KV cache (874432).
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 66.71 GiB for max tokens in paged KV cache (874432).
[01/15/2025-05:52:33] [TRT-LLM] [I] Load engine takes: 13.803145170211792 sec
[01/15/2025-05:52:33] [TRT-LLM] [I] Load engine takes: 13.803199291229248 sec
[01/15/2025-05:52:33] [TRT-LLM] [I] Load engine takes: 13.803199529647827 sec
[01/15/2025-05:52:33] [TRT-LLM] [I] Load engine takes: 13.80239987373352 sec
Input [Text 0]: "<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
推荐10个北京好玩的景区<|im_end|>
<|im_start|>assistant
"
Output [Text 0 Beam 0]: ""


### Who can help?

I updated from 0.12.0 to 0.16.0 but still encountered issues with executing Qwen SQint8, resulting in an empty output during runtime. Please pay attention to this



### Information

- [ ] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

python3 convert_checkpoint.py --model_dir  /opt/model/Qwen2-72B-Instruct --output_dir ./tllm_checkpoint_4gpu_sq_int8_batch_8 --dtype float16 --smoothquant 0.5  --per_token  --per_channel --tp_size 4
trtllm-build --checkpoint_dir ./tllm_checkpoint_4gpu_sq_int8_batch_8  \
--output_dir ./ int8_sq_bin_0113_batch8_20000_tp4/   \
--gemm_plugin float16
mpirun -n 4 --allow-run-as-root \
Python3../run.by -- input_text "Recommended 10 Fun Scenic Spots in New York"\
--max_output_len=500 \
--tokenizer_dir /opt/model/Qwen2-72B-Instruct/  \
--engine_dir=/opt/model/TensorRT-LLM-0.16.0/examples/qwen/int8_sq_bin_0113_batch8_20000_tp4/

### Expected behavior

not ''

### actual behavior

output ''

### additional notes

no more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

0.16.0 Qwen2-72B-Struct SQ error #2693

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

0.16.0 Qwen2-72B-Struct SQ error #2693

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions