You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Starting from v4.46, the logits model output will have the same type as the model (except at train time, where it will always be FP32)
calibrating model: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 512/512 [1:51:15<00:00, 13.04s/it]
Weights loaded. Total time: 00:04:54
Weights loaded. Total time: 00:04:48
Weights loaded. Total time: 00:04:46
Weights loaded. Total time: 00:04:41
Total time of converting checkpoints: 02:16:55
[TensorRT-LLM] TensorRT-LLM version: 0.16.0
[01/15/2025-05:42:02] [TRT-LLM] [I] Set bert_attention_plugin to auto.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set gpt_attention_plugin to auto.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set gemm_plugin to float16.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set nccl_plugin to auto.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set lora_plugin to None.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set moe_plugin to auto.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set low_latency_gemm_plugin to None.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set low_latency_gemm_swiglu_plugin to None.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set context_fmha to True.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set remove_input_padding to True.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set reduce_fusion to False.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set user_buffer to False.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set tokens_per_block to 64.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set multiple_profiles to False.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set paged_state to True.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set streamingllm to False.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set use_fused_mlp to True.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set pp_reduce_scatter to False.
[01/15/2025-05:42:02] [TRT-LLM] [W] Implicitly setting QWenConfig.seq_length = 8192
[01/15/2025-05:42:02] [TRT-LLM] [W] Implicitly setting QWenConfig.qwen_type = qwen2
[01/15/2025-05:42:02] [TRT-LLM] [W] Implicitly setting QWenConfig.moe_intermediate_size = 0
[01/15/2025-05:42:02] [TRT-LLM] [W] Implicitly setting QWenConfig.moe_shared_expert_intermediate_size = 0
[01/15/2025-05:42:02] [TRT-LLM] [W] Implicitly setting QWenConfig.tie_word_embeddings = False
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [I] Set dtype to float16.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set paged_kv_cache to True.
[01/15/2025-05:42:02] [TRT-LLM] [W] Overriding paged_state to False
[01/15/2025-05:42:02] [TRT-LLM] [I] Set paged_state to False.
[01/15/2025-05:42:02] [TRT-LLM] [I] max_seq_len is not specified, using deduced value 32768
[01/15/2025-05:42:02] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.
[01/15/2025-05:42:02] [TRT-LLM] [W] padding removal and fMHA are both enabled, max_input_len is not required and will be ignored
[01/15/2025-05:42:03] [TRT] [I] [MemUsageChange] Init CUDA: CPU -14, GPU +0, now: CPU 231, GPU 423 (MiB)
[01/15/2025-05:42:18] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +2038, GPU +374, now: CPU 2404, GPU 797 (MiB)
[01/15/2025-05:42:19] [TRT-LLM] [I] Set smooth_quant_gemm_plugin to float16.
[01/15/2025-05:42:19] [TRT-LLM] [I] Set rmsnorm_quantization_plugin to float16.
[01/15/2025-05:42:19] [TRT-LLM] [I] Set layernorm_quantization_plugin to float16.
[01/15/2025-05:42:19] [TRT-LLM] [I] Set quantize_per_token_plugin to True.
[01/15/2025-05:42:19] [TRT-LLM] [I] Set quantize_tensor_plugin to True.
[01/15/2025-05:42:19] [TRT-LLM] [I] Set nccl_plugin to float16.
[01/15/2025-05:42:20] [TRT-LLM] [I] Total time of constructing network from module object 17.962120056152344 seconds
[01/15/2025-05:42:20] [TRT-LLM] [I] Total optimization profiles added: 1
[01/15/2025-05:42:20] [TRT-LLM] [I] Total time to initialize the weights in network Unnamed Network 0: 00:00:00
[01/15/2025-05:42:20] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0
[01/15/2025-05:42:20] [TRT] [W] Unused Input: position_ids
[01/15/2025-05:42:25] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[01/15/2025-05:42:25] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[01/15/2025-05:42:26] [TRT] [I] Compiler backend is used during engine build.
[01/15/2025-05:42:34] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[01/15/2025-05:42:34] [TRT] [I] Detected 18 inputs and 1 output network tensors.
[01/15/2025-05:44:06] [TRT] [I] Total Host Persistent Memory: 547120 bytes
[01/15/2025-05:44:06] [TRT] [I] Total Device Persistent Memory: 0 bytes
[01/15/2025-05:44:06] [TRT] [I] Max Scratch Memory: 268468224 bytes
[01/15/2025-05:44:06] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 2181 steps to complete.
[01/15/2025-05:44:07] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 433.139ms to assign 21 blocks to 2181 nodes requiring 1501994496 bytes.
[01/15/2025-05:44:07] [TRT] [I] Total Activation Memory: 1501992960 bytes
[01/15/2025-05:44:08] [TRT] [I] Total Weights Memory: 3251073536 bytes
[01/15/2025-05:44:08] [TRT] [I] Compiler backend is used during engine execution.
[01/15/2025-05:44:51] [TRT] [I] Engine generation completed in 146.584 seconds.
[01/15/2025-05:44:51] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 1 MiB, GPU 3101 MiB
[01/15/2025-05:44:54] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:02:33
[01/15/2025-05:44:54] [TRT] [I] Serialized 27 bytes of code generator cache.
[01/15/2025-05:44:54] [TRT] [I] Serialized 125277 bytes of compilation cache.
[01/15/2025-05:44:54] [TRT] [I] Serialized 22 timing cache entries
[01/15/2025-05:44:54] [TRT-LLM] [I] Timing cache serialized to model.cache
[01/15/2025-05:44:54] [TRT-LLM] [I] Build phase peak memory: 26320.83 MB, children: 13.00 MB
[01/15/2025-05:44:54] [TRT-LLM] [I] Serializing engine to ./int8_sq_bin_0113_batch8_20000_tp4/rank0.engine...
[01/15/2025-05:45:00] [TRT-LLM] [I] Engine serialized. Total time: 00:00:06
[01/15/2025-05:45:00] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
。。。。。
[01/15/2025-05:45:00] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:45:00] [TRT-LLM] [I] Set dtype to float16.
[01/15/2025-05:45:00] [TRT-LLM] [I] Set paged_kv_cache to True.
[01/15/2025-05:45:00] [TRT-LLM] [W] Overriding paged_state to False
[01/15/2025-05:45:00] [TRT-LLM] [I] Set paged_state to False.
[01/15/2025-05:45:00] [TRT-LLM] [I] max_seq_len is not specified, using deduced value 32768
[01/15/2025-05:45:00] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.
[01/15/2025-05:45:00] [TRT-LLM] [W] padding removal and fMHA are both enabled, max_input_len is not required and will be ignored
[01/15/2025-05:45:00] [TRT-LLM] [I] Set smooth_quant_gemm_plugin to float16.
[01/15/2025-05:45:00] [TRT-LLM] [I] Set rmsnorm_quantization_plugin to float16.
[01/15/2025-05:45:00] [TRT-LLM] [I] Set layernorm_quantization_plugin to float16.
[01/15/2025-05:45:00] [TRT-LLM] [I] Set quantize_per_token_plugin to True.
[01/15/2025-05:45:00] [TRT-LLM] [I] Set quantize_tensor_plugin to True.
[01/15/2025-05:45:00] [TRT-LLM] [I] Set nccl_plugin to float16.
[01/15/2025-05:45:01] [TRT-LLM] [I] Total time of constructing network from module object 0.8860244750976562 seconds
[01/15/2025-05:45:01] [TRT-LLM] [I] Total optimization profiles added: 1
[01/15/2025-05:45:01] [TRT-LLM] [I] Total time to initialize the weights in network Unnamed Network 0: 00:00:00
[01/15/2025-05:45:01] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0
[01/15/2025-05:45:01] [TRT] [W] Unused Input: position_ids
[01/15/2025-05:45:05] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[01/15/2025-05:45:05] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[01/15/2025-05:45:06] [TRT] [I] Compiler backend is used during engine build.
[01/15/2025-05:45:14] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[01/15/2025-05:45:14] [TRT] [I] Detected 18 inputs and 1 output network tensors.
[01/15/2025-05:46:42] [TRT] [I] Total Host Persistent Memory: 547120 bytes
[01/15/2025-05:46:42] [TRT] [I] Total Device Persistent Memory: 0 bytes
[01/15/2025-05:46:42] [TRT] [I] Max Scratch Memory: 268468224 bytes
[01/15/2025-05:46:42] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 2181 steps to complete.
[01/15/2025-05:46:43] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 423.463ms to assign 21 blocks to 2181 nodes requiring 1501994496 bytes.
[01/15/2025-05:46:43] [TRT] [I] Total Activation Memory: 1501992960 bytes
[01/15/2025-05:46:44] [TRT] [I] Total Weights Memory: 3251043840 bytes
[01/15/2025-05:46:44] [TRT] [I] Compiler backend is used during engine execution.
[01/15/2025-05:46:44] [TRT] [I] Engine generation completed in 98.7165 seconds.
[01/15/2025-05:46:44] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 1 MiB, GPU 3101 MiB
[01/15/2025-05:46:46] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:01:45
[01/15/2025-05:46:46] [TRT-LLM] [I] Build phase peak memory: 46100.56 MB, children: 13.00 MB
[01/15/2025-05:46:47] [TRT-LLM] [I] Serializing engine to ./int8_sq_bin_0113_batch8_20000_tp4/rank1.engine...
[01/15/2025-05:46:53] [TRT-LLM] [I] Engine serialized. Total time: 00:00:06
[01/15/2025-05:46:53] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
。。。。。
[01/15/2025-05:46:53] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:46:53] [TRT-LLM] [I] Set dtype to float16.
[01/15/2025-05:46:53] [TRT-LLM] [I] Set paged_kv_cache to True.
[01/15/2025-05:46:53] [TRT-LLM] [W] Overriding paged_state to False
[01/15/2025-05:46:53] [TRT-LLM] [I] Set paged_state to False.
[01/15/2025-05:46:53] [TRT-LLM] [I] max_seq_len is not specified, using deduced value 32768
[01/15/2025-05:46:53] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.
[01/15/2025-05:46:53] [TRT-LLM] [W] padding removal and fMHA are both enabled, max_input_len is not required and will be ignored
[01/15/2025-05:46:54] [TRT-LLM] [I] Set smooth_quant_gemm_plugin to float16.
[01/15/2025-05:46:54] [TRT-LLM] [I] Set rmsnorm_quantization_plugin to float16.
[01/15/2025-05:46:54] [TRT-LLM] [I] Set layernorm_quantization_plugin to float16.
[01/15/2025-05:46:54] [TRT-LLM] [I] Set quantize_per_token_plugin to True.
[01/15/2025-05:46:54] [TRT-LLM] [I] Set quantize_tensor_plugin to True.
[01/15/2025-05:46:54] [TRT-LLM] [I] Set nccl_plugin to float16.
[01/15/2025-05:46:54] [TRT-LLM] [I] Total time of constructing network from module object 0.9638781547546387 seconds
[01/15/2025-05:46:54] [TRT-LLM] [I] Total optimization profiles added: 1
[01/15/2025-05:46:54] [TRT-LLM] [I] Total time to initialize the weights in network Unnamed Network 0: 00:00:00
[01/15/2025-05:46:54] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0
[01/15/2025-05:46:54] [TRT] [W] Unused Input: position_ids
[01/15/2025-05:46:58] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[01/15/2025-05:46:58] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[01/15/2025-05:46:59] [TRT] [I] Compiler backend is used during engine build.
[01/15/2025-05:47:07] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[01/15/2025-05:47:08] [TRT] [I] Detected 18 inputs and 1 output network tensors.
[01/15/2025-05:48:36] [TRT] [I] Total Host Persistent Memory: 547120 bytes
[01/15/2025-05:48:36] [TRT] [I] Total Device Persistent Memory: 0 bytes
[01/15/2025-05:48:36] [TRT] [I] Max Scratch Memory: 268468224 bytes
[01/15/2025-05:48:36] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 2181 steps to complete.
[01/15/2025-05:48:36] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 432.537ms to assign 21 blocks to 2181 nodes requiring 1501994496 bytes.
[01/15/2025-05:48:36] [TRT] [I] Total Activation Memory: 1501992960 bytes
[01/15/2025-05:48:37] [TRT] [I] Total Weights Memory: 3251043840 bytes
[01/15/2025-05:48:37] [TRT] [I] Compiler backend is used during engine execution.
[01/15/2025-05:48:37] [TRT] [I] Engine generation completed in 98.9362 seconds.
[01/15/2025-05:48:37] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 1 MiB, GPU 3101 MiB
[01/15/2025-05:48:40] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:01:45
[01/15/2025-05:48:40] [TRT-LLM] [I] Build phase peak memory: 46105.67 MB, children: 13.00 MB
[01/15/2025-05:48:41] [TRT-LLM] [I] Serializing engine to ./int8_sq_bin_0113_batch8_20000_tp4/rank2.engine...
[01/15/2025-05:48:47] [TRT-LLM] [I] Engine serialized. Total time: 00:00:06
[01/15/2025-05:48:47] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
。。。。。
[01/15/2025-05:48:47] [TRT-LLM] [I] Set dtype to float16.
[01/15/2025-05:48:47] [TRT-LLM] [I] Set paged_kv_cache to True.
[01/15/2025-05:48:47] [TRT-LLM] [W] Overriding paged_state to False
[01/15/2025-05:48:47] [TRT-LLM] [I] Set paged_state to False.
[01/15/2025-05:48:47] [TRT-LLM] [I] max_seq_len is not specified, using deduced value 32768
[01/15/2025-05:48:47] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.
[01/15/2025-05:48:47] [TRT-LLM] [W] padding removal and fMHA are both enabled, max_input_len is not required and will be ignored
[01/15/2025-05:48:47] [TRT-LLM] [I] Set smooth_quant_gemm_plugin to float16.
[01/15/2025-05:48:47] [TRT-LLM] [I] Set rmsnorm_quantization_plugin to float16.
[01/15/2025-05:48:47] [TRT-LLM] [I] Set layernorm_quantization_plugin to float16.
[01/15/2025-05:48:47] [TRT-LLM] [I] Set quantize_per_token_plugin to True.
[01/15/2025-05:48:47] [TRT-LLM] [I] Set quantize_tensor_plugin to True.
[01/15/2025-05:48:47] [TRT-LLM] [I] Set nccl_plugin to float16.
[01/15/2025-05:48:48] [TRT-LLM] [I] Total time of constructing network from module object 0.9600269794464111 seconds
[01/15/2025-05:48:48] [TRT-LLM] [I] Total optimization profiles added: 1
[01/15/2025-05:48:48] [TRT-LLM] [I] Total time to initialize the weights in network Unnamed Network 0: 00:00:00
[01/15/2025-05:48:48] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0
[01/15/2025-05:48:48] [TRT] [W] Unused Input: position_ids
[01/15/2025-05:48:52] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[01/15/2025-05:48:52] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[01/15/2025-05:48:52] [TRT] [I] Compiler backend is used during engine build.
[01/15/2025-05:49:00] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[01/15/2025-05:49:00] [TRT] [I] Detected 18 inputs and 1 output network tensors.
[01/15/2025-05:50:29] [TRT] [I] Total Host Persistent Memory: 547120 bytes
[01/15/2025-05:50:29] [TRT] [I] Total Device Persistent Memory: 0 bytes
[01/15/2025-05:50:29] [TRT] [I] Max Scratch Memory: 268468224 bytes
[01/15/2025-05:50:29] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 2181 steps to complete.
[01/15/2025-05:50:29] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 434.772ms to assign 21 blocks to 2181 nodes requiring 1501994496 bytes.
[01/15/2025-05:50:29] [TRT] [I] Total Activation Memory: 1501992960 bytes
[01/15/2025-05:50:30] [TRT] [I] Total Weights Memory: 3251043840 bytes
[01/15/2025-05:50:30] [TRT] [I] Compiler backend is used during engine execution.
[01/15/2025-05:50:30] [TRT] [I] Engine generation completed in 98.2933 seconds.
[01/15/2025-05:50:30] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 1 MiB, GPU 3101 MiB
[01/15/2025-05:50:33] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:01:44
[01/15/2025-05:50:33] [TRT-LLM] [I] Build phase peak memory: 46105.16 MB, children: 13.00 MB
[01/15/2025-05:50:33] [TRT-LLM] [I] Serializing engine to ./int8_sq_bin_0113_batch8_20000_tp4/rank3.engine...
[01/15/2025-05:50:39] [TRT-LLM] [I] Engine serialized. Total time: 00:00:05
[01/15/2025-05:50:40] [TRT-LLM] [I] Total time of building all engines: 00:08:38
[TensorRT-LLM] TensorRT-LLM version: 0.16.0
[TensorRT-LLM] TensorRT-LLM version: 0.16.0
[TensorRT-LLM] TensorRT-LLM version: 0.16.0
[TensorRT-LLM] TensorRT-LLM version: 0.16.0
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[01/15/2025-05:52:18] [TRT-LLM] [I] Using C++ session
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 0
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[01/15/2025-05:52:19] [TRT-LLM] [I] Using C++ session
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 2
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[01/15/2025-05:52:19] [TRT-LLM] [I] Using C++ session
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[01/15/2025-05:52:19] [TRT-LLM] [I] Using C++ session
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 3
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 1
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 2
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 1
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 3
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] Rank 1 is using GPU 1
[TensorRT-LLM][INFO] Rank 2 is using GPU 2
[TensorRT-LLM][INFO] Rank 3 is using GPU 3
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 32768
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (32768) * 80
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8192 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 32768
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (32768) * 80
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8192 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 32768
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (32768) * 80
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8192 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 32768
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (32768) * 80
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8192 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 3122 MiB
[TensorRT-LLM][INFO] Loaded engine size: 3122 MiB
[TensorRT-LLM][INFO] Loaded engine size: 3122 MiB
[TensorRT-LLM][INFO] Loaded engine size: 3122 MiB
[TensorRT-LLM][INFO] Detecting local TP group for rank 2
[TensorRT-LLM][INFO] Detecting local TP group for rank 3
[TensorRT-LLM][INFO] Detecting local TP group for rank 1
[TensorRT-LLM][INFO] Detecting local TP group for rank 0
[TensorRT-LLM][INFO] TP group is intra-node for rank 3
[TensorRT-LLM][INFO] TP group is intra-node for rank 2
[TensorRT-LLM][INFO] TP group is intra-node for rank 1
[TensorRT-LLM][INFO] TP group is intra-node for rank 0
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1432.41 MiB for execution context memory.
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1432.41 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1432.41 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1432.41 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 3100 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 3100 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 3100 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 3100 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.36 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.36 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.36 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.36 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.88 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.25 GiB, available: 74.12 GiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.88 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.25 GiB, available: 74.12 GiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.88 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.25 GiB, available: 74.12 GiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.88 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.25 GiB, available: 74.12 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 13663
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 13663
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 13663
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 13663
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 512
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 512
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 512
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 512
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 66.71 GiB for max tokens in paged KV cache (874432).
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 66.71 GiB for max tokens in paged KV cache (874432).
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 66.71 GiB for max tokens in paged KV cache (874432).
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 66.71 GiB for max tokens in paged KV cache (874432).
[01/15/2025-05:52:33] [TRT-LLM] [I] Load engine takes: 13.803145170211792 sec
[01/15/2025-05:52:33] [TRT-LLM] [I] Load engine takes: 13.803199291229248 sec
[01/15/2025-05:52:33] [TRT-LLM] [I] Load engine takes: 13.803199529647827 sec
[01/15/2025-05:52:33] [TRT-LLM] [I] Load engine takes: 13.80239987373352 sec
Input [Text 0]: "<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
推荐10个北京好玩的景区<|im_end|>
<|im_start|>assistant
"
Output [Text 0 Beam 0]: ""
Who can help?
I updated from 0.12.0 to 0.16.0 but still encountered issues with executing Qwen SQint8, resulting in an empty output during runtime. Please pay attention to this
Information
The official example scripts
My own modified scripts
Tasks
An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
System Info
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 37/37 [02:24<00:00, 3.90s/it]
Starting from v4.46, the
logits
model output will have the same type as the model (except at train time, where it will always be FP32)calibrating model: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 512/512 [1:51:15<00:00, 13.04s/it]
Weights loaded. Total time: 00:04:54
Weights loaded. Total time: 00:04:48
Weights loaded. Total time: 00:04:46
Weights loaded. Total time: 00:04:41
Total time of converting checkpoints: 02:16:55
[TensorRT-LLM] TensorRT-LLM version: 0.16.0
[01/15/2025-05:42:02] [TRT-LLM] [I] Set bert_attention_plugin to auto.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set gpt_attention_plugin to auto.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set gemm_plugin to float16.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set nccl_plugin to auto.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set lora_plugin to None.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set moe_plugin to auto.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set low_latency_gemm_plugin to None.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set low_latency_gemm_swiglu_plugin to None.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set context_fmha to True.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set remove_input_padding to True.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set reduce_fusion to False.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set user_buffer to False.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set tokens_per_block to 64.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set multiple_profiles to False.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set paged_state to True.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set streamingllm to False.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set use_fused_mlp to True.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set pp_reduce_scatter to False.
[01/15/2025-05:42:02] [TRT-LLM] [W] Implicitly setting QWenConfig.seq_length = 8192
[01/15/2025-05:42:02] [TRT-LLM] [W] Implicitly setting QWenConfig.qwen_type = qwen2
[01/15/2025-05:42:02] [TRT-LLM] [W] Implicitly setting QWenConfig.moe_intermediate_size = 0
[01/15/2025-05:42:02] [TRT-LLM] [W] Implicitly setting QWenConfig.moe_shared_expert_intermediate_size = 0
[01/15/2025-05:42:02] [TRT-LLM] [W] Implicitly setting QWenConfig.tie_word_embeddings = False
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [I] Set dtype to float16.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set paged_kv_cache to True.
[01/15/2025-05:42:02] [TRT-LLM] [W] Overriding paged_state to False
[01/15/2025-05:42:02] [TRT-LLM] [I] Set paged_state to False.
[01/15/2025-05:42:02] [TRT-LLM] [I] max_seq_len is not specified, using deduced value 32768
[01/15/2025-05:42:02] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.
[01/15/2025-05:42:02] [TRT-LLM] [W] padding removal and fMHA are both enabled, max_input_len is not required and will be ignored
[01/15/2025-05:42:03] [TRT] [I] [MemUsageChange] Init CUDA: CPU -14, GPU +0, now: CPU 231, GPU 423 (MiB)
[01/15/2025-05:42:18] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +2038, GPU +374, now: CPU 2404, GPU 797 (MiB)
[01/15/2025-05:42:19] [TRT-LLM] [I] Set smooth_quant_gemm_plugin to float16.
[01/15/2025-05:42:19] [TRT-LLM] [I] Set rmsnorm_quantization_plugin to float16.
[01/15/2025-05:42:19] [TRT-LLM] [I] Set layernorm_quantization_plugin to float16.
[01/15/2025-05:42:19] [TRT-LLM] [I] Set quantize_per_token_plugin to True.
[01/15/2025-05:42:19] [TRT-LLM] [I] Set quantize_tensor_plugin to True.
[01/15/2025-05:42:19] [TRT-LLM] [I] Set nccl_plugin to float16.
[01/15/2025-05:42:20] [TRT-LLM] [I] Total time of constructing network from module object 17.962120056152344 seconds
[01/15/2025-05:42:20] [TRT-LLM] [I] Total optimization profiles added: 1
[01/15/2025-05:42:20] [TRT-LLM] [I] Total time to initialize the weights in network Unnamed Network 0: 00:00:00
[01/15/2025-05:42:20] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0
[01/15/2025-05:42:20] [TRT] [W] Unused Input: position_ids
[01/15/2025-05:42:25] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[01/15/2025-05:42:25] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[01/15/2025-05:42:26] [TRT] [I] Compiler backend is used during engine build.
[01/15/2025-05:42:34] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[01/15/2025-05:42:34] [TRT] [I] Detected 18 inputs and 1 output network tensors.
[01/15/2025-05:44:06] [TRT] [I] Total Host Persistent Memory: 547120 bytes
[01/15/2025-05:44:06] [TRT] [I] Total Device Persistent Memory: 0 bytes
[01/15/2025-05:44:06] [TRT] [I] Max Scratch Memory: 268468224 bytes
[01/15/2025-05:44:06] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 2181 steps to complete.
[01/15/2025-05:44:07] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 433.139ms to assign 21 blocks to 2181 nodes requiring 1501994496 bytes.
[01/15/2025-05:44:07] [TRT] [I] Total Activation Memory: 1501992960 bytes
[01/15/2025-05:44:08] [TRT] [I] Total Weights Memory: 3251073536 bytes
[01/15/2025-05:44:08] [TRT] [I] Compiler backend is used during engine execution.
[01/15/2025-05:44:51] [TRT] [I] Engine generation completed in 146.584 seconds.
[01/15/2025-05:44:51] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 1 MiB, GPU 3101 MiB
[01/15/2025-05:44:54] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:02:33
[01/15/2025-05:44:54] [TRT] [I] Serialized 27 bytes of code generator cache.
[01/15/2025-05:44:54] [TRT] [I] Serialized 125277 bytes of compilation cache.
[01/15/2025-05:44:54] [TRT] [I] Serialized 22 timing cache entries
[01/15/2025-05:44:54] [TRT-LLM] [I] Timing cache serialized to model.cache
[01/15/2025-05:44:54] [TRT-LLM] [I] Build phase peak memory: 26320.83 MB, children: 13.00 MB
[01/15/2025-05:44:54] [TRT-LLM] [I] Serializing engine to ./int8_sq_bin_0113_batch8_20000_tp4/rank0.engine...
[01/15/2025-05:45:00] [TRT-LLM] [I] Engine serialized. Total time: 00:00:06
[01/15/2025-05:45:00] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
。。。。。
[01/15/2025-05:45:00] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:45:00] [TRT-LLM] [I] Set dtype to float16.
[01/15/2025-05:45:00] [TRT-LLM] [I] Set paged_kv_cache to True.
[01/15/2025-05:45:00] [TRT-LLM] [W] Overriding paged_state to False
[01/15/2025-05:45:00] [TRT-LLM] [I] Set paged_state to False.
[01/15/2025-05:45:00] [TRT-LLM] [I] max_seq_len is not specified, using deduced value 32768
[01/15/2025-05:45:00] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.
[01/15/2025-05:45:00] [TRT-LLM] [W] padding removal and fMHA are both enabled, max_input_len is not required and will be ignored
[01/15/2025-05:45:00] [TRT-LLM] [I] Set smooth_quant_gemm_plugin to float16.
[01/15/2025-05:45:00] [TRT-LLM] [I] Set rmsnorm_quantization_plugin to float16.
[01/15/2025-05:45:00] [TRT-LLM] [I] Set layernorm_quantization_plugin to float16.
[01/15/2025-05:45:00] [TRT-LLM] [I] Set quantize_per_token_plugin to True.
[01/15/2025-05:45:00] [TRT-LLM] [I] Set quantize_tensor_plugin to True.
[01/15/2025-05:45:00] [TRT-LLM] [I] Set nccl_plugin to float16.
[01/15/2025-05:45:01] [TRT-LLM] [I] Total time of constructing network from module object 0.8860244750976562 seconds
[01/15/2025-05:45:01] [TRT-LLM] [I] Total optimization profiles added: 1
[01/15/2025-05:45:01] [TRT-LLM] [I] Total time to initialize the weights in network Unnamed Network 0: 00:00:00
[01/15/2025-05:45:01] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0
[01/15/2025-05:45:01] [TRT] [W] Unused Input: position_ids
[01/15/2025-05:45:05] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[01/15/2025-05:45:05] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[01/15/2025-05:45:06] [TRT] [I] Compiler backend is used during engine build.
[01/15/2025-05:45:14] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[01/15/2025-05:45:14] [TRT] [I] Detected 18 inputs and 1 output network tensors.
[01/15/2025-05:46:42] [TRT] [I] Total Host Persistent Memory: 547120 bytes
[01/15/2025-05:46:42] [TRT] [I] Total Device Persistent Memory: 0 bytes
[01/15/2025-05:46:42] [TRT] [I] Max Scratch Memory: 268468224 bytes
[01/15/2025-05:46:42] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 2181 steps to complete.
[01/15/2025-05:46:43] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 423.463ms to assign 21 blocks to 2181 nodes requiring 1501994496 bytes.
[01/15/2025-05:46:43] [TRT] [I] Total Activation Memory: 1501992960 bytes
[01/15/2025-05:46:44] [TRT] [I] Total Weights Memory: 3251043840 bytes
[01/15/2025-05:46:44] [TRT] [I] Compiler backend is used during engine execution.
[01/15/2025-05:46:44] [TRT] [I] Engine generation completed in 98.7165 seconds.
[01/15/2025-05:46:44] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 1 MiB, GPU 3101 MiB
[01/15/2025-05:46:46] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:01:45
[01/15/2025-05:46:46] [TRT-LLM] [I] Build phase peak memory: 46100.56 MB, children: 13.00 MB
[01/15/2025-05:46:47] [TRT-LLM] [I] Serializing engine to ./int8_sq_bin_0113_batch8_20000_tp4/rank1.engine...
[01/15/2025-05:46:53] [TRT-LLM] [I] Engine serialized. Total time: 00:00:06
[01/15/2025-05:46:53] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
。。。。。
[01/15/2025-05:46:53] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:46:53] [TRT-LLM] [I] Set dtype to float16.
[01/15/2025-05:46:53] [TRT-LLM] [I] Set paged_kv_cache to True.
[01/15/2025-05:46:53] [TRT-LLM] [W] Overriding paged_state to False
[01/15/2025-05:46:53] [TRT-LLM] [I] Set paged_state to False.
[01/15/2025-05:46:53] [TRT-LLM] [I] max_seq_len is not specified, using deduced value 32768
[01/15/2025-05:46:53] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.
[01/15/2025-05:46:53] [TRT-LLM] [W] padding removal and fMHA are both enabled, max_input_len is not required and will be ignored
[01/15/2025-05:46:54] [TRT-LLM] [I] Set smooth_quant_gemm_plugin to float16.
[01/15/2025-05:46:54] [TRT-LLM] [I] Set rmsnorm_quantization_plugin to float16.
[01/15/2025-05:46:54] [TRT-LLM] [I] Set layernorm_quantization_plugin to float16.
[01/15/2025-05:46:54] [TRT-LLM] [I] Set quantize_per_token_plugin to True.
[01/15/2025-05:46:54] [TRT-LLM] [I] Set quantize_tensor_plugin to True.
[01/15/2025-05:46:54] [TRT-LLM] [I] Set nccl_plugin to float16.
[01/15/2025-05:46:54] [TRT-LLM] [I] Total time of constructing network from module object 0.9638781547546387 seconds
[01/15/2025-05:46:54] [TRT-LLM] [I] Total optimization profiles added: 1
[01/15/2025-05:46:54] [TRT-LLM] [I] Total time to initialize the weights in network Unnamed Network 0: 00:00:00
[01/15/2025-05:46:54] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0
[01/15/2025-05:46:54] [TRT] [W] Unused Input: position_ids
[01/15/2025-05:46:58] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[01/15/2025-05:46:58] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[01/15/2025-05:46:59] [TRT] [I] Compiler backend is used during engine build.
[01/15/2025-05:47:07] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[01/15/2025-05:47:08] [TRT] [I] Detected 18 inputs and 1 output network tensors.
[01/15/2025-05:48:36] [TRT] [I] Total Host Persistent Memory: 547120 bytes
[01/15/2025-05:48:36] [TRT] [I] Total Device Persistent Memory: 0 bytes
[01/15/2025-05:48:36] [TRT] [I] Max Scratch Memory: 268468224 bytes
[01/15/2025-05:48:36] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 2181 steps to complete.
[01/15/2025-05:48:36] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 432.537ms to assign 21 blocks to 2181 nodes requiring 1501994496 bytes.
[01/15/2025-05:48:36] [TRT] [I] Total Activation Memory: 1501992960 bytes
[01/15/2025-05:48:37] [TRT] [I] Total Weights Memory: 3251043840 bytes
[01/15/2025-05:48:37] [TRT] [I] Compiler backend is used during engine execution.
[01/15/2025-05:48:37] [TRT] [I] Engine generation completed in 98.9362 seconds.
[01/15/2025-05:48:37] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 1 MiB, GPU 3101 MiB
[01/15/2025-05:48:40] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:01:45
[01/15/2025-05:48:40] [TRT-LLM] [I] Build phase peak memory: 46105.67 MB, children: 13.00 MB
[01/15/2025-05:48:41] [TRT-LLM] [I] Serializing engine to ./int8_sq_bin_0113_batch8_20000_tp4/rank2.engine...
[01/15/2025-05:48:47] [TRT-LLM] [I] Engine serialized. Total time: 00:00:06
[01/15/2025-05:48:47] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
。。。。。
[01/15/2025-05:48:47] [TRT-LLM] [I] Set dtype to float16.
[01/15/2025-05:48:47] [TRT-LLM] [I] Set paged_kv_cache to True.
[01/15/2025-05:48:47] [TRT-LLM] [W] Overriding paged_state to False
[01/15/2025-05:48:47] [TRT-LLM] [I] Set paged_state to False.
[01/15/2025-05:48:47] [TRT-LLM] [I] max_seq_len is not specified, using deduced value 32768
[01/15/2025-05:48:47] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.
[01/15/2025-05:48:47] [TRT-LLM] [W] padding removal and fMHA are both enabled, max_input_len is not required and will be ignored
[01/15/2025-05:48:47] [TRT-LLM] [I] Set smooth_quant_gemm_plugin to float16.
[01/15/2025-05:48:47] [TRT-LLM] [I] Set rmsnorm_quantization_plugin to float16.
[01/15/2025-05:48:47] [TRT-LLM] [I] Set layernorm_quantization_plugin to float16.
[01/15/2025-05:48:47] [TRT-LLM] [I] Set quantize_per_token_plugin to True.
[01/15/2025-05:48:47] [TRT-LLM] [I] Set quantize_tensor_plugin to True.
[01/15/2025-05:48:47] [TRT-LLM] [I] Set nccl_plugin to float16.
[01/15/2025-05:48:48] [TRT-LLM] [I] Total time of constructing network from module object 0.9600269794464111 seconds
[01/15/2025-05:48:48] [TRT-LLM] [I] Total optimization profiles added: 1
[01/15/2025-05:48:48] [TRT-LLM] [I] Total time to initialize the weights in network Unnamed Network 0: 00:00:00
[01/15/2025-05:48:48] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0
[01/15/2025-05:48:48] [TRT] [W] Unused Input: position_ids
[01/15/2025-05:48:52] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[01/15/2025-05:48:52] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[01/15/2025-05:48:52] [TRT] [I] Compiler backend is used during engine build.
[01/15/2025-05:49:00] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[01/15/2025-05:49:00] [TRT] [I] Detected 18 inputs and 1 output network tensors.
[01/15/2025-05:50:29] [TRT] [I] Total Host Persistent Memory: 547120 bytes
[01/15/2025-05:50:29] [TRT] [I] Total Device Persistent Memory: 0 bytes
[01/15/2025-05:50:29] [TRT] [I] Max Scratch Memory: 268468224 bytes
[01/15/2025-05:50:29] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 2181 steps to complete.
[01/15/2025-05:50:29] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 434.772ms to assign 21 blocks to 2181 nodes requiring 1501994496 bytes.
[01/15/2025-05:50:29] [TRT] [I] Total Activation Memory: 1501992960 bytes
[01/15/2025-05:50:30] [TRT] [I] Total Weights Memory: 3251043840 bytes
[01/15/2025-05:50:30] [TRT] [I] Compiler backend is used during engine execution.
[01/15/2025-05:50:30] [TRT] [I] Engine generation completed in 98.2933 seconds.
[01/15/2025-05:50:30] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 1 MiB, GPU 3101 MiB
[01/15/2025-05:50:33] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:01:44
[01/15/2025-05:50:33] [TRT-LLM] [I] Build phase peak memory: 46105.16 MB, children: 13.00 MB
[01/15/2025-05:50:33] [TRT-LLM] [I] Serializing engine to ./int8_sq_bin_0113_batch8_20000_tp4/rank3.engine...
[01/15/2025-05:50:39] [TRT-LLM] [I] Engine serialized. Total time: 00:00:05
[01/15/2025-05:50:40] [TRT-LLM] [I] Total time of building all engines: 00:08:38
[TensorRT-LLM] TensorRT-LLM version: 0.16.0
[TensorRT-LLM] TensorRT-LLM version: 0.16.0
[TensorRT-LLM] TensorRT-LLM version: 0.16.0
[TensorRT-LLM] TensorRT-LLM version: 0.16.0
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[01/15/2025-05:52:18] [TRT-LLM] [I] Using C++ session
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 0
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[01/15/2025-05:52:19] [TRT-LLM] [I] Using C++ session
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 2
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[01/15/2025-05:52:19] [TRT-LLM] [I] Using C++ session
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[01/15/2025-05:52:19] [TRT-LLM] [I] Using C++ session
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 3
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 1
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 2
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 1
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 3
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] Rank 1 is using GPU 1
[TensorRT-LLM][INFO] Rank 2 is using GPU 2
[TensorRT-LLM][INFO] Rank 3 is using GPU 3
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 32768
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (32768) * 80
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8192 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 32768
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (32768) * 80
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8192 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 32768
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (32768) * 80
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8192 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 32768
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (32768) * 80
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8192 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 3122 MiB
[TensorRT-LLM][INFO] Loaded engine size: 3122 MiB
[TensorRT-LLM][INFO] Loaded engine size: 3122 MiB
[TensorRT-LLM][INFO] Loaded engine size: 3122 MiB
[TensorRT-LLM][INFO] Detecting local TP group for rank 2
[TensorRT-LLM][INFO] Detecting local TP group for rank 3
[TensorRT-LLM][INFO] Detecting local TP group for rank 1
[TensorRT-LLM][INFO] Detecting local TP group for rank 0
[TensorRT-LLM][INFO] TP group is intra-node for rank 3
[TensorRT-LLM][INFO] TP group is intra-node for rank 2
[TensorRT-LLM][INFO] TP group is intra-node for rank 1
[TensorRT-LLM][INFO] TP group is intra-node for rank 0
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1432.41 MiB for execution context memory.
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1432.41 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1432.41 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1432.41 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 3100 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 3100 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 3100 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 3100 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.36 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.36 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.36 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.36 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.88 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.25 GiB, available: 74.12 GiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.88 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.25 GiB, available: 74.12 GiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.88 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.25 GiB, available: 74.12 GiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.88 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.25 GiB, available: 74.12 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 13663
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 13663
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 13663
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 13663
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 512
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 512
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 512
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 512
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 66.71 GiB for max tokens in paged KV cache (874432).
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 66.71 GiB for max tokens in paged KV cache (874432).
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 66.71 GiB for max tokens in paged KV cache (874432).
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 66.71 GiB for max tokens in paged KV cache (874432).
[01/15/2025-05:52:33] [TRT-LLM] [I] Load engine takes: 13.803145170211792 sec
[01/15/2025-05:52:33] [TRT-LLM] [I] Load engine takes: 13.803199291229248 sec
[01/15/2025-05:52:33] [TRT-LLM] [I] Load engine takes: 13.803199529647827 sec
[01/15/2025-05:52:33] [TRT-LLM] [I] Load engine takes: 13.80239987373352 sec
Input [Text 0]: "<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
推荐10个北京好玩的景区<|im_end|>
<|im_start|>assistant
"
Output [Text 0 Beam 0]: ""
Who can help?
I updated from 0.12.0 to 0.16.0 but still encountered issues with executing Qwen SQint8, resulting in an empty output during runtime. Please pay attention to this
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
python3 convert_checkpoint.py --model_dir /opt/model/Qwen2-72B-Instruct --output_dir ./tllm_checkpoint_4gpu_sq_int8_batch_8 --dtype float16 --smoothquant 0.5 --per_token --per_channel --tp_size 4
trtllm-build --checkpoint_dir ./tllm_checkpoint_4gpu_sq_int8_batch_8
--output_dir ./ int8_sq_bin_0113_batch8_20000_tp4/
--gemm_plugin float16
mpirun -n 4 --allow-run-as-root
Python3../run.by -- input_text "Recommended 10 Fun Scenic Spots in New York"
--max_output_len=500
--tokenizer_dir /opt/model/Qwen2-72B-Instruct/
--engine_dir=/opt/model/TensorRT-LLM-0.16.0/examples/qwen/int8_sq_bin_0113_batch8_20000_tp4/
Expected behavior
not ''
actual behavior
output ''
additional notes
no more
The text was updated successfully, but these errors were encountered: