Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0.16.0 Qwen2-72B-Struct SQ error #2693

Open
4 tasks
gy0514020329 opened this issue Jan 15, 2025 · 0 comments
Open
4 tasks

0.16.0 Qwen2-72B-Struct SQ error #2693

gy0514020329 opened this issue Jan 15, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@gy0514020329
Copy link

System Info

Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 37/37 [02:24<00:00, 3.90s/it]

Starting from v4.46, the logits model output will have the same type as the model (except at train time, where it will always be FP32)
calibrating model: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 512/512 [1:51:15<00:00, 13.04s/it]
Weights loaded. Total time: 00:04:54
Weights loaded. Total time: 00:04:48
Weights loaded. Total time: 00:04:46
Weights loaded. Total time: 00:04:41
Total time of converting checkpoints: 02:16:55

[TensorRT-LLM] TensorRT-LLM version: 0.16.0
[01/15/2025-05:42:02] [TRT-LLM] [I] Set bert_attention_plugin to auto.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set gpt_attention_plugin to auto.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set gemm_plugin to float16.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set nccl_plugin to auto.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set lora_plugin to None.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set moe_plugin to auto.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set low_latency_gemm_plugin to None.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set low_latency_gemm_swiglu_plugin to None.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set context_fmha to True.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set remove_input_padding to True.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set reduce_fusion to False.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set user_buffer to False.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set tokens_per_block to 64.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set multiple_profiles to False.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set paged_state to True.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set streamingllm to False.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set use_fused_mlp to True.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set pp_reduce_scatter to False.
[01/15/2025-05:42:02] [TRT-LLM] [W] Implicitly setting QWenConfig.seq_length = 8192
[01/15/2025-05:42:02] [TRT-LLM] [W] Implicitly setting QWenConfig.qwen_type = qwen2
[01/15/2025-05:42:02] [TRT-LLM] [W] Implicitly setting QWenConfig.moe_intermediate_size = 0
[01/15/2025-05:42:02] [TRT-LLM] [W] Implicitly setting QWenConfig.moe_shared_expert_intermediate_size = 0
[01/15/2025-05:42:02] [TRT-LLM] [W] Implicitly setting QWenConfig.tie_word_embeddings = False
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:42:02] [TRT-LLM] [I] Set dtype to float16.
[01/15/2025-05:42:02] [TRT-LLM] [I] Set paged_kv_cache to True.
[01/15/2025-05:42:02] [TRT-LLM] [W] Overriding paged_state to False
[01/15/2025-05:42:02] [TRT-LLM] [I] Set paged_state to False.
[01/15/2025-05:42:02] [TRT-LLM] [I] max_seq_len is not specified, using deduced value 32768
[01/15/2025-05:42:02] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.

[01/15/2025-05:42:02] [TRT-LLM] [W] padding removal and fMHA are both enabled, max_input_len is not required and will be ignored
[01/15/2025-05:42:03] [TRT] [I] [MemUsageChange] Init CUDA: CPU -14, GPU +0, now: CPU 231, GPU 423 (MiB)
[01/15/2025-05:42:18] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +2038, GPU +374, now: CPU 2404, GPU 797 (MiB)
[01/15/2025-05:42:19] [TRT-LLM] [I] Set smooth_quant_gemm_plugin to float16.
[01/15/2025-05:42:19] [TRT-LLM] [I] Set rmsnorm_quantization_plugin to float16.
[01/15/2025-05:42:19] [TRT-LLM] [I] Set layernorm_quantization_plugin to float16.
[01/15/2025-05:42:19] [TRT-LLM] [I] Set quantize_per_token_plugin to True.
[01/15/2025-05:42:19] [TRT-LLM] [I] Set quantize_tensor_plugin to True.
[01/15/2025-05:42:19] [TRT-LLM] [I] Set nccl_plugin to float16.
[01/15/2025-05:42:20] [TRT-LLM] [I] Total time of constructing network from module object 17.962120056152344 seconds
[01/15/2025-05:42:20] [TRT-LLM] [I] Total optimization profiles added: 1
[01/15/2025-05:42:20] [TRT-LLM] [I] Total time to initialize the weights in network Unnamed Network 0: 00:00:00
[01/15/2025-05:42:20] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0
[01/15/2025-05:42:20] [TRT] [W] Unused Input: position_ids
[01/15/2025-05:42:25] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[01/15/2025-05:42:25] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[01/15/2025-05:42:26] [TRT] [I] Compiler backend is used during engine build.
[01/15/2025-05:42:34] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[01/15/2025-05:42:34] [TRT] [I] Detected 18 inputs and 1 output network tensors.
[01/15/2025-05:44:06] [TRT] [I] Total Host Persistent Memory: 547120 bytes
[01/15/2025-05:44:06] [TRT] [I] Total Device Persistent Memory: 0 bytes
[01/15/2025-05:44:06] [TRT] [I] Max Scratch Memory: 268468224 bytes
[01/15/2025-05:44:06] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 2181 steps to complete.
[01/15/2025-05:44:07] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 433.139ms to assign 21 blocks to 2181 nodes requiring 1501994496 bytes.
[01/15/2025-05:44:07] [TRT] [I] Total Activation Memory: 1501992960 bytes
[01/15/2025-05:44:08] [TRT] [I] Total Weights Memory: 3251073536 bytes
[01/15/2025-05:44:08] [TRT] [I] Compiler backend is used during engine execution.
[01/15/2025-05:44:51] [TRT] [I] Engine generation completed in 146.584 seconds.
[01/15/2025-05:44:51] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 1 MiB, GPU 3101 MiB
[01/15/2025-05:44:54] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:02:33
[01/15/2025-05:44:54] [TRT] [I] Serialized 27 bytes of code generator cache.
[01/15/2025-05:44:54] [TRT] [I] Serialized 125277 bytes of compilation cache.
[01/15/2025-05:44:54] [TRT] [I] Serialized 22 timing cache entries
[01/15/2025-05:44:54] [TRT-LLM] [I] Timing cache serialized to model.cache
[01/15/2025-05:44:54] [TRT-LLM] [I] Build phase peak memory: 26320.83 MB, children: 13.00 MB
[01/15/2025-05:44:54] [TRT-LLM] [I] Serializing engine to ./int8_sq_bin_0113_batch8_20000_tp4/rank0.engine...
[01/15/2025-05:45:00] [TRT-LLM] [I] Engine serialized. Total time: 00:00:06
[01/15/2025-05:45:00] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
。。。。。
[01/15/2025-05:45:00] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:45:00] [TRT-LLM] [I] Set dtype to float16.
[01/15/2025-05:45:00] [TRT-LLM] [I] Set paged_kv_cache to True.
[01/15/2025-05:45:00] [TRT-LLM] [W] Overriding paged_state to False
[01/15/2025-05:45:00] [TRT-LLM] [I] Set paged_state to False.
[01/15/2025-05:45:00] [TRT-LLM] [I] max_seq_len is not specified, using deduced value 32768
[01/15/2025-05:45:00] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.

[01/15/2025-05:45:00] [TRT-LLM] [W] padding removal and fMHA are both enabled, max_input_len is not required and will be ignored
[01/15/2025-05:45:00] [TRT-LLM] [I] Set smooth_quant_gemm_plugin to float16.
[01/15/2025-05:45:00] [TRT-LLM] [I] Set rmsnorm_quantization_plugin to float16.
[01/15/2025-05:45:00] [TRT-LLM] [I] Set layernorm_quantization_plugin to float16.
[01/15/2025-05:45:00] [TRT-LLM] [I] Set quantize_per_token_plugin to True.
[01/15/2025-05:45:00] [TRT-LLM] [I] Set quantize_tensor_plugin to True.
[01/15/2025-05:45:00] [TRT-LLM] [I] Set nccl_plugin to float16.
[01/15/2025-05:45:01] [TRT-LLM] [I] Total time of constructing network from module object 0.8860244750976562 seconds
[01/15/2025-05:45:01] [TRT-LLM] [I] Total optimization profiles added: 1
[01/15/2025-05:45:01] [TRT-LLM] [I] Total time to initialize the weights in network Unnamed Network 0: 00:00:00
[01/15/2025-05:45:01] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0
[01/15/2025-05:45:01] [TRT] [W] Unused Input: position_ids
[01/15/2025-05:45:05] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[01/15/2025-05:45:05] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[01/15/2025-05:45:06] [TRT] [I] Compiler backend is used during engine build.
[01/15/2025-05:45:14] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[01/15/2025-05:45:14] [TRT] [I] Detected 18 inputs and 1 output network tensors.
[01/15/2025-05:46:42] [TRT] [I] Total Host Persistent Memory: 547120 bytes
[01/15/2025-05:46:42] [TRT] [I] Total Device Persistent Memory: 0 bytes
[01/15/2025-05:46:42] [TRT] [I] Max Scratch Memory: 268468224 bytes
[01/15/2025-05:46:42] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 2181 steps to complete.
[01/15/2025-05:46:43] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 423.463ms to assign 21 blocks to 2181 nodes requiring 1501994496 bytes.
[01/15/2025-05:46:43] [TRT] [I] Total Activation Memory: 1501992960 bytes
[01/15/2025-05:46:44] [TRT] [I] Total Weights Memory: 3251043840 bytes
[01/15/2025-05:46:44] [TRT] [I] Compiler backend is used during engine execution.
[01/15/2025-05:46:44] [TRT] [I] Engine generation completed in 98.7165 seconds.
[01/15/2025-05:46:44] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 1 MiB, GPU 3101 MiB
[01/15/2025-05:46:46] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:01:45
[01/15/2025-05:46:46] [TRT-LLM] [I] Build phase peak memory: 46100.56 MB, children: 13.00 MB
[01/15/2025-05:46:47] [TRT-LLM] [I] Serializing engine to ./int8_sq_bin_0113_batch8_20000_tp4/rank1.engine...
[01/15/2025-05:46:53] [TRT-LLM] [I] Engine serialized. Total time: 00:00:06
[01/15/2025-05:46:53] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
。。。。。
[01/15/2025-05:46:53] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[01/15/2025-05:46:53] [TRT-LLM] [I] Set dtype to float16.
[01/15/2025-05:46:53] [TRT-LLM] [I] Set paged_kv_cache to True.
[01/15/2025-05:46:53] [TRT-LLM] [W] Overriding paged_state to False
[01/15/2025-05:46:53] [TRT-LLM] [I] Set paged_state to False.
[01/15/2025-05:46:53] [TRT-LLM] [I] max_seq_len is not specified, using deduced value 32768
[01/15/2025-05:46:53] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.

[01/15/2025-05:46:53] [TRT-LLM] [W] padding removal and fMHA are both enabled, max_input_len is not required and will be ignored
[01/15/2025-05:46:54] [TRT-LLM] [I] Set smooth_quant_gemm_plugin to float16.
[01/15/2025-05:46:54] [TRT-LLM] [I] Set rmsnorm_quantization_plugin to float16.
[01/15/2025-05:46:54] [TRT-LLM] [I] Set layernorm_quantization_plugin to float16.
[01/15/2025-05:46:54] [TRT-LLM] [I] Set quantize_per_token_plugin to True.
[01/15/2025-05:46:54] [TRT-LLM] [I] Set quantize_tensor_plugin to True.
[01/15/2025-05:46:54] [TRT-LLM] [I] Set nccl_plugin to float16.
[01/15/2025-05:46:54] [TRT-LLM] [I] Total time of constructing network from module object 0.9638781547546387 seconds
[01/15/2025-05:46:54] [TRT-LLM] [I] Total optimization profiles added: 1
[01/15/2025-05:46:54] [TRT-LLM] [I] Total time to initialize the weights in network Unnamed Network 0: 00:00:00
[01/15/2025-05:46:54] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0
[01/15/2025-05:46:54] [TRT] [W] Unused Input: position_ids
[01/15/2025-05:46:58] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[01/15/2025-05:46:58] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[01/15/2025-05:46:59] [TRT] [I] Compiler backend is used during engine build.
[01/15/2025-05:47:07] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[01/15/2025-05:47:08] [TRT] [I] Detected 18 inputs and 1 output network tensors.
[01/15/2025-05:48:36] [TRT] [I] Total Host Persistent Memory: 547120 bytes
[01/15/2025-05:48:36] [TRT] [I] Total Device Persistent Memory: 0 bytes
[01/15/2025-05:48:36] [TRT] [I] Max Scratch Memory: 268468224 bytes
[01/15/2025-05:48:36] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 2181 steps to complete.
[01/15/2025-05:48:36] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 432.537ms to assign 21 blocks to 2181 nodes requiring 1501994496 bytes.
[01/15/2025-05:48:36] [TRT] [I] Total Activation Memory: 1501992960 bytes
[01/15/2025-05:48:37] [TRT] [I] Total Weights Memory: 3251043840 bytes
[01/15/2025-05:48:37] [TRT] [I] Compiler backend is used during engine execution.
[01/15/2025-05:48:37] [TRT] [I] Engine generation completed in 98.9362 seconds.
[01/15/2025-05:48:37] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 1 MiB, GPU 3101 MiB
[01/15/2025-05:48:40] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:01:45
[01/15/2025-05:48:40] [TRT-LLM] [I] Build phase peak memory: 46105.67 MB, children: 13.00 MB
[01/15/2025-05:48:41] [TRT-LLM] [I] Serializing engine to ./int8_sq_bin_0113_batch8_20000_tp4/rank2.engine...
[01/15/2025-05:48:47] [TRT-LLM] [I] Engine serialized. Total time: 00:00:06
[01/15/2025-05:48:47] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
。。。。。
[01/15/2025-05:48:47] [TRT-LLM] [I] Set dtype to float16.
[01/15/2025-05:48:47] [TRT-LLM] [I] Set paged_kv_cache to True.
[01/15/2025-05:48:47] [TRT-LLM] [W] Overriding paged_state to False
[01/15/2025-05:48:47] [TRT-LLM] [I] Set paged_state to False.
[01/15/2025-05:48:47] [TRT-LLM] [I] max_seq_len is not specified, using deduced value 32768
[01/15/2025-05:48:47] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.

[01/15/2025-05:48:47] [TRT-LLM] [W] padding removal and fMHA are both enabled, max_input_len is not required and will be ignored
[01/15/2025-05:48:47] [TRT-LLM] [I] Set smooth_quant_gemm_plugin to float16.
[01/15/2025-05:48:47] [TRT-LLM] [I] Set rmsnorm_quantization_plugin to float16.
[01/15/2025-05:48:47] [TRT-LLM] [I] Set layernorm_quantization_plugin to float16.
[01/15/2025-05:48:47] [TRT-LLM] [I] Set quantize_per_token_plugin to True.
[01/15/2025-05:48:47] [TRT-LLM] [I] Set quantize_tensor_plugin to True.
[01/15/2025-05:48:47] [TRT-LLM] [I] Set nccl_plugin to float16.
[01/15/2025-05:48:48] [TRT-LLM] [I] Total time of constructing network from module object 0.9600269794464111 seconds
[01/15/2025-05:48:48] [TRT-LLM] [I] Total optimization profiles added: 1
[01/15/2025-05:48:48] [TRT-LLM] [I] Total time to initialize the weights in network Unnamed Network 0: 00:00:00
[01/15/2025-05:48:48] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0
[01/15/2025-05:48:48] [TRT] [W] Unused Input: position_ids
[01/15/2025-05:48:52] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[01/15/2025-05:48:52] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[01/15/2025-05:48:52] [TRT] [I] Compiler backend is used during engine build.
[01/15/2025-05:49:00] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[01/15/2025-05:49:00] [TRT] [I] Detected 18 inputs and 1 output network tensors.
[01/15/2025-05:50:29] [TRT] [I] Total Host Persistent Memory: 547120 bytes
[01/15/2025-05:50:29] [TRT] [I] Total Device Persistent Memory: 0 bytes
[01/15/2025-05:50:29] [TRT] [I] Max Scratch Memory: 268468224 bytes
[01/15/2025-05:50:29] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 2181 steps to complete.
[01/15/2025-05:50:29] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 434.772ms to assign 21 blocks to 2181 nodes requiring 1501994496 bytes.
[01/15/2025-05:50:29] [TRT] [I] Total Activation Memory: 1501992960 bytes
[01/15/2025-05:50:30] [TRT] [I] Total Weights Memory: 3251043840 bytes
[01/15/2025-05:50:30] [TRT] [I] Compiler backend is used during engine execution.
[01/15/2025-05:50:30] [TRT] [I] Engine generation completed in 98.2933 seconds.
[01/15/2025-05:50:30] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 1 MiB, GPU 3101 MiB
[01/15/2025-05:50:33] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:01:44
[01/15/2025-05:50:33] [TRT-LLM] [I] Build phase peak memory: 46105.16 MB, children: 13.00 MB
[01/15/2025-05:50:33] [TRT-LLM] [I] Serializing engine to ./int8_sq_bin_0113_batch8_20000_tp4/rank3.engine...
[01/15/2025-05:50:39] [TRT-LLM] [I] Engine serialized. Total time: 00:00:05
[01/15/2025-05:50:40] [TRT-LLM] [I] Total time of building all engines: 00:08:38

[TensorRT-LLM] TensorRT-LLM version: 0.16.0
[TensorRT-LLM] TensorRT-LLM version: 0.16.0
[TensorRT-LLM] TensorRT-LLM version: 0.16.0
[TensorRT-LLM] TensorRT-LLM version: 0.16.0
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[01/15/2025-05:52:18] [TRT-LLM] [I] Using C++ session
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 0
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[01/15/2025-05:52:19] [TRT-LLM] [I] Using C++ session
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 2
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[01/15/2025-05:52:19] [TRT-LLM] [I] Using C++ session
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[01/15/2025-05:52:19] [TRT-LLM] [I] Using C++ session
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 3
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 1
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 2
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 1
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 3
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] Rank 1 is using GPU 1
[TensorRT-LLM][INFO] Rank 2 is using GPU 2
[TensorRT-LLM][INFO] Rank 3 is using GPU 3
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 32768
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (32768) * 80
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8192 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 32768
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (32768) * 80
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8192 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 32768
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (32768) * 80
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8192 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 32768
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (32768) * 80
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8192 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 3122 MiB
[TensorRT-LLM][INFO] Loaded engine size: 3122 MiB
[TensorRT-LLM][INFO] Loaded engine size: 3122 MiB
[TensorRT-LLM][INFO] Loaded engine size: 3122 MiB
[TensorRT-LLM][INFO] Detecting local TP group for rank 2
[TensorRT-LLM][INFO] Detecting local TP group for rank 3
[TensorRT-LLM][INFO] Detecting local TP group for rank 1
[TensorRT-LLM][INFO] Detecting local TP group for rank 0
[TensorRT-LLM][INFO] TP group is intra-node for rank 3
[TensorRT-LLM][INFO] TP group is intra-node for rank 2
[TensorRT-LLM][INFO] TP group is intra-node for rank 1
[TensorRT-LLM][INFO] TP group is intra-node for rank 0
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1432.41 MiB for execution context memory.
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1432.41 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1432.41 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1432.41 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 3100 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 3100 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 3100 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 3100 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.36 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.36 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.36 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.36 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.88 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.25 GiB, available: 74.12 GiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.88 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.25 GiB, available: 74.12 GiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.88 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.25 GiB, available: 74.12 GiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.88 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.25 GiB, available: 74.12 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 13663
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 13663
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 13663
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 13663
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 512
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 512
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 512
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 512
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 66.71 GiB for max tokens in paged KV cache (874432).
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 66.71 GiB for max tokens in paged KV cache (874432).
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 66.71 GiB for max tokens in paged KV cache (874432).
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 66.71 GiB for max tokens in paged KV cache (874432).
[01/15/2025-05:52:33] [TRT-LLM] [I] Load engine takes: 13.803145170211792 sec
[01/15/2025-05:52:33] [TRT-LLM] [I] Load engine takes: 13.803199291229248 sec
[01/15/2025-05:52:33] [TRT-LLM] [I] Load engine takes: 13.803199529647827 sec
[01/15/2025-05:52:33] [TRT-LLM] [I] Load engine takes: 13.80239987373352 sec
Input [Text 0]: "<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
推荐10个北京好玩的景区<|im_end|>
<|im_start|>assistant
"
Output [Text 0 Beam 0]: ""

Who can help?

I updated from 0.12.0 to 0.16.0 but still encountered issues with executing Qwen SQint8, resulting in an empty output during runtime. Please pay attention to this

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

python3 convert_checkpoint.py --model_dir /opt/model/Qwen2-72B-Instruct --output_dir ./tllm_checkpoint_4gpu_sq_int8_batch_8 --dtype float16 --smoothquant 0.5 --per_token --per_channel --tp_size 4
trtllm-build --checkpoint_dir ./tllm_checkpoint_4gpu_sq_int8_batch_8
--output_dir ./ int8_sq_bin_0113_batch8_20000_tp4/
--gemm_plugin float16
mpirun -n 4 --allow-run-as-root
Python3../run.by -- input_text "Recommended 10 Fun Scenic Spots in New York"
--max_output_len=500
--tokenizer_dir /opt/model/Qwen2-72B-Instruct/
--engine_dir=/opt/model/TensorRT-LLM-0.16.0/examples/qwen/int8_sq_bin_0113_batch8_20000_tp4/

Expected behavior

not ''

actual behavior

output ''

additional notes

no more

@gy0514020329 gy0514020329 added the bug Something isn't working label Jan 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant