SIGABRT while trying to build trtllm engine for biomistral model on T4 #2619

krishnanpooja · 2024-12-24T10:43:59Z

System Info

-GPU Name - [4 x NVIDIA Tesla T4]

TensorRT-LLM version: 0.15.0.dev2024111200
CUDA Version:
NVIDIA-SMI 470.256.02 Driver Version: 470.256.02 CUDA Version: 12.1

Who can help?

@byshiue , can you please guide me as how can I resolve this error?

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

python modeling/convert_checkpoint.py --model_dir ${{inputs.model_path}}
--output_dir ${{outputs.output_dir}}/tllm_checkpoint_1gpu_mistral
--dtype float16
--tp_size 4

echo "converted checkpoint"

export CUDA_MODULE_LOADING=LAZY

echo $CUDA_MODULE_LOADING

trtllm-build --checkpoint_dir ${{outputs.output_dir}}/tllm_checkpoint_1gpu_mistral
--output_dir ${{outputs.output_dir}}/tllm_checkpoint_1gpu_mistral/trt_build_output
--gemm_plugin auto
--max_input_len 8000
--context_fmha enable

Expected behavior

I am able to load the model on T4 using vLLM engine. Expectation is that it should work for TensorRT-LLM as well.

actual behavior

Error :
*** Process received signal *** [37f2f5cd59bd4d738ff60b08585e947b000000:00279] Signal: Aborted (6) [37f2f5cd59bd4d738ff60b08585e947b000000:00279] Signal code: (-6) [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [ 0] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x15299ac49420] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x15299a92c00b] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x15299a90b859] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [ 3] /opt/conda/envs/ptca/bin/../lib/libstdc++.so.6(+0xb135a)[0x15290c29435a] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [ 4] /opt/conda/envs/ptca/bin/../lib/libstdc++.so.6(+0xb03b9)[0x15290c2933b9] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [ 5] /opt/conda/envs/ptca/bin/../lib/libstdc++.so.6(__gxx_personality_v0+0x87)[0x15290c293ae7] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [ 6] /opt/conda/envs/ptca/bin/../lib/libgcc_s.so.1(+0x111e4)[0x1529976221e4] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [ 7] /opt/conda/envs/ptca/bin/../lib/libgcc_s.so.1(_Unwind_Resume+0x12e)[0x152997622c1e] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [ 8] /opt/conda/envs/ptca/lib/python3.10/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x7891f4)[0x1527237051f4] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [ 9] /opt/conda/envs/ptca/lib/python3.10/site-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(_ZN12tensorrt_llm7plugins24GPTAttentionPluginCommon10initializeEv+0x275)[0x1525ea662375] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [10] /opt/conda/envs/ptca/lib/python3.10/site-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(_ZNK12tensorrt_llm7plugins24GPTAttentionPluginCommon9cloneImplINS0_18GPTAttentionPluginEEEPT_v+0x333)[0x1525ea6a9313] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [11] /opt/conda/envs/ptca/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0xb7a5a7)[0x1528fe3865a7] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [12] /opt/conda/envs/ptca/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0xa8188e)[0x1528fe28d88e] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [13] /opt/conda/envs/ptca/lib/python3.10/site-packages/tensorrt_bindings/tensorrt.so(+0xfe062)[0x15284e5d2062] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [14] /opt/conda/envs/ptca/lib/python3.10/site-packages/tensorrt_bindings/tensorrt.so(+0x4c91e)[0x15284e52091e] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [15] /opt/conda/envs/ptca/bin/python[0x4fcaf7] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [16] /opt/conda/envs/ptca/bin/python(_PyObject_MakeTpCall+0x25b)[0x4f657b] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [17] /opt/conda/envs/ptca/bin/python[0x50861f] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [18] /opt/conda/envs/ptca/bin/python(_PyEval_EvalFrameDefault+0x4b2c)[0x4f1e9c] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [19] /opt/conda/envs/ptca/bin/python(_PyFunction_Vectorcall+0x6f)[0x4fcf3f] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [20] /opt/conda/envs/ptca/bin/python(PyObject_Call+0xb8)[0x508cd8] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [21] /opt/conda/envs/ptca/bin/python(_PyEval_EvalFrameDefault+0x2de4)[0x4f0154] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [22] /opt/conda/envs/ptca/bin/python(_PyFunction_Vectorcall+0x6f)[0x4fcf3f] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [23] /opt/conda/envs/ptca/bin/python(PyObject_Call+0xb8)[0x508cd8] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [24] /opt/conda/envs/ptca/bin/python(_PyEval_EvalFrameDefault+0x2de4)[0x4f0154] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [25] /opt/conda/envs/ptca/bin/python[0x50832e] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [26] /opt/conda/envs/ptca/bin/python(PyObject_Call+0xb8)[0x508cd8] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [27] /opt/conda/envs/ptca/bin/python(_PyEval_EvalFrameDefault+0x2de4)[0x4f0154] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [28] /opt/conda/envs/ptca/bin/python(_PyFunction_Vectorcall+0x6f)[0x4fcf3f] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [29] /opt/conda/envs/ptca/bin/python(_PyObject_FastCallDictTstate+0x17d)[0x4f5afd] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] *** End of error message *** /bin/bash: line 5: 279 Aborted (core dumped) trtllm-build --checkpoint_dir output_dir/tllm_checkpoint_1gpu_mistral --output_dir output_dir/tllm_checkpoint_1gpu_mistral/trt_build_output --gemm_plugin auto --max_input_len 8000 --context_fmha enable

I tried to context-fmha=disable. I got memory error.

additional notes

I using the convert_checkpoint.py (from llama example) and then trtllm-build command

The text was updated successfully, but these errors were encountered:

nv-guomingz · 2024-12-24T15:24:07Z

hi @krishnanpooja we don't test the latest TRT-LLM functionality on T4 platform as we remove its supporting since 0.14 release. So there's no guarantee that your case could run succesfully on T4.

krishnanpooja added the bug Something isn't working label Dec 24, 2024

nv-guomingz added triaged Issue has been triaged by maintainers and removed bug Something isn't working labels Dec 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIGABRT while trying to build trtllm engine for biomistral model on T4 #2619

SIGABRT while trying to build trtllm engine for biomistral model on T4 #2619

krishnanpooja commented Dec 24, 2024

nv-guomingz commented Dec 24, 2024

SIGABRT while trying to build trtllm engine for biomistral model on T4 #2619

SIGABRT while trying to build trtllm engine for biomistral model on T4 #2619

Comments

krishnanpooja commented Dec 24, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

nv-guomingz commented Dec 24, 2024