Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGABRT while trying to build trtllm engine for biomistral model on T4 #2619

Open
2 of 4 tasks
krishnanpooja opened this issue Dec 24, 2024 · 1 comment
Open
2 of 4 tasks
Labels
triaged Issue has been triaged by maintainers

Comments

@krishnanpooja
Copy link

System Info

-GPU Name - [4 x NVIDIA Tesla T4]

  • TensorRT-LLM version: 0.15.0.dev2024111200
  • CUDA Version:
    NVIDIA-SMI 470.256.02 Driver Version: 470.256.02 CUDA Version: 12.1

Who can help?

@byshiue , can you please guide me as how can I resolve this error?

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

python modeling/convert_checkpoint.py --model_dir ${{inputs.model_path}}
--output_dir ${{outputs.output_dir}}/tllm_checkpoint_1gpu_mistral
--dtype float16
--tp_size 4

echo "converted checkpoint"

export CUDA_MODULE_LOADING=LAZY

echo $CUDA_MODULE_LOADING

trtllm-build --checkpoint_dir ${{outputs.output_dir}}/tllm_checkpoint_1gpu_mistral
--output_dir ${{outputs.output_dir}}/tllm_checkpoint_1gpu_mistral/trt_build_output
--gemm_plugin auto
--max_input_len 8000
--context_fmha enable

Expected behavior

I am able to load the model on T4 using vLLM engine. Expectation is that it should work for TensorRT-LLM as well.

actual behavior

  • Error :
    *** Process received signal *** [37f2f5cd59bd4d738ff60b08585e947b000000:00279] Signal: Aborted (6) [37f2f5cd59bd4d738ff60b08585e947b000000:00279] Signal code: (-6) [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [ 0] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x15299ac49420] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x15299a92c00b] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x15299a90b859] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [ 3] /opt/conda/envs/ptca/bin/../lib/libstdc++.so.6(+0xb135a)[0x15290c29435a] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [ 4] /opt/conda/envs/ptca/bin/../lib/libstdc++.so.6(+0xb03b9)[0x15290c2933b9] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [ 5] /opt/conda/envs/ptca/bin/../lib/libstdc++.so.6(__gxx_personality_v0+0x87)[0x15290c293ae7] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [ 6] /opt/conda/envs/ptca/bin/../lib/libgcc_s.so.1(+0x111e4)[0x1529976221e4] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [ 7] /opt/conda/envs/ptca/bin/../lib/libgcc_s.so.1(_Unwind_Resume+0x12e)[0x152997622c1e] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [ 8] /opt/conda/envs/ptca/lib/python3.10/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x7891f4)[0x1527237051f4] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [ 9] /opt/conda/envs/ptca/lib/python3.10/site-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(_ZN12tensorrt_llm7plugins24GPTAttentionPluginCommon10initializeEv+0x275)[0x1525ea662375] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [10] /opt/conda/envs/ptca/lib/python3.10/site-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(_ZNK12tensorrt_llm7plugins24GPTAttentionPluginCommon9cloneImplINS0_18GPTAttentionPluginEEEPT_v+0x333)[0x1525ea6a9313] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [11] /opt/conda/envs/ptca/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0xb7a5a7)[0x1528fe3865a7] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [12] /opt/conda/envs/ptca/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0xa8188e)[0x1528fe28d88e] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [13] /opt/conda/envs/ptca/lib/python3.10/site-packages/tensorrt_bindings/tensorrt.so(+0xfe062)[0x15284e5d2062] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [14] /opt/conda/envs/ptca/lib/python3.10/site-packages/tensorrt_bindings/tensorrt.so(+0x4c91e)[0x15284e52091e] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [15] /opt/conda/envs/ptca/bin/python[0x4fcaf7] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [16] /opt/conda/envs/ptca/bin/python(_PyObject_MakeTpCall+0x25b)[0x4f657b] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [17] /opt/conda/envs/ptca/bin/python[0x50861f] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [18] /opt/conda/envs/ptca/bin/python(_PyEval_EvalFrameDefault+0x4b2c)[0x4f1e9c] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [19] /opt/conda/envs/ptca/bin/python(_PyFunction_Vectorcall+0x6f)[0x4fcf3f] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [20] /opt/conda/envs/ptca/bin/python(PyObject_Call+0xb8)[0x508cd8] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [21] /opt/conda/envs/ptca/bin/python(_PyEval_EvalFrameDefault+0x2de4)[0x4f0154] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [22] /opt/conda/envs/ptca/bin/python(_PyFunction_Vectorcall+0x6f)[0x4fcf3f] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [23] /opt/conda/envs/ptca/bin/python(PyObject_Call+0xb8)[0x508cd8] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [24] /opt/conda/envs/ptca/bin/python(_PyEval_EvalFrameDefault+0x2de4)[0x4f0154] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [25] /opt/conda/envs/ptca/bin/python[0x50832e] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [26] /opt/conda/envs/ptca/bin/python(PyObject_Call+0xb8)[0x508cd8] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [27] /opt/conda/envs/ptca/bin/python(_PyEval_EvalFrameDefault+0x2de4)[0x4f0154] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [28] /opt/conda/envs/ptca/bin/python(_PyFunction_Vectorcall+0x6f)[0x4fcf3f] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] [29] /opt/conda/envs/ptca/bin/python(_PyObject_FastCallDictTstate+0x17d)[0x4f5afd] [37f2f5cd59bd4d738ff60b08585e947b000000:00279] *** End of error message *** /bin/bash: line 5: 279 Aborted (core dumped) trtllm-build --checkpoint_dir output_dir/tllm_checkpoint_1gpu_mistral --output_dir output_dir/tllm_checkpoint_1gpu_mistral/trt_build_output --gemm_plugin auto --max_input_len 8000 --context_fmha enable

I tried to context-fmha=disable. I got memory error.

additional notes

I using the convert_checkpoint.py (from llama example) and then trtllm-build command

@krishnanpooja krishnanpooja added the bug Something isn't working label Dec 24, 2024
@nv-guomingz
Copy link
Collaborator

hi @krishnanpooja we don't test the latest TRT-LLM functionality on T4 platform as we remove its supporting since 0.14 release. So there's no guarantee that your case could run succesfully on T4.

@nv-guomingz nv-guomingz added triaged Issue has been triaged by maintainers and removed bug Something isn't working labels Dec 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

2 participants