Confusion about versions of TensorRT, TensorRT-LLM and tensorrtllm_backend #7439

bhavin192 · 2024-07-11T13:17:34Z

bhavin192
Jul 11, 2024

I'm writing a blog post which first builds the engine using nvidia/cuda:12.4.0-devel-ubuntu22.04 image and then runs it with Triton.

Building the engine

I'm installing required dependencies and then tensorrt_llm with:

pip3 install tensorrt_llm==0.10.0 -U --extra-index-url https://pypi.nvidia.com

This pulls TensorRT version 10.0.1:

# pip freeze | grep -i tensor    
safetensors==0.4.3
tensorrt==10.0.1
tensorrt-cu12==10.2.0.post1
tensorrt-cu12-bindings==10.2.0.post1
tensorrt-cu12-libs==10.2.0.post1
tensorrt-llm==0.10.0

# dpkg -la | grep -i tensor

Running it with Triton

Now when I try to run this engine with Triton image nvcr.io/nvidia/tritonserver:24.06-trtllm-python-py3, it fails with this error:

[TensorRT-LLM][ERROR] 1: [stdArchiveReader.cpp::stdArchiveReaderInitCommon::47] Error Code 1: Serialization (Serialization assertion stdVersionRead == kSERIALIZATION_VERSION failed.Version tag does not match. Note: Current Version: 237, Serialized Engine Version: 238)

But if we check the TensorRT and tensorrt_llm versions installed in this image, they match what I had during engine build process.

# pip freeze | grep -i tensorrt
graphsurgeon @ file:///workspace/TensorRT-8.6.3.1/graphsurgeon/graphsurgeon-0.4.6-py2.py3-none-any.whl#sha256=0fbadaefbbe6e9920b9f814ae961c4a279be602812edf3ed7fb9cc6f8f4809fe
tensorrt @ file:///usr/local/tensorrt/python/tensorrt-10.0.1-cp310-none-linux_x86_64.whl#sha256=059ffaf7d7140c6cb11e6b0cf90212508c72eadf28b39a3964248c86630ac873
tensorrt_llm @ file:///opt/tritonserver/backends/tensorrtllm/tensorrt_llm-0.10.0-cp310-cp310-linux_x86_64.whl#sha256=23cd9c355c4022676ee94432efe0cce1af533a21d8f361dd5defcd0967d03266
torch-tensorrt @ file:///opt/pytorch/torch_tensorrt/dist/torch_tensorrt-2.3.0a0-cp310-cp310-linux_x86_64.whl#sha256=905a9df8a2a360ac719ed8ff72c14fb6e3f807e60f5652025ad382ef08d009b7
uff @ file:///workspace/TensorRT-8.6.3.1/uff/uff-0.6.9-py2.py3-none-any.whl#sha256=618a3f812d491f0d3c4f2e38b99e03217ca37b206db14cee079f2bf681eb4fe3

# dpkg -la | grep -i tensorrt

More confusing part is the release notes and support matrix which lists completely different version of TensorRT for 24.06 image of tritonserver.
https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/rel-24-06.html and https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html says TensorRT 10.1.0.27, but we have 10.0.1.6 instead in the image.

Building the engine within Triton

When I built the engine with the same Triton image, it just worked without any issues. I'm aware that, that's the recommended way to do this, i.e. use the same image which is going to be used to run Triton. I'm going to add that note in my blog post, but I still wanted to understand what's happening.

Questions

So questions I'm trying to find answers to (sorry the list turned out to be a bit bigger than I though):

Because tritonserver:yy.mm-trtllm-python-py3 are special images with tensorrt_llm backend in them, the versions of TensorRT and TensorRT-LLM in this image can differ from what is written in support matrix, is it correct?
Despite having same version of TensorRT and TensorRT-LLM while compiling the engine and in the Triton image, why it fails to load the model? Is there anything else which needs to be of the same version?
What's this Serialized Engine Version, does it depend on something else than TensorRT and TensorRT-LLM version?
What's the correct way to find TensorRT and TensorRT-LLM version present in a Triton image, what I used above (checking pip and dpkg) is correct?

bhavin192
Jul 15, 2024
Author

Hello @tanmayv25 @kaiyux sorry for tagging you folks directly. After posting the discussion here I realized that tensorrtllm_backend repository asks to create an issue in case of any questions or doubts.
Guessing you can answer these questions and this discussions space might not be monitored by you folks, I thought of tagging you here.
Also do consider tagging other folks in case you are not the correct ones to tag.

2 replies

bhavin192 Aug 1, 2024
Author

@krishung5 mind checking this discussion? Your comment #7374 (comment) on the other issue answers question 1 and 4 I guess. Do you know what's going on regarding 2 and 3?

krishung5 Aug 2, 2024
Collaborator

Hi @bhavin192,

Yes, that's correct.
The nvcr.io/nvidia/tritonserver:24.06-trtllm-python-py3 was builit based off nvcr.io/nvidia/pytorch:24.03-py3, with TRT 10.0.1. You'd have to make sure the container you use to build engines has the same dependency stacks to avoid any version mismatch. I see that the TRT version was 8.6.3.1 from the pip freeze | grep -i tensorrt comment. You can refer to the build steps here and check out the dockerfile to see how we uninstall the previous TRT and install the TRT with the version we want within the container.
Mostly it's related to TRT version mismatch.
Yes, I would just do pip3 list for the tensorrt-llm package

root@35de844641ff:/opt/tritonserver# pip3 list | grep -i tensorrt-llm
tensorrt-llm             0.11.0

For the TRT version, I would just check

root@35de844641ff:/opt/tritonserver# ls -l /usr/local/tensorrt/lib/
total 1639292
lrwxrwxrwx 1 root root         20 May 29 18:13 libnvinfer.so -> libnvinfer.so.10.1.0
lrwxrwxrwx 1 root root         20 May 29 18:13 libnvinfer.so.10 -> libnvinfer.so.10.1.0
-rwxr-xr-x 1 root root  238531744 May 29 18:12 libnvinfer.so.10.1.0
-rwxr-xr-x 1 root root 1366003328 May 29 18:13 libnvinfer_builder_resource.so.10.1.0
lrwxrwxrwx 1 root root         29 May 29 18:10 libnvinfer_dispatch.so -> libnvinfer_dispatch.so.10.1.0
lrwxrwxrwx 1 root root         29 May 29 18:10 libnvinfer_dispatch.so.10 -> libnvinfer_dispatch.so.10.1.0
-rwxr-xr-x 1 root root     988096 May 29 18:10 libnvinfer_dispatch.so.10.1.0
lrwxrwxrwx 1 root root         25 May 29 18:09 libnvinfer_lean.so -> libnvinfer_lean.so.10.1.0
lrwxrwxrwx 1 root root         25 May 29 18:09 libnvinfer_lean.so.10 -> libnvinfer_lean.so.10.1.0
-rwxr-xr-x 1 root root   34789240 May 29 18:09 libnvinfer_lean.so.10.1.0
lrwxrwxrwx 1 root root         27 May 29 18:14 libnvinfer_plugin.so -> libnvinfer_plugin.so.10.1.0
lrwxrwxrwx 1 root root         27 May 29 18:14 libnvinfer_plugin.so.10 -> libnvinfer_plugin.so.10.1.0
-rwxr-xr-x 1 root root   33690088 May 29 18:14 libnvinfer_plugin.so.10.1.0
lrwxrwxrwx 1 root root         30 May 29 18:14 libnvinfer_vc_plugin.so -> libnvinfer_vc_plugin.so.10.1.0
lrwxrwxrwx 1 root root         30 May 29 18:14 libnvinfer_vc_plugin.so.10 -> libnvinfer_vc_plugin.so.10.1.0
-rwxr-xr-x 1 root root    1004224 May 29 18:14 libnvinfer_vc_plugin.so.10.1.0
lrwxrwxrwx 1 root root         21 May 29 18:13 libnvonnxparser.so -> libnvonnxparser.so.10
lrwxrwxrwx 1 root root         25 May 29 18:13 libnvonnxparser.so.10 -> libnvonnxparser.so.10.1.0
-rwxr-xr-x 1 root root    3597776 May 29 18:09 libnvonnxparser.so.10.1.0
drwxr-xr-x 2 root root       4096 May 29 18:14 stubs

We install the specific TRT version directly instead of using dpkg, so it might not show up there. We started to document the TRT version used in the TRT-LLM container in the release note as my previous comment mentioned, so could just check there moving forward.

datdo-msft · 2024-07-17T18:26:04Z

datdo-msft
Jul 17, 2024

Hi @bhavin192 , when you said that you "built the engine with the same Triton image, it just worked without any issues" -- does this mean that you ran convert_checkpoint.py AND trtllm-build with the same Triton image? Asking to see if it matters because I ran convert_checkpoint.py with nvidia/cuda:12.1.0-devel-ubuntu22.04 and that took a long time that I don't want to repeat again -- so was wondering if I could just run trtllm-build with the same Triton image.

2 replies

bhavin192 Jul 17, 2024
Author

I'm using quantize.py in my case, which also takes care of converting the weights to TensorRT-LLM checkpoints. I didn't have to run that again within the Triton image. I just ran trtllm-build inside Triton image.

I'm not sure if the TensorRT-LLM checkpoint format also comes under some version compatibility constraints or not. Though nothing is mentioned about it in the docs at least: https://nvidia.github.io/TensorRT-LLM/architecture/checkpoint.html

datdo-msft Jul 17, 2024

Cool, thank you for the prompt response!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confusion about versions of TensorRT, TensorRT-LLM and tensorrtllm_backend #7439

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Confusion about versions of TensorRT, TensorRT-LLM and tensorrtllm_backend #7439

bhavin192 Jul 11, 2024

Building the engine

Running it with Triton

Building the engine within Triton

Questions

Replies: 2 comments · 4 replies

bhavin192 Jul 15, 2024 Author

bhavin192 Aug 1, 2024 Author

krishung5 Aug 2, 2024 Collaborator

datdo-msft Jul 17, 2024

bhavin192 Jul 17, 2024 Author

datdo-msft Jul 17, 2024

bhavin192
Jul 11, 2024

Replies: 2 comments 4 replies

bhavin192
Jul 15, 2024
Author

bhavin192 Aug 1, 2024
Author

krishung5 Aug 2, 2024
Collaborator

datdo-msft
Jul 17, 2024

bhavin192 Jul 17, 2024
Author