Failed TensorRT-LLM Benchmark #2694

maulikmadhavi · 2025-01-15T10:41:11Z

System Info

CPU architecture: x86_64 (Linux node 6.5.0-25-generic )
CPU/Host memory size: 503GiB
GPU properties
- GPU name: H100
- GPU memory size: 80GB
Libraries
- TensorRT-LLM branch or tag: v0.16.0
- Container used:
NVIDIA driver version: 535.161.07
OS: Ubuntu 22.04

Who can help?

Documentation: @juney-nvidia
Others: @byshiue

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Follow the steps on as per performance benchmarking link

Generate synthetic data

python benchmarks/cpp/prepare_dataset.py --stdout --tokenizer meta-llama/Llama-2-7b-hf token-norm-dist --input-mean 128 --output-mean 128 --input-stdev 0 --output-stdev 0 --num-requests 3000 > /tmp/synthetic_128_128.txt

Build the model:

trtllm-bench --model meta-llama/Llama-2-7b-hf build --dataset /tmp/synthetic_128_128.txt --quantization FP8

Run benchmark

trtllm-bench --model meta-llama/Llama-2-7b-hf throughput --dataset /tmp/synthetic_128_128.txt --engine_dir /tmp/meta-llama/Llama-2-7b-hf/tp_1_pp_1

Expected behavior

Running docker using make => make -C docker release_run

[01/15/2025-01:15:18] [TRT-LLM] [I] Stopping response parsing.                                                                                                                                                            
[01/15/2025-01:15:18] [TRT-LLM] [I] Collecting last responses before shutdown.                                                                                                                                            
[01/15/2025-01:15:18] [TRT-LLM] [I] Completed request parsing.                                                                                                                                                            
[01/15/2025-01:15:18] [TRT-LLM] [I] Parsing stopped.                                                                                                                                                                      
[01/15/2025-01:15:18] [TRT-LLM] [I] Request generator successfully joined.                                                                                                                                                
[01/15/2025-01:15:18] [TRT-LLM] [I] Statistics process successfully joined.                                                                                                                                               
[01/15/2025-01:15:18] [TRT-LLM] [I]                                                                                                                                                                                       
                                                                                                                                                                                                                          
===========================================================                                                                                                                                                               
= ENGINE DETAILS                                                                                                                                                                                                          
===========================================================                                                                                                                                                               
Model:                  meta-llama/Llama-2-7b-hf                                                                                                                                                                          
Engine Directory:       /tmp/meta-llama/Llama-2-7b-hf/tp_1_pp_1                                                                                                                                                           
TensorRT-LLM Version:   0.16.0                                                                               
Dtype:                  float16                                                                              
KV Cache Dtype:         FP8                                                                                  
Quantization:           FP8                                                                                  
Max Sequence Length:    256                                                                                  

===========================================================                                                  
= WORLD + RUNTIME INFORMATION                                                                                
===========================================================                                                  
TP Size:                1                                                                                    
PP Size:                1                                                                                    
Max Runtime Batch Size: 1280                                                                                 
Max Runtime Tokens:     2304                                                                                 
Scheduling Policy:      Guaranteed No Evict                                                                  
KV Memory Percentage:   90.00%                                                                               
Issue Rate (req/sec):   2.8149E+13                                                                           

===========================================================                                                  
= PERFORMANCE OVERVIEW                                                                                       
===========================================================                                                  
Number of requests:             3000                                                                         
Average Input Length (tokens):  128.0000                                                                     
Average Output Length (tokens): 128.0000                                                                     
Token Throughput (tokens/sec):  12067.8672                                                                   
Request Throughput (req/sec):   94.2802                                                                      
Total Latency (ms):             31820.0387                                                                   

===========================================================

actual behavior

Running docker using docker => docker run --rm --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus=device=1 -it -p 8000:8000 -v <path-to-TensorRT-LLM/>:/app/ 89fg611dcfd

[TensorRT-LLM] TensorRT-LLM version: 0.16.0
[01/15/2025-10:23:14] [TRT-LLM] [I] Preparing to run throughput benchmark...
[01/15/2025-10:23:14] [TRT-LLM] [I] Setting up benchmarker and infrastructure.
[01/15/2025-10:23:14] [TRT-LLM] [I] Initializing Throughput Benchmark. [rate=-1 req/s]
[01/15/2025-10:23:14] [TRT-LLM] [I] Ready to start benchmark.
[01/15/2025-10:23:14] [TRT-LLM] [I] Initializing Executor.
[TensorRT-LLM][WARNING] Setting cudaGraphCacheSize to a value greater than 0 without enabling cudaGraphMode has no effect.
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
/usr/local/lib/python3.12/dist-packages/tensorrt_llm/bin/executorWorker: error while loading shared libraries: libnvinfer_plugin_tensorrt_llm.so: cannot open shared object file: No such file or directory
--------------------------------------------------------------------------
Child job 2 terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------

Stays here for quire longer time, upon Ctrl+C

A request has timed out and will therefore fail:

  Operation:  LOOKUP: orted/pmix/pmix_server_pub.c:345

Your job may terminate as a result of this problem. You may want to
adjust the MCA parameter pmix_server_max_wait and try again. If this
occurred during a connect/accept operation, you can adjust that time
using the pmix_base_exchange_timeout parameter.
--------------------------------------------------------------------------

Aborted!
--------------------------------------------------------------------------
(null) detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[40030,2],0]
  Exit code:    127
--------------------------------------------------------------------------A request has timed out and will therefore fail:

  Operation:  LOOKUP: orted/pmix/pmix_server_pub.c:345

Your job may terminate as a result of this problem. You may want to
adjust the MCA parameter pmix_server_max_wait and try again. If this
occurred during a connect/accept operation, you can adjust that time
using the pmix_base_exchange_timeout parameter.
--------------------------------------------------------------------------

Aborted!
--------------------------------------------------------------------------
(null) detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[40030,2],0]
  Exit code:    127
--------------------------------------------------------------------------

additional notes

=> Need to map port and directory to save time and repeated HF model downloads

Thanks

The text was updated successfully, but these errors were encountered:

MartinMarciniszyn · 2025-01-27T12:14:08Z

@FrankD412, could you please take a look at this?

FrankD412 · 2025-01-27T15:50:38Z

@maulikmadhavi -- I'm taking a look to see if I can reproduce this.

FrankD412 · 2025-01-28T03:31:34Z

@maulikmadhavi -- I think the issue here is that you're mounting your code repository here: <path-to-TensorRT-LLM/>:/app/. When I mimicked your command with the mounted code, it failed with the same error you experienced.

root@bb5f4949b9df:/workspace# trtllm-bench --model meta-llama/Llama-2-7b-hf throughput --dataset /tmp/synthetic_128_128.txt --engine_dir /tmp/meta-llama/Llama-2-7b-hf/tp_1_pp_1
[TensorRT-LLM] TensorRT-LLM version: 0.16.0
[01/28/2025-03:27:01] [TRT-LLM] [I] Preparing to run throughput benchmark...
[01/28/2025-03:27:02] [TRT-LLM] [I] Setting up benchmarker and infrastructure.
[01/28/2025-03:27:02] [TRT-LLM] [I] Initializing Throughput Benchmark. [rate=-1 req/s]
[01/28/2025-03:27:02] [TRT-LLM] [I] Ready to start benchmark.
[01/28/2025-03:27:02] [TRT-LLM] [I] Initializing Executor.
[TensorRT-LLM][WARNING] Setting cudaGraphCacheSize to a value greater than 0 without enabling cudaGraphMode has no effect.
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
/usr/local/lib/python3.12/dist-packages/tensorrt_llm/bin/executorWorker: error while loading shared libraries: libnvinfer_plugin_tensorrt_llm.so: cannot open shared object file: No such file or directory
--------------------------------------------------------------------------
Child job 2 terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------

The container build with the make command contains TensorRT-LLM, so if you mount the repository something in the linking isn't quite right. I wasn't able to reproduce your exact issue using just the container. I tried with both make and using docker run --rm --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus=device=1 -it -p 8000:8000 docker.io/tensorrt_llm/release:latest right after running make -C docker release_build as you listed above and they both worked. Mind giving that a go?

sbaby171 · 2025-01-28T21:45:40Z

I dont typically use the Docker instances but I have seen the same issue. I simply update my LD_LIBRARY_PATH

export LD_LIBRARY_PATH=~/trt-tarball/TensorRT-10.7.0.23/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/mnt/storage/VENV-TRTLLM/lib/python3.10/site-packages/tensorrt_llm/libs:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/mnt/storage/VENV-TRTLLM/lib/python3.10/site-packages/nvidia/nccl/lib:$LD_LIBRARY_PATH

maulikmadhavi · 2025-01-29T01:14:58Z

@FrankD412 Thanks for your test.

yes it works without mounting the local dir => docker run --rm --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus=device=1 -it -p 8000:8000 docker.io/tensorrt_llm/release:latest.
with mounting the local dir, it failed.

maulikmadhavi · 2025-01-29T02:10:40Z

@sbaby171 thanks for sharing your env path settings. I agree that many times it is good to have dockerless setup.

maulikmadhavi · 2025-01-29T02:15:46Z

Hi @FrankD412

The purpose of using mount dir to save trt engine with different parameters. It works with docker run --rm --ulimit memlock=-1 --ulimit stack=67108864 --gpus=device=0 -it -p 8000:8000 -v <local-tmp-dir>:/tmp/ -v <local-hf-dir>:/root/.cache/huggingface 89fg611dcfd
I found the benchmark script works with following:

mapping <local-tmp-dir> to /tmp [to save trt engines]
mapping <local-hf-dir> to /root/.cache/huggingface [to reuse hf cache from local and avoid re-downloading].

I believe there should be some conflicts caused when I use the local repo Tensorrt-LLM scripts. The conflicts may between local repo and tensorrt_llm within docker.

Thanks

FrankD412 · 2025-01-29T02:41:50Z

@maulikmadhavi,

Glad you found a solution. Normally, I do something similar. You could also map another directory for <local-tmp-dir> and then use the --workspace option for trtllm-bench and would store the engines in your mounted directory. For the cache, I also like to set -e HUGGINGFACE_HUB_CACHE=<docker-cache> -v <local-hf-dir>:<docker-cache>. These tweaks could give you a little more flexibility if you need it.

maulikmadhavi added the bug Something isn't working label Jan 15, 2025

FrankD412 self-assigned this Jan 28, 2025

maulikmadhavi closed this as completed Jan 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed TensorRT-LLM Benchmark #2694

Failed TensorRT-LLM Benchmark #2694

maulikmadhavi commented Jan 15, 2025 •

edited

Loading

MartinMarciniszyn commented Jan 27, 2025

FrankD412 commented Jan 27, 2025

FrankD412 commented Jan 28, 2025

sbaby171 commented Jan 28, 2025

maulikmadhavi commented Jan 29, 2025 •

edited

Loading

maulikmadhavi commented Jan 29, 2025

maulikmadhavi commented Jan 29, 2025

FrankD412 commented Jan 29, 2025

Failed TensorRT-LLM Benchmark #2694

Failed TensorRT-LLM Benchmark #2694

Comments

maulikmadhavi commented Jan 15, 2025 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

MartinMarciniszyn commented Jan 27, 2025

FrankD412 commented Jan 27, 2025

FrankD412 commented Jan 28, 2025

sbaby171 commented Jan 28, 2025

maulikmadhavi commented Jan 29, 2025 • edited Loading

maulikmadhavi commented Jan 29, 2025

maulikmadhavi commented Jan 29, 2025

FrankD412 commented Jan 29, 2025

maulikmadhavi commented Jan 15, 2025 •

edited

Loading

maulikmadhavi commented Jan 29, 2025 •

edited

Loading