Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed TensorRT-LLM Benchmark #2694

Closed
1 of 4 tasks
maulikmadhavi opened this issue Jan 15, 2025 · 8 comments
Closed
1 of 4 tasks

Failed TensorRT-LLM Benchmark #2694

maulikmadhavi opened this issue Jan 15, 2025 · 8 comments
Assignees
Labels
bug Something isn't working

Comments

@maulikmadhavi
Copy link

maulikmadhavi commented Jan 15, 2025

System Info

  • CPU architecture: x86_64 (Linux node 6.5.0-25-generic )
  • CPU/Host memory size: 503GiB
  • GPU properties
    • GPU name: H100
    • GPU memory size: 80GB
  • Libraries
    • TensorRT-LLM branch or tag: v0.16.0
    • Container used:
  • NVIDIA driver version: 535.161.07
  • OS: Ubuntu 22.04

Who can help?

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Follow the steps on as per performance benchmarking link

  1. Generate synthetic data
python benchmarks/cpp/prepare_dataset.py --stdout --tokenizer meta-llama/Llama-2-7b-hf token-norm-dist --input-mean 128 --output-mean 128 --input-stdev 0 --output-stdev 0 --num-requests 3000 > /tmp/synthetic_128_128.txt
  1. Build the model:
trtllm-bench --model meta-llama/Llama-2-7b-hf build --dataset /tmp/synthetic_128_128.txt --quantization FP8
  1. Run benchmark
trtllm-bench --model meta-llama/Llama-2-7b-hf throughput --dataset /tmp/synthetic_128_128.txt --engine_dir /tmp/meta-llama/Llama-2-7b-hf/tp_1_pp_1

Expected behavior

Running docker using make => make -C docker release_run

[01/15/2025-01:15:18] [TRT-LLM] [I] Stopping response parsing.                                                                                                                                                            
[01/15/2025-01:15:18] [TRT-LLM] [I] Collecting last responses before shutdown.                                                                                                                                            
[01/15/2025-01:15:18] [TRT-LLM] [I] Completed request parsing.                                                                                                                                                            
[01/15/2025-01:15:18] [TRT-LLM] [I] Parsing stopped.                                                                                                                                                                      
[01/15/2025-01:15:18] [TRT-LLM] [I] Request generator successfully joined.                                                                                                                                                
[01/15/2025-01:15:18] [TRT-LLM] [I] Statistics process successfully joined.                                                                                                                                               
[01/15/2025-01:15:18] [TRT-LLM] [I]                                                                                                                                                                                       
                                                                                                                                                                                                                          
===========================================================                                                                                                                                                               
= ENGINE DETAILS                                                                                                                                                                                                          
===========================================================                                                                                                                                                               
Model:                  meta-llama/Llama-2-7b-hf                                                                                                                                                                          
Engine Directory:       /tmp/meta-llama/Llama-2-7b-hf/tp_1_pp_1                                                                                                                                                           
TensorRT-LLM Version:   0.16.0                                                                               
Dtype:                  float16                                                                              
KV Cache Dtype:         FP8                                                                                  
Quantization:           FP8                                                                                  
Max Sequence Length:    256                                                                                  

===========================================================                                                  
= WORLD + RUNTIME INFORMATION                                                                                
===========================================================                                                  
TP Size:                1                                                                                    
PP Size:                1                                                                                    
Max Runtime Batch Size: 1280                                                                                 
Max Runtime Tokens:     2304                                                                                 
Scheduling Policy:      Guaranteed No Evict                                                                  
KV Memory Percentage:   90.00%                                                                               
Issue Rate (req/sec):   2.8149E+13                                                                           

===========================================================                                                  
= PERFORMANCE OVERVIEW                                                                                       
===========================================================                                                  
Number of requests:             3000                                                                         
Average Input Length (tokens):  128.0000                                                                     
Average Output Length (tokens): 128.0000                                                                     
Token Throughput (tokens/sec):  12067.8672                                                                   
Request Throughput (req/sec):   94.2802                                                                      
Total Latency (ms):             31820.0387                                                                   

===========================================================   

actual behavior

Running docker using docker => docker run --rm --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus=device=1 -it -p 8000:8000 -v <path-to-TensorRT-LLM/>:/app/ 89fg611dcfd

[TensorRT-LLM] TensorRT-LLM version: 0.16.0
[01/15/2025-10:23:14] [TRT-LLM] [I] Preparing to run throughput benchmark...
[01/15/2025-10:23:14] [TRT-LLM] [I] Setting up benchmarker and infrastructure.
[01/15/2025-10:23:14] [TRT-LLM] [I] Initializing Throughput Benchmark. [rate=-1 req/s]
[01/15/2025-10:23:14] [TRT-LLM] [I] Ready to start benchmark.
[01/15/2025-10:23:14] [TRT-LLM] [I] Initializing Executor.
[TensorRT-LLM][WARNING] Setting cudaGraphCacheSize to a value greater than 0 without enabling cudaGraphMode has no effect.
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
/usr/local/lib/python3.12/dist-packages/tensorrt_llm/bin/executorWorker: error while loading shared libraries: libnvinfer_plugin_tensorrt_llm.so: cannot open shared object file: No such file or directory
--------------------------------------------------------------------------
Child job 2 terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------

Stays here for quire longer time, upon Ctrl+C

A request has timed out and will therefore fail:

  Operation:  LOOKUP: orted/pmix/pmix_server_pub.c:345

Your job may terminate as a result of this problem. You may want to
adjust the MCA parameter pmix_server_max_wait and try again. If this
occurred during a connect/accept operation, you can adjust that time
using the pmix_base_exchange_timeout parameter.
--------------------------------------------------------------------------

Aborted!
--------------------------------------------------------------------------
(null) detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[40030,2],0]
  Exit code:    127
--------------------------------------------------------------------------A request has timed out and will therefore fail:

  Operation:  LOOKUP: orted/pmix/pmix_server_pub.c:345

Your job may terminate as a result of this problem. You may want to
adjust the MCA parameter pmix_server_max_wait and try again. If this
occurred during a connect/accept operation, you can adjust that time
using the pmix_base_exchange_timeout parameter.
--------------------------------------------------------------------------

Aborted!
--------------------------------------------------------------------------
(null) detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[40030,2],0]
  Exit code:    127
--------------------------------------------------------------------------

additional notes

=> Need to map port and directory to save time and repeated HF model downloads

Thanks

@maulikmadhavi maulikmadhavi added the bug Something isn't working label Jan 15, 2025
@MartinMarciniszyn
Copy link
Collaborator

@FrankD412, could you please take a look at this?

@FrankD412
Copy link
Collaborator

@maulikmadhavi -- I'm taking a look to see if I can reproduce this.

@FrankD412
Copy link
Collaborator

@maulikmadhavi -- I think the issue here is that you're mounting your code repository here: <path-to-TensorRT-LLM/>:/app/. When I mimicked your command with the mounted code, it failed with the same error you experienced.

root@bb5f4949b9df:/workspace# trtllm-bench --model meta-llama/Llama-2-7b-hf throughput --dataset /tmp/synthetic_128_128.txt --engine_dir /tmp/meta-llama/Llama-2-7b-hf/tp_1_pp_1
[TensorRT-LLM] TensorRT-LLM version: 0.16.0
[01/28/2025-03:27:01] [TRT-LLM] [I] Preparing to run throughput benchmark...
[01/28/2025-03:27:02] [TRT-LLM] [I] Setting up benchmarker and infrastructure.
[01/28/2025-03:27:02] [TRT-LLM] [I] Initializing Throughput Benchmark. [rate=-1 req/s]
[01/28/2025-03:27:02] [TRT-LLM] [I] Ready to start benchmark.
[01/28/2025-03:27:02] [TRT-LLM] [I] Initializing Executor.
[TensorRT-LLM][WARNING] Setting cudaGraphCacheSize to a value greater than 0 without enabling cudaGraphMode has no effect.
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
/usr/local/lib/python3.12/dist-packages/tensorrt_llm/bin/executorWorker: error while loading shared libraries: libnvinfer_plugin_tensorrt_llm.so: cannot open shared object file: No such file or directory
--------------------------------------------------------------------------
Child job 2 terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------

The container build with the make command contains TensorRT-LLM, so if you mount the repository something in the linking isn't quite right. I wasn't able to reproduce your exact issue using just the container. I tried with both make and using docker run --rm --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus=device=1 -it -p 8000:8000 docker.io/tensorrt_llm/release:latest right after running make -C docker release_build as you listed above and they both worked. Mind giving that a go?

@FrankD412 FrankD412 self-assigned this Jan 28, 2025
@sbaby171
Copy link

I dont typically use the Docker instances but I have seen the same issue. I simply update my LD_LIBRARY_PATH

export LD_LIBRARY_PATH=~/trt-tarball/TensorRT-10.7.0.23/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/mnt/storage/VENV-TRTLLM/lib/python3.10/site-packages/tensorrt_llm/libs:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/mnt/storage/VENV-TRTLLM/lib/python3.10/site-packages/nvidia/nccl/lib:$LD_LIBRARY_PATH

@maulikmadhavi
Copy link
Author

maulikmadhavi commented Jan 29, 2025

@FrankD412 Thanks for your test.

  • yes it works without mounting the local dir => docker run --rm --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus=device=1 -it -p 8000:8000 docker.io/tensorrt_llm/release:latest.
  • with mounting the local dir, it failed.

@maulikmadhavi
Copy link
Author

@sbaby171 thanks for sharing your env path settings. I agree that many times it is good to have dockerless setup.

@maulikmadhavi
Copy link
Author

Hi @FrankD412

The purpose of using mount dir to save trt engine with different parameters. It works with docker run --rm --ulimit memlock=-1 --ulimit stack=67108864 --gpus=device=0 -it -p 8000:8000 -v <local-tmp-dir>:/tmp/ -v <local-hf-dir>:/root/.cache/huggingface 89fg611dcfd
I found the benchmark script works with following:

  • mapping <local-tmp-dir> to /tmp [to save trt engines]
  • mapping <local-hf-dir> to /root/.cache/huggingface [to reuse hf cache from local and avoid re-downloading].

I believe there should be some conflicts caused when I use the local repo Tensorrt-LLM scripts. The conflicts may between local repo and tensorrt_llm within docker.

Thanks

@FrankD412
Copy link
Collaborator

@maulikmadhavi,

Glad you found a solution. Normally, I do something similar. You could also map another directory for <local-tmp-dir> and then use the --workspace option for trtllm-bench and would store the engines in your mounted directory. For the cache, I also like to set -e HUGGINGFACE_HUB_CACHE=<docker-cache> -v <local-hf-dir>:<docker-cache>. These tweaks could give you a little more flexibility if you need it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants