Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TensorRT-LLM pipeline parallelism is broken #259

Open
asesorov opened this issue Sep 12, 2024 · 12 comments
Open

TensorRT-LLM pipeline parallelism is broken #259

asesorov opened this issue Sep 12, 2024 · 12 comments

Comments

@asesorov
Copy link
Contributor

asesorov commented Sep 12, 2024

Problem Description

When trying to use pipeline parallelism in tensorrt-llm on 2+ NVIDIA GPUs, I encounter AssertionError: Expected but not provided tensors:{'transformer.vocab_embedding.weight'}. I tried other models, but error is the same.

Environment

  • GPUs: 2xNVIDIA RTX 4090
  • Docker: optimum-nvidia (0.1.0b7, latest available)
  • Optimum Benchmark version: 0.4.0

Optimum Benchmark configuration

defaults:
  - benchmark
  - backend: tensorrt-llm
  - scenario: inference
  - launcher: process
  - _self_

name: trt_llama

launcher:
  device_isolation: true
  device_isolation_action: warn

backend:
  device: cuda
  dtype: bfloat16
  device_ids: 0,1
  max_prompt_length: 1024
  max_batch_size: 16
  max_new_tokens: 100
  tp: 1
  pp: 2
  world_size: 2
  gpus_per_node: 2
  model: IlyaGusev/saiga_llama3_8b

scenario:
  latency: true
  memory: false
  energy: false
  input_shapes:
    batch_size: 1
    sequence_length: 128
  generate_kwargs:
    max_new_tokens: 100
    min_new_tokens: 100

Logs

With mpirun:
trt-llm_2gpus_pp_mpirun_n2.log

Without mpirun:
trt-llm_2gpus_pp.log

Preview of the error:

AssertionError: Expected but not provided tensors:{'transformer.vocab_embedding.weight'}
@IlyasMoutawwakil
Copy link
Member

Hello, does this same configuration work for you outside of the context of optimum-benchmark ?
Also how did you launch your benchmarks ? You mentioned mpirun but I'm not sure that's needed to run distributed trt-llm.

@asesorov
Copy link
Contributor Author

asesorov commented Sep 23, 2024

@IlyasMoutawwakil when running without mpi, I get RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: mpiSize == tp * pp (/src/tensorrt_llm/cpp/tensorrt_llm/runtime/worldConfig.cpp:90)

Here's the sample configuration I use:

defaults:
  - benchmark
  - backend: tensorrt-llm
  - scenario: inference
  - launcher: process
  - _self_

name: trt_llama

launcher:
  device_isolation: true
  device_isolation_action: warn

backend:
  device: cuda
  dtype: bfloat16
  device_ids: 0,1
  max_prompt_length: 1024
  max_batch_size: 16
  tp: 2
  pp: 1
  world_size: 2
  gpus_per_node: 2
  model: TinyLlama/TinyLlama-1.1B-Chat-v1.0

scenario:
  latency: true
  memory: false
  energy: false
  input_shapes:
    batch_size: 1
    sequence_length: 128
  generate_kwargs:
    max_new_tokens: 100
    min_new_tokens: 100

And command line which successfully launches the benchmark: mpirun -n 2 --allow-run-as-root optimum-benchmark --config-dir /mnt/host --config-name trt_llama_2gp us. However, without mpi I'm getting the mpiSize == tp * pp assertion error. Please, tell me if I'm doing something wrong. Thank you in advance.

@IlyasMoutawwakil
Copy link
Member

will investigate this, I remember launching distributed (tp) trt-llm without mpirun, but it's been long now.

@IlyasMoutawwakil
Copy link
Member

I was able to run trt-llm with tp and pp without the mpirun runner, I believe that's only needed for multi-node.
Both configs are being tested as part of the CI with TinyLllama.

@asesorov
Copy link
Contributor Author

Very strange - I tried now to reproduce CLI tests on my machine using optimum-nvidia:latest container, and still got the same error:
test-cli:logging_utils.py:63 RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: mpiSize == tp * pp (/src/tensorrt_llm/cpp/tensorrt_llm/runtime/worldConfig.cpp:90)

In logs, I see that world_size is indeed 1:
[PYTEST-PROCESS][2024-09-24 07:23:01][test-cli][INFO] - [TensorRT-LLM][INFO] MPI size: 1, rank: 0

Here are my steps:

  1. docker run -it --rm --gpus all -e HF_HOME=/mnt/storage -e CUDA_VISIBLE_DEVICES=0,1 --name optimum-nvidia docker.io/huggingface/optimum-nvidia:latest
  2. pip install optimum-benchmark[tensorrt-llm]
  3. git clone https://github.com/huggingface/optimum-benchmark.git && cd optimum-benchmark/
  4. pip install -e .[testing]
  5. FORCE_SEQUENTIAL=1 pytest tests/test_cli.py -x -s -k "cli and cuda and tensorrt_llm and (tp or pp)"

@asesorov
Copy link
Contributor Author

Sorry, I double-checked the logs and figured out that I'm using pre-built engines from single-GPU runs 🤦‍♂️
Nevertheless, I still see this line after successfult run, however: [TensorRT-LLM][INFO] MPI size: 1, rank: 0
And in nvidia-smi I see that only 1 of 2 GPUs is used during CLI tests.

@asesorov
Copy link
Contributor Author

Also, I see this in the GitHub CI log (e.g. https://github.com/huggingface/optimum-benchmark/actions/runs/11008321942/job/30565746560):

[PYTEST-PROCESS][2024-09-24 06:41:41][test-cli][INFO] - [TensorRT-LLM][INFO] MPI size: 1, rank: 0
Warning: ROCESS][2024-09-24 06:41:42][test-cli][INFO] - [TensorRT-LLM][WARNING] Device 0 peer access Device 1 is not available.

@IlyasMoutawwakil
Copy link
Member

In my "local" tests (on an A100) I see equal usage on both GPUs, until kv cache starts being allocated and that's when one machine uses more than the other (almost gets saturated) I guess that's weird but it sounds like an issue in tensorrt-llm. I also don't get [TensorRT-LLM][WARNING] Device 0 peer access Device 1 is not available. locally, this is an issue with the communication topology as explained in NVIDIA/TensorRT-LLM#1487 (comment), I'm running "locally" on a DGX machine with SXM4 so it makes sense to support p2p.

I also checked optimum-nvidia code and it's using the LLM helper class at:
https://github.com/huggingface/optimum-nvidia/blob/main/src/optimum/nvidia/runtime.py
this API uses MPIPoolSession when mpirun is not used to launch https://github.com/NVIDIA/TensorRT-LLM/blob/a65dba7aaf7e2d8bb0120eea8f8f04deff145d6a/tensorrt_llm/hlapi/llm.py#L126-L132
this class is better documented in the examples https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llm-api

@IlyasMoutawwakil
Copy link
Member

tell me if this makes sense, I admit it is weird and confusing that the logs show MPI size as 1.

@asesorov
Copy link
Contributor Author

Tried on another machine with different GPUs, and still see the same usage (one GPU is used - and, as you said, almost saturated, another is idle):
image

Additionally, the metrics when using single-GPU or TP in 2 GPUs are identical (in case of 4090 and tinyllama, throughput is always around 350 tokens/s). Indeed it seems like a trtllm issue. Can you tell me if it is possible to smoothly upgrade the TensorRT-LLM from 0.9.0dev (used in optimum-nvidia image) to newer version to try it?
Also, when I used mpirun I (expectedly) saw double throughput results which were a bit different - is it correct to sum these results to get the correct throughput?
And thank you for your help.

@IlyasMoutawwakil
Copy link
Member

IlyasMoutawwakil commented Sep 24, 2024

No it's actually wrong to sum throughputs with TP or PP, these two strategies split the model and not the data, so in the case of TP tensors are split, and only half of the computation is performed on each GPU, but you can't have different inputs on each process (unlike DP). That's why batch_size=1 works with TP and PP, but the min batch size with DP is 2.

It makes sense for me that TP gives as much perf as single gpu here, in fact I'm surprised it reaches that, as it's a strategy that's optimized for compute bound problems (big weights + prefill = big matmuls) with a bit of comm overhead.

@IlyasMoutawwakil
Copy link
Member

@asesorov I can also easily implement an MPIrun launcher to verify these results. Will ping you in a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants