TensorRT-LLM pipeline parallelism is broken #259

asesorov · 2024-09-12T09:24:05Z

Problem Description

When trying to use pipeline parallelism in tensorrt-llm on 2+ NVIDIA GPUs, I encounter AssertionError: Expected but not provided tensors:{'transformer.vocab_embedding.weight'}. I tried other models, but error is the same.

Environment

GPUs: 2xNVIDIA RTX 4090
Docker: optimum-nvidia (0.1.0b7, latest available)
Optimum Benchmark version: 0.4.0

Optimum Benchmark configuration

defaults:
  - benchmark
  - backend: tensorrt-llm
  - scenario: inference
  - launcher: process
  - _self_

name: trt_llama

launcher:
  device_isolation: true
  device_isolation_action: warn

backend:
  device: cuda
  dtype: bfloat16
  device_ids: 0,1
  max_prompt_length: 1024
  max_batch_size: 16
  max_new_tokens: 100
  tp: 1
  pp: 2
  world_size: 2
  gpus_per_node: 2
  model: IlyaGusev/saiga_llama3_8b

scenario:
  latency: true
  memory: false
  energy: false
  input_shapes:
    batch_size: 1
    sequence_length: 128
  generate_kwargs:
    max_new_tokens: 100
    min_new_tokens: 100

Logs

With mpirun:
trt-llm_2gpus_pp_mpirun_n2.log

Without mpirun:
trt-llm_2gpus_pp.log

Preview of the error:

AssertionError: Expected but not provided tensors:{'transformer.vocab_embedding.weight'}

The text was updated successfully, but these errors were encountered:

IlyasMoutawwakil · 2024-09-19T08:48:31Z

Hello, does this same configuration work for you outside of the context of optimum-benchmark ?
Also how did you launch your benchmarks ? You mentioned mpirun but I'm not sure that's needed to run distributed trt-llm.

asesorov · 2024-09-23T06:41:56Z

@IlyasMoutawwakil when running without mpi, I get RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: mpiSize == tp * pp (/src/tensorrt_llm/cpp/tensorrt_llm/runtime/worldConfig.cpp:90)

Here's the sample configuration I use:

defaults:
  - benchmark
  - backend: tensorrt-llm
  - scenario: inference
  - launcher: process
  - _self_

name: trt_llama

launcher:
  device_isolation: true
  device_isolation_action: warn

backend:
  device: cuda
  dtype: bfloat16
  device_ids: 0,1
  max_prompt_length: 1024
  max_batch_size: 16
  tp: 2
  pp: 1
  world_size: 2
  gpus_per_node: 2
  model: TinyLlama/TinyLlama-1.1B-Chat-v1.0

scenario:
  latency: true
  memory: false
  energy: false
  input_shapes:
    batch_size: 1
    sequence_length: 128
  generate_kwargs:
    max_new_tokens: 100
    min_new_tokens: 100

And command line which successfully launches the benchmark: mpirun -n 2 --allow-run-as-root optimum-benchmark --config-dir /mnt/host --config-name trt_llama_2gp us. However, without mpi I'm getting the mpiSize == tp * pp assertion error. Please, tell me if I'm doing something wrong. Thank you in advance.

IlyasMoutawwakil · 2024-09-23T07:03:41Z

will investigate this, I remember launching distributed (tp) trt-llm without mpirun, but it's been long now.

IlyasMoutawwakil · 2024-09-23T17:24:40Z

I was able to run trt-llm with tp and pp without the mpirun runner, I believe that's only needed for multi-node.
Both configs are being tested as part of the CI with TinyLllama.

asesorov · 2024-09-24T07:33:21Z

Very strange - I tried now to reproduce CLI tests on my machine using optimum-nvidia:latest container, and still got the same error:
test-cli:logging_utils.py:63 RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: mpiSize == tp * pp (/src/tensorrt_llm/cpp/tensorrt_llm/runtime/worldConfig.cpp:90)

In logs, I see that world_size is indeed 1:
[PYTEST-PROCESS][2024-09-24 07:23:01][test-cli][INFO] - [TensorRT-LLM][INFO] MPI size: 1, rank: 0

Here are my steps:

docker run -it --rm --gpus all -e HF_HOME=/mnt/storage -e CUDA_VISIBLE_DEVICES=0,1 --name optimum-nvidia docker.io/huggingface/optimum-nvidia:latest
pip install optimum-benchmark[tensorrt-llm]
git clone https://github.com/huggingface/optimum-benchmark.git && cd optimum-benchmark/
pip install -e .[testing]
FORCE_SEQUENTIAL=1 pytest tests/test_cli.py -x -s -k "cli and cuda and tensorrt_llm and (tp or pp)"

asesorov · 2024-09-24T07:53:54Z

Sorry, I double-checked the logs and figured out that I'm using pre-built engines from single-GPU runs 🤦‍♂️
Nevertheless, I still see this line after successfult run, however: [TensorRT-LLM][INFO] MPI size: 1, rank: 0
And in nvidia-smi I see that only 1 of 2 GPUs is used during CLI tests.

asesorov · 2024-09-24T08:23:31Z

Also, I see this in the GitHub CI log (e.g. https://github.com/huggingface/optimum-benchmark/actions/runs/11008321942/job/30565746560):

[PYTEST-PROCESS][2024-09-24 06:41:41][test-cli][INFO] - [TensorRT-LLM][INFO] MPI size: 1, rank: 0
Warning: ROCESS][2024-09-24 06:41:42][test-cli][INFO] - [TensorRT-LLM][WARNING] Device 0 peer access Device 1 is not available.

IlyasMoutawwakil · 2024-09-24T09:27:59Z

In my "local" tests (on an A100) I see equal usage on both GPUs, until kv cache starts being allocated and that's when one machine uses more than the other (almost gets saturated) I guess that's weird but it sounds like an issue in tensorrt-llm. I also don't get [TensorRT-LLM][WARNING] Device 0 peer access Device 1 is not available. locally, this is an issue with the communication topology as explained in NVIDIA/TensorRT-LLM#1487 (comment), I'm running "locally" on a DGX machine with SXM4 so it makes sense to support p2p.

I also checked optimum-nvidia code and it's using the LLM helper class at:
https://github.com/huggingface/optimum-nvidia/blob/main/src/optimum/nvidia/runtime.py
this API uses MPIPoolSession when mpirun is not used to launch https://github.com/NVIDIA/TensorRT-LLM/blob/a65dba7aaf7e2d8bb0120eea8f8f04deff145d6a/tensorrt_llm/hlapi/llm.py#L126-L132
this class is better documented in the examples https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llm-api

IlyasMoutawwakil · 2024-09-24T09:30:31Z

tell me if this makes sense, I admit it is weird and confusing that the logs show MPI size as 1.

asesorov · 2024-09-24T11:43:38Z

Tried on another machine with different GPUs, and still see the same usage (one GPU is used - and, as you said, almost saturated, another is idle):

Additionally, the metrics when using single-GPU or TP in 2 GPUs are identical (in case of 4090 and tinyllama, throughput is always around 350 tokens/s). Indeed it seems like a trtllm issue. Can you tell me if it is possible to smoothly upgrade the TensorRT-LLM from 0.9.0dev (used in optimum-nvidia image) to newer version to try it?
Also, when I used mpirun I (expectedly) saw double throughput results which were a bit different - is it correct to sum these results to get the correct throughput?
And thank you for your help.

IlyasMoutawwakil · 2024-09-24T12:49:56Z

No it's actually wrong to sum throughputs with TP or PP, these two strategies split the model and not the data, so in the case of TP tensors are split, and only half of the computation is performed on each GPU, but you can't have different inputs on each process (unlike DP). That's why batch_size=1 works with TP and PP, but the min batch size with DP is 2.

It makes sense for me that TP gives as much perf as single gpu here, in fact I'm surprised it reaches that, as it's a strategy that's optimized for compute bound problems (big weights + prefill = big matmuls) with a bit of comm overhead.

IlyasMoutawwakil · 2024-09-24T12:59:42Z

@asesorov I can also easily implement an MPIrun launcher to verify these results. Will ping you in a PR.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TensorRT-LLM pipeline parallelism is broken #259

TensorRT-LLM pipeline parallelism is broken #259

asesorov commented Sep 12, 2024 •

edited

Loading

IlyasMoutawwakil commented Sep 19, 2024

asesorov commented Sep 23, 2024 •

edited

Loading

IlyasMoutawwakil commented Sep 23, 2024

IlyasMoutawwakil commented Sep 23, 2024

asesorov commented Sep 24, 2024

asesorov commented Sep 24, 2024

asesorov commented Sep 24, 2024

IlyasMoutawwakil commented Sep 24, 2024

IlyasMoutawwakil commented Sep 24, 2024

asesorov commented Sep 24, 2024

IlyasMoutawwakil commented Sep 24, 2024 •

edited

Loading

IlyasMoutawwakil commented Sep 24, 2024

TensorRT-LLM pipeline parallelism is broken #259

TensorRT-LLM pipeline parallelism is broken #259

Comments

asesorov commented Sep 12, 2024 • edited Loading

Problem Description

Environment

Optimum Benchmark configuration

Logs

IlyasMoutawwakil commented Sep 19, 2024

asesorov commented Sep 23, 2024 • edited Loading

IlyasMoutawwakil commented Sep 23, 2024

IlyasMoutawwakil commented Sep 23, 2024

asesorov commented Sep 24, 2024

asesorov commented Sep 24, 2024

asesorov commented Sep 24, 2024

IlyasMoutawwakil commented Sep 24, 2024

IlyasMoutawwakil commented Sep 24, 2024

asesorov commented Sep 24, 2024

IlyasMoutawwakil commented Sep 24, 2024 • edited Loading

IlyasMoutawwakil commented Sep 24, 2024

asesorov commented Sep 12, 2024 •

edited

Loading

asesorov commented Sep 23, 2024 •

edited

Loading

IlyasMoutawwakil commented Sep 24, 2024 •

edited

Loading