Inference freezes when running llama example with pp>2

Hi,
  I am trying to run the example script provided for llama model for inference only. Since the repository is going through migration and a lot of changes, I went back and install the stable `v0.2.0` version. Everything works fine until I started trying to run the example script using cpu-initialization on more than 2 pipeline stages. I am currently running on a server with 8 gpus of Nvidia L4. For `pp = 2 `it works perfectly, but as soon as I run the same script with pp more than 2, after the model is initialized, all the other gpus have 0 utilization according to `nvidia-smi` output, and the gpu ranked 1 will have 100% util, yet the entire inference process freezes. Has anyone seeing similar issues? Or perhaps there are some quick fix I can try?

NVCC and Cuda Verison: `12.1`.
torch version: ` 2.4.0.dev20240521+cu118`.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inference freezes when running llama example with pp>2 #1118

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inference freezes when running llama example with pp>2 #1118

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions