Skip to content

Inference freezes when running llama example with pp>2 #1118

Open
@JamesLYan

Description

@JamesLYan

Hi,
I am trying to run the example script provided for llama model for inference only. Since the repository is going through migration and a lot of changes, I went back and install the stable v0.2.0 version. Everything works fine until I started trying to run the example script using cpu-initialization on more than 2 pipeline stages. I am currently running on a server with 8 gpus of Nvidia L4. For pp = 2 it works perfectly, but as soon as I run the same script with pp more than 2, after the model is initialized, all the other gpus have 0 utilization according to nvidia-smi output, and the gpu ranked 1 will have 100% util, yet the entire inference process freezes. Has anyone seeing similar issues? Or perhaps there are some quick fix I can try?

NVCC and Cuda Verison: 12.1.
torch version: 2.4.0.dev20240521+cu118.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions