Description
Hi,
I am trying to run the example script provided for llama model for inference only. Since the repository is going through migration and a lot of changes, I went back and install the stable v0.2.0
version. Everything works fine until I started trying to run the example script using cpu-initialization on more than 2 pipeline stages. I am currently running on a server with 8 gpus of Nvidia L4. For pp = 2
it works perfectly, but as soon as I run the same script with pp more than 2, after the model is initialized, all the other gpus have 0 utilization according to nvidia-smi
output, and the gpu ranked 1 will have 100% util, yet the entire inference process freezes. Has anyone seeing similar issues? Or perhaps there are some quick fix I can try?
NVCC and Cuda Verison: 12.1
.
torch version: 2.4.0.dev20240521+cu118
.