-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPI Not Working in Quark #1281
Comments
https://github.com/QuarkContainer/Quark/blob/gpu-multiprocessing/test/multiprocess_torchminimal.py also failed. Test program is in branch:gpu-multiprocessing. just run |
The repro could be simplified as below. sudo docker run -it --runtime=quark_d --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/pytorch:24.01-py3 bash -c "mpirun -np 2 --allow-run-as-root ls" As the crash is at /opt/hpcx/ompi/lib/libopen-pal.so, I download it and objdump it and get the assembly code as low. It crashed at last line. b7701: 48 8b 7c 24 08 mov 0x8(%rsp),%rdi The b7718 is very like following source code: But I can't map the assembly to the c source code :-( It might related to open-mpi/hwloc#525. The issue should be related to cpuid ax=4, cx = 0. Following is a test which prove when disable that, the issue could be skipped. |
Sounds like making Ray work is easier. To make an env with ray and reproduce, follow this commit |
Their code does check zero value, so the |
more verbose logging on
|
for the record, I tested the However both quark and gvisor fails to run any program using @chengchen666 could you test running the same |
@chengchen666 I add some workarunds in the branch GPUVirtMPI. To get the MPI runable, we also need to use following commandline. mpirun --host localhost,localhost -np 2 --allow-run-as-root python3 /Quark/test/pytorch_minimal.py |
GPUVirtMPI branch works for me. Thanks |
@chengchen666 I tried the Ray issue in the GPUVirtNew branch and can't repro the "Address" issue. The log is as below. root@brad-MS-7D46:/var/log/quark# rm quark.log; docker run -it --runtime=quark_d -v /home/brad/rust/Quark/test:/test --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 rayllm bash -c "python3 /test/multiprocess_torchminimal.py" =============
|
@chengchen666 the mpirun bug fix has been merged with commit #1298. Now we don't need to add "--host localhost,localhost" in the mpirun commandline. |
Issue log:
sudo docker run -it --runtime=quark_d -v /home/Quark:/Quark --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/pytorch:24.01-py3 bash
To reproduce:
mpirun -np 2 --allow-run-as-root python3 /Quark/test/pytorch_minimal.py
and change code of pytorch_minimal.py: change device from GPU to cpu, to make sure cpu version program can work first.
The text was updated successfully, but these errors were encountered: