Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI Not Working in Quark #1281

Open
chengchen666 opened this issue May 23, 2024 · 11 comments
Open

MPI Not Working in Quark #1281

chengchen666 opened this issue May 23, 2024 · 11 comments
Assignees

Comments

@chengchen666
Copy link
Collaborator

chengchen666 commented May 23, 2024

Issue log:

root@5b2cd4b2aca7:/cchen/Quark/test# mpirun -np 2  --allow-run-as-root python3 pytorch_minimal.py
[5b2cd4b2aca7:00372] opal_ifinit: ioctl(SIOCGIFADDR) failed with errno=19
[5b2cd4b2aca7:00372] *** Process received signal ***
[5b2cd4b2aca7:00372] Signal: Floating point exception (8)
[5b2cd4b2aca7:00372] Signal code: Integer divide-by-zero (1)
[5b2cd4b2aca7:00372] Failing at address: 0x2afa05c7472e
[5b2cd4b2aca7:00372] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x2afa05e42520]
[5b2cd4b2aca7:00372] [ 1] /opt/hpcx/ompi/lib/libopen-pal.so.40(+0xb772e)[0x2afa05c7472e]
[5b2cd4b2aca7:00372] [ 2] /opt/hpcx/ompi/lib/libopen-pal.so.40(+0xb85d6)[0x2afa05c755d6]
[5b2cd4b2aca7:00372] [ 3] /opt/hpcx/ompi/lib/libopen-pal.so.40(+0xb8b0e)[0x2afa05c75b0e]
[5b2cd4b2aca7:00372] [ 4] /opt/hpcx/ompi/lib/libopen-pal.so.40(opal_hwloc201_hwloc_topology_load+0xdb)[0x2afa05c866eb]
[5b2cd4b2aca7:00372] [ 5] /opt/hpcx/ompi/lib/libopen-pal.so.40(opal_hwloc_base_get_topology+0x1116)[0x2afa05c53e66]
[5b2cd4b2aca7:00372] [ 6] /opt/hpcx/ompi/lib/openmpi/mca_ess_hnp.so(+0x6686)[0x2afa0605f686]
[5b2cd4b2aca7:00372] [ 7] /opt/hpcx/ompi/lib/libopen-rte.so.40(orte_init+0x2b8)[0x2afa05b94fd8]
[5b2cd4b2aca7:00372] [ 8] /opt/hpcx/ompi/lib/libopen-rte.so.40(orte_submit_init+0x8e5)[0x2afa05b45535]
[5b2cd4b2aca7:00372] [ 9] mpirun(+0x13a3)[0x5565f56533a3]
[5b2cd4b2aca7:00372] [10] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x2afa05e29d90]
[5b2cd4b2aca7:00372] [11] /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x2afa05e29e40]
[5b2cd4b2aca7:00372] [12] mpirun(+0x11f5)[0x5565f56531f5]
[5b2cd4b2aca7:00372] *** End of error message ***
Floating point exception

sudo docker run -it --runtime=quark_d -v /home/Quark:/Quark --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/pytorch:24.01-py3 bash

To reproduce:
mpirun -np 2 --allow-run-as-root python3 /Quark/test/pytorch_minimal.py
and change code of pytorch_minimal.py: change device from GPU to cpu, to make sure cpu version program can work first.

# device = torch.device("cuda:0")
device = torch.device("cpu")
@QuarkContainer QuarkContainer self-assigned this May 23, 2024
@chengchen666
Copy link
Collaborator Author

https://github.com/QuarkContainer/Quark/blob/gpu-multiprocessing/test/multiprocess_torchminimal.py also failed. Test program is in branch:gpu-multiprocessing. just run python3 multiprocess_torchminimal.py will launch two cpu program by using Ray. So this program requires an image with Ray preinstalled

@QuarkContainer
Copy link
Owner

QuarkContainer commented May 25, 2024

The repro could be simplified as below.

sudo docker run -it --runtime=quark_d --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/pytorch:24.01-py3 bash -c "mpirun -np 2 --allow-run-as-root ls"

As the crash is at /opt/hpcx/ompi/lib/libopen-pal.so, I download it and objdump it and get the assembly code as low. It crashed at last line.

b7701: 48 8b 7c 24 08 mov 0x8(%rsp),%rdi
b7706: 8b 77 20 mov 0x20(%rdi),%esi
b7709: 85 f6 test %esi,%esi
b770b: 75 69 jne b7776 <look_proc.isra.0+0x326>
b770d: 83 c1 01 add $0x1,%ecx
b7710: 41 89 4f 2c mov %ecx,0x2c(%r15)
b7714: 85 db test %ebx,%ebx
b7716: 75 2a jne b7742 <look_proc.isra.0+0x2f2>
b7718: c1 e8 1a shr $0x1a,%eax
b771b: 31 d2 xor %edx,%edx
b771d: 8d 48 01 lea 0x1(%rax),%ecx
b7720: 8b 44 24 1c mov 0x1c(%rsp),%eax
b7724: f7 f1 div %ecx
b7726: 31 d2 xor %edx,%edx
b7728: 89 c1 mov %eax,%ecx
b772a: 8b 44 24 10 mov 0x10(%rsp),%eax
b772e: f7 f1 div %ecx

The b7718 is very like following source code:
https://github.com/open-mpi/hwloc/blob/63a8288d31a1baf67a909466aba9a022c78ca7b1/hwloc/topology-x86.c#L727

But I can't map the assembly to the c source code :-(

It might related to open-mpi/hwloc#525.

The issue should be related to cpuid ax=4, cx = 0. Following is a test which prove when disable that, the issue could be skipped.

e65898c

@chengchen666
Copy link
Collaborator Author

https://github.com/QuarkContainer/Quark/blob/gpu-multiprocessing/test/multiprocess_torchminimal.py also failed. Test program is in branch:gpu-multiprocessing. just run python3 multiprocess_torchminimal.py will launch two cpu program by using Ray. So this program requires an image with Ray preinstalled

Sounds like making Ray work is easier. To make an env with ray and reproduce, follow this commit https://github.com/QuarkContainer/Quark/commit/a711f35b6706f004cc0f579a09926af821e84131

@shrik3
Copy link
Collaborator

shrik3 commented May 25, 2024

The b7718 is very like following source code: https://github.com/open-mpi/hwloc/blob/63a8288d31a1baf67a909466aba9a022c78ca7b1/hwloc/topology-x86.c#L727

But I can't map the assembly to the c source code :-(

It might related to open-mpi/hwloc#525.

The issue should be related to cpuid ax=4, cx = 0. Following is a test which prove when disable that, the issue could be skipped.

e65898c

Their code does check zero value, so the b772e: f7 f1 div %ecx should not happen ... I'm confused...

https://github.com/open-mpi/hwloc/blob/63a8288d31a1baf67a909466aba9a022c78ca7b1/hwloc/topology-x86.c#L730

@shrik3
Copy link
Collaborator

shrik3 commented May 26, 2024

more verbose logging on mpirun

mpirun -n 3 --prtemca rmaps_base_verbose 10 --display alloc --output tag ls

@shrik3
Copy link
Collaborator

shrik3 commented May 26, 2024

@shrik3
Copy link
Collaborator

shrik3 commented May 26, 2024

for the record, I tested the mfisherman/openmpi container, the mpirun(perhaps newer library version than the pytorch one?) doesn't have the div-by-zero condition.

However both quark and gvisor fails to run any program using mpirun, because the cpu topology is not correctly detected (see #1291). There is no easy fix at at the moment.

@chengchen666 could you test running the same mpirun command with --map-by slot:oversubscribe ? This is working for me (not the correct way but at least no error) , @QuarkContainer has a workaround for the div-by-zero condition

@QuarkContainer
Copy link
Owner

@chengchen666 I add some workarunds in the branch GPUVirtMPI.

To get the MPI runable, we also need to use following commandline.

mpirun --host localhost,localhost -np 2 --allow-run-as-root python3 /Quark/test/pytorch_minimal.py

@chengchen666
Copy link
Collaborator Author

GPUVirtMPI branch works for me. Thanks

@QuarkContainer
Copy link
Owner

@chengchen666 I tried the Ray issue in the GPUVirtNew branch and can't repro the "Address" issue. The log is as below.

root@brad-MS-7D46:/var/log/quark# rm quark.log; docker run -it --runtime=quark_d -v /home/brad/rust/Quark/test:/test --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 rayllm bash -c "python3 /test/multiprocess_torchminimal.py"

=============
== PyTorch ==

NVIDIA Release 23.09 (build 69180607)
PyTorch Version 2.1.0a0+32f93b1

Container image Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Copyright (c) 2014-2023 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006 Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015 Google Inc.
Copyright (c) 2015 Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available.
Use the NVIDIA Container Toolkit to start this container with GPU support; see
https://docs.nvidia.com/datacenter/cloud-native/ .

2024-05-28 13:20:33,634 INFO worker.py:1749 -- Started a local Ray instance.
Traceback (most recent call last):
File "/test/multiprocess_torchminimal.py", line 13, in
@ray.remote()
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 3431, in remote
assert len(args) == 0 and len(kwargs) > 0, ray_option_utils.remote_args_error_string
AssertionError: The @ray.remote decorator must be applied either with no arguments and no parentheses, for example '@ray.remote', or it must be applied using some of the arguments in the list ['max_calls', 'max_retries', 'num_cpus', 'num_returns', 'object_store_memory', 'retry_exceptions', '_generator_backpressure_num_objects', 'concurrency_groups', 'lifetime', 'max_concurrency', 'max_restarts', 'max_task_retries', 'max_pending_calls', 'namespace', 'get_if_exists', 'accelerator_type', 'memory', 'name', 'num_gpus', 'placement_group', 'placement_group_bundle_index', 'placement_group_capture_child_tasks', 'resources', 'runtime_env', 'scheduling_strategy', '_metadata', 'enable_task_events'], for example '@ray.remote(num_returns=2, resources={"CustomResource": 1})'.

@QuarkContainer
Copy link
Owner

@chengchen666 the mpirun bug fix has been merged with commit #1298.

Now we don't need to add "--host localhost,localhost" in the mpirun commandline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants