Open
Description
Issue log:
root@5b2cd4b2aca7:/cchen/Quark/test# mpirun -np 2 --allow-run-as-root python3 pytorch_minimal.py
[5b2cd4b2aca7:00372] opal_ifinit: ioctl(SIOCGIFADDR) failed with errno=19
[5b2cd4b2aca7:00372] *** Process received signal ***
[5b2cd4b2aca7:00372] Signal: Floating point exception (8)
[5b2cd4b2aca7:00372] Signal code: Integer divide-by-zero (1)
[5b2cd4b2aca7:00372] Failing at address: 0x2afa05c7472e
[5b2cd4b2aca7:00372] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x2afa05e42520]
[5b2cd4b2aca7:00372] [ 1] /opt/hpcx/ompi/lib/libopen-pal.so.40(+0xb772e)[0x2afa05c7472e]
[5b2cd4b2aca7:00372] [ 2] /opt/hpcx/ompi/lib/libopen-pal.so.40(+0xb85d6)[0x2afa05c755d6]
[5b2cd4b2aca7:00372] [ 3] /opt/hpcx/ompi/lib/libopen-pal.so.40(+0xb8b0e)[0x2afa05c75b0e]
[5b2cd4b2aca7:00372] [ 4] /opt/hpcx/ompi/lib/libopen-pal.so.40(opal_hwloc201_hwloc_topology_load+0xdb)[0x2afa05c866eb]
[5b2cd4b2aca7:00372] [ 5] /opt/hpcx/ompi/lib/libopen-pal.so.40(opal_hwloc_base_get_topology+0x1116)[0x2afa05c53e66]
[5b2cd4b2aca7:00372] [ 6] /opt/hpcx/ompi/lib/openmpi/mca_ess_hnp.so(+0x6686)[0x2afa0605f686]
[5b2cd4b2aca7:00372] [ 7] /opt/hpcx/ompi/lib/libopen-rte.so.40(orte_init+0x2b8)[0x2afa05b94fd8]
[5b2cd4b2aca7:00372] [ 8] /opt/hpcx/ompi/lib/libopen-rte.so.40(orte_submit_init+0x8e5)[0x2afa05b45535]
[5b2cd4b2aca7:00372] [ 9] mpirun(+0x13a3)[0x5565f56533a3]
[5b2cd4b2aca7:00372] [10] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x2afa05e29d90]
[5b2cd4b2aca7:00372] [11] /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x2afa05e29e40]
[5b2cd4b2aca7:00372] [12] mpirun(+0x11f5)[0x5565f56531f5]
[5b2cd4b2aca7:00372] *** End of error message ***
Floating point exception
sudo docker run -it --runtime=quark_d -v /home/Quark:/Quark --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/pytorch:24.01-py3 bash
To reproduce:
mpirun -np 2 --allow-run-as-root python3 /Quark/test/pytorch_minimal.py
and change code of pytorch_minimal.py: change device from GPU to cpu, to make sure cpu version program can work first.
# device = torch.device("cuda:0")
device = torch.device("cpu")
Metadata
Metadata
Assignees
Labels
No labels