Closed
Description
I believe changes over the last week may have introduce issues in spawn support. Two successive runs of mpi4py testsuite both failed at the same point. From the traceback, looks like the issue happens while children run MPI_Init_thread
.
https://github.com/mpi4py/mpi4py-testing/runs/7703615156?check_suite_focus=true#step:17:1365
Traceback from link above
testArgsOnlyAtRootMultiple (test_spawn.TestSpawnSelf) ... [fv-az292-337:164868] *** Process received signal ***
[fv-az292-337:164868] Signal: Segmentation fault (11)
[fv-az292-337:164868] Signal code: Address not mapped (1)
[fv-az292-337:164868] Failing at address: 0x55a66b9ee180
[fv-az292-337:164868] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7fdecf9c8090]
[fv-az292-337:164868] [ 1] /usr/local/lib/libopen-pal.so.0(+0xc8fc4)[0x7fdecebcffc4]
[fv-az292-337:164868] [ 2] /usr/local/lib/libopen-pal.so.0(mca_btl_sm_poll_handle_frag+0x45)[0x7fdecebd1733]
[fv-az292-337:164868] [ 3] /usr/local/lib/libopen-pal.so.0(+0xca9ab)[0x7fdecebd19ab]
[fv-az292-337:164868] [ 4] /usr/local/lib/libopen-pal.so.0(+0xcacab)[0x7fdecebd1cab]
[fv-az292-337:164868] [ 5] /usr/local/lib/libopen-pal.so.0(opal_progress+0x43)[0x7fdeceb3bd6f]
[fv-az292-337:164868] [ 6] /usr/local/lib/libopen-pal.so.0(ompi_sync_wait_mt+0x1ef)[0x7fdecebf1d3f]
[fv-az292-337:164868] [ 7] /usr/local/lib/libmpi.so.0(+0xa813e)[0x7fdececec13e]
[fv-az292-337:164868] [ 8] /usr/local/lib/libmpi.so.0(ompi_request_default_wait+0x2b)[0x7fdececec385]
[fv-az292-337:164868] [ 9] /usr/local/lib/libmpi.so.0(ompi_coll_base_bcast_intra_generic+0x760)[0x7fdecedd304b]
[fv-az292-337:164868] [10] /usr/local/lib/libmpi.so.0(ompi_coll_base_bcast_intra_pipeline+0x1a3)[0x7fdecedd3551]
[fv-az292-337:164868] [11] /usr/local/lib/libmpi.so.0(ompi_coll_tuned_bcast_intra_do_this+0x126)[0x7fdecee0bd76]
[fv-az292-337:164868] [12] /usr/local/lib/libmpi.so.0(ompi_coll_tuned_bcast_intra_dec_fixed+0x43c)[0x7fdecee02832]
[fv-az292-337:164868] [13] /usr/local/lib/libmpi.so.0(ompi_dpm_connect_accept+0x8a8)[0x7fdececbf3b7]
[fv-az292-337:164868] [14] /usr/local/lib/libmpi.so.0(ompi_dpm_dyn_init+0xd6)[0x7fdececccb28]
[fv-az292-337:164868] [15] /usr/local/lib/libmpi.so.0(ompi_mpi_init+0x837)[0x7fdececeeece]
[fv-az292-337:164868] [16] /usr/local/lib/libmpi.so.0(PMPI_Init_thread+0xdd)[0x7fdeced59548]
[fv-az292-337:164868] [17] /opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages/mpi4py/MPI.cpython-310-x86_64-linux-gnu.so(+0x33f67)[0x7fdecf14af67]
[fv-az292-337:164868] [18] /opt/hostedtoolcache/Python/3.10.5/x64/lib/libpython3.10.so.1.0(PyModule_ExecDef+0x73)[0x7fdecfdcc0c3]
[fv-az292-337:164868] [19] /opt/hostedtoolcache/Python/3.10.5/x64/lib/libpython3.10.so.1.0(+0x274460)[0x7fdecfdfa460]
[fv-az292-337:164868] [20] /opt/hostedtoolcache/Python/3.10.5/x64/lib/libpython3.10.so.1.0(+0x19745e)[0x7fdecfd1d45e]
[fv-az292-337:164868] [21] /opt/hostedtoolcache/Python/3.10.5/x64/lib/libpython3.10.so.1.0(PyObject_Call+0x8e)[0x7fdecfceeffe]
[fv-az292-337:164868] [22] /opt/hostedtoolcache/Python/3.10.5/x64/lib/libpython3.10.so.1.0(_PyEval_EvalFrameDefault+0x630b)[0x7fdecfd6bddb]
[fv-az292-337:164868] [23] /opt/hostedtoolcache/Python/3.10.5/x64/lib/libpython3.10.so.1.0(+0x1de4dc)[0x7fdecfd644dc]
[fv-az292-337:164868] [24] /opt/hostedtoolcache/Python/3.10.5/x64/lib/libpython3.10.so.1.0(_PyEval_EvalFrameDefault+0x5021)[0x7fdecfd6aaf1]
[fv-az292-337:164868] [25] /opt/hostedtoolcache/Python/3.10.5/x64/lib/libpython3.10.so.1.0(+0x1de4dc)[0x7fdecfd644dc]
[fv-az292-337:164868] [26] /opt/hostedtoolcache/Python/3.10.5/x64/lib/libpython3.10.so.1.0(_PyEval_EvalFrameDefault+0x773)[0x7fdecfd66243]
[fv-az292-337:164868] [27] /opt/hostedtoolcache/Python/3.10.5/x64/lib/libpython3.10.so.1.0(+0x1de4dc)[0x7fdecfd644dc]
[fv-az292-337:164868] [28] /opt/hostedtoolcache/Python/3.10.5/x64/lib/libpython3.10.so.1.0(_PyEval_EvalFrameDefault+0x33e)[0x7fdecfd65e0e]
[fv-az292-337:164868] [29] /opt/hostedtoolcache/Python/3.10.5/x64/lib/libpython3.10.so.1.0(+0x1de4dc)[0x7fdecfd644dc]
[fv-az292-337:164868] *** End of error message ***
[fv-az292-337:164866] OPAL ERROR: Server not available in file dpm/dpm.c at line 403
[fv-az292-337:164855] OPAL ERROR: Server not available in file dpm/dpm.c at line 403
ERROR
[fv-az292-337:164867] OPAL ERROR: Server not available in file dpm/dpm.c at line 403
testCommSpawn (test_spawn.TestSpawnSelf) ... [fv-az292-337:00000] *** An error occurred in MPI_Init_thread
[fv-az292-337:00000] *** reported by process [1431306243,1]
[fv-az292-337:00000] *** on a NULL communicator
[fv-az292-337:00000] *** Unknown error
[fv-az292-337:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[fv-az292-337:00000] *** and MPI will try to terminate your MPI job as well)
ok
testCommSpawnMultiple (test_spawn.TestSpawnSelf) ... 2 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
Extra bits from valgrind (locan run with debug build)
==1494211== Conditional jump or move depends on uninitialised value(s)
==1494211== at 0x167996FC: pmix_bfrops_base_value_unload (bfrop_base_fns.c:409)
==1494211== by 0x16798687: PMIx_Value_unload (bfrop_base_fns.c:54)
==1494211== by 0x16065326: ompi_dpm_connect_accept (dpm.c:423)
==1494211== by 0x160CCFD5: PMPI_Comm_spawn_multiple (comm_spawn_multiple.c:199)
==1494211== by 0x15F1DD69: __pyx_pf_6mpi4py_3MPI_9Intracomm_38Spawn_multiple (MPI.c:149745)
==1494211== by 0x15F1D6F8: __pyx_pw_6mpi4py_3MPI_9Intracomm_39Spawn_multiple (MPI.c:149423)
==1494211== by 0x4991160: cfunction_call (methodobject.c:543)
==1494211== by 0x498D262: _PyObject_MakeTpCall (call.c:215)
==1494211== by 0x498C590: UnknownInlinedFun (abstract.h:112)
==1494211== by 0x498C590: UnknownInlinedFun (abstract.h:99)
==1494211== by 0x498C590: UnknownInlinedFun (abstract.h:123)
==1494211== by 0x498C590: call_function (ceval.c:5869)
==1494211== by 0x4985D92: _PyEval_EvalFrameDefault (ceval.c:4231)
==1494211== by 0x49838D2: UnknownInlinedFun (pycore_ceval.h:46)
==1494211== by 0x49838D2: _PyEval_Vector (ceval.c:5065)
==1494211== by 0x4998FD7: UnknownInlinedFun (call.c:342)
==1494211== by 0x4998FD7: UnknownInlinedFun (abstract.h:114)
==1494211== by 0x4998FD7: method_vectorcall (classobject.c:53)