Skip to content

mpi4py: Regression in spawn tests #10631

Closed
@dalcinl

Description

@dalcinl

I believe changes over the last week may have introduce issues in spawn support. Two successive runs of mpi4py testsuite both failed at the same point. From the traceback, looks like the issue happens while children run MPI_Init_thread.

https://github.com/mpi4py/mpi4py-testing/runs/7703615156?check_suite_focus=true#step:17:1365

Traceback from link above
testArgsOnlyAtRootMultiple (test_spawn.TestSpawnSelf) ... [fv-az292-337:164868] *** Process received signal ***
[fv-az292-337:164868] Signal: Segmentation fault (11)
[fv-az292-337:164868] Signal code: Address not mapped (1)
[fv-az292-337:164868] Failing at address: 0x55a66b9ee180
[fv-az292-337:164868] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7fdecf9c8090]
[fv-az292-337:164868] [ 1] /usr/local/lib/libopen-pal.so.0(+0xc8fc4)[0x7fdecebcffc4]
[fv-az292-337:164868] [ 2] /usr/local/lib/libopen-pal.so.0(mca_btl_sm_poll_handle_frag+0x45)[0x7fdecebd1733]
[fv-az292-337:164868] [ 3] /usr/local/lib/libopen-pal.so.0(+0xca9ab)[0x7fdecebd19ab]
[fv-az292-337:164868] [ 4] /usr/local/lib/libopen-pal.so.0(+0xcacab)[0x7fdecebd1cab]
[fv-az292-337:164868] [ 5] /usr/local/lib/libopen-pal.so.0(opal_progress+0x43)[0x7fdeceb3bd6f]
[fv-az292-337:164868] [ 6] /usr/local/lib/libopen-pal.so.0(ompi_sync_wait_mt+0x1ef)[0x7fdecebf1d3f]
[fv-az292-337:164868] [ 7] /usr/local/lib/libmpi.so.0(+0xa813e)[0x7fdececec13e]
[fv-az292-337:164868] [ 8] /usr/local/lib/libmpi.so.0(ompi_request_default_wait+0x2b)[0x7fdececec385]
[fv-az292-337:164868] [ 9] /usr/local/lib/libmpi.so.0(ompi_coll_base_bcast_intra_generic+0x760)[0x7fdecedd304b]
[fv-az292-337:164868] [10] /usr/local/lib/libmpi.so.0(ompi_coll_base_bcast_intra_pipeline+0x1a3)[0x7fdecedd3551]
[fv-az292-337:164868] [11] /usr/local/lib/libmpi.so.0(ompi_coll_tuned_bcast_intra_do_this+0x126)[0x7fdecee0bd76]
[fv-az292-337:164868] [12] /usr/local/lib/libmpi.so.0(ompi_coll_tuned_bcast_intra_dec_fixed+0x43c)[0x7fdecee02832]
[fv-az292-337:164868] [13] /usr/local/lib/libmpi.so.0(ompi_dpm_connect_accept+0x8a8)[0x7fdececbf3b7]
[fv-az292-337:164868] [14] /usr/local/lib/libmpi.so.0(ompi_dpm_dyn_init+0xd6)[0x7fdececccb28]
[fv-az292-337:164868] [15] /usr/local/lib/libmpi.so.0(ompi_mpi_init+0x837)[0x7fdececeeece]
[fv-az292-337:164868] [16] /usr/local/lib/libmpi.so.0(PMPI_Init_thread+0xdd)[0x7fdeced59548]
[fv-az292-337:164868] [17] /opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages/mpi4py/MPI.cpython-310-x86_64-linux-gnu.so(+0x33f67)[0x7fdecf14af67]
[fv-az292-337:164868] [18] /opt/hostedtoolcache/Python/3.10.5/x64/lib/libpython3.10.so.1.0(PyModule_ExecDef+0x73)[0x7fdecfdcc0c3]
[fv-az292-337:164868] [19] /opt/hostedtoolcache/Python/3.10.5/x64/lib/libpython3.10.so.1.0(+0x274460)[0x7fdecfdfa460]
[fv-az292-337:164868] [20] /opt/hostedtoolcache/Python/3.10.5/x64/lib/libpython3.10.so.1.0(+0x19745e)[0x7fdecfd1d45e]
[fv-az292-337:164868] [21] /opt/hostedtoolcache/Python/3.10.5/x64/lib/libpython3.10.so.1.0(PyObject_Call+0x8e)[0x7fdecfceeffe]
[fv-az292-337:164868] [22] /opt/hostedtoolcache/Python/3.10.5/x64/lib/libpython3.10.so.1.0(_PyEval_EvalFrameDefault+0x630b)[0x7fdecfd6bddb]
[fv-az292-337:164868] [23] /opt/hostedtoolcache/Python/3.10.5/x64/lib/libpython3.10.so.1.0(+0x1de4dc)[0x7fdecfd644dc]
[fv-az292-337:164868] [24] /opt/hostedtoolcache/Python/3.10.5/x64/lib/libpython3.10.so.1.0(_PyEval_EvalFrameDefault+0x5021)[0x7fdecfd6aaf1]
[fv-az292-337:164868] [25] /opt/hostedtoolcache/Python/3.10.5/x64/lib/libpython3.10.so.1.0(+0x1de4dc)[0x7fdecfd644dc]
[fv-az292-337:164868] [26] /opt/hostedtoolcache/Python/3.10.5/x64/lib/libpython3.10.so.1.0(_PyEval_EvalFrameDefault+0x773)[0x7fdecfd66243]
[fv-az292-337:164868] [27] /opt/hostedtoolcache/Python/3.10.5/x64/lib/libpython3.10.so.1.0(+0x1de4dc)[0x7fdecfd644dc]
[fv-az292-337:164868] [28] /opt/hostedtoolcache/Python/3.10.5/x64/lib/libpython3.10.so.1.0(_PyEval_EvalFrameDefault+0x33e)[0x7fdecfd65e0e]
[fv-az292-337:164868] [29] /opt/hostedtoolcache/Python/3.10.5/x64/lib/libpython3.10.so.1.0(+0x1de4dc)[0x7fdecfd644dc]
[fv-az292-337:164868] *** End of error message ***
[fv-az292-337:164866] OPAL ERROR: Server not available in file dpm/dpm.c at line 403
[fv-az292-337:164855] OPAL ERROR: Server not available in file dpm/dpm.c at line 403
ERROR
[fv-az292-337:164867] OPAL ERROR: Server not available in file dpm/dpm.c at line 403
testCommSpawn (test_spawn.TestSpawnSelf) ... [fv-az292-337:00000] *** An error occurred in MPI_Init_thread
[fv-az292-337:00000] *** reported by process [1431306243,1]
[fv-az292-337:00000] *** on a NULL communicator
[fv-az292-337:00000] *** Unknown error
[fv-az292-337:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[fv-az292-337:00000] ***    and MPI will try to terminate your MPI job as well)
ok
testCommSpawnMultiple (test_spawn.TestSpawnSelf) ... 2 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
Extra bits from valgrind (locan run with debug build)
==1494211== Conditional jump or move depends on uninitialised value(s)
==1494211==    at 0x167996FC: pmix_bfrops_base_value_unload (bfrop_base_fns.c:409)
==1494211==    by 0x16798687: PMIx_Value_unload (bfrop_base_fns.c:54)
==1494211==    by 0x16065326: ompi_dpm_connect_accept (dpm.c:423)
==1494211==    by 0x160CCFD5: PMPI_Comm_spawn_multiple (comm_spawn_multiple.c:199)
==1494211==    by 0x15F1DD69: __pyx_pf_6mpi4py_3MPI_9Intracomm_38Spawn_multiple (MPI.c:149745)
==1494211==    by 0x15F1D6F8: __pyx_pw_6mpi4py_3MPI_9Intracomm_39Spawn_multiple (MPI.c:149423)
==1494211==    by 0x4991160: cfunction_call (methodobject.c:543)
==1494211==    by 0x498D262: _PyObject_MakeTpCall (call.c:215)
==1494211==    by 0x498C590: UnknownInlinedFun (abstract.h:112)
==1494211==    by 0x498C590: UnknownInlinedFun (abstract.h:99)
==1494211==    by 0x498C590: UnknownInlinedFun (abstract.h:123)
==1494211==    by 0x498C590: call_function (ceval.c:5869)
==1494211==    by 0x4985D92: _PyEval_EvalFrameDefault (ceval.c:4231)
==1494211==    by 0x49838D2: UnknownInlinedFun (pycore_ceval.h:46)
==1494211==    by 0x49838D2: _PyEval_Vector (ceval.c:5065)
==1494211==    by 0x4998FD7: UnknownInlinedFun (call.c:342)
==1494211==    by 0x4998FD7: UnknownInlinedFun (abstract.h:114)
==1494211==    by 0x4998FD7: method_vectorcall (classobject.c:53)

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions