Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Openmpi bugs #31

Closed
wants to merge 63 commits into from
Closed

Openmpi bugs #31

wants to merge 63 commits into from

Conversation

alexandrebouchard
Copy link
Member

Potentially a new bug that showed up on Saturday after all tests reintroduced. Maybe same as JuliaParallel/MPI.jl#725 but not sure yet.

alexandrebouchard and others added 30 commits March 6, 2023 09:29
A crash seemed to occur in CI during garbage collection

julia:3364 terminated with signal 11 at PC=7f997aaec971 SP=7f9936ccb970.  Backtrace:
/opt/hostedtoolcache/julia/1.8.5/x64/bin/../lib/julia/libjulia-internal.so.1(ijl_gc_safepoint+0x11)[0x7f997aaec971]

This looks like a crash during garbage collection code (this would be consistent with the fact that it only shows up with allocation-heavy test, i.e. Turing).

So even though the finalizer checks first before calling free, maybe there is something flaky there.

#30 (comment)
@alexandrebouchard
Copy link
Member Author

This might be related: JuliaPlots/Plots.jl#3583

Possibly the problem is when JJWrappers prepares the mpiexec environment

https://github.com/JuliaPackaging/JLLWrappers.jl/blob/e5b7b192484c3b9fe51dbed03df269945f8c7594/src/products/executable_generators.jl#L22

@alexandrebouchard
Copy link
Member Author

Close inspection of a failing runtest.jl based on a system MPI revealed the following:

  • the following line works if placed at the very beginning of runtest but not later on in the script
run(`/usr/local/bin/mpiexec ls`)
  • printing the value of the ENV variable reveals that it gets updated somewhere in between, with added values:
IPATH_NO_BACKTRACE=1
HFI_NO_BACKTRACE=1
OMPI_MCA_ess=singleton
ORTE_SCHIZO_DETECTION=ORTE
OMPI_MCA_orte_launch=1
PMIX_NAMESPACE=1092091905
PMIX_RANK=0
PMIX_SERVER_URI3=1092091904.0;tcp4://127.0.0.1:53361
PMIX_SERVER_URI2=1092091904.0;tcp4://127.0.0.1:53361
PMIX_SERVER_URI21=1092091904.0;tcp4://127.0.0.1:53361
PMIX_SECURITY_MODE=native
PMIX_PTL_MODULE=tcp,usock
PMIX_BFROP_BUFFER_TYPE=PMIX_BFROP_BUFFER_NON_DESC
PMIX_GDS_MODULE=ds21,ds12,hash
PMIX_SERVER_TMPDIR=/tmp/ompi.korolev.501/pid.38533
PMIX_SYSTEM_TMPDIR=/tmp
PMIX_DSTORE_21_BASE_PATH=/tmp/ompi.korolev.501/pid.38533/pmix_dstor_ds21_38533
PMIX_DSTORE_ESH_BASE_PATH=/tmp/ompi.korolev.501/pid.38533/pmix_dstor_ds12_38533
PMIX_HOSTNAME=korolev.local
PMIX_VERSION=3.2.4rc1
OMPI_MCA_orte_precondition_transports=f14ed8193d52a2bd-56a807c4118762de
OMPI_MCA_pmix=^s1,s2,cray,isolated
PMIX_MCA_mca_base_component_show_load_errors=1
PMIX_MCA_ptl=tcp,usock
PMIX_MCA_psec=native
PMIX_MCA_gds=ds21,ds12,hash
OMPI_MCA_orte_ess_num_procs=1
OMPI_APP_CTX_NUM_PROCS=1

As soon as these additional environment variables are set, it seems that calling mpiexec no longer works (returns the uninformative pipeline_error).

Frustratingly, it is not clear at all what makes that modification. Probably some hack in MPIPreferences/JLLWrapper/something else?

@miguelbiron
Copy link
Collaborator

That is very interesting! How did you figure this out!? Can you highlight the extra ENV variables that appear?

@alexandrebouchard
Copy link
Member Author

The one included are just the additional ones.

@miguelbiron
Copy link
Collaborator

Yikes...

@alexandrebouchard
Copy link
Member Author

Those issues were later on fixed in #30

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants