Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mpiexecjl doesn't handle juliaup with non-default channel #857

Open
giordano opened this issue Aug 6, 2024 · 3 comments · Fixed by #858
Open

mpiexecjl doesn't handle juliaup with non-default channel #857

giordano opened this issue Aug 6, 2024 · 3 comments · Fixed by #858
Labels

Comments

@giordano
Copy link
Member

giordano commented Aug 6, 2024

I have a system where the test introduced in #834 is failing:

MPIPreferences:
  binary:  OpenMPI_jll
  abi:     OpenMPI

Package versions
  MPI.jl:             0.20.20
  MPIPreferences.jl:  0.1.11
  OpenMPI_jll:        4.1.6+0

Library information:
  libmpi:  /home/cceamgi/.julia/artifacts/58dcf187642cdfbafb3581993ca3d8de565acc78/lib/libmpi.so
  libmpi dlpath:  /home/cceamgi/.julia/artifacts/58dcf187642cdfbafb3581993ca3d8de565acc78/lib/libmpi.so
  MPI version:  3.1.0
  Library version:
    Open MPI v4.1.6, package: Open MPI [email protected] Distribution, ident: 4.1.6, repo rev: v4.1.6, Sep 30, 2023
Hello world, I am rank 3 of 4
Hello world, I am rank 2 of 4
Hello world, I am rank 0 of 4
Hello world, I am rank 1 of 4
mpiexecjl: Test Failed at /home/cceamgi/.julia/packages/MPI/is7GN/test/mpiexecjl.jl:41
  Expression: p.exitcode == exit_code
   Evaluated: 1 == 10

I need to investigate what's wrong with this. For the record, this isn't specific to OpenMPI_jll, I see the same with MPICH_jll. I wonder if the problem is the shell, here /bin/sh is

$ /bin/sh --version
GNU bash, version 5.1.8(1)-release (aarch64-redhat-linux-gnu)
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>

This is free software; you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
@giordano giordano added the bug label Aug 6, 2024
@giordano
Copy link
Member Author

giordano commented Aug 6, 2024

Ah, the problem is that Julia doesn't start at all, I can see errors like

ERROR: Unable to load dependent library /data/cceamgi/julia-depot/juliaup/julia-nightly/bin/../lib/julia/libjulia-internal.so.1.12
Message:/data/cceamgi/julia-depot/juliaup/julia-nightly/bin/../lib/julia/libjulia-internal.so.1.12: undefined symbol: unw_ensure_tls
ERROR: Unable to load dependent library /data/cceamgi/julia-depot/juliaup/julia-nightly/bin/../lib/julia/libjulia-internal.so.1.12
ERROR: Unable to load dependent library /data/cceamgi/julia-depot/juliaup/julia-nightly/bin/../lib/julia/libjulia-internal.so.1.12
Message:/data/cceamgi/julia-depot/juliaup/julia-nightly/bin/../lib/julia/libjulia-internal.so.1.12: undefined symbol: unw_ensure_tls
ERROR: Unable to load dependent library /data/cceamgi/julia-depot/juliaup/julia-nightly/bin/../lib/julia/libjulia-internal.so.1.12
Message:/data/cceamgi/julia-depot/juliaup/julia-nightly/bin/../lib/julia/libjulia-internal.so.1.12: undefined symbol: unw_ensure_tls
Message:/data/cceamgi/julia-depot/juliaup/julia-nightly/bin/../lib/julia/libjulia-internal.so.1.12: undefined symbol: unw_ensure_tls

@giordano
Copy link
Member Author

On a different system I'm seeing the same outside of tests with Julia nightly:

$ ~/.julia/bin/mpiexecjl -np 1 --project julia +nightly -e ''
ERROR: Unable to load dependent library /home/mose/.julia/juliaup/julia-nightly/bin/../lib/julia/libjulia-internal.so.1.12
Message:/home/mose/.julia/juliaup/julia-nightly/bin/../lib/julia/libjulia-internal.so.1.12: undefined symbol: unw_ensure_tls
┌ Error: The MPI process failed
│   proc = Process(setenv(`/home/mose/.julia/artifacts/62773cea33514bc12f48f228effadcb2ead6184a/bin/mpiexec -np 1 julia +nightly -e ''`,[...]), ProcessExited(1))
└ @ Main none:7

I suspect this is a real issue with Julia v1.12

@giordano giordano reopened this Aug 24, 2024
@giordano
Copy link
Member Author

giordano commented Aug 24, 2024

Ah, I understand the issue now, and I understand why JULIA_BINDIR solved the issue in #858. TL;DR: the issue arises with mpiexecjl when using juliaup with a channel different than the default one.

In

MPI.jl/bin/mpiexecjl

Lines 54 to 58 in 780aaa0

if [ -n "${JULIA_BINDIR}" ]; then
JULIA_CMD="${JULIA_BINDIR}/julia"
else
JULIA_CMD="julia"
fi
we run julia assuming it's in PATH (unless JULIA_BINDIR is set), but if I try to run mpiexecjl ... julia +nightly we're entering the script

MPI.jl/bin/mpiexecjl

Lines 61 to 70 in 780aaa0

SCRIPT='
using MPI
ENV["JULIA_PROJECT"] = dirname(Base.active_project())
proc = run(pipeline(`$(mpiexec()) $(ARGS)`; stdout, stderr); wait=false)
wait(proc)
if !iszero(proc.exitcode)
@error "The MPI process failed" proc
end
exit(proc.exitcode)
'
with the default juliaup channel, setting up LD_LIBRARY_PATH for that version of Julia, which breaks down when we then try to start the other julia process: if that's a different version of Julia we're mixing up libraries for different versions of Julia. This also explains why we don't have problems here in CI: we don't use juliaup (let alone mixing up different channels).

I'm really not sure we have a good solution for this besides setting JULIA_BINDIR 🤔 Should we parse julia +channel specially in the script to deal with this? That'd complicate argument parsing quite a bit.

@giordano giordano changed the title mpiexecjl exit code test failing mpiexecjl doesn't handle juliaup with non-default channel Sep 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant