Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI runs fail with Intel v2023.0.0 #141

Open
dominic-chang opened this issue Sep 28, 2023 · 7 comments
Open

MPI runs fail with Intel v2023.0.0 #141

dominic-chang opened this issue Sep 28, 2023 · 7 comments

Comments

@dominic-chang
Copy link
Contributor

Running the MPI example:

using Pigeons
result = pigeons(
    target = toy_mvn_target(100), 
    checkpoint = true, 
    on = ChildProcess(
            n_local_mpi_processes = 4))

with openmpiv4.1.5 on intelv2023.0.0 results in the following error:

ERROR: ERROR: LoadError: LoadError: AssertionError: all(1 .≤ to_global_indices .≤ e.load.n_global_indices)
Stacktrace:
  [1] transmit!(e::Pigeons.Entangler, source_data::Vector{Int64}, to_global_indices::Vector{Int64}, write_received_data_here::Vector{Int64})
    @ Pigeons ~/.julia/packages/Pigeons/nM8Nq/src/mpi_utils/Entangler.jl:136
  [2] transmit
    @ ~/.julia/packages/Pigeons/nM8Nq/src/mpi_utils/Entangler.jl:100 [inlined]
  [3] permuted_get(p::Pigeons.PermutedDistributedArray{Int64}, indices::Vector{Int64})
    @ Pigeons ~/.julia/packages/Pigeons/nM8Nq/src/mpi_utils/PermutedDistributedArray.jl:75
  [4] swap!(pair_swapper::Vector{Pigeons.ScaledPrecisionNormalLogPotential}, replicas::EntangledReplicas{...}, swap_graph::Pigeons.OddEven)
    @ Pigeons ~/.julia/packages/Pigeons/nM8Nq/src/swap/swap.jl:84
  [5] communicate!
    @ ~/.julia/packages/Pigeons/nM8Nq/src/pt/pigeons.jl:68 [inlined]
  [6] macro expansion
    @ ~/.julia/packages/Pigeons/nM8Nq/src/pt/pigeons.jl:51 [inlined]
  [7] macro expansion
    @ ./timing.jl:501 [inlined]
  [8] run_one_round!(pt::PT{...})
    @ Pigeons ~/.julia/packages/Pigeons/nM8Nq/src/pt/pigeons.jl:49
  [9] pigeons(pt::PT{...})
    @ Pigeons ~/.julia/packages/Pigeons/nM8Nq/src/pt/pigeons.jl:18
 [10] top-level scope
    @ ~/bamextension/results/all/2023-09-28-09-02-05-Gdtupdpb/.launch_script.jl:15
in expression starting at /n/home06/dochang/bamextension/results/all/2023-09-28-09-02-05-Gdtupdpb/.launch_script.jl:15
AssertionError: all(1 .≤ to_global_indices .≤ e.load.n_global_indices)
Stacktrace:
  [1] transmit!(e::Pigeons.Entangler, source_data::Vector{Int64}, to_global_indices::Vector{Int64}, write_received_data_here::Vector{Int64})
    @ Pigeons ~/.julia/packages/Pigeons/nM8Nq/src/mpi_utils/Entangler.jl:136
  [2] transmit
    @ ~/.julia/packages/Pigeons/nM8Nq/src/mpi_utils/Entangler.jl:100 [inlined]
  [3] permuted_get(p::Pigeons.PermutedDistributedArray{Int64}, indices::Vector{Int64})
    @ Pigeons ~/.julia/packages/Pigeons/nM8Nq/src/mpi_utils/PermutedDistributedArray.jl:75
  [4] swap!(pair_swapper::Vector{Pigeons.ScaledPrecisionNormalLogPotential}, replicas::EntangledReplicas{...}, swap_graph::Pigeons.OddEven)
    @ Pigeons ~/.julia/packages/Pigeons/nM8Nq/src/swap/swap.jl:84
  [5] communicate!
    @ ~/.julia/packages/Pigeons/nM8Nq/src/pt/pigeons.jl:68 [inlined]
  [6] macro expansion
    @ ~/.julia/packages/Pigeons/nM8Nq/src/pt/pigeons.jl:51 [inlined]
  [7] macro expansion
    @ ./timing.jl:501 [inlined]
  [8] run_one_round!(pt::PT{...})
    @ Pigeons ~/.julia/packages/Pigeons/nM8Nq/src/pt/pigeons.jl:49
  [9] pigeons(pt::PT{...})
    @ Pigeons ~/.julia/packages/Pigeons/nM8Nq/src/pt/pigeons.jl:18
 [10] top-level scope
    @ ~/bamextension/results/all/2023-09-28-09-02-05-Gdtupdpb/.launch_script.jl:15
in expression starting at /n/home06/dochang/bamextension/results/all/2023-09-28-09-02-05-Gdtupdpb/.launch_script.jl:15  
@miguelbiron
Copy link
Collaborator

Hi -- can you post the output of versioninfo() please?

@dominic-chang
Copy link
Contributor Author

Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 × Intel(R) Xeon(R) Gold 6134 CPU @ 3.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake-avx512)
  Threads: 4 on 16 virtual cores
Environment:
  LD_LIBRARY_PATH = /n/sw/helmod-rocky8/apps/Comp/intel/23.2.0-fasrc01/openmpi/4.1.5-fasrc03/lib64:/n/sw/intel-oneapi-2023.2/tbb/2023.2.0/lib/intel64:/n/sw/intel-oneapi-2023.2/mkl/2023.2.0/lib/intel64:/n/sw/intel-oneapi-2023.2/compiler/2023.2.0/linux/compiler/lib/intel64:/n/sw/intel-oneapi-2023.2/compiler/2023.2.0/linux/lib:/n/sw/helmod-rocky8/apps/Core/gcc/13.2.0-fasrc01/lib64:/n/sw/helmod-rocky8/apps/Core/mpc/1.3.1-fasrc02/lib64:/n/sw/helmod-rocky8/apps/Core/mpfr/4.2.1-fasrc01/lib64:/n/sw/helmod-rocky8/apps/Core/gmp/6.3.0-fasrc01/lib64:/usr/local/lib:/n/sw/helmod-rocky8/apps/Core/cuda/12.2.0-fasrc01/cuda/extras/CUPTI/lib64:/n/sw/helmod-rocky8/apps/Core/cuda/12.2.0-fasrc01/cuda/lib64:/n/sw/helmod-rocky8/apps/Core/cuda/12.2.0-fasrc01/cuda/lib::
  JULIA_NUM_THREADS = 4

@miguelbiron
Copy link
Collaborator

Thank you. The first thing to try here is to avoid using the system MPI and see if the example runs then. You can do this by not running setup_mpi but if you already did, you can manually delete the LocalPreferences.toml file that was created where your Project.toml lives. Let us know how it goes.

@dominic-chang
Copy link
Contributor Author

Thanks. I ended up using the gcc compiler instead. This resolved the segfault issue that I was having.

@dominic-chang
Copy link
Contributor Author

dominic-chang commented Oct 8, 2023

I was having an issue with openMPI and another dependency, so I ended up switching back to the intel compiler. This time I am using intelmpi v2021.10.0. I deleted the LocalPreferences.toml and ran pigeons without setup_mpi which resolved the segmentation fault issue but still failed immediately. Here are the contents of info/stderr.txt

 match_arg (../../../../../src/pm/i_hydra/libhydra/arg/hydra_arg.c:91): unrecognized argument merge-stderr-to-stdout
 HYD_arg_parse_array (../../../../../src/pm/i_hydra/libhydra/arg/hydra_arg.c:128): argument matching returned error
 mpiexec_get_parameters (../../../../../src/pm/i_hydra/mpiexec/mpiexec_params.c:1359): error parsing input array
 main (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:1893): error parsing parameters

I am running this example from the tutorials:

mpi_run = pigeons(
    target = toy_mvn_target(1000000), 
    n_chains = 1000,
    checkpoint = true,
    on = MPI(
        n_mpi_processes = 1000,
        n_threads = 1))

I think this error is because the -output-filename and -merge-stderr-to-stdout are not flags for intel's version of mpiexec. The submission_script executes correctly if I replace these flags with their intel counterparts.

@alexandrebouchard
Copy link
Member

Let me know what are the corresponding flags, that should be the basis of a relatively simple patch. The only missing piece is if there is a robust way to "detect" that mpiexec is Intel?

@dominic-chang
Copy link
Contributor Author

Hi, sorry for taking so long to get around to this. Here's a link to a pull request with the proper flags.

After this change, an example execution that works on the Purdue Anvil cluster is

settings = Pigeons.MPISettings(;
submission_system=:slurm,
add_to_submission = [
    "#SBATCH -p wholenode",
    ], 
    environment_modules=["intel/19.0.5.281","impi/2019.5.281"]
)
Pigeons.setup_mpi(settings)

pt = Pigeons.pigeons(
    target=toy_mvn_target(10), 
    record = [traces, round_trip, Pigeons.timing_extrema], 
    checkpoint=true, 
    n_chains=200, 
 on = Pigeons.MPIProcesses(
        n_mpi_processes = 100,
        walltime="0-01:00:00",
        n_threads = 1,
        mpiexec_args=`--mpi=pmi2`
    ),
    n_rounds=10
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants