Crash in parallel simulations when flushing particle data on JUWELS Booster #2542

MaxThevenet · 2021-11-10T17:29:08Z

A production 3D simulation with openPMD output crashes at the first particle flush, when running in parallel. This input file is a reproducer showing the problem on a simplified setup. This is executed with the following submission script on the Juwels Booster. The CMake command and output can be found here, and I used the following profile file. The crash gave the following files: error.txt and Backtrace.

Note: the same run on a V100-equipped cluster ran successfully with the following CMake output ran successfully.

ax3l · 2021-11-11T17:26:54Z

Thanks for the detailed report!

As discussed on Slack, we see this problem only on Juwels so far and the same run works on other clusters.

The backtrace indicates that the problem originates straight out of the MPI-I/O layer (ROMIO). That's a bit curious, because by default OpenMPI uses OMPIO as it's I/O implementation instead of ROMIO, so something seems to be non-default on Juwels. I see that your profile, sourced in the submission script, has

# change the MPI-IO backend in OpenMPI from OMPIO to ROMIO (experimental)
#export OMPI_MCA_io=romio321

in it before the srun call. Let's comment this line out to make sure the default OMPIO implementation is used. OMPIO is pretty buggy itself, but I reported/fixed a series of bugs in the past related to HDF5 I/O in the past, and the OpenMPI 4.1.1 on Juwels should contain all those fixes: openPMD/openPMD-api#446

Another thing we discussed is to ask the cluster support for the newest version of HDF5 in the 1.10 series, so providing instead of HDF5 1.10.6 the 1.10.8 release. Cluster support could also run a few tests, e.g., with hdf5-iotest and ior to check if the MPI-I/O layer and HDF5 implementation are generally in working condition.

ax3l · 2021-11-11T17:29:23Z

If this works, then we should remove the hint to change this in our docs:
https://github.com/ECP-WarpX/WarpX/blob/development/Docs/source/install/hpc/juwels.rst

I think we used this temporarily to work around another earlier issue on Juwels.

ax3l · 2021-11-11T17:44:34Z

OMPIO errors still with (test from Maxence):

[jwb0065.juwels:28280] mca_sharedfp_sm_file_open: Error, unable to open file for mmap: /tmp/ompi.jwb0065.juwels.17674/jf.0/4086174241/openpmd_00010.h5_cid-5-28280.sm
[jwb0065.juwels:28279] mca_sharedfp_sm_file_open: Error, unable to open file for mmap: /tmp/ompi.jwb0065.juwels.17674/jf.0/4086174241/openpmd_00010.h5_cid-5-28280.sm
[jwb0065.juwels:28277] mca_sharedfp_sm_file_open: Error, unable to open file for mmap: /tmp/ompi.jwb0065.juwels.17674/jf.0/4086174241/openpmd_00010.h5_cid-5-28280.sm
[jwb0065.juwels:28278] mca_sharedfp_sm_file_open: Error, unable to open file for mmap: /tmp/ompi.jwb0065.juwels.17674/jf.0/4086174241/openpmd_00010.h5_cid-5-28280.sm
HDF5-DIAG: Error detected in HDF5 (1.10.6) MPI-process 2:
  #000: H5F.c line 444 in H5Fcreate(): unable to create file
    major: File accessibilty
HDF5-DIAG: Error detected in HDF5 (1.10.6) MPI-process 1:
HDF5-DIAG: Error detected in HDF5 (1.10.6) MPI-process 0:
  #000: H5F.c line 444 in H5Fcreate(): unable to create file
  #000: H5F.c line 444 in H5Fcreate(): unable to create file
    major: File accessibilty
    minor: Unable to open file
    major: File accessibilty
    minor: Unable to open file
HDF5-DIAG: Error detected in HDF5 (1.10.6) MPI-process 3:
  #000: H5F.c line 444 in H5Fcreate(): unable to create file
    major: File accessibilty
    minor: Unable to open file
  #001: H5Fint.c line 1498 in H5F_open(): unable to open file: time = Thu Nov 11 18:41:15 2021
, name = 'diags/injection/openpmd_00010.h5', tent_flags = 13
    major: File accessibilty
    minor: Unable to open file
  #001: H5Fint.c line 1498 in H5F_open(): unable to open file: time = Thu Nov 11 18:41:15 2021
, name = 'diags/injection/openpmd_00010.h5', tent_flags = 13

ax3l · 2021-11-11T17:47:27Z

Let's see if we can work-around this via:

export OMPI_MCA_io=ompio
export HDF5_USE_FILE_LOCKING=FALSE

Update: same error.

ax3l · 2021-11-11T17:50:14Z

The first errors:

mca_sharedfp_sm_file_open: Error, unable to open file for mmap

point to a ulimit issue: open-mpi/ompi#4336

Update:ulimit -n returns 524288 (pretty good) and the above linked issue is fixed in OpenMPI 4.1.1

ax3l · 2021-11-11T18:01:58Z

So the OMPIO problem here seems to point to a problem opening temporary files on Juwels for mmap (which is memory mapping for files):
https://github.com/open-mpi/ompi/blob/v4.1.1/ompi/mca/sharedfp/sm/sharedfp_sm_file_open.c#L129-L141

This seems to be part of the OpenMPI sharedfp framework, so it likely has its own controls that we can try:
https://www.open-mpi.org/faq/?category=ompio#sharedfp-parametesrs

For more exhaustive tuning of I/O parameters, we recommend the utilization of the Open Tool for Parameter Optimization (OTPO), a tool specifically designed to explore the MCA parameter space of Open MPI.

That tool might be something for @AndiH? :)

ax3l · 2021-11-11T18:22:58Z

I asked about additional --mca options that we could try to modify or skip the sharedfp framework component in open-mpi/ompi#9656

damianam · 2021-11-11T20:15:05Z

@AndiH brought me here. I am probably the guy you want to talk to if you have problems on the JUWELS Booster and/or the MPIs there.

Bear with me now, I am not an OpenMPI expert. We explicitly disable the ompio framework via the mpi-settings environment module:

$ ml show mpi-settings | grep setenv
setenv("EBROOTOPENMPIMINSETTINGS","/p/software/juwelsbooster/stages/2020/software/OpenMPI-settings/4.1CUDA")
setenv("EBVERSIONOPENMPIMINSETTINGS","4.1")
setenv("EBDEVELOPENMPIMINSETTINGS","/p/software/juwelsbooster/stages/2020/software/OpenMPI-settings/4.1CUDA/easybuild/MPI_settings-OpenMPI-4.1-mpi-settings-CUDA-easybuild-devel")
setenv("SLURM_MPI_TYPE","pspmix")
setenv("UCX_TLS","rc_x,cuda_ipc,gdr_copy,self,sm,cuda_copy")
setenv("UCX_MEMTYPE_CACHE","n")
setenv("UCX_MAX_RNDV_RAILS","1")
setenv("OMPI_MCA_mca_base_component_show_load_errors","1")
setenv("OMPI_MCA_mpi_param_check","1")
setenv("OMPI_MCA_mpi_show_handle_leaks","1")
setenv("OMPI_MCA_mpi_warn_on_fork","1")
setenv("OMPI_MCA_btl","^uct,openib")
setenv("OMPI_MCA_btl_openib_allow_ib","1")
setenv("OMPI_MCA_bml_r2_show_unreach_errors","0")
setenv("OMPI_MCA_coll","^ml")
setenv("OMPI_MCA_coll_hcoll_enable","1")
setenv("OMPI_MCA_coll_hcoll_np","0")
setenv("OMPI_MCA_pml","ucx")
setenv("OMPI_MCA_osc","^rdma")
setenv("OMPI_MCA_opal_abort_print_stack","1")
setenv("OMPI_MCA_opal_set_max_sys_limits","1")
setenv("OMPI_MCA_opal_event_include","epoll")
setenv("OMPI_MCA_btl_openib_warn_default_gid_prefix","0")
setenv("OMPI_MCA_io","romio321")

If you actively enable ompio you are exploring uncharted territory for us. Regardless of that, it seems like the issue pops up when using the sm component of the sharedfp framework. Did you try disabling that component?:

export OMPI_MCA_sharedfp="^sm"

Alternatively, enable the other components (lockedfile or individual) exclusively.

Not claiming that is a fix, but could be a step forward. Are you by any chance doing IO from GPU buffers? I wonder if that could play a role.

Regarding the OTPO tool: My understanding is that this is a user-space tool that everyone can use to tweak OpenMPI for their particular cases (ie: no admin intervention necessary to benefit from it). I would be interested in learning more about it, but being realistic the chances that I can take a deep dive are slim.

I typically ignore github notifications nowadays, since I am not actively involved in github projects anymore. I will keep an eye on this one for the next couple of days, but feel free to ping me via other channels if I don't react.

ax3l · 2021-11-18T17:42:35Z

Thank you @damianam and thanks for chiming in!

@MaxThevenet can you try this?

export OMPI_MCA_sharedfp="^sm"

And @damianam you say we should try

export OMPI_MCA_sharedfp="lockedfile"

and

export OMPI_MCA_sharedfp="individual"

as alternative strategies?

Are you by any chance doing IO from GPU buffers? I wonder if that could play a role.

We absolutely are at the moment, yes. #2097

Is your MPI GPU-aware?
Otherwise we could also experiment with adding --mca mpi_leave_pinned 0 to mpiexec/srun - I saw pinning race issues in I/O in the past with PIConGPU.

AndiH · 2021-11-18T18:00:32Z

Our MPI is CUDA-aware, yes; you can see in the UCX_TLS variable which @damianam grepped above.

MaxThevenet · 2021-11-18T19:35:21Z

Thanks for looking into it! I tried

export OMPI_MCA_sharedfp="^sm"
srun -n 4 --cpu_bind=sockets $HOME/src/warpx/build/bin/warpx.3d.MPI.CUDA.DP.OPMD inputs &> output_sm.txt
export OMPI_MCA_sharedfp="lockedfile"
srun -n 4 --cpu_bind=sockets $HOME/src/warpx/build/bin/warpx.3d.MPI.CUDA.DP.OPMD inputs &> output_lf.txt
export OMPI_MCA_sharedfp="individual"
srun -n 4 --cpu_bind=sockets $HOME/src/warpx/build/bin/warpx.3d.MPI.CUDA.DP.OPMD inputs &> output_in.txt

But all runs failed with similar errors.

MaxThevenet · 2021-11-23T15:21:25Z

If it helps, I can provide instructions to install the code etc. to make a simple and quick reproducer (although the main files are already in the issue description). For now, I found a workaround so users can keep going: use ADIOS2 output rather than HDF5. I installed ADIOS2 from source in my $HOME, and this seems to be working well.

MaxThevenet added bug Something isn't working component: openPMD openPMD I/O component: diagnostics all types of outputs labels Nov 10, 2021

ax3l assigned ax3l and MaxThevenet Nov 11, 2021

ax3l added component: third party Changes in WarpX that reflect a change in a third-party library machine / system Machine or system-specific issue labels Nov 11, 2021

ax3l mentioned this issue Nov 11, 2021

OMPIO: Options to disable/change sharedpf MCA? open-mpi/ompi#9656

Closed

ax3l mentioned this issue Apr 4, 2022

Disable hdf5 chunking by default ComputationalRadiationPhysics/picongpu#3919

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash in parallel simulations when flushing particle data on JUWELS Booster #2542

Crash in parallel simulations when flushing particle data on JUWELS Booster #2542

MaxThevenet commented Nov 10, 2021

ax3l commented Nov 11, 2021 •

edited

Loading

ax3l commented Nov 11, 2021

ax3l commented Nov 11, 2021 •

edited

Loading

ax3l commented Nov 11, 2021 •

edited

Loading

ax3l commented Nov 11, 2021 •

edited

Loading

ax3l commented Nov 11, 2021 •

edited

Loading

ax3l commented Nov 11, 2021

damianam commented Nov 11, 2021 •

edited

Loading

ax3l commented Nov 18, 2021 •

edited

Loading

AndiH commented Nov 18, 2021

MaxThevenet commented Nov 18, 2021 •

edited

Loading

MaxThevenet commented Nov 23, 2021

Crash in parallel simulations when flushing particle data on JUWELS Booster #2542

Crash in parallel simulations when flushing particle data on JUWELS Booster #2542

Comments

MaxThevenet commented Nov 10, 2021

ax3l commented Nov 11, 2021 • edited Loading

ax3l commented Nov 11, 2021

ax3l commented Nov 11, 2021 • edited Loading

ax3l commented Nov 11, 2021 • edited Loading

ax3l commented Nov 11, 2021 • edited Loading

ax3l commented Nov 11, 2021 • edited Loading

ax3l commented Nov 11, 2021

damianam commented Nov 11, 2021 • edited Loading

ax3l commented Nov 18, 2021 • edited Loading

AndiH commented Nov 18, 2021

MaxThevenet commented Nov 18, 2021 • edited Loading

MaxThevenet commented Nov 23, 2021

ax3l commented Nov 11, 2021 •

edited

Loading

ax3l commented Nov 11, 2021 •

edited

Loading

ax3l commented Nov 11, 2021 •

edited

Loading

ax3l commented Nov 11, 2021 •

edited

Loading

ax3l commented Nov 11, 2021 •

edited

Loading

damianam commented Nov 11, 2021 •

edited

Loading

ax3l commented Nov 18, 2021 •

edited

Loading

MaxThevenet commented Nov 18, 2021 •

edited

Loading