Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash in parallel simulations when flushing particle data on JUWELS Booster #2542

Open
MaxThevenet opened this issue Nov 10, 2021 · 12 comments
Assignees
Labels
bug Something isn't working component: diagnostics all types of outputs component: openPMD openPMD I/O component: third party Changes in WarpX that reflect a change in a third-party library machine / system Machine or system-specific issue

Comments

@MaxThevenet
Copy link
Member

A production 3D simulation with openPMD output crashes at the first particle flush, when running in parallel. This input file is a reproducer showing the problem on a simplified setup. This is executed with the following submission script on the Juwels Booster. The CMake command and output can be found here, and I used the following profile file. The crash gave the following files: error.txt and Backtrace.

Note: the same run on a V100-equipped cluster ran successfully with the following CMake output ran successfully.

@MaxThevenet MaxThevenet added bug Something isn't working component: openPMD openPMD I/O component: diagnostics all types of outputs labels Nov 10, 2021
@ax3l ax3l added component: third party Changes in WarpX that reflect a change in a third-party library machine / system Machine or system-specific issue labels Nov 11, 2021
@ax3l
Copy link
Member

ax3l commented Nov 11, 2021

Thanks for the detailed report!

As discussed on Slack, we see this problem only on Juwels so far and the same run works on other clusters.

The backtrace indicates that the problem originates straight out of the MPI-I/O layer (ROMIO). That's a bit curious, because by default OpenMPI uses OMPIO as it's I/O implementation instead of ROMIO, so something seems to be non-default on Juwels. I see that your profile, sourced in the submission script, has

# change the MPI-IO backend in OpenMPI from OMPIO to ROMIO (experimental)
#export OMPI_MCA_io=romio321

in it before the srun call. Let's comment this line out to make sure the default OMPIO implementation is used. OMPIO is pretty buggy itself, but I reported/fixed a series of bugs in the past related to HDF5 I/O in the past, and the OpenMPI 4.1.1 on Juwels should contain all those fixes: openPMD/openPMD-api#446

Another thing we discussed is to ask the cluster support for the newest version of HDF5 in the 1.10 series, so providing instead of HDF5 1.10.6 the 1.10.8 release. Cluster support could also run a few tests, e.g., with hdf5-iotest and ior to check if the MPI-I/O layer and HDF5 implementation are generally in working condition.

@ax3l
Copy link
Member

ax3l commented Nov 11, 2021

If this works, then we should remove the hint to change this in our docs:
https://github.com/ECP-WarpX/WarpX/blob/development/Docs/source/install/hpc/juwels.rst

I think we used this temporarily to work around another earlier issue on Juwels.

@ax3l
Copy link
Member

ax3l commented Nov 11, 2021

OMPIO errors still with (test from Maxence):

[jwb0065.juwels:28280] mca_sharedfp_sm_file_open: Error, unable to open file for mmap: /tmp/ompi.jwb0065.juwels.17674/jf.0/4086174241/openpmd_00010.h5_cid-5-28280.sm
[jwb0065.juwels:28279] mca_sharedfp_sm_file_open: Error, unable to open file for mmap: /tmp/ompi.jwb0065.juwels.17674/jf.0/4086174241/openpmd_00010.h5_cid-5-28280.sm
[jwb0065.juwels:28277] mca_sharedfp_sm_file_open: Error, unable to open file for mmap: /tmp/ompi.jwb0065.juwels.17674/jf.0/4086174241/openpmd_00010.h5_cid-5-28280.sm
[jwb0065.juwels:28278] mca_sharedfp_sm_file_open: Error, unable to open file for mmap: /tmp/ompi.jwb0065.juwels.17674/jf.0/4086174241/openpmd_00010.h5_cid-5-28280.sm
HDF5-DIAG: Error detected in HDF5 (1.10.6) MPI-process 2:
  #000: H5F.c line 444 in H5Fcreate(): unable to create file
    major: File accessibilty
HDF5-DIAG: Error detected in HDF5 (1.10.6) MPI-process 1:
HDF5-DIAG: Error detected in HDF5 (1.10.6) MPI-process 0:
  #000: H5F.c line 444 in H5Fcreate(): unable to create file
  #000: H5F.c line 444 in H5Fcreate(): unable to create file
    major: File accessibilty
    minor: Unable to open file
    major: File accessibilty
    minor: Unable to open file
HDF5-DIAG: Error detected in HDF5 (1.10.6) MPI-process 3:
  #000: H5F.c line 444 in H5Fcreate(): unable to create file
    major: File accessibilty
    minor: Unable to open file
  #001: H5Fint.c line 1498 in H5F_open(): unable to open file: time = Thu Nov 11 18:41:15 2021
, name = 'diags/injection/openpmd_00010.h5', tent_flags = 13
    major: File accessibilty
    minor: Unable to open file
  #001: H5Fint.c line 1498 in H5F_open(): unable to open file: time = Thu Nov 11 18:41:15 2021
, name = 'diags/injection/openpmd_00010.h5', tent_flags = 13

@ax3l
Copy link
Member

ax3l commented Nov 11, 2021

Let's see if we can work-around this via:

export OMPI_MCA_io=ompio
export HDF5_USE_FILE_LOCKING=FALSE

Update: same error.

@ax3l
Copy link
Member

ax3l commented Nov 11, 2021

The first errors:

mca_sharedfp_sm_file_open: Error, unable to open file for mmap

point to a ulimit issue: open-mpi/ompi#4336

Update:ulimit -n returns 524288 (pretty good) and the above linked issue is fixed in OpenMPI 4.1.1

@ax3l
Copy link
Member

ax3l commented Nov 11, 2021

So the OMPIO problem here seems to point to a problem opening temporary files on Juwels for mmap (which is memory mapping for files):
https://github.com/open-mpi/ompi/blob/v4.1.1/ompi/mca/sharedfp/sm/sharedfp_sm_file_open.c#L129-L141

This seems to be part of the OpenMPI sharedfp framework, so it likely has its own controls that we can try:
https://www.open-mpi.org/faq/?category=ompio#sharedfp-parametesrs

For more exhaustive tuning of I/O parameters, we recommend the utilization of the Open Tool for Parameter Optimization (OTPO), a tool specifically designed to explore the MCA parameter space of Open MPI.

That tool might be something for @AndiH? :)

@ax3l
Copy link
Member

ax3l commented Nov 11, 2021

I asked about additional --mca options that we could try to modify or skip the sharedfp framework component in open-mpi/ompi#9656

@damianam
Copy link

damianam commented Nov 11, 2021

@AndiH brought me here. I am probably the guy you want to talk to if you have problems on the JUWELS Booster and/or the MPIs there.

Bear with me now, I am not an OpenMPI expert. We explicitly disable the ompio framework via the mpi-settings environment module:

$ ml show mpi-settings | grep setenv
setenv("EBROOTOPENMPIMINSETTINGS","/p/software/juwelsbooster/stages/2020/software/OpenMPI-settings/4.1CUDA")
setenv("EBVERSIONOPENMPIMINSETTINGS","4.1")
setenv("EBDEVELOPENMPIMINSETTINGS","/p/software/juwelsbooster/stages/2020/software/OpenMPI-settings/4.1CUDA/easybuild/MPI_settings-OpenMPI-4.1-mpi-settings-CUDA-easybuild-devel")
setenv("SLURM_MPI_TYPE","pspmix")
setenv("UCX_TLS","rc_x,cuda_ipc,gdr_copy,self,sm,cuda_copy")
setenv("UCX_MEMTYPE_CACHE","n")
setenv("UCX_MAX_RNDV_RAILS","1")
setenv("OMPI_MCA_mca_base_component_show_load_errors","1")
setenv("OMPI_MCA_mpi_param_check","1")
setenv("OMPI_MCA_mpi_show_handle_leaks","1")
setenv("OMPI_MCA_mpi_warn_on_fork","1")
setenv("OMPI_MCA_btl","^uct,openib")
setenv("OMPI_MCA_btl_openib_allow_ib","1")
setenv("OMPI_MCA_bml_r2_show_unreach_errors","0")
setenv("OMPI_MCA_coll","^ml")
setenv("OMPI_MCA_coll_hcoll_enable","1")
setenv("OMPI_MCA_coll_hcoll_np","0")
setenv("OMPI_MCA_pml","ucx")
setenv("OMPI_MCA_osc","^rdma")
setenv("OMPI_MCA_opal_abort_print_stack","1")
setenv("OMPI_MCA_opal_set_max_sys_limits","1")
setenv("OMPI_MCA_opal_event_include","epoll")
setenv("OMPI_MCA_btl_openib_warn_default_gid_prefix","0")
setenv("OMPI_MCA_io","romio321")

If you actively enable ompio you are exploring uncharted territory for us. Regardless of that, it seems like the issue pops up when using the sm component of the sharedfp framework. Did you try disabling that component?:

export OMPI_MCA_sharedfp="^sm"

Alternatively, enable the other components (lockedfile or individual) exclusively.

Not claiming that is a fix, but could be a step forward. Are you by any chance doing IO from GPU buffers? I wonder if that could play a role.

Regarding the OTPO tool: My understanding is that this is a user-space tool that everyone can use to tweak OpenMPI for their particular cases (ie: no admin intervention necessary to benefit from it). I would be interested in learning more about it, but being realistic the chances that I can take a deep dive are slim.

I typically ignore github notifications nowadays, since I am not actively involved in github projects anymore. I will keep an eye on this one for the next couple of days, but feel free to ping me via other channels if I don't react.

@ax3l
Copy link
Member

ax3l commented Nov 18, 2021

Thank you @damianam and thanks for chiming in!

@MaxThevenet can you try this?

export OMPI_MCA_sharedfp="^sm"

And @damianam you say we should try

export OMPI_MCA_sharedfp="lockedfile"

and

export OMPI_MCA_sharedfp="individual"

as alternative strategies?

Are you by any chance doing IO from GPU buffers? I wonder if that could play a role.

We absolutely are at the moment, yes. #2097

Is your MPI GPU-aware?
Otherwise we could also experiment with adding --mca mpi_leave_pinned 0 to mpiexec/srun - I saw pinning race issues in I/O in the past with PIConGPU.

@AndiH
Copy link

AndiH commented Nov 18, 2021

Our MPI is CUDA-aware, yes; you can see in the UCX_TLS variable which @damianam grepped above.

@MaxThevenet
Copy link
Member Author

MaxThevenet commented Nov 18, 2021

Thanks for looking into it! I tried

export OMPI_MCA_sharedfp="^sm"
srun -n 4 --cpu_bind=sockets $HOME/src/warpx/build/bin/warpx.3d.MPI.CUDA.DP.OPMD inputs &> output_sm.txt
export OMPI_MCA_sharedfp="lockedfile"
srun -n 4 --cpu_bind=sockets $HOME/src/warpx/build/bin/warpx.3d.MPI.CUDA.DP.OPMD inputs &> output_lf.txt
export OMPI_MCA_sharedfp="individual"
srun -n 4 --cpu_bind=sockets $HOME/src/warpx/build/bin/warpx.3d.MPI.CUDA.DP.OPMD inputs &> output_in.txt

But all runs failed with similar errors.

@MaxThevenet
Copy link
Member Author

If it helps, I can provide instructions to install the code etc. to make a simple and quick reproducer (although the main files are already in the issue description). For now, I found a workaround so users can keep going: use ADIOS2 output rather than HDF5. I installed ADIOS2 from source in my $HOME, and this seems to be working well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working component: diagnostics all types of outputs component: openPMD openPMD I/O component: third party Changes in WarpX that reflect a change in a third-party library machine / system Machine or system-specific issue
Projects
None yet
Development

No branches or pull requests

4 participants