-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crash in parallel simulations when flushing particle data on JUWELS Booster #2542
Comments
Thanks for the detailed report! As discussed on Slack, we see this problem only on Juwels so far and the same run works on other clusters. The backtrace indicates that the problem originates straight out of the MPI-I/O layer (ROMIO). That's a bit curious, because by default OpenMPI uses OMPIO as it's I/O implementation instead of ROMIO, so something seems to be non-default on Juwels. I see that your profile, sourced in the submission script, has # change the MPI-IO backend in OpenMPI from OMPIO to ROMIO (experimental)
#export OMPI_MCA_io=romio321 in it before the Another thing we discussed is to ask the cluster support for the newest version of HDF5 in the 1.10 series, so providing instead of HDF5 1.10.6 the 1.10.8 release. Cluster support could also run a few tests, e.g., with |
If this works, then we should remove the hint to change this in our docs: I think we used this temporarily to work around another earlier issue on Juwels. |
OMPIO errors still with (test from Maxence):
|
Let's see if we can work-around this via: export OMPI_MCA_io=ompio
export HDF5_USE_FILE_LOCKING=FALSE Update: same error. |
The first errors:
point to a Update: |
So the OMPIO problem here seems to point to a problem opening temporary files on Juwels for This seems to be part of the OpenMPI
That tool might be something for @AndiH? :) |
I asked about additional |
@AndiH brought me here. I am probably the guy you want to talk to if you have problems on the JUWELS Booster and/or the MPIs there. Bear with me now, I am not an OpenMPI expert. We explicitly disable the
If you actively enable
Alternatively, enable the other components ( Not claiming that is a fix, but could be a step forward. Are you by any chance doing IO from GPU buffers? I wonder if that could play a role. Regarding the OTPO tool: My understanding is that this is a user-space tool that everyone can use to tweak OpenMPI for their particular cases (ie: no admin intervention necessary to benefit from it). I would be interested in learning more about it, but being realistic the chances that I can take a deep dive are slim. I typically ignore github notifications nowadays, since I am not actively involved in github projects anymore. I will keep an eye on this one for the next couple of days, but feel free to ping me via other channels if I don't react. |
Thank you @damianam and thanks for chiming in! @MaxThevenet can you try this? export OMPI_MCA_sharedfp="^sm" And @damianam you say we should try export OMPI_MCA_sharedfp="lockedfile" and export OMPI_MCA_sharedfp="individual" as alternative strategies?
We absolutely are at the moment, yes. #2097 Is your MPI GPU-aware? |
Our MPI is CUDA-aware, yes; you can see in the |
Thanks for looking into it! I tried
But all runs failed with similar errors. |
If it helps, I can provide instructions to install the code etc. to make a simple and quick reproducer (although the main files are already in the issue description). For now, I found a workaround so users can keep going: use ADIOS2 output rather than HDF5. I installed ADIOS2 from source in my $HOME, and this seems to be working well. |
A production 3D simulation with openPMD output crashes at the first particle flush, when running in parallel. This input file is a reproducer showing the problem on a simplified setup. This is executed with the following submission script on the Juwels Booster. The CMake command and output can be found here, and I used the following profile file. The crash gave the following files: error.txt and Backtrace.
Note: the same run on a V100-equipped cluster ran successfully with the following CMake output ran successfully.
The text was updated successfully, but these errors were encountered: