-
Notifications
You must be signed in to change notification settings - Fork 868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OMPIO: Options to disable/change sharedpf MCA? #9656
Comments
@ax3l thank you for reporting the issue. A couple of quick thoughts. OMPIO is actually able to execute without shared file pointers if the functionality is not used in the application. However, what happens here is that a sharedfp component is reporting 'I can be used', but the file open operation fails for whatever reason. Do you have any chance to check why the file in the /tmp directory could not be created? Since the sm component is chosen, I would expect that it was a single node job, is this correct? If we cannot fix the file open problem in /tmp, one option to circumvent things could be to disqualify all sharedfp components ( at least to make sure that this resolves all issues), e.g.
|
Thank you so much for the feedback @edgargabriel!
@damianam can you comment on this? :)
The error report we got in WarpX by @MaxThevenet was definitely a single-node job, yes:
Thanks, we will try those next! Another info that might help here: the problem seems to be appear when we pass GPU unified memory addresses to device global memory to MPI-I/O. This usually works fine, just raising it here to give a complete picture [ref]. From the system side, @damianam also posted now the full OpenMPI settings used on Juwels Booster. |
I don't know why the file could not be created but I have a few pointers on this regard. The compute nodes are diskless, everything lives in memory. If by some chance you run too close to the limit, and then try to allocate a large file in As a next step I would suggest to disable |
@damianam thank you for the explanation. The file that the sharedfp sm component creates is very small, just an offset value (i.e. 8 bytes, although will probably use an entire block from the file system level). I was thinking about adding something to the component query function to try to open a file itself, and if that doesn't work for whatever reason the component would disqualify itself. |
That sounds like a sensible thing to do. Besides of that, are we positive that the path to the file exists? |
I want to mention that I have encountered the exact same issue with a different code (MGLET) running on Jureca-DC, also in Jülich SC. The MPI is OpenMPI 4.1.1. In my case I am able to mitigate the issue by setting In summary, with MGLET I have two I/O related errors with OpenMPI. The present one occurring with 'ompio' with a single node and issue #7795 when using 'romio' on several nodes... |
…rations try to actually open the sharedfp/sm file during the query operation to ensure that the component can actually run. This is based on some reports on the mailing list that the sharedfp/sm operation causes problems in certain circumstances. Fixes issue open-mpi#9656 Signed-off-by: Edgar Gabriel <[email protected]>
…rations try to actually open the sharedfp/sm file during the query operation to ensure that the component can actually run. This is based on some reports on the mailing list that the sharedfp/sm operation causes problems in certain circumstances. Fixes issue open-mpi#9656 Signed-off-by: Edgar Gabriel <[email protected]>
…rations try to actually open the sharedfp/sm file during the query operation to ensure that the component can actually run. This is based on some reports on the mailing list that the sharedfp/sm operation causes problems in certain circumstances. Fixes issue open-mpi#9656 Signed-off-by: Edgar Gabriel <[email protected]> (cherry picked from commit d8464d2)
…rations try to actually open the sharedfp/sm file during the query operation to ensure that the component can actually run. This is based on some reports on the mailing list that the sharedfp/sm operation causes problems in certain circumstances. Fixes issue open-mpi#9656 Signed-off-by: Edgar Gabriel <[email protected]> (cherry picked from commit d8464d2)
…rations try to actually open the sharedfp/sm file during the query operation to ensure that the component can actually run. This is based on some reports on the mailing list that the sharedfp/sm operation causes problems in certain circumstances. Fixes issue open-mpi#9656 Signed-off-by: Edgar Gabriel <[email protected]> (cherry picked from commit d8464d2)
Background information
What version of Open MPI are you using?
4.1.1
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Via Juelich JSC syadmins (likely using Easybuild).
Please describe the system on which you are running
Details of the problem
We are running parallel HDF5 files on the Juelich (JSC) JUWELS Booster cluster.
With the OMPIO backend, we observe write issues of the form:
The only google result for the error message points to #4336, but that one should be fixed for the version we use (OpenMPI 4.1.1).
Following the source,
https://github.com/open-mpi/ompi/blob/v4.1.1/ompi/mca/sharedfp/sm/sharedfp_sm_file_open.c#L129-L141
the problem seems to be related to the
sharedfp framework
and I read in the FAQs that it has a couple of requirements on the file system capabilities, which might not be fulfilled at JUWELS booster (I am speculating as a system user here).I was wondering if I could set options in
sharedfp
to modify those requirements or skipsharedfp
altogether, but I could not find any. We already triedexport HDF5_USE_FILE_LOCKING=FALSE
on the HDF5 side, but that had no effect. Can you recommend anymca
options we could try?We also tried to using the ROMIO backend over OMPIO, but that has problems of its own on that system.
Attn & X-Ref
X-ref: ECP-WarpX/WarpX#2542 (comment)
cc @MaxThevenet @AndiH
help maybe @ggouaillardet ? :)
The text was updated successfully, but these errors were encountered: