Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable hdf5 chunking by default #3919

Conversation

franzpoeschel
Copy link
Contributor

@franzpoeschel franzpoeschel commented Nov 19, 2021

We specify export OMPI_MCA_io=^ompio in some templates and that somehow makes HDF5 segfault when chunking is enabled. Chunking is only a recent addition in the openPMD-api so the error probably did not came up too often for now, but we should clarify this before merging this one.

TODO:

  • Verify that this fixes the crashes that I saw on Hemera
  • Clarify if this is the solution we want for this error or if it can be fixed in another way

@franzpoeschel
Copy link
Contributor Author

franzpoeschel commented Nov 22, 2021

Some fixes were needed, but I could now verify that this PR solves the issue that I saw on Hemera and Summit

If someone really needs chunking, it should be possible by overriding this option in --openPMD.json and selecting a working MPI IO backend. I will verify this and probably need to add documentation for this.

@franzpoeschel
Copy link
Contributor Author

I've now verified that chunking can be re-enabled by setting --openPMD.json '{"hdf5":{"dataset":{"chunks": "auto"}}} and added documentation for this.

@psychocoderHPC psychocoderHPC added the bug a bug in the project's code label Nov 23, 2021
@franzpoeschel
Copy link
Contributor Author

For the moment this PR is ready for review.
Note that this disables HDF5 chunking by default for all workflows.
We could alternatively add an environment variable to disable chunking exclusively to those templates where export OMPI_MCA_io=^ompio is specified and cross fingers that everywhere else will cause no trouble. The issue that I observed definitely happened inside ROMIO:

[kepler020:22879:0:22879] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid:  22880) ====
 0  /trinity/shared/pkg/mpi/ucx/1.10.0/gcc/7.3.0/lib/libucs.so.0(ucs_handle_error+0xe4) [0x2aaad4474504]
 1  /trinity/shared/pkg/mpi/ucx/1.10.0/gcc/7.3.0/lib/libucs.so.0(+0x2782c) [0x2aaad447482c]
 2  /trinity/shared/pkg/mpi/ucx/1.10.0/gcc/7.3.0/lib/libucs.so.0(+0x27a94) [0x2aaad4474a94]
 3  /trinity/shared/pkg/mpi/openmpi/4.0.4-cuda112//gcc/7.3.0/lib/openmpi/mca_io_romio321.so(ADIOI_Flatten+0x5ab) [0x2aad89320d1b]
 4  /trinity/shared/pkg/mpi/openmpi/4.0.4-cuda112//gcc/7.3.0/lib/openmpi/mca_io_romio321.so(ADIOI_Flatten_datatype+0xe1) [0x2aad89322901]
 5  /trinity/shared/pkg/mpi/openmpi/4.0.4-cuda112//gcc/7.3.0/lib/openmpi/mca_io_romio321.so(ADIO_Set_view+0x20d) [0x2aad8931863d]
 6  /trinity/shared/pkg/mpi/openmpi/4.0.4-cuda112//gcc/7.3.0/lib/openmpi/mca_io_romio321.so(mca_io_romio_dist_MPI_File_set_view+0x2c6) [0x2aad892fe0d6]
 7  /trinity/shared/pkg/mpi/openmpi/4.0.4-cuda112//gcc/7.3.0/lib/openmpi/mca_io_romio321.so(mca_io_romio321_file_set_view+0xdc) [0x2aad892f74dc]
 8  /trinity/shared/pkg/mpi/openmpi/4.0.4-cuda112/gcc/7.3.0/lib/libmpi.so.40(MPI_File_set_view+0x11c) [0x2aaaaad46fac]
 9  /home/poesch58/pic_env/local/lib/libhdf5.so.200(+0x131f5c) [0x2aaaaee4bf5c]
10  /home/poesch58/pic_env/local/lib/libhdf5.so.200(H5FD_write+0xc9) [0x2aaaaee48e59]
11  /home/poesch58/pic_env/local/lib/libhdf5.so.200(H5F__accum_write+0x1c4) [0x2aaaaee260b4]
12  /home/poesch58/pic_env/local/lib/libhdf5.so.200(H5PB_write+0x7da) [0x2aaaaef2763a]
13  /home/poesch58/pic_env/local/lib/libhdf5.so.200(H5F_shared_block_write+0xa7) [0x2aaaaee31187]
14  /home/poesch58/pic_env/local/lib/libhdf5.so.200(H5D__chunk_allocate+0x1a3b) [0x2aaaaedd936b]
15  /home/poesch58/pic_env/local/lib/libhdf5.so.200(+0xccf43) [0x2aaaaede6f43]
16  /home/poesch58/pic_env/local/lib/libhdf5.so.200(H5D__alloc_storage+0x1cf) [0x2aaaaedecfef]
17  /home/poesch58/pic_env/local/lib/libhdf5.so.200(H5D__layout_oh_create+0x335) [0x2aaaaedf3785]
18  /home/poesch58/pic_env/local/lib/libhdf5.so.200(H5D__create+0x870) [0x2aaaaede87b0]
19  /home/poesch58/pic_env/local/lib/libhdf5.so.200(+0xe23d9) [0x2aaaaedfc3d9]
20  /home/poesch58/pic_env/local/lib/libhdf5.so.200(H5O_obj_create+0x9a) [0x2aaaaeedf82a]
21  /home/poesch58/pic_env/local/lib/libhdf5.so.200(+0x18d47c) [0x2aaaaeea747c]
22  /home/poesch58/pic_env/local/lib/libhdf5.so.200(+0x1627a5) [0x2aaaaee7c7a5]
23  /home/poesch58/pic_env/local/lib/libhdf5.so.200(H5G_traverse+0xd2) [0x2aaaaee7cc52]
24  /home/poesch58/pic_env/local/lib/libhdf5.so.200(+0x18928d) [0x2aaaaeea328d]
25  /home/poesch58/pic_env/local/lib/libhdf5.so.200(H5L_link_object+0x36) [0x2aaaaeea8956]
26  /home/poesch58/pic_env/local/lib/libhdf5.so.200(H5D__create_named+0x5a) [0x2aaaaede7eca]
27  /home/poesch58/pic_env/local/lib/libhdf5.so.200(H5VL__native_dataset_create+0xd8) [0x2aaaaf006828]
28  /home/poesch58/pic_env/local/lib/libhdf5.so.200(+0x2d6456) [0x2aaaaeff0456]
29  /home/poesch58/pic_env/local/lib/libhdf5.so.200(H5VL_dataset_create+0xda) [0x2aaaaeff5a8a]
30  /home/poesch58/pic_env/local/lib/libhdf5.so.200(H5Dcreate2+0x11b) [0x2aaaaedc6deb]
31  /home/poesch58/pic_env/local/lib64/libopenPMD.so(_ZN7openPMD17HDF5IOHandlerImpl13createDatasetEPNS_8WritableERKNS_9ParameterILNS_9OperationE9EEE+0x7f0) [0x2aaaac776be0]
32  /home/poesch58/pic_env/local/lib64/libopenPMD.so(_ZN7openPMD21AbstractIOHandlerImpl5flushEv+0x15b) [0x2aaaac7795ab]
33  /home/poesch58/pic_env/local/lib64/libopenPMD.so(_ZN7openPMD21ParallelHDF5IOHandler5flushEv+0x11) [0x2aaaac77ea71]
34  /home/poesch58/pic_env/local/lib64/libopenPMD.so(_ZN7openPMD15SeriesInterface14flushGorVBasedESt17_Rb_tree_iteratorISt4pairIKmNS_9IterationEEES6_+0x462) [0x2aaaac7171f2]
35  /home/poesch58/pic_env/local/lib64/libopenPMD.so(_ZN7openPMD15SeriesInterface10flush_implESt17_Rb_tree_iteratorISt4pairIKmNS_9IterationEEES6_NS_10FlushLevelEb+0x60) [0x2aaaac7173e0]
36  /home/poesch58/pic_env/local/lib64/libopenPMD.so(_ZN7openPMD15SeriesInterface5flushEv+0x34) [0x2aaaac7174c4]
37  /home/poesch58/pic_env/scratch/pic_run_fileIO_4_v6/input/bin/picongpu() [0x637e8d]
38  /home/poesch58/pic_env/scratch/pic_r==== backtrace (tid:  22878) ====

psychocoderHPC
psychocoderHPC previously approved these changes Nov 24, 2021
@psychocoderHPC
Copy link
Member

Clarify if this is the solution we want for this error or if it can be fixed in another way

IMO your solution is fine. To be fair we do not really care if HDF5 is using chunking or not. On most systems, we prefer to use ADIOS2 anyway.

@franzpoeschel
Copy link
Contributor Author

Clarify if this is the solution we want for this error or if it can be fixed in another way

IMO your solution is fine. To be fair we do not really care if HDF5 is using chunking or not. On most systems, we prefer to use ADIOS2 anyway.

I thought so too, so I have no issue with just disabling that and having less trouble heading our way. There might be performance implications, but we never used chunking in the past either, and it's still available as an opt-in.

@psychocoderHPC
Copy link
Member

@franzpoeschel Could You please resolve the merge conflict, I will set the CI to not compile all again because the conflict is in the documentation only.

@psychocoderHPC psychocoderHPC added the CI:no-compile CI is skipping compile/runtime tests but runs PICMI tests label Nov 25, 2021
@franzpoeschel
Copy link
Contributor Author

@psychocoderHPC done

@psychocoderHPC psychocoderHPC merged commit 2e63df5 into ComputationalRadiationPhysics:dev Nov 26, 2021
@ax3l
Copy link
Member

ax3l commented Apr 4, 2022

To make this more reproducible, can you confirm those please:

  • system: Hemera
  • MPI flavor: OpenMPI - any vendor patches (e.g., in OpenMPI or libfabric et al.)
  • OpenMPI version: 4.0.4
  • HDF5 version: ?

Why did you switch in modern OpenMPI versions from OpenMPI's MPIIO to ROMIO? I opened a couple of issues with OpenMPI on GitHub in the past, and they do not seem to recommend to do so (see my reports and discussion with JSC on Juwels Booster, for instance.)

@PrometheusPi
Copy link
Member

@psychocoderHPC and @franzpoeschel I think @ax3l's question is addressed to you.

@franzpoeschel
Copy link
Contributor Author

I should rerun these tests anyway because there's a chance that it's related to the issue fixed by openPMD/openPMD-api#1239. Will do and post more detailed results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug a bug in the project's code CI:no-compile CI is skipping compile/runtime tests but runs PICMI tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants