Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix for MPI non-blocking collectives (issue #17) #32

Open
wants to merge 30 commits into
base: develop
Choose a base branch
from

Conversation

antonl321
Copy link

This branch addresses the issue of random failures observed when using MPL collectives in non-blocking mode.

It is assumed that the cause of these failures is the missing displacements arrays references when MPL_wait is called if MPL_alltoallv or any other other collective operation uses the option to created the displacement arrays inside the MPL wrapper.

Solution description:

  • when mpl_alltoallv is used in non-blocking the displacement arrays are stored in a linked list together with the MPI_request value and number of mpi ranks.
  • when MPL_wait is called the linked list is inspected, if a node has the MPI_request value equal to the argument value then that node is removed from he linked list.

Implementation:

  • A new fortran module was created to store the linked list node and the operation routines.
  • a node contains the MPI request, nproc and two allocatable arrays for the send and receive arrays.
  • the link list is private, it is operated with the list_manager class that append a new node at the beginning of communication and deletes it at end of communication.
  • the list manager allocates the displacement arrays and returns pointer to them.
  • the MPL_preamble routine was modified to use pointers that point to various displacement array: local, arguments or to the linked list according to the input pattern.

This version needs more work but I think that a checkpoint on the implementation design is in order before polishing the details.

@FussyDuck
Copy link

FussyDuck commented Jan 8, 2025

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ antonl321
❌ Lucian Anton


Lucian Anton seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@wdeconinck wdeconinck changed the base branch from main to develop January 9, 2025 09:08
@wdeconinck wdeconinck force-pushed the bug/syla_develop_ialltoallv.issue17 branch from 2168149 to b5cd669 Compare January 9, 2025 09:21
Copy link
Collaborator

@wdeconinck wdeconinck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @antonl321 thanks for your submission! It will be great to have this working.
I have not yet reviewed the actual changes to mpl_alltoallv using the linked_list, but rather the CMake and the testing.
You could already address my suggestions, and if you can contact me offline if you need clarification. Thanks! 🙏 😃

CMakeLists.txt Outdated
Comment on lines 28 to 30
if( CMAKE_Fortran_COMPILER_ID MATCHES "Cray")
add_link_options("-homp")
endif()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be needed or needs to be implemented using the OpenMP::OpenMP_Fortran target depending on the content of HAVE_OMP variable. I thought this was already working.

build.sh Outdated
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably you did not mean to commit this file, it should not be committed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have tow auxiliary files: build.sh and env.sh to be removed when all is clear.

env.sh Outdated
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably you did not mean to commit this file, it should not be committed.

if( HAVE_MPI AND MPIEXEC )
set( LAUNCH ${MPIEXEC} ${MPIEXEC_NUMPROC_FLAG} 1 )
set( LAUNCH ${MPIEXEC} ${MPI_OPTS} ${MPIEXEC_NUMPROC_FLAG} 1 )
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ecbuild_add_test we cater for something similar but name it MPI_ARGS instead. Perhaps you can use the following snippet instead.
https://github.com/ecmwf/ecbuild/blob/develop/cmake/ecbuild_add_test.cmake#L427-L432

COMMAND ${CMAKE_COMMAND}
"-DEXECUTABLE=$<TARGET_FILE:fiat-test-mpl-alltoallv-no-output>"
"-DLAUNCH=${LAUNCH2}"
-P ${CMAKE_CURRENT_SOURCE_DIR}/test_mpl_alltoallv_no_output.cmake )
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should be able to just use "ecbuild_add_test" with the MPI option. There's no need to parse the output of the test to detect failure. When the test itself crashes and exits with non-zero due to a failed assertion it should be failing automatically. The "_no_output" suffix is not needed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just used the pattern used by other tests.

Copy link
Collaborator

@wdeconinck wdeconinck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another round of review and some requested changes:

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should not be committed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should not be committed

@@ -0,0 +1,233 @@
module linked_list_mod
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for safety, I'd rename this to mpl_linked_list_mod


end subroutine

#define FAIL(msg) call fail_impl(msg,__FILE__,__LINE__)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is likely the cause for the need of -free-line-length-none flag.
I prefer not to use the flag as it is compiler specific and reduces portability to other platforms.
Please, replace __FILE__ with "alltoallv.F90" and remove the flag.

Comment on lines +162 to +172
ecbuild_add_executable(
TARGET alltoallv
SOURCES alltoallv.F90
LIBS fiat
FFLAGS $<$<Fortran_COMPILER_ID:GNU>:-ffree-line-length-none>
LINKER_LANGUAGE Fortran
NOINSTALL )

ecbuild_add_test( TARGET fiat_test_alltoallv
COMMAND alltoallv
MPI 2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be simplified:

Suggested change
ecbuild_add_executable(
TARGET alltoallv
SOURCES alltoallv.F90
LIBS fiat
FFLAGS $<$<Fortran_COMPILER_ID:GNU>:-ffree-line-length-none>
LINKER_LANGUAGE Fortran
NOINSTALL )
ecbuild_add_test( TARGET fiat_test_alltoallv
COMMAND alltoallv
MPI 2)
ecbuild_add_test(
TARGET fiat_test_alltoallv
SOURCES alltoallv.F90
LIBS fiat
LINKER_LANGUAGE Fortran
MPI 2 )

Comment on lines +178 to +187
ecbuild_add_executable(
TARGET linked_list
SOURCES linked_list.F90
LIBS fiat
LINKER_LANGUAGE Fortran
NOINSTALL )

ecbuild_add_test( TARGET fiat_test_linked_list
COMMAND linked_list)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be simplified:

Suggested change
ecbuild_add_executable(
TARGET linked_list
SOURCES linked_list.F90
LIBS fiat
LINKER_LANGUAGE Fortran
NOINSTALL )
ecbuild_add_test( TARGET fiat_test_linked_list
COMMAND linked_list)
ecbuild_add_test(
TARGET fiat_test_mpl_linked_list
SOURCES linked_list.F90
LIBS fiat
LINKER_LANGUAGE Fortran )

@@ -67,6 +67,7 @@ MODULE MPL_WAIT_MOD
USE MPL_MPIF
USE MPL_DATA_MODULE
USE MPL_MESSAGE_MOD
USE LINKED_LIST_MOD
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
USE LINKED_LIST_MOD
USE MPL_LINKED_LIST_MOD

@@ -72,6 +72,7 @@ MODULE MPL_ALLTOALLV_MOD
USE MPL_STATS_MOD
USE MPL_WAIT_MOD
USE YOMMPLSTATS
USE LINKED_LIST_MOD
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
USE LINKED_LIST_MOD
USE MPL_LINKED_LIST_MOD

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants