-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix for MPI non-blocking collectives (issue #17) #32
base: develop
Are you sure you want to change the base?
fix for MPI non-blocking collectives (issue #17) #32
Conversation
…he displacements data
… tests, extended to all data types
Lucian Anton seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
…he displacements data
… tests, extended to all data types
2168149
to
b5cd669
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @antonl321 thanks for your submission! It will be great to have this working.
I have not yet reviewed the actual changes to mpl_alltoallv using the linked_list, but rather the CMake and the testing.
You could already address my suggestions, and if you can contact me offline if you need clarification. Thanks! 🙏 😃
CMakeLists.txt
Outdated
if( CMAKE_Fortran_COMPILER_ID MATCHES "Cray") | ||
add_link_options("-homp") | ||
endif() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be needed or needs to be implemented using the OpenMP::OpenMP_Fortran target depending on the content of HAVE_OMP variable. I thought this was already working.
build.sh
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably you did not mean to commit this file, it should not be committed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I have tow auxiliary files: build.sh and env.sh to be removed when all is clear.
env.sh
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably you did not mean to commit this file, it should not be committed.
if( HAVE_MPI AND MPIEXEC ) | ||
set( LAUNCH ${MPIEXEC} ${MPIEXEC_NUMPROC_FLAG} 1 ) | ||
set( LAUNCH ${MPIEXEC} ${MPI_OPTS} ${MPIEXEC_NUMPROC_FLAG} 1 ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In ecbuild_add_test we cater for something similar but name it MPI_ARGS instead. Perhaps you can use the following snippet instead.
https://github.com/ecmwf/ecbuild/blob/develop/cmake/ecbuild_add_test.cmake#L427-L432
tests/CMakeLists.txt
Outdated
COMMAND ${CMAKE_COMMAND} | ||
"-DEXECUTABLE=$<TARGET_FILE:fiat-test-mpl-alltoallv-no-output>" | ||
"-DLAUNCH=${LAUNCH2}" | ||
-P ${CMAKE_CURRENT_SOURCE_DIR}/test_mpl_alltoallv_no_output.cmake ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should be able to just use "ecbuild_add_test" with the MPI option. There's no need to parse the output of the test to detect failure. When the test itself crashes and exits with non-zero due to a failed assertion it should be failing automatically. The "_no_output" suffix is not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just used the pattern used by other tests.
…after testing with nproc > 2
…after testing with nproc > 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another round of review and some requested changes:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file should not be committed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file should not be committed
@@ -0,0 +1,233 @@ | |||
module linked_list_mod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for safety, I'd rename this to mpl_linked_list_mod
|
||
end subroutine | ||
|
||
#define FAIL(msg) call fail_impl(msg,__FILE__,__LINE__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is likely the cause for the need of -free-line-length-none
flag.
I prefer not to use the flag as it is compiler specific and reduces portability to other platforms.
Please, replace __FILE__
with "alltoallv.F90" and remove the flag.
ecbuild_add_executable( | ||
TARGET alltoallv | ||
SOURCES alltoallv.F90 | ||
LIBS fiat | ||
FFLAGS $<$<Fortran_COMPILER_ID:GNU>:-ffree-line-length-none> | ||
LINKER_LANGUAGE Fortran | ||
NOINSTALL ) | ||
|
||
ecbuild_add_test( TARGET fiat_test_alltoallv | ||
COMMAND alltoallv | ||
MPI 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be simplified:
ecbuild_add_executable( | |
TARGET alltoallv | |
SOURCES alltoallv.F90 | |
LIBS fiat | |
FFLAGS $<$<Fortran_COMPILER_ID:GNU>:-ffree-line-length-none> | |
LINKER_LANGUAGE Fortran | |
NOINSTALL ) | |
ecbuild_add_test( TARGET fiat_test_alltoallv | |
COMMAND alltoallv | |
MPI 2) | |
ecbuild_add_test( | |
TARGET fiat_test_alltoallv | |
SOURCES alltoallv.F90 | |
LIBS fiat | |
LINKER_LANGUAGE Fortran | |
MPI 2 ) |
ecbuild_add_executable( | ||
TARGET linked_list | ||
SOURCES linked_list.F90 | ||
LIBS fiat | ||
LINKER_LANGUAGE Fortran | ||
NOINSTALL ) | ||
|
||
ecbuild_add_test( TARGET fiat_test_linked_list | ||
COMMAND linked_list) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be simplified:
ecbuild_add_executable( | |
TARGET linked_list | |
SOURCES linked_list.F90 | |
LIBS fiat | |
LINKER_LANGUAGE Fortran | |
NOINSTALL ) | |
ecbuild_add_test( TARGET fiat_test_linked_list | |
COMMAND linked_list) | |
ecbuild_add_test( | |
TARGET fiat_test_mpl_linked_list | |
SOURCES linked_list.F90 | |
LIBS fiat | |
LINKER_LANGUAGE Fortran ) |
@@ -67,6 +67,7 @@ MODULE MPL_WAIT_MOD | |||
USE MPL_MPIF | |||
USE MPL_DATA_MODULE | |||
USE MPL_MESSAGE_MOD | |||
USE LINKED_LIST_MOD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
USE LINKED_LIST_MOD | |
USE MPL_LINKED_LIST_MOD |
@@ -72,6 +72,7 @@ MODULE MPL_ALLTOALLV_MOD | |||
USE MPL_STATS_MOD | |||
USE MPL_WAIT_MOD | |||
USE YOMMPLSTATS | |||
USE LINKED_LIST_MOD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
USE LINKED_LIST_MOD | |
USE MPL_LINKED_LIST_MOD |
This branch addresses the issue of random failures observed when using MPL collectives in non-blocking mode.
It is assumed that the cause of these failures is the missing displacements arrays references when MPL_wait is called if MPL_alltoallv or any other other collective operation uses the option to created the displacement arrays inside the MPL wrapper.
Solution description:
Implementation:
This version needs more work but I think that a checkpoint on the implementation design is in order before polishing the details.