-
Notifications
You must be signed in to change notification settings - Fork 868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
part_persist: fix start_all() #12972
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand you mostly moved code around, but as I read through it I noticed several issues with the code that could be revised by the author.
/** | ||
* This is a helper function that frees a request. This requires ompi_part_persist.lock be held before calling. | ||
*/ | ||
__opal_attribute_always_inline__ static inline int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the point of having this function as __always_inline__
when its single use is protected by a mutex and the function itself has many calls to release objects and memory ?
} else { | ||
request->req_ompi.req_status.MPI_SOURCE = request->req_comm->c_my_rank; | ||
} | ||
request->req_ompi.req_complete_cb = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the correct behavior for requests. Setting req_complete_cb
to NULL here will prevent the ompi_request_complete
from properly calling the callback associated with the request. This means that some of the collectives components that use this functionality could not use partitioned communications (I'm not saying they do, just that this breakage of consistency will be a pain to debug in the future).
size_t i; | ||
|
||
/* prevent re-entry, */ | ||
int block_entry = opal_atomic_add_fetch_32(&(ompi_part_persist.block_entry), 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 atomic operations for each failed access to progress and a lock for the successful case ? Why not using a try_lock
?
return OMPI_SUCCESS; | ||
} | ||
|
||
OPAL_LIST_FOREACH(current, ompi_part_persist.progress_list, mca_part_persist_list_t) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything before this FOREACH should be called only once per process. And checking for this should definitely not be part of each progress call !!!
if(false == req->first_send) | ||
{ | ||
if(MCA_PART_PERSIST_REQUEST_PSEND == req->req_type) { | ||
req->done_count = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done_count
is set to zero in all case, can be moved out of the branches.
please try rebasing on top of main to see if mpi4py error goes away. |
c31050b
to
6a7aeb2
Compare
@bosilca: FWIW I did split this PR into two commits:
|
@hppritcha rebasing did get rid of the |
use a separate loop index for the innermost loop. Thanks Jonas Harlacher for bringing this to our attention. Fixes open-mpi#12969 Signed-off-by: Gilles Gouaillardet <[email protected]>
move subroutines from header to source files. Many subroutines are only used by a single source file, so move them out of the header file. This will make it easier for debuggers since setting a breakpoint in a header file is not always correctly supported. Signed-off-by: Gilles Gouaillardet <[email protected]>
6a7aeb2
to
bbbe62f
Compare
No description provided.