Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of the current state of MPI Continuations proposal [WIP] #9502

Draft
wants to merge 68 commits into
base: main
Choose a base branch
from

Conversation

devreal
Copy link
Contributor

@devreal devreal commented Oct 9, 2021

This PR adds an implementation of the MPI Continuations proposal as extension to Open MPI. This proposal is current under discussion in the MPI hybrid and accelerator working group. By integrating it as an extension into Open MPI, I am hoping to provide an up-to-date implementation for the community to experiment with. It reflects the current state of the proposal and will be kept in sync with the evolving proposal.

The implementation is mostly confined to the extension itself, with the exception of hooks in request test and wait functions used to allow polling on a continuation request to complete outstanding continuations. This is needed for the implementation of the "mpi_continue_poll_only" info key (see the description below).

Note: this PR is currently WIP as it relies on a workaround to define OMPI_HAVE_MPI_EXT_CONTINUE, which is used to disable the hooks in request test/wait functions if the extension is not enabled. The underlying issue is that the extensions integration currently does not support including the mpiext.h header in implementation files (needed to disable the hooks mentioned above).

Overview of MPI Continuations

Continuations provide a mechanism for attaching callbacks to outstanding operation requests. A call to MPIX_Continue takes a request, a function pointer, a user-provided data pointer, and a status object (or MPI_STATUS_IGNORE), along with a continuation request and attaches the continuation to the operation request:

MPI_Request req;
// status object has to remain valid until the callback is invoked
MPI_Status *status = malloc(sizeof(MPI_Status));
char *buf = ...;
MPI_Irecv(buf, ..., MPI_ANY_SOURCE, ... &req);
MPIX_Continue(&req, &complete_cb, buf, status, cont_req);
assert(req == MPI_REQUEST_NULL);

The ownership of non-persistent requests is returned to MPI and the pointer to the request will be set to MPI_REQUEST_NULL. The callback is passed the status pointer and the user-provided data pointer:

void complete_cb(MPI_Status *status, void *user_data) {
  printf("Send completed\n");
  char *buf = (char*)user_data;
  process_msg(buf, status->MPI_SOURCE);
  free(buf); // free the send buffer
  free(status);    // free the status
}

The status has to remain valid until the invocation of the callback and is set according to the operation before the callback is invoked.

The continuation is registered with the provided continuation request. The continuation request is a request allocated using MPIX_Continue_init:

MPIX_Continue_init(info, &cont_req);

Continuation requests may be used to test/wait for completion of all continuations registered with it using MPI_Test/Wait. Supported info keys are:

  • "mpi_continue_poll_only": if true, only execute continuations when MPI_Test/Wait is called on the continuation request. If false, continuations may be executed at any time inside a call into MPI (inside a callback registered with opal_progress in the implementation; default: false).
  • "mpi_continue_enqueue_complete": if false, the continuation is executed immediately if the operations are already complete when MPIX_Continue is called. Execution is deferred otherwise (default: false)
  • "mpi_continue_max_poll": the maximum number of continuations to execute when calling MPI_Test on the continuation request (default: -1, meaning unlimited)

A continuation may in turn be attached to a continuation request, in which case it will be executed once all continuations registered with the continuation request have completed.

In addition to MPIX_Continue, the proposal also includes MPIX_Continueall which attaches a continuation to a set of requests such that the continuation is executed once all operations have completed.

Signed-off-by: Joseph Schuchart [email protected]
Signed-off-by: George Bosilca [email protected]

Copy link
Member

@bosilca bosilca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the impact of the continuations on the latency ? What if we mix normal requests and continuations in the same app ?

ompi/mpiext/continue/c/continuation.c Outdated Show resolved Hide resolved
ompi/mpiext/continue/c/continuation.c Outdated Show resolved Hide resolved
ompi/mpiext/continue/c/continuation.c Show resolved Hide resolved
ompi/mpiext/continue/c/continuation.c Outdated Show resolved Hide resolved
ompi/mpiext/continue/c/continuation.c Outdated Show resolved Hide resolved
ompi/mpiext/continue/c/continuation.c Outdated Show resolved Hide resolved
ompi/mpiext/continue/c/continuation.c Outdated Show resolved Hide resolved
@devreal
Copy link
Contributor Author

devreal commented Oct 18, 2021

I pushed a bunch of changes. The biggest change is that a continuation request is now derived from ompi_request_t, which shrunk ompi_request_cont_data_t considerably and cuts down on the pointer chasing. The latter is now only used to link operation requests to the continuation object and the user-provided status object.

@devreal
Copy link
Contributor Author

devreal commented Oct 18, 2021

Oh, and I will squash everything down to one commit once we're done...

Copy link
Member

@bosilca bosilca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitively much cleaner.


# This example can always build, so we just execute $1 if it was
# requested.
AS_IF([test "$ENABLE_continue" = "1" || \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These IFs looks similar enough to be merged.

ompi/request/req_test.c Show resolved Hide resolved
@devreal devreal force-pushed the mpi-continue-master branch from 2f49fa2 to fa149c2 Compare November 9, 2021 04:40
@devreal
Copy link
Contributor Author

devreal commented Nov 9, 2021

@bosilca I pushed a few changes:

  1. Now using the new OMPI_MPIEXT_*_POST_CONFIG hook. I hope this is how it is intended to be used.
  2. I had to make sure that the state of continuation requests is not changed to inactive in test/wait. They are always active and never explicitly (re)started by the user.
  3. Continuations that are poll-only (not executed as part of global progress) will now only be executed by the thread testing/waiting for their completion. We don't merge the local list into the global list anymore but instead allow the waiting thread to execute them if not blocked on the sync. Otherwise they will be executed at the end of the wait, once all continuations are eligible for execution. I think this change makes sense as this avoids other threads from being disturbed by poll-only continuations.

int ompi_continue_progress_request(ompi_request_t *cont_req);

/**
* Attach a continuation to a set of operations represented by \c requests.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this \c ?


/**
* Register the provided continuation request to be included in the
* global progress loop (used while a thread is waiting for the contnuation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo. contnuation


opal_atomic_unlock(&cont_req->cont_lock);

if (NULL == thread_progress_list) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this global object not protected for concurrent accesses ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a thread_local variable that stores the continuation requests the current thread is waiting on

@devreal devreal force-pushed the mpi-continue-master branch from 0fffb9c to e014717 Compare April 18, 2022 13:22
@devreal devreal marked this pull request as draft October 6, 2022 15:23
@github-actions
Copy link

github-actions bot commented Nov 3, 2022

Hello! The Git Commit Checker CI bot found a few problems with this PR:

d80d3ca: Remove re-iteration of continuation requests in te...

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@devreal devreal force-pushed the mpi-continue-master branch from 03c36a5 to e2979db Compare November 22, 2022 15:04
@github-actions
Copy link

Hello! The Git Commit Checker CI bot found a few problems with this PR:

f3babb5: Remove re-iteration of continuation requests in te...

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

5 similar comments
@github-actions
Copy link

Hello! The Git Commit Checker CI bot found a few problems with this PR:

f3babb5: Remove re-iteration of continuation requests in te...

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@github-actions
Copy link

Hello! The Git Commit Checker CI bot found a few problems with this PR:

f3babb5: Remove re-iteration of continuation requests in te...

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@github-actions
Copy link

Hello! The Git Commit Checker CI bot found a few problems with this PR:

f3babb5: Remove re-iteration of continuation requests in te...

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@github-actions
Copy link

Hello! The Git Commit Checker CI bot found a few problems with this PR:

f3babb5: Remove re-iteration of continuation requests in te...

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@github-actions
Copy link

Hello! The Git Commit Checker CI bot found a few problems with this PR:

f3babb5: Remove re-iteration of continuation requests in te...

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@devreal devreal force-pushed the mpi-continue-master branch from df79651 to c8c0437 Compare March 2, 2023 21:40
@github-actions
Copy link

github-actions bot commented Mar 2, 2023

Hello! The Git Commit Checker CI bot found a few problems with this PR:

f3babb5: Remove re-iteration of continuation requests in te...

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

3 similar comments
@github-actions
Copy link

Hello! The Git Commit Checker CI bot found a few problems with this PR:

f3babb5: Remove re-iteration of continuation requests in te...

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@github-actions
Copy link

github-actions bot commented Sep 7, 2023

Hello! The Git Commit Checker CI bot found a few problems with this PR:

f3babb5: Remove re-iteration of continuation requests in te...

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

Copy link

github-actions bot commented Feb 6, 2024

Hello! The Git Commit Checker CI bot found a few problems with this PR:

f3babb5: Remove re-iteration of continuation requests in te...

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@devreal devreal force-pushed the mpi-continue-master branch from a02d677 to b6440c0 Compare April 2, 2024 19:44
Copy link

github-actions bot commented Apr 2, 2024

Hello! The Git Commit Checker CI bot found a few problems with this PR:

a32e7e3: Remove re-iteration of continuation requests in te...

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

2 similar comments
Copy link

github-actions bot commented Apr 2, 2024

Hello! The Git Commit Checker CI bot found a few problems with this PR:

a32e7e3: Remove re-iteration of continuation requests in te...

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

Copy link

github-actions bot commented Apr 2, 2024

Hello! The Git Commit Checker CI bot found a few problems with this PR:

a32e7e3: Remove re-iteration of continuation requests in te...

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

devreal added 25 commits July 18, 2024 09:54
Explicit global progress instead of global progress
when progressing a request, to avoid recursive progress calls.

Signed-off-by: Joseph Schuchart <[email protected]>
Also: always store continuations directly in poll-only CRs.

Signed-off-by: Joseph Schuchart <[email protected]>
- Swap flags and max_poll parameter in MPI_Continue_init
- Swap cont_req and info parameter in MPI_Continue_init
- Add new flags and remove support for info keys mpi_continue_poll_only,
  mpi_continue_max_poll, and mpi_continue_enqueue_complete

Signed-off-by: Joseph Schuchart <[email protected]>
This was a temporary bandaid to fix the C11 sequentially consistent store...

Signed-off-by: Joseph Schuchart <[email protected]>
Continuation requests may be marked as complete before another
continuation is registered. In that case, we should mark
the continuation request as pending again, unless a sync object
has been registered already.

Signed-off-by: Joseph Schuchart <[email protected]>
The cont->cont_req field was reset *after* it was returned to the freelist...

Signed-off-by: Joseph Schuchart <[email protected]>
We must protect the activation of the completion of a continuation request
to ensure that we don't miss any updates.

Signed-off-by: Joseph Schuchart <[email protected]>
Runnable continuations should be added to cont_complete_defer_list if
the continuation request is inactive, so that it can be enqueued
for execution once the request is enabled.

Signed-off-by: Joseph Schuchart <[email protected]>
This was a temporary idea that has been removed again.

Signed-off-by: Joseph Schuchart <[email protected]>
MPIX_CONT_PERSISTENT will not be part of the first proposal and has
been removed.

Signed-off-by: Joseph Schuchart <[email protected]>
@devreal devreal force-pushed the mpi-continue-master branch from 790a7b6 to 68b887c Compare July 18, 2024 13:54
Copy link

Hello! The Git Commit Checker CI bot found a few problems with this PR:

d77ca1c: Remove re-iteration of continuation requests in te...

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

It is not sufficient to just set one value, we need proper initialization
through OMPI_REQUEST_INIT.

Signed-off-by: Joseph Schuchart <[email protected]>
Copy link

github-actions bot commented Aug 8, 2024

Hello! The Git Commit Checker CI bot found a few problems with this PR:

d77ca1c: Remove re-iteration of continuation requests in te...

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@JiakunYan
Copy link

I found something that may need more clarification.

The continuation proposal says

A continuation request completes once the callbacks of all continuations registered with it have completed execution.
After initialization or completion, a continuation request has to be started to enable the execution of registered continuations.

It seems to me that the MPI runtime should set the status of the continuation request to INACTIVE when the last continuation registered with it completes its execution. However, what I see in this PR is that the code will set the status to INACTIVE only when MPI_Test on the continuation request finds it completed.

This difference matters because it decides whether the following code logic is correct: if my application never tested (and then re-started) the continuation request (imagining I have some magic way to keep making progress on the runtime e.g., testing other regular requests), could I keep attaching new continues to it and assume they always get executed without delay?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants