Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mtl/ofi: avoid accessing request object after completion callback(restart ci) #12175

Merged
merged 1 commit into from
Dec 21, 2023

Conversation

wenduwan
Copy link
Contributor

The completion callback can potentially invalidate the request object, so it is not safe to access the object afterwards.

@wenduwan
Copy link
Contributor Author

Testing in AWS internal CI.

@@ -154,9 +154,9 @@ ompi_mtl_ofi_context_progress(int ctxt_id)
ret = ofi_req->event_callback(&ompi_mtl_ofi_wc[i], ofi_req);
if (OMPI_SUCCESS != ret) {
opal_output(0,
"%s:%d: Error returned by request (type: %d) event callback: %zd.\n"
"%s:%d: Error returned by request event callback: %zd.\n"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you cache the req type before invoking the event_callback and use that in the error message? might help a little with debugging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that's smart. Let me do that...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Running our CI again...

@wenduwan wenduwan force-pushed the mt_completion_callback branch 2 times, most recently from 27a9516 to c90f848 Compare December 21, 2023 16:55
@wenduwan wenduwan changed the title mtl/ofi: avoid accessing request object after completion callback mtl/ofi: avoid accessing request object after completion callback(restart ci) Dec 21, 2023
@wenduwan
Copy link
Contributor Author

nvidia ci vomitted...

Run /start clean
~/CI /github/workspace/ompi/ompi
Cleaning evironment
Error from server (NotFound): pods "ompi-ci-1" not found
Error: Process completed with exit code 1.

@hppritcha
Copy link
Member

rerunning it doesn't seem to help.

@wenduwan
Copy link
Contributor Author

It appears that the backend server is not available. I tried to restart the ci for #12182

Request completion callback function can potentially invalidate the
request object. We should avoid accessing the object afterwards.

Signed-off-by: Wenduo Wang <[email protected]>
@wenduwan wenduwan force-pushed the mt_completion_callback branch from c90f848 to 6d79aae Compare December 21, 2023 17:49
@wenduwan
Copy link
Contributor Author

Finally worked... Our internal CI also passed.

Merging...

@wenduwan wenduwan merged commit fd6fe3d into open-mpi:main Dec 21, 2023
10 checks passed
@wenduwan wenduwan deleted the mt_completion_callback branch December 21, 2023 18:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants