-
Notifications
You must be signed in to change notification settings - Fork 382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prov/shm: Add unmap_region function #10364
Conversation
bot:aws:retest |
ad41c78
to
5c3e25b
Compare
@shijin-aws whats the AWS failure? |
Lots of failures for Open MPI RMA test, one example is
|
I can try reproducing locally to get more logs. But it seems to me the segfaults happened in some locking steps |
I can test it locally. We dont test OMPI in our CI so those tests weren't run on our end. |
Does any provider other than efa can access shm inside Open MPI run now? Open MPI cannot run with shm itself as it doesn't support FI_REMOTE_COMM |
I have a "hack" for OMPI to force it to run shm without needing another provider. I just remove taht dependency and then run tests that dont need it. Im pretty sure this bug is all related so fixing one of those tests will likely fix most/all of them |
Thanks. The interesting thing is that the error only happens for our inter-node test, in this case we only touch shm during the control interface operations (av_insert, ep open, mr reg), no data transfer |
do you have a stack trace with the av_insert path? I removed the lock inside of map_to_region and pushed it to be outside. I only moved the locks in prov/shm. I think I need to take a look at where it is called in prov/efa. Thats likely the bug |
The bug is triggered at the MPI_Finalize stage, I think it's in the clean up stage |
I wonder if there is a path we dont test in Intel CI that removes nodes from the rb tree after I already remove them with the updated call. Ill take a look there |
@shijin-aws I have been unsuccessful in recreating the same bug locally. Do you know which call is causing the segfault? I assume its in smr_ep_close. I also found a missed lock in smr_ep.c:227 needs to be locked and unlocked afterwards |
@zachdworkin Is this new push just for debugging? |
its to fix the miss of the lock in smr_ep.c:227 and for debugging |
@zachdworkin I will get back to you today on this |
@zachdworkin The error disappears for your latest push
|
excellent! I suspect that the missing lock was the issue. I will re-push without the prints. Thanks! |
Your latest push failed in build time
|
@shijin-aws can you share the aws failure? |
bot:aws:retest |
@zachdworkin This time the segfaults happened again
|
3c33060
to
b46fbbb
Compare
@zachdworkin this time it's an efa unit test failure
Backtrace is
|
16ad4a3
to
9ba7aa3
Compare
@shijin-aws can you share the latest failure? I fixed the previous one |
It's the same segfault in MPI run again ....
|
Can you run this again with debug logging? I can't reproduce this issue with my ompi workaround patch |
Backtrace with debug build
|
The segfault happens when accessing pid in the peer_region
Is |
2000bbe
to
82aa3c1
Compare
This function is mainly for the niche case where on progress_connreq a peer is added to the map with its region needing to be mapped, and then after mapping it, it's discovered that the newly mapped peer's process died. In this case we need to unmap them and free any resources that were opened for communicating with them. Remove lock from map_to_region and unmap_region functions and require lock acquirement before calling those functions. This is necessary because on av removal path, map will be double locked if the functions also process locking the map. The map_to_region function is updated to mirror this policy. Signed-off-by: Zach Dworkin <[email protected]>
This function is mainly for the niche case where on progress_connreq a peer is added to the map with its region needing to be mapped, and then after mapping it, it's discovered that the newly mapped peer's process died. In this case we need to unmap them and free any resources that were opened for communicating with them.
With the creation of this function we can rework smr_map_del to use it as common code. This requires changes to smr_av.c where smr_map_del is called. smr_map_del is now an iterable function. This is to optimize smr_map_cleanup to use ofi_rbmap_foreach to only cleanup peers that exist.