-
Notifications
You must be signed in to change notification settings - Fork 382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
man/fi_peer, prov/shm,efa: update peer context usage and providers using old usage #10377
Conversation
@shijin-aws I updated efa here, but let me know if I missed something. I can also squash the second and third patches into on because technically the shm change will break efa until that third patch is added. I kept it separate for easier review. I can squash when you confirm it looks good and passes |
AWS CI failure is related, seems all EFA single node test failed. No detailed logs yet, will try to reproduce |
bot:aws:retest |
@@ -1304,10 +1309,22 @@ static int efa_rdm_ep_ctrl(struct fid *fid, int command, void *arg) | |||
* shared memory region. | |||
*/ | |||
if (ep->shm_ep) { | |||
peer_srx_context.srx = util_get_peer_srx(ep->peer_srx_ep); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does shm_peer_srx
do the same thing as ep->peer_srx_ep
? We created ep->peer_srx_ep
via util_ep_srx_context
which allocates the owner resources for srx. I need to read your man page change more carefully, but it doesn't make sense to me that we have 2 owner srx resources here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 2 resources here - the peer srx (allocated by efa and imported to shm) and then shm has a fid_ep that it uses to track any shm resources (it doesn't need any but it can). I think that second one I actually implemented incorrectly because shm doesn't return a fid anymore. It's not necessary for the shm case but it's part of the definition so that's definitely a miss and maybe why it's failing (because the fi_close at the end will actually segfault)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shijin-aws I fixed the issues I knew would cause a failure but it looks like it's still grumpy. I'm guessing I missed something in the message flow - can you share the failure?
@aingerson I get a segfault when running
|
@aingerson the latest push fixed the earlier segfaults, but we have some efa-specific unit test failure, I trying to get to the bottom |
prov/efa/src/rdm/efa_rdm_ep_fiops.c
Outdated
@@ -987,6 +987,13 @@ static int efa_rdm_ep_close(struct fid *fid) | |||
} | |||
|
|||
if (efa_rdm_ep->shm_ep) { | |||
ret = fi_close(&efa_rdm_ep->shm_srx->fid); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have other places that can close shm resources, including https://github.com/ofiwg/libfabric/blob/main/prov/efa/src/rdm/efa_rdm_ep_fiops.c#L1048-L1059
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And our unit tests have tons of places that replies on closing shm ep hackily as fi_close
: https://github.com/ofiwg/libfabric/blob/main/prov/efa/test/efa_unit_test_ep.c#L126-L131. These tests are all failed by this change. I think we need to modify our unit tests that doesn't close shm ep explicitly in the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shijin-aws Oh that is funky. I didn't realize that. I added a ref count in the shm srx import to make sure it gets closed properly even though there isn't really anything to be closed for shm so I think dropping this close will cause failures in the regular case. I think you probably at minimum need a wrapper around that hack to close both the shm ep and the shm srx. Do you think that would be enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am preparing a patch to fix that. We can use fi_setopt(.. FI_OPT_SHARED_MEMORY_PERMMITED, false)
to disable shm in the test. Will let you know when it's ready
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sweet, thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But generally it is annoying that provider needs to manage more resources for peer providers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but I think we need to match all the peer resources to the same flow - the peer cq and peer counter are the same (efa needs to close the shm cq and counters). I think it needs to happen and this one was just missed so it's annoying to add it in later. That would be a perk of the link provider, to remove that handling from efa
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened a PR to clean up our tests: #10406. we need to rebase your PR after #10406 is merged and then make sure shm resources are cleaned correctly in https://github.com/ofiwg/libfabric/blob/main/prov/efa/src/rdm/efa_rdm_ep_fiops.c#L1048-L1059 as well.
man/fi_peer.3.md
Outdated
When importing any shared fabric object into a peer, the owner will create a | ||
separate fid_peer_* for each peer provider it intends to import into. The owner | ||
will pass this unique fid_peer_* into each peer through the context parameter of | ||
the init call for the resource (ie fi_cq_open, fi_srx_context, fi_cntr_open, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ie
=> i.e.
/* shm srx fid (shm-owned) */ | ||
struct fid_ep *shm_srx; | ||
/* shm peer_srx (efa-owned) */ | ||
struct fid_peer_srx *shm_peer_srx; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still not convinced that we need to have 3 peer srx related resources in efa ep, adding these 2, there are
- peer_srx_ep
- shm_srx
- shm_peer_srx
If we want to make an analogy to peer cq and cntr
efa_cq has only 1 peer cq related field: shm_cq,
it opens the shm_cq via
shm_cq_attr.flags |= FI_PEER;
peer_cq_context.size = sizeof(peer_cq_context);
peer_cq_context.cq = cq->util_cq.peer_cq;
fi_cq_open(efa_domain->shm_domain, &shm_cq_attr,
&cq->shm_cq, &peer_cq_context);
I believe shm_cq
plays a role similar to shm_srx
here, util_cq.peer_cq
plays the role of shm_peer_srx
here.
If we make an analogy between util_cq
and util_srx
, shouldn't we make shm_peer_srx
to be part of util_srx
, as util_srx.peer_srx
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shijin-aws I think it depends on the use case of efa (I don't really know how efa uses the srx on the efa side). The issue (and difference between the CQ use case) is the peer_ops. shm needs to be able to set the peer_srx->peer_ops to its own ops. Every peer that has a separate set of peer_ops needs a separate peer_srx for dedicated use for that peer. If in efa you access all the util_srx functions directly and not through the owner_ops (as if you were a peer) then you're right, you don't need a separate peer_srx - you could put it in the util_srx and have that one be dedicated for shm. However, if in efa you are calling get_msg(), etc, you have to pass in a unique peer_srx that has the efa peer_ops for the util_srx to call later. I think you likely need two (one for efa and one for shm) because I think you just go directly through the util_generic_recvmsg call which directly initiates the start_msg() flow.
For the CQ case, there are no peer_ops, so the peers can technically share the same peer_cq because the only purpose is to call the owner_ops which won't change from peer to peer.
So for the three resources you're talking about, I think each one is needed for a slightly different purpose:
- peer_srx_ep -> peer_srx for the efa-owned srx, for use with efa (owner_ops = util_srx_ops, peer_ops = efa_peer_ops)
- shm_srx -> shm fid for srx (shm doesn't need it but in case it did need anything extra, this is to track it)
- shm_peer_srx -> peer_srx for the efa-owned srx, for use with shm (owner_ops = util_srx_ops, peer_ops = shm_peer_ops)
@aingerson still a lot of efa unit test failure in AWS CI, will get back you later today |
I hit a segmentation fault around this line inside
|
I figure it out. EFA provider create shm_ep during fi_endpoint, but only create srx resources during fi_enable. Our unit tests have tests that only create/close ep without enable. Your |
I reworked your efa change: shijin-aws@57749f6 please cherry-pick |
Add clarification in the man page indicating that the owner is responsible for creating unique fi_peer_*_contexts for each peer and that the peers are only allowed to set the peer ops of that context. Signed-off-by: Alexia Ingerson <[email protected]>
The peer API has been updated to specify that the owner must allocate the peer's fid_peer_srx. The shm implementation was allocating its own internal fid_peer_srx. This updates the shm implementation to assume it has a unique fid_peer_srx and updates the imported fid_peer_srx peer_ops, saving a pointer to the fid_peer_srx instead of the internal fid_ep which required a wrapper function to get back to the fid_peer_srx It also returns an internal fid_ep for the created srx which is used to close the srx by the owner. Even though shm doesn't need anything attached to the internal fid_ep, it is there for consistency and to track the domain reference counting for errors. This patch also moves the srx specific functions into smr_domain where they belong Signed-off-by: Alexia Ingerson <[email protected]>
The previous definition of the peer API didn't specify who allocated the second peer structure (the one referenced by the peer). The shm implementation was choosing to duplicate the imported srx and set it internally. The new definition specifies that the owner handle the duplication of the peer resource which is then imported into the peer to just set. Shm has been updated accordingly but efa needs to be updated to create a second peer_srx and set the fields to the original one for the peer to reference the owner_ops correctly. This also adds a missing fi_close for the shm srx resource Signed-off-by: Alexia Ingerson <[email protected]> Signed-off-by: Shi Jin <[email protected]>
@shijin-aws Thanks for fixing, Shi! Looks like everything is passing. I think this is finally good to go know. Feel free to merge if that looks ok to you |
The man pages had no wording describing who owned the shared peer resource when using the peer API. Because of this, the util cq and counter implementation were using owner-allocated fid_peer_* resources and the shm srx implementation was using a peer-allocated resource. This patch set updates the man pages to add consistency to the resource allocation flow, specifying that the owner is responsible for allocating the resource and making sure peers use different fids. It also updates the shm implementation to adhere to this flow and updates efa's use of shm as a peer in this way as well.