man/fi_peer, prov/shm,efa: update peer context usage and providers using old usage #10377

aingerson · 2024-09-11T21:18:40Z

The man pages had no wording describing who owned the shared peer resource when using the peer API. Because of this, the util cq and counter implementation were using owner-allocated fid_peer_* resources and the shm srx implementation was using a peer-allocated resource. This patch set updates the man pages to add consistency to the resource allocation flow, specifying that the owner is responsible for allocating the resource and making sure peers use different fids. It also updates the shm implementation to adhere to this flow and updates efa's use of shm as a peer in this way as well.

aingerson · 2024-09-11T21:20:37Z

@shijin-aws I updated efa here, but let me know if I missed something. I can also squash the second and third patches into on because technically the shm change will break efa until that third patch is added. I kept it separate for easier review. I can squash when you confirm it looks good and passes

shijin-aws · 2024-09-12T16:39:20Z

AWS CI failure is related, seems all EFA single node test failed. No detailed logs yet, will try to reproduce

shijin-aws · 2024-09-12T22:25:52Z

bot:aws:retest

shijin-aws · 2024-09-12T22:52:07Z

prov/efa/src/rdm/efa_rdm_ep_fiops.c

@@ -1304,10 +1309,22 @@ static int efa_rdm_ep_ctrl(struct fid *fid, int command, void *arg)
 		 * shared memory region.
 		 */
 		if (ep->shm_ep) {
-			peer_srx_context.srx = util_get_peer_srx(ep->peer_srx_ep);


Does shm_peer_srx do the same thing as ep->peer_srx_ep? We created ep->peer_srx_ep via util_ep_srx_context which allocates the owner resources for srx. I need to read your man page change more carefully, but it doesn't make sense to me that we have 2 owner srx resources here

There are 2 resources here - the peer srx (allocated by efa and imported to shm) and then shm has a fid_ep that it uses to track any shm resources (it doesn't need any but it can). I think that second one I actually implemented incorrectly because shm doesn't return a fid anymore. It's not necessary for the shm case but it's part of the definition so that's definitely a miss and maybe why it's failing (because the fi_close at the end will actually segfault)

@shijin-aws I fixed the issues I knew would cause a failure but it looks like it's still grumpy. I'm guessing I missed something in the message flow - can you share the failure?

shijin-aws · 2024-09-18T23:32:17Z

@aingerson I get a segfault when running fi_rdm_pingpong -p efa -E on single node, the backtrace is

Core was generated by `fi_rdm_pingpong -p efa -E'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fb4ea3b03ca in smr_av_insert (av_fid=0x555b85edd9d0, addr=0x7ffd3979cd10, count=1, fi_addr=0x7fb482e39258, flags=2199023255552, context=0x0)
    at prov/shm/src/smr_av.c:175
175                             smr_ep->srx->owner_ops->foreach_unspec_addr(smr_ep->srx,
(gdb) 
(gdb) bt
#0  0x00007fb4ea3b03ca in smr_av_insert (av_fid=0x555b85edd9d0, addr=0x7ffd3979cd10, count=1, fi_addr=0x7fb482e39258, flags=2199023255552, context=0x0)
    at prov/shm/src/smr_av.c:175
#1  0x00007fb4ea315ff6 in fi_av_insert (av=0x555b85edd9d0, addr=0x7ffd3979cd10, count=1, fi_addr=0x7fb482e39258, flags=2199023255552, context=0x0)
    at ./include/rdma/fi_domain.h:509
#2  0x00007fb4ea31970f in efa_conn_rdm_init (av=0x555b85eda2a0, conn=0x7fb482e39220, insert_shm_av=true) at prov/efa/src/efa_av.c:319
#3  0x00007fb4ea31bc3e in efa_conn_alloc (av=0x555b85eda2a0, raw_addr=0x7ffd3979d020, flags=0, context=0x0, insert_shm_av=true) at prov/efa/src/efa_av.c:487
#4  0x00007fb4ea31d4f4 in efa_av_insert_one (av=0x555b85eda2a0, addr=0x7ffd3979d020, fi_addr=0x7ffd3979cfa8, flags=0, context=0x0, insert_shm_av=true)
    at prov/efa/src/efa_av.c:625
#5  0x00007fb4ea31d6ed in efa_av_insert (av_fid=0x555b85eda2e8, addr=0x7ffd3979d020, count=1, fi_addr=0x555b85c8f3d0 <remote_fi_addr>, flags=0, context=0x0)
    at prov/efa/src/efa_av.c:667
#6  0x0000555b85c7bef8 in fi_av_insert (context=0x0, flags=0, fi_addr=0x555b85c8f3d0 <remote_fi_addr>, count=1, addr=0x7ffd3979d020, av=0x555b85eda2e8)
    at /home/ubuntu/PortaFiducia/build/libraries/libfabric/main/install/libfabric/include/rdma/fi_domain.h:509
#7  ft_av_insert (av=av@entry=0x555b85eda2e8, addr=addr@entry=0x7ffd3979d020, count=count@entry=1, fi_addr=fi_addr@entry=0x555b85c8f3d0 <remote_fi_addr>, 
    flags=flags@entry=0, context=context@entry=0x0) at common/shared.c:1516
#8  0x0000555b85c84bef in ft_exchange_addresses_oob (av_ptr=0x555b85eda2e8, ep_ptr=<optimized out>, remote_addr=0x555b85c8f3d0 <remote_fi_addr>)
    at common/shared.c:1555
#9  0x0000555b85c84da5 in ft_init_av_dst_addr (av_ptr=0x555b85eda2e8, ep_ptr=0x555b85eda520, remote_addr=remote_addr@entry=0x555b85c8f3d0 <remote_fi_addr>)
    at common/shared.c:1574
#10 0x0000555b85c84f74 in ft_init_av () at common/shared.c:1531
#11 0x0000555b85c85859 in ft_init_fabric () at common/shared.c:1336
#12 0x0000555b85c789b7 in run () at benchmarks/rdm_pingpong.c:109
#13 main (argc=4, argv=0x7ffd3979da08) at benchmarks/rdm_pingpong.c:109

shijin-aws · 2024-09-19T22:51:49Z

@aingerson the latest push fixed the earlier segfaults, but we have some efa-specific unit test failure, I trying to get to the bottom

shijin-aws · 2024-09-19T23:00:22Z

prov/efa/src/rdm/efa_rdm_ep_fiops.c

@@ -987,6 +987,13 @@ static int efa_rdm_ep_close(struct fid *fid)
 	}

 	if (efa_rdm_ep->shm_ep) {
+		ret = fi_close(&efa_rdm_ep->shm_srx->fid);


We have other places that can close shm resources, including https://github.com/ofiwg/libfabric/blob/main/prov/efa/src/rdm/efa_rdm_ep_fiops.c#L1048-L1059

And our unit tests have tons of places that replies on closing shm ep hackily as fi_close: https://github.com/ofiwg/libfabric/blob/main/prov/efa/test/efa_unit_test_ep.c#L126-L131. These tests are all failed by this change. I think we need to modify our unit tests that doesn't close shm ep explicitly in the test.

@shijin-aws Oh that is funky. I didn't realize that. I added a ref count in the shm srx import to make sure it gets closed properly even though there isn't really anything to be closed for shm so I think dropping this close will cause failures in the regular case. I think you probably at minimum need a wrapper around that hack to close both the shm ep and the shm srx. Do you think that would be enough?

I am preparing a patch to fix that. We can use fi_setopt(.. FI_OPT_SHARED_MEMORY_PERMMITED, false) to disable shm in the test. Will let you know when it's ready

Sweet, thank you!

But generally it is annoying that provider needs to manage more resources for peer providers

I agree, but I think we need to match all the peer resources to the same flow - the peer cq and peer counter are the same (efa needs to close the shm cq and counters). I think it needs to happen and this one was just missed so it's annoying to add it in later. That would be a perk of the link provider, to remove that handling from efa

I opened a PR to clean up our tests: #10406. we need to rebase your PR after #10406 is merged and then make sure shm resources are cleaned correctly in https://github.com/ofiwg/libfabric/blob/main/prov/efa/src/rdm/efa_rdm_ep_fiops.c#L1048-L1059 as well.

shijin-aws · 2024-09-24T04:20:09Z

man/fi_peer.3.md

+When importing any shared fabric object into a peer, the owner will create a
+separate fid_peer_* for each peer provider it intends to import into.  The owner
+will pass this unique fid_peer_* into each peer through the context parameter of
+the init call for the resource (ie fi_cq_open, fi_srx_context, fi_cntr_open,


ie => i.e.

shijin-aws · 2024-09-24T05:08:00Z

prov/efa/src/rdm/efa_rdm_ep.h

+	/* shm srx fid (shm-owned) */
+	struct fid_ep *shm_srx;
+	/* shm peer_srx (efa-owned) */
+	struct fid_peer_srx *shm_peer_srx;


I am still not convinced that we need to have 3 peer srx related resources in efa ep, adding these 2, there are

peer_srx_ep

shm_srx

shm_peer_srx

If we want to make an analogy to peer cq and cntr

efa_cq has only 1 peer cq related field: shm_cq,
it opens the shm_cq via

shm_cq_attr.flags |= FI_PEER; peer_cq_context.size = sizeof(peer_cq_context); peer_cq_context.cq = cq->util_cq.peer_cq; fi_cq_open(efa_domain->shm_domain, &shm_cq_attr, &cq->shm_cq, &peer_cq_context);

I believe shm_cq plays a role similar to shm_srx here, util_cq.peer_cq plays the role of shm_peer_srx here.

If we make an analogy between util_cq and util_srx, shouldn't we make shm_peer_srx to be part of util_srx, as util_srx.peer_srx ?

@shijin-aws I think it depends on the use case of efa (I don't really know how efa uses the srx on the efa side). The issue (and difference between the CQ use case) is the peer_ops. shm needs to be able to set the peer_srx->peer_ops to its own ops. Every peer that has a separate set of peer_ops needs a separate peer_srx for dedicated use for that peer. If in efa you access all the util_srx functions directly and not through the owner_ops (as if you were a peer) then you're right, you don't need a separate peer_srx - you could put it in the util_srx and have that one be dedicated for shm. However, if in efa you are calling get_msg(), etc, you have to pass in a unique peer_srx that has the efa peer_ops for the util_srx to call later. I think you likely need two (one for efa and one for shm) because I think you just go directly through the util_generic_recvmsg call which directly initiates the start_msg() flow.
For the CQ case, there are no peer_ops, so the peers can technically share the same peer_cq because the only purpose is to call the owner_ops which won't change from peer to peer.
So for the three resources you're talking about, I think each one is needed for a slightly different purpose:

peer_srx_ep -> peer_srx for the efa-owned srx, for use with efa (owner_ops = util_srx_ops, peer_ops = efa_peer_ops)

shm_srx -> shm fid for srx (shm doesn't need it but in case it did need anything extra, this is to track it)

shm_peer_srx -> peer_srx for the efa-owned srx, for use with shm (owner_ops = util_srx_ops, peer_ops = shm_peer_ops)

shijin-aws · 2024-10-04T19:02:00Z

@aingerson still a lot of efa unit test failure in AWS CI, will get back you later today

shijin-aws · 2024-10-08T00:10:50Z

I hit a segmentation fault around this line inside efa_rdm_ep_close

if (efa_rdm_ep->shm_ep) {
		ret = fi_close(&efa_rdm_ep->shm_srx->fid);

shijin-aws · 2024-10-08T00:31:23Z

I figure it out. EFA provider create shm_ep during fi_endpoint, but only create srx resources during fi_enable. Our unit tests have tests that only create/close ep without enable. Your fi_close shm_srx and shm_peer_srx should explicitly check efa_rdm_ep->shm_srx and efa_rdm_ep->shm_peer_srx before doing it.

shijin-aws · 2024-10-08T00:49:22Z

I reworked your efa change: shijin-aws@57749f6 please cherry-pick

Add clarification in the man page indicating that the owner is responsible for creating unique fi_peer_*_contexts for each peer and that the peers are only allowed to set the peer ops of that context. Signed-off-by: Alexia Ingerson <[email protected]>

The peer API has been updated to specify that the owner must allocate the peer's fid_peer_srx. The shm implementation was allocating its own internal fid_peer_srx. This updates the shm implementation to assume it has a unique fid_peer_srx and updates the imported fid_peer_srx peer_ops, saving a pointer to the fid_peer_srx instead of the internal fid_ep which required a wrapper function to get back to the fid_peer_srx It also returns an internal fid_ep for the created srx which is used to close the srx by the owner. Even though shm doesn't need anything attached to the internal fid_ep, it is there for consistency and to track the domain reference counting for errors. This patch also moves the srx specific functions into smr_domain where they belong Signed-off-by: Alexia Ingerson <[email protected]>

The previous definition of the peer API didn't specify who allocated the second peer structure (the one referenced by the peer). The shm implementation was choosing to duplicate the imported srx and set it internally. The new definition specifies that the owner handle the duplication of the peer resource which is then imported into the peer to just set. Shm has been updated accordingly but efa needs to be updated to create a second peer_srx and set the fields to the original one for the peer to reference the owner_ops correctly. This also adds a missing fi_close for the shm srx resource Signed-off-by: Alexia Ingerson <[email protected]> Signed-off-by: Shi Jin <[email protected]>

aingerson · 2024-10-10T22:37:48Z

@shijin-aws Thanks for fixing, Shi! Looks like everything is passing. I think this is finally good to go know. Feel free to merge if that looks ok to you

aingerson requested review from shijin-aws and amirshehataornl September 11, 2024 21:18

shijin-aws reviewed Sep 12, 2024

View reviewed changes

aingerson force-pushed the peer branch from 4efd08f to d6ee067 Compare September 17, 2024 19:54

aingerson force-pushed the peer branch from d6ee067 to e200ca3 Compare September 19, 2024 18:30

shijin-aws reviewed Sep 19, 2024

View reviewed changes

shijin-aws reviewed Sep 24, 2024

View reviewed changes

aingerson force-pushed the peer branch from e200ca3 to 76855a9 Compare October 4, 2024 04:41

aingerson force-pushed the peer branch from 76855a9 to 7ce7ee5 Compare October 10, 2024 16:04

aingerson added 2 commits October 10, 2024 09:05

aingerson force-pushed the peer branch from 7ce7ee5 to b55d2d2 Compare October 10, 2024 16:05

shijin-aws approved these changes Oct 11, 2024

View reviewed changes

shijin-aws merged commit 527d320 into ofiwg:main Oct 11, 2024
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

man/fi_peer, prov/shm,efa: update peer context usage and providers using old usage #10377

man/fi_peer, prov/shm,efa: update peer context usage and providers using old usage #10377

aingerson commented Sep 11, 2024

aingerson commented Sep 11, 2024

shijin-aws commented Sep 12, 2024

shijin-aws commented Sep 12, 2024

shijin-aws Sep 12, 2024 •

edited

Loading

aingerson Sep 12, 2024

aingerson Sep 18, 2024

shijin-aws commented Sep 18, 2024

shijin-aws commented Sep 19, 2024 •

edited

Loading

shijin-aws Sep 19, 2024

shijin-aws Sep 19, 2024

aingerson Sep 20, 2024

shijin-aws Sep 20, 2024 •

edited

Loading

aingerson Sep 20, 2024

shijin-aws Sep 20, 2024

aingerson Sep 20, 2024

shijin-aws Sep 23, 2024

shijin-aws Sep 24, 2024

shijin-aws Sep 24, 2024

aingerson Sep 27, 2024

shijin-aws commented Oct 4, 2024 •

edited

Loading

shijin-aws commented Oct 8, 2024

shijin-aws commented Oct 8, 2024

shijin-aws commented Oct 8, 2024

aingerson commented Oct 10, 2024

man/fi_peer, prov/shm,efa: update peer context usage and providers using old usage #10377

man/fi_peer, prov/shm,efa: update peer context usage and providers using old usage #10377

Conversation

aingerson commented Sep 11, 2024

aingerson commented Sep 11, 2024

shijin-aws commented Sep 12, 2024

shijin-aws commented Sep 12, 2024

shijin-aws Sep 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shijin-aws commented Sep 18, 2024

shijin-aws commented Sep 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shijin-aws Sep 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shijin-aws commented Oct 4, 2024 • edited Loading

shijin-aws commented Oct 8, 2024

shijin-aws commented Oct 8, 2024

shijin-aws commented Oct 8, 2024

aingerson commented Oct 10, 2024

shijin-aws Sep 12, 2024 •

edited

Loading

shijin-aws commented Sep 19, 2024 •

edited

Loading

shijin-aws Sep 20, 2024 •

edited

Loading

shijin-aws commented Oct 4, 2024 •

edited

Loading