Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

man/fi_peer, prov/shm,efa: update peer context usage and providers using old usage #10377

Merged
merged 3 commits into from
Oct 11, 2024

Conversation

aingerson
Copy link
Contributor

The man pages had no wording describing who owned the shared peer resource when using the peer API. Because of this, the util cq and counter implementation were using owner-allocated fid_peer_* resources and the shm srx implementation was using a peer-allocated resource. This patch set updates the man pages to add consistency to the resource allocation flow, specifying that the owner is responsible for allocating the resource and making sure peers use different fids. It also updates the shm implementation to adhere to this flow and updates efa's use of shm as a peer in this way as well.

@aingerson
Copy link
Contributor Author

@shijin-aws I updated efa here, but let me know if I missed something. I can also squash the second and third patches into on because technically the shm change will break efa until that third patch is added. I kept it separate for easier review. I can squash when you confirm it looks good and passes

@shijin-aws
Copy link
Contributor

AWS CI failure is related, seems all EFA single node test failed. No detailed logs yet, will try to reproduce

@shijin-aws
Copy link
Contributor

bot:aws:retest

@@ -1304,10 +1309,22 @@ static int efa_rdm_ep_ctrl(struct fid *fid, int command, void *arg)
* shared memory region.
*/
if (ep->shm_ep) {
peer_srx_context.srx = util_get_peer_srx(ep->peer_srx_ep);
Copy link
Contributor

@shijin-aws shijin-aws Sep 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does shm_peer_srx do the same thing as ep->peer_srx_ep? We created ep->peer_srx_ep via util_ep_srx_context which allocates the owner resources for srx. I need to read your man page change more carefully, but it doesn't make sense to me that we have 2 owner srx resources here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 2 resources here - the peer srx (allocated by efa and imported to shm) and then shm has a fid_ep that it uses to track any shm resources (it doesn't need any but it can). I think that second one I actually implemented incorrectly because shm doesn't return a fid anymore. It's not necessary for the shm case but it's part of the definition so that's definitely a miss and maybe why it's failing (because the fi_close at the end will actually segfault)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shijin-aws I fixed the issues I knew would cause a failure but it looks like it's still grumpy. I'm guessing I missed something in the message flow - can you share the failure?

@shijin-aws
Copy link
Contributor

@aingerson I get a segfault when running fi_rdm_pingpong -p efa -E on single node, the backtrace is

Core was generated by `fi_rdm_pingpong -p efa -E'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fb4ea3b03ca in smr_av_insert (av_fid=0x555b85edd9d0, addr=0x7ffd3979cd10, count=1, fi_addr=0x7fb482e39258, flags=2199023255552, context=0x0)
    at prov/shm/src/smr_av.c:175
175                             smr_ep->srx->owner_ops->foreach_unspec_addr(smr_ep->srx,
(gdb) 
(gdb) bt
#0  0x00007fb4ea3b03ca in smr_av_insert (av_fid=0x555b85edd9d0, addr=0x7ffd3979cd10, count=1, fi_addr=0x7fb482e39258, flags=2199023255552, context=0x0)
    at prov/shm/src/smr_av.c:175
#1  0x00007fb4ea315ff6 in fi_av_insert (av=0x555b85edd9d0, addr=0x7ffd3979cd10, count=1, fi_addr=0x7fb482e39258, flags=2199023255552, context=0x0)
    at ./include/rdma/fi_domain.h:509
#2  0x00007fb4ea31970f in efa_conn_rdm_init (av=0x555b85eda2a0, conn=0x7fb482e39220, insert_shm_av=true) at prov/efa/src/efa_av.c:319
#3  0x00007fb4ea31bc3e in efa_conn_alloc (av=0x555b85eda2a0, raw_addr=0x7ffd3979d020, flags=0, context=0x0, insert_shm_av=true) at prov/efa/src/efa_av.c:487
#4  0x00007fb4ea31d4f4 in efa_av_insert_one (av=0x555b85eda2a0, addr=0x7ffd3979d020, fi_addr=0x7ffd3979cfa8, flags=0, context=0x0, insert_shm_av=true)
    at prov/efa/src/efa_av.c:625
#5  0x00007fb4ea31d6ed in efa_av_insert (av_fid=0x555b85eda2e8, addr=0x7ffd3979d020, count=1, fi_addr=0x555b85c8f3d0 <remote_fi_addr>, flags=0, context=0x0)
    at prov/efa/src/efa_av.c:667
#6  0x0000555b85c7bef8 in fi_av_insert (context=0x0, flags=0, fi_addr=0x555b85c8f3d0 <remote_fi_addr>, count=1, addr=0x7ffd3979d020, av=0x555b85eda2e8)
    at /home/ubuntu/PortaFiducia/build/libraries/libfabric/main/install/libfabric/include/rdma/fi_domain.h:509
#7  ft_av_insert (av=av@entry=0x555b85eda2e8, addr=addr@entry=0x7ffd3979d020, count=count@entry=1, fi_addr=fi_addr@entry=0x555b85c8f3d0 <remote_fi_addr>, 
    flags=flags@entry=0, context=context@entry=0x0) at common/shared.c:1516
#8  0x0000555b85c84bef in ft_exchange_addresses_oob (av_ptr=0x555b85eda2e8, ep_ptr=<optimized out>, remote_addr=0x555b85c8f3d0 <remote_fi_addr>)
    at common/shared.c:1555
#9  0x0000555b85c84da5 in ft_init_av_dst_addr (av_ptr=0x555b85eda2e8, ep_ptr=0x555b85eda520, remote_addr=remote_addr@entry=0x555b85c8f3d0 <remote_fi_addr>)
    at common/shared.c:1574
#10 0x0000555b85c84f74 in ft_init_av () at common/shared.c:1531
#11 0x0000555b85c85859 in ft_init_fabric () at common/shared.c:1336
#12 0x0000555b85c789b7 in run () at benchmarks/rdm_pingpong.c:109
#13 main (argc=4, argv=0x7ffd3979da08) at benchmarks/rdm_pingpong.c:109

@shijin-aws
Copy link
Contributor

shijin-aws commented Sep 19, 2024

@aingerson the latest push fixed the earlier segfaults, but we have some efa-specific unit test failure, I trying to get to the bottom

@@ -987,6 +987,13 @@ static int efa_rdm_ep_close(struct fid *fid)
}

if (efa_rdm_ep->shm_ep) {
ret = fi_close(&efa_rdm_ep->shm_srx->fid);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have other places that can close shm resources, including https://github.com/ofiwg/libfabric/blob/main/prov/efa/src/rdm/efa_rdm_ep_fiops.c#L1048-L1059

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And our unit tests have tons of places that replies on closing shm ep hackily as fi_close: https://github.com/ofiwg/libfabric/blob/main/prov/efa/test/efa_unit_test_ep.c#L126-L131. These tests are all failed by this change. I think we need to modify our unit tests that doesn't close shm ep explicitly in the test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shijin-aws Oh that is funky. I didn't realize that. I added a ref count in the shm srx import to make sure it gets closed properly even though there isn't really anything to be closed for shm so I think dropping this close will cause failures in the regular case. I think you probably at minimum need a wrapper around that hack to close both the shm ep and the shm srx. Do you think that would be enough?

Copy link
Contributor

@shijin-aws shijin-aws Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am preparing a patch to fix that. We can use fi_setopt(.. FI_OPT_SHARED_MEMORY_PERMMITED, false) to disable shm in the test. Will let you know when it's ready

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet, thank you!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But generally it is annoying that provider needs to manage more resources for peer providers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but I think we need to match all the peer resources to the same flow - the peer cq and peer counter are the same (efa needs to close the shm cq and counters). I think it needs to happen and this one was just missed so it's annoying to add it in later. That would be a perk of the link provider, to remove that handling from efa

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened a PR to clean up our tests: #10406. we need to rebase your PR after #10406 is merged and then make sure shm resources are cleaned correctly in https://github.com/ofiwg/libfabric/blob/main/prov/efa/src/rdm/efa_rdm_ep_fiops.c#L1048-L1059 as well.

man/fi_peer.3.md Outdated
When importing any shared fabric object into a peer, the owner will create a
separate fid_peer_* for each peer provider it intends to import into. The owner
will pass this unique fid_peer_* into each peer through the context parameter of
the init call for the resource (ie fi_cq_open, fi_srx_context, fi_cntr_open,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ie => i.e.

/* shm srx fid (shm-owned) */
struct fid_ep *shm_srx;
/* shm peer_srx (efa-owned) */
struct fid_peer_srx *shm_peer_srx;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still not convinced that we need to have 3 peer srx related resources in efa ep, adding these 2, there are

  • peer_srx_ep
  • shm_srx
  • shm_peer_srx

If we want to make an analogy to peer cq and cntr

efa_cq has only 1 peer cq related field: shm_cq,
it opens the shm_cq via


shm_cq_attr.flags |= FI_PEER;
peer_cq_context.size = sizeof(peer_cq_context);
peer_cq_context.cq = cq->util_cq.peer_cq;
fi_cq_open(efa_domain->shm_domain, &shm_cq_attr,
				 &cq->shm_cq, &peer_cq_context);

I believe shm_cq plays a role similar to shm_srx here, util_cq.peer_cq plays the role of shm_peer_srx here.

If we make an analogy between util_cq and util_srx, shouldn't we make shm_peer_srx to be part of util_srx, as util_srx.peer_srx ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shijin-aws I think it depends on the use case of efa (I don't really know how efa uses the srx on the efa side). The issue (and difference between the CQ use case) is the peer_ops. shm needs to be able to set the peer_srx->peer_ops to its own ops. Every peer that has a separate set of peer_ops needs a separate peer_srx for dedicated use for that peer. If in efa you access all the util_srx functions directly and not through the owner_ops (as if you were a peer) then you're right, you don't need a separate peer_srx - you could put it in the util_srx and have that one be dedicated for shm. However, if in efa you are calling get_msg(), etc, you have to pass in a unique peer_srx that has the efa peer_ops for the util_srx to call later. I think you likely need two (one for efa and one for shm) because I think you just go directly through the util_generic_recvmsg call which directly initiates the start_msg() flow.
For the CQ case, there are no peer_ops, so the peers can technically share the same peer_cq because the only purpose is to call the owner_ops which won't change from peer to peer.
So for the three resources you're talking about, I think each one is needed for a slightly different purpose:

  • peer_srx_ep -> peer_srx for the efa-owned srx, for use with efa (owner_ops = util_srx_ops, peer_ops = efa_peer_ops)
  • shm_srx -> shm fid for srx (shm doesn't need it but in case it did need anything extra, this is to track it)
  • shm_peer_srx -> peer_srx for the efa-owned srx, for use with shm (owner_ops = util_srx_ops, peer_ops = shm_peer_ops)

@shijin-aws
Copy link
Contributor

shijin-aws commented Oct 4, 2024

@aingerson still a lot of efa unit test failure in AWS CI, will get back you later today

@shijin-aws
Copy link
Contributor

I hit a segmentation fault around this line inside efa_rdm_ep_close

if (efa_rdm_ep->shm_ep) {
		ret = fi_close(&efa_rdm_ep->shm_srx->fid);

@shijin-aws
Copy link
Contributor

I figure it out. EFA provider create shm_ep during fi_endpoint, but only create srx resources during fi_enable. Our unit tests have tests that only create/close ep without enable. Your fi_close shm_srx and shm_peer_srx should explicitly check efa_rdm_ep->shm_srx and efa_rdm_ep->shm_peer_srx before doing it.

@shijin-aws
Copy link
Contributor

I reworked your efa change: shijin-aws@57749f6 please cherry-pick

Add clarification in the man page indicating that the owner is
responsible for creating unique fi_peer_*_contexts for each peer
and that the peers are only allowed to set the peer ops of that
context.

Signed-off-by: Alexia Ingerson <[email protected]>
The peer API has been updated to specify that the owner must allocate
the peer's fid_peer_srx. The shm implementation was allocating its
own internal fid_peer_srx.
This updates the shm implementation to assume it has a unique
fid_peer_srx and updates the imported fid_peer_srx peer_ops, saving
a pointer to the fid_peer_srx instead of the internal fid_ep which
required a wrapper function to get back to the fid_peer_srx

It also returns an internal fid_ep for the created srx which is used
to close the srx by the owner. Even though shm doesn't need anything
attached to the internal fid_ep, it is there for consistency and to
track the domain reference counting for errors.

This patch also moves the srx specific functions into smr_domain
where they belong

Signed-off-by: Alexia Ingerson <[email protected]>
The previous definition of the peer API didn't specify who allocated the
second peer structure (the one referenced by the peer). The shm implementation
was choosing to duplicate the imported srx and set it internally. The new
definition specifies that the owner handle the duplication of the peer resource
which is then imported into the peer to just set. Shm has been updated accordingly
but efa needs to be updated to create a second peer_srx and set the fields to the
original one for the peer to reference the owner_ops correctly.

This also adds a missing fi_close for the shm srx resource

Signed-off-by: Alexia Ingerson <[email protected]>
Signed-off-by: Shi Jin <[email protected]>
@aingerson
Copy link
Contributor Author

@shijin-aws Thanks for fixing, Shi! Looks like everything is passing. I think this is finally good to go know. Feel free to merge if that looks ok to you

@shijin-aws shijin-aws merged commit 527d320 into ofiwg:main Oct 11, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants