Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/efa: Use READ NACK protocol if P2P is not available for RDMA read #10363

Merged
merged 3 commits into from
Sep 11, 2024

Conversation

jiaxiyan
Copy link
Contributor

@jiaxiyan jiaxiyan commented Sep 6, 2024

p2p is required for RDMA read because the user buffer has to be registered to efa device.
When p2p is not available, receiver should send a NACK packet to make sender fall back to long CTS protocol to avoid RDMA read.

@jiaxiyan jiaxiyan requested a review from a team September 6, 2024 21:58
@shijin-aws
Copy link
Contributor

Can you also update https://github.com/ofiwg/libfabric/blob/main/prov/efa/docs/efa_rdm_protocol_v4.md to explain this feature as part of READ NACK protocol?

The emulated long-read write subprotocol uses RDMA read to emulate
a write operation. When p2p is not available, sender should fall
back to emulated long CTS write protocol.
Also change the log level to info.

Signed-off-by: Jessie Yang <[email protected]>
@shijin-aws
Copy link
Contributor

Use long CTS protocol

I prefer to change it to Use READ NACK protocol

@jiaxiyan jiaxiyan changed the title prov/efa: Use long CTS protocol if P2P is not available for RDMA read prov/efa: Use READ NACK protocol if P2P is not available for RDMA read Sep 9, 2024
if (err) {
if (err == -FI_ENOMR) {
if (efa_rdm_peer_support_read_nack(peer))
/* Only set the flag here. The NACK
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where did runtread post read nack pkt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See line 919

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, we are only posting read nack pkt for runt read when we get all pkts that were runt. So we cannot post read nack here. This does make the refactor a bit less useful...

Copy link
Contributor Author

@jiaxiyan jiaxiyan Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Emulated long-read write did not handle MR exhaustion. The refactor will check for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we don't need efa_rdm_rxe_map_insert for runt read either.

We don't need efa_rdm_rxe_map_insert for runt read because it is a multi-req protocol and we already inserted it earlier.

Emulated long-read write did not handle MR exhaustion. The refactor will check for that.

Yeah, I agree. I was just thinking how to make efa_rdm_pke_post_remote_read_or_nack cover the nack pkt post for all pkt types. It seems we cannot do that for runting read

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with documenting in efa_rdm_pke_post_remote_read_or_nack that why we don't need map insert and nack post for runting read.

if (OFI_UNLIKELY(err)) {
EFA_WARN(FI_LOG_CQ,
"RDMA post read or queue failed.\n");
efa_base_ep_write_eq_error(&ep->base_ep, err, FI_EFA_ERR_RDMA_READ_POST);
efa_rdm_rxe_release(rxe);
efa_rdm_pke_release_rx(pkt_entry);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we know why we were releasing pkt entry here again since we already released in L562

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like a double free to me so I remove it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, seems a bug

int p2p_avail;

pkt_type = efa_rdm_pke_get_base_hdr(pkt_entry)->type;
p2p_avail = efa_rdm_ep_use_p2p(ep, rxe->desc[0]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it safe? efa_rdm_ep_use_p2p can return -FI_ENOSYS right?

/* Only set the flag for runting read. The NACK
* packet is sent after all runting read
* RTM packets have been received */
if (!efa_rdm_pkt_type_is_runtread(pkt_type)) {
Copy link
Contributor

@shijin-aws shijin-aws Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

efa_rdm_rxe_map_insert is a protocol implementation detail, and only works for pkt type that has msg_id. So we shouldn't check the op code here, but check the pkt type. Since msg_id is only present in rtm pkt type, it should be something like efa_rdm_pkt_type_is_rtm

For cleaner logic, can we also refactor the code block to

if (efa_rdm_pkt_type_is_runtread(pkt_type)) /* do nothing for runting read */
   return 0;

if (efa_rdm_pkt_type_is_rtm(pkt_type))
   efa_rdm_rxe_map_insert(&ep->rxe_map, pkt_entry, rxe);

err = efa_rdm_ope_post_send_or_queue(rxe, EFA_RDM_READ_NACK_PKT);

Comment on lines 108 to 110
efa_rdm_pke_post_remote_read_or_nack(struct efa_rdm_ep *ep,
struct efa_rdm_pke *pkt_entry,
struct efa_rdm_ope *rxe)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Aligning with spaces (don't care personally). A few other places as well.

when the receiver is unable to register a memory region for the RDMA read operation.
Failure to register the memory region is typically because of a hardware limitation.
when the receiver is unable to register a memory region for the RDMA read operation,
typically because of a hardware limitation, or when P2P support is unavailable for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"p2p support is unavailable" is also a hardware restriction

RDMA read requires p2p. Use READ NACK protocol when p2p is
not available.
Also change the log level to info.

Signed-off-by: Jessie Yang <[email protected]>
@shijin-aws
Copy link
Contributor

bot:aws:retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants