-
Notifications
You must be signed in to change notification settings - Fork 434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCT/GGA: Support is_connected API #10403
Conversation
src/uct/ib/mlx5/gga/gga_mlx5.c
Outdated
uint32_t addr_qp = 0; | ||
uct_gga_mlx5_ep_address_t *ep_addr; | ||
|
||
if (params->field_mask & UCT_EP_IS_CONNECTED_FIELD_EP_ADDR) { | ||
ep_addr = (uct_gga_mlx5_ep_address_t*)params->ep_addr; | ||
addr_qp = uct_ib_unpack_uint24(ep_addr->qp_num); | ||
} | ||
|
||
return uct_rc_mlx5_ep_is_connected_to_addr(tl_ep, params, addr_qp) && | ||
uct_base_ep_is_connected(tl_ep, params); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's align with other transports which use uct_base_ep_is_connected
:
uint32_t addr_qp = 0; | |
uct_gga_mlx5_ep_address_t *ep_addr; | |
if (params->field_mask & UCT_EP_IS_CONNECTED_FIELD_EP_ADDR) { | |
ep_addr = (uct_gga_mlx5_ep_address_t*)params->ep_addr; | |
addr_qp = uct_ib_unpack_uint24(ep_addr->qp_num); | |
} | |
return uct_rc_mlx5_ep_is_connected_to_addr(tl_ep, params, addr_qp) && | |
uct_base_ep_is_connected(tl_ep, params); | |
uct_gga_mlx5_ep_address_t *ep_addr; | |
uint32_t addr_qp; | |
if (!(params->field_mask & UCT_EP_IS_CONNECTED_FIELD_EP_ADDR)) { | |
return 0; | |
} | |
if (!uct_base_ep_is_connected(tl_ep, params)) { | |
return 0; | |
} | |
ep_addr = (uct_gga_mlx5_ep_address_t*)params->ep_addr; | |
addr_qp = uct_ib_unpack_uint24(ep_addr->qp_num); | |
return uct_rc_mlx5_ep_is_connected_to_addr(tl_ep, params, addr_qp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ep_addr is optional, so cannot use:
if (!(params->field_mask & UCT_EP_IS_CONNECTED_FIELD_EP_ADDR)) {
return 0;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is not addr_qp == 0
means false
for is_connected
state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it means we skip the comparison between addr_qp and qp_num (and only compare device address)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
src/uct/ib/mlx5/gga/gga_mlx5.c
Outdated
if (params->field_mask & UCT_EP_IS_CONNECTED_FIELD_EP_ADDR) { | ||
ep_addr = (uct_gga_mlx5_ep_address_t*)params->ep_addr; | ||
addr_qp = uct_ib_unpack_uint24(ep_addr->qp_num); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is some code duplication with uct_rc_mlx5_base_ep_is_connected
to eliminate it we can create new parent struct for uct_gga_mlx5_ep_address_t and uct_rc_mlx5_ep_address_t - that will contain only qp_num. This struct can be called uct_rc_mlx5_base_ep_address_t.
uct_rc_mlx5_base_ep_is_connected will cast params->ep_addr to uct_rc_mlx5_base_ep_address_t*, if the flag UCT_EP_IS_CONNECTED_FIELD_EP_ADDR is specified.
uct_rc_mlx5_base_ep_address_t super; | ||
uint8_t flags; | ||
uct_ib_uint24_t flush_rkey; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like braking of backward compatibility...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there currently a scenario where gga transport is selected by default by UCP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it can be selected for rma_bw
lane
src/uct/ib/mlx5/gga/gga_mlx5.c
Outdated
const uct_ep_is_connected_params_t *params) | ||
{ | ||
return uct_rc_mlx5_base_ep_is_connected(tl_ep, params) && | ||
uct_base_ep_is_connected(tl_ep, params); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why need uct_base_ep_is_connected? seems rc_mlx5 does not need it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to compare iface address (only required for GGA)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both will compare IB device address which is redundant
can we reuse common code with uct_gga_mlx5_iface_is_reachable_v2 and compare iface address directly?
BTW @evgeny-leksikov @amaslenn the gga transport does not currently implement returning info string from uct_gga_mlx5_iface_is_reachable_v2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW @evgeny-leksikov @amaslenn the gga transport does not currently implement returning info string from uct_gga_mlx5_iface_is_reachable_v2
thanks, I'll add
src/uct/ib/mlx5/gga/gga_mlx5.c
Outdated
return uct_rc_mlx5_base_ep_is_connected(tl_ep, params) && | ||
uct_base_ep_is_connected(tl_ep, params); | ||
uct_gga_mlx5_iface_is_same_device(tl_ep->iface, iface_addr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: swap order (call uct_gga_mlx5_iface_is_same_device first)
src/uct/ib/mlx5/gga/gga_mlx5.c
Outdated
const uct_iface_is_reachable_params_t *params) | ||
uct_gga_mlx5_iface_is_same_device(const uct_iface_h tl_iface, | ||
const uct_iface_addr_t *iface_addr, | ||
const uct_iface_is_reachable_params_t *params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uct_iface_is_reachable_params_t
looks redundant here, maybe just const char* str, size_t length
and use ucs_snprintf_safe
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we drop uct_iface_is_reachable_params_t
then there will be code duplication:
char *info_str_buf = UCS_PARAM_VALUE(UCT_IFACE_IS_REACHABLE_FIELD,
params, info_string, INFO_STRING,
NULL);
size_t info_str_buf_len = UCS_PARAM_VALUE(UCT_IFACE_IS_REACHABLE_FIELD,
params, info_string_length,
INFO_STRING_LENGTH, 0);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can to move most of logic of this function up, to uct_gga_mlx5_iface_is_reachable_v2().
So uct_gga_mlx5_iface_is_same_device() will get only tl_iface and iface_addr, and return 0/1. It will contain only the logic in lines 627-629. Then no need to define dummy params inside uct_gga_mlx5_ep_is_connected()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there will still be code duplication:
if (gga_addr == NULL) {
uct_iface_fill_info_str_buf(params, "iface address is empty");
return 0;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uct_iface_fill_info_str_buf(params, "iface address is empty"); should be only in uct_gga_mlx5_iface_is_reachable_v2; uct_gga_mlx5_ep_is_connected can just check if (gga_addr == NULL)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (gga_addr == NULL)
will be the duplication
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it's small enough (1 line) duplication and it can avoid extra initialization of dummy struct so it's worth it
src/uct/ib/mlx5/gga/gga_mlx5.c
Outdated
if (gga_addr->be_sys_image_guid != | ||
device->dev_attr.orig_attr.sys_image_guid) { | ||
local_guid = device->dev_attr.orig_attr.sys_image_guid; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (gga_addr->be_sys_image_guid != | |
device->dev_attr.orig_attr.sys_image_guid) { | |
local_guid = device->dev_attr.orig_attr.sys_image_guid; | |
local_guid = device->dev_attr.orig_attr.sys_image_guid; | |
if (gga_addr->be_sys_image_guid != local_guid) { |
src/uct/ib/mlx5/gga/gga_mlx5.c
Outdated
const uct_iface_is_reachable_params_t *params) | ||
uct_gga_mlx5_iface_is_same_device(const uct_iface_h tl_iface, | ||
const uct_iface_addr_t *iface_addr, | ||
const uct_iface_is_reachable_params_t *params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can to move most of logic of this function up, to uct_gga_mlx5_iface_is_reachable_v2().
So uct_gga_mlx5_iface_is_same_device() will get only tl_iface and iface_addr, and return 0/1. It will contain only the logic in lines 627-629. Then no need to define dummy params inside uct_gga_mlx5_ep_is_connected()
@shasson5 pls also note the merge conflicts |
@yosefe please review |
What?
Support is_connected API
Why?
Allow GGA to work properly during EP reconfiguration