Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/verbs: add param for setting RDMA CM ToS #10360

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jakemoroni
Copy link

@jakemoroni jakemoroni commented Sep 4, 2024

Opens:

  • I am unable to test on Windows. I added a no-op handler for rdma_set_option so it should build and have no effect if no ToS is set. If a ToS is set, then it should result in a warning.
  • I'm not quite sure about the calls in verbs_domain_xrc.c. They may not all be necessary.
  • The valid range for RDMA CM ToS is 0-255, but the new FI_VERBS_TOS param takes a range of -1 through 255, with -1 being a special value to indicate "no ToS set" (default), which allows the call to rdma_set_option to be completely skipped. My original plan was to just have 0 be the default and always call rdma_set_option but this would introduce a behavioral change on systems where the default is non-0 and would add extra overhead in scenarios where the ToS is a don't care.

By default, if the RDMA CM application does not explicitly set a ToS, then the system default is used. This system default can be changed via the default_roce_tos configfs param, but this is global to the system and is somewhat cumbersome to deal with.

This change introduces a new environment param: FI_VERBS_TOS

This param can be used to explicitly set the ToS via the rdma_set_option API. If the ToS is set, then the system default is ignored and the new value is used instead.

This allows for multiple concurrent workloads to use different ToS values.

Valid range is -1 through 255. If unset or set to -1, then the call to rdma_set_option is omitted, thus preserving existing behavior.

Copy link
Contributor

@j-xiong j-xiong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in general. Just some nitpicks.

prov/verbs/src/verbs_domain_xrc.c Outdated Show resolved Hide resolved
prov/verbs/src/verbs_ofi.h Outdated Show resolved Hide resolved
prov/verbs/src/verbs_ofi.h Outdated Show resolved Hide resolved
@j-xiong
Copy link
Contributor

j-xiong commented Sep 6, 2024

Please add a description of the new variable to the verbs provider man page. You can also drop the "RFC" part from the title.

@jakemoroni jakemoroni changed the title RFC: prov/verbs: add param for setting RDMA CM ToS prov/verbs: add param for setting RDMA CM ToS Sep 6, 2024
@jakemoroni
Copy link
Author

Please add a description of the new variable to the verbs provider man page. You can also drop the "RFC" part from the title.

Done. Also rebased and retested.

By default, if the RDMA CM application does not explicitly
set a ToS, then the system default is used. This system default
can be changed via the default_roce_tos configfs param, but this
is global to the system and is somewhat cumbersome to deal with.

This change introduces a new environment param: FI_VERBS_TOS

This param can be used to explicitly set the ToS via the
rdma_set_option API. If the ToS is set, then the system default
is ignored and the new value is used instead.

This allows for multiple concurrent workloads to use different
ToS values.

Valid range is -1 through 255. If unset or set to -1, then
the call to rdma_set_option is omitted, thus preserving existing
behavior.

Not supported/tested on Windows.

Signed-off-by: Jacob Moroni <[email protected]>
@shefty
Copy link
Member

shefty commented Sep 10, 2024

The fi_tx_attr includes a tclass value that can be used to control the service level. It supports DSCP via helper functions. Can this be used instead?

@a-szegel
Copy link
Contributor

bot:aws:retest

@jakemoroni
Copy link
Author

The fi_tx_attr includes a tclass value that can be used to control the service level. It supports DSCP via helper functions. Can this be used instead?

I do think this is a good idea. I have some questions regarding how to map these libfabric tclass values to an appropriate "TOS" value though.

The RDMA CM API, like to the regular socket API, allows you to specify an 8 bit "TOS" value. However, the definition of this octet has changed over the years and seems to have converged on two incompatible encodings:

  • RFC 1349 TOS (legacy)
  • DSCP

For either encoding, the value is set using the same API (DSCP values need to be left shifted by 2 first) and there's unfortunately no way to indicate the encoding. I assume that the RDMA CM TOS is semantically similar to the regular socket TOS in this regard.

DSCP is the more modern encoding, but some places still assume that this value is an RFC 1349 TOS. For example, Linux assumes that the value specified in a setsockopt(IP_TOS) call on an IPv4 socket is an RFC 1349 TOS, so if you provide a DSCP value here, it will be mapped to a totally incorrect user priority (EF maps to a lower priority than AF12, etc.).

With a regular IPv4 socket, this can be worked around by just setting the priority afterwards with setsockopt(SO_PRIORITY), but this is not possible with RDMA CM.

I'm wondering if it makes sense to support a few options here:

  • An env param like FI_VERBS_TOS which can be set to: off (default), tos, dscp, or some value in the range of 0x00-0xFF

    • off: No TOS is set at all; RDMA CM default (configurable via sysfs) is used
    • tos: The libfabric tclass gets mapped to a RFC 1349 style TOS value
    • dscp: The libfabric tclass gets mapped to a DSCP value
    • 0x00-0xFF: The raw value is set directly

    ** For both the tos and dscp values, if the libfabric tclass value has the DSCP bit set (FI_TC_DSCP), then value is left shifted by 2 and set directly.

WDYT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants