-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCP/PROTO: UCX_PROTO_INFO=used option #10395
base: master
Are you sure you want to change the base?
Conversation
@@ -241,7 +278,8 @@ ucp_proto_select_elem_info(ucp_worker_h worker, | |||
/* Print contents */ | |||
ucp_proto_table_row_separator(strb, col_width, 3); | |||
ucs_array_for_each(row_elem, &table) { | |||
ucs_string_buffer_appendf(strb, info_row_fmt, col_width[0], | |||
ucs_string_buffer_appendf(strb, info_row_fmt, row_elem->counter_str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the meaning of these numbers will not be obvious to users. Please add the value to a separate column with the name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I followed the existing design - as you can see there are no really named columns, e.g. for message range. The column name is actually just the config name. So if I add a new named column just for selections count - that would look ridiculous IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps, we should add a row with column names: size range / protocol / resources / use count.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about the added value of this change.
I mean, to me it seems this is a debug feature meant to be used mostly by UCX developers who know what to look for. This probably explains why these column names were not introduced so far - there is no much benefit I guess.
On the other hand - it would also increase the output size - and this is exactly the issue I'm trying to solve with this commit.
src/ucp/proto/proto.h
Outdated
@@ -141,6 +141,9 @@ typedef struct { | |||
|
|||
/* Map of used lanes */ | |||
ucp_lane_map_t lane_map; | |||
|
|||
/* Selections count */ | |||
size_t selections; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe unsigned is enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
@@ -108,7 +109,11 @@ ucp_proto_select_lookup(ucp_worker_h worker, ucp_proto_select_t *proto_select, | |||
proto_select->cache.value = select_elem; | |||
} | |||
|
|||
return ucp_proto_select_thresholds_search(select_elem, msg_length); | |||
thresh = ucp_proto_select_thresholds_search(select_elem, msg_length); | |||
#ifdef ENABLE_DEBUG_DATA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need it? Does it mean that used option will not work in the release mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, it won't work in release mode
The only reason to have ENABLE_DEBUG_DATA is to save memory and a bit of CPU cycles in the hot path (this increment).
I don't have strong opinion on whether it should work in release mode, I thought it's mostly used by UCX engineers, so can always be built in debug mode. Please share your thoughts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imo, we need it in release mode. We ask customers to provide this output quite often. Initial protocol selection is not a fast-path, or am i missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feature is not about initial protocol selection, it counts all the protocol selections during send execution, so in fast-path. So we add an increment operation to the fast-path, which is not a big deal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i see would it be possible to to increase only on slow path? Then you would not need a counter, but rather a boolean indicating whether the protocol was selected at least once
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, it is possible, but it would reduce the added value IMO.
One of the ideas that we discussed on the meeting was the ability to show protocols whose selection count is above some limit. This would be useful to not display stats for auxilliary / init messages.
But look, maybe we can use a combined approach:
- We use counter, not boolean, memory overhead is the same
- When ENABLE_DEBUG_DATA is set, we collect precise stats (so does not harm release)
- When in release mode, we resort to storing 1 in this counter in the slow path (so we have some stats in release)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, unfortunately it's not that easy with Release mode..
The problem is that short protocol initialization triggers protocol selection with message size 0:
https://github.com/openucx/ucx/pull/10395/files#diff-b729ff9730727dbf69aa2e2f32eab876f0d1f68803a2b7714a911fe056909004R730
ucp_proto_select_short_init -> ucp_proto_select_lookup(..., 0)
This is a problem, because we don't want to take into account this "probing" selection. If we count it, then we basically display all the protocols, not actually selected ones. If we don't count it - then we can't detect selection of short protocols, because ucp_proto_select_lookup_slow
will not be called anymore..
In DEBUG mode I just revert this probing selection:
#ifdef ENABLE_DEBUG_DATA
/* Revert stats counter for probe operation */
--((ucp_proto_threshold_elem_t *)thresh)->proto_config.selections;
#endif
But I don't see a good solution for Release mode...
So now I'm thinking that we should probably just enable this feature for all modes (without checking ENABLE_DEBUG_DATA)
@@ -2878,6 +2893,7 @@ static void ucp_worker_destroy_eps(ucp_worker_h worker, | |||
void ucp_worker_destroy(ucp_worker_h worker) | |||
{ | |||
ucs_debug("destroy worker %p", worker); | |||
ucp_worker_trace_configs(worker); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems we display the data only during working destroy, but what if we have a long running application and we want to show the info without waiting for app exit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This scenario is not really covered by the initial impl, as it does not seem to be a mainstream use case.
Here are 2 ideas what we can do:
- Display all the protocols as before with
UCX_PRORO_INFO=y
option - Implement periodic timer (with configurable interval, 10s default) in worker using
timerfd_create
andtimerfd_settime
, that would triggerucp_worker_trace_configs
.
Personally I believe it's not really worth the effort. But if you think it's useful, can be done maybe in the next PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we print the info after the count reaches a certain threshold?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In short - I think that may not be always reliable, and requires operations on fast-path
Currently the idea is that we support this feature in 2 modes:
- Debug mode (
ENABLE_DEBUG_DATA
) - when we get precise statistics. Requires adding increment operation to the fast path. - Release mode - do not use counter, but only detect used protocols in the slow path. This way we don't introduce delays to the fast-path.
Btw, what's your opinion on that? Should have have these 2 modes of operation, or we just enable this feature always?
Basically in release mode the counter is never incremented above 1, so we cannot rely on this.
And in debug mode this would require to introduce one more if branch to the fast-path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we need to try making the feature available also in release mode without adding overhead when it is not enabled
What?
Currently when UCX_PROTO_INFO is enabled, it shows tons of output with details of all the selected protocols, most of which is never used by the application. The idea is to introduce a new option to the config variable ("used"), to limit display to only actually used protocols.
There is already a wildcard option may improve the output, but still it's not enough, because you must know in advance what to look for. On the contrary, displaying used protocols is showing the exact choice made by UCX without guesses, and it also shows the amount of selections per threshold
Why?
Easy of debugging and diagnostics
How?
When
ENABLE_DEBUG_DATA
is enabled, it displays the exact amount of selections per threshold