Skip to content

Commit

Permalink
cmd: extend default timeout when needed in flux overlay errors
Browse files Browse the repository at this point in the history
Problem: On systems with ~10K nodes, `flux overlay errors` sometimes
reports "Connection timed out" for some ranks for which RPCs are
issued on the first iteration. The problem seems to be that the
timeout starts immediately when `flux_future_then(3)` is called, but
for large systems with a flat TBON the program may not re-enter the
reactor for >0.5s due to the size of the initial payload.

While one solution would be to delay sending _any_ RPCs until the
first time the check watcher is called, this unnecessarily extends
the runtime of the program by at least the initial payload processing
time. Instead, scale the timeout for large systems (>2K nodes) by
the size of system, such that 10K node systems get a roughly 2.5s
timeout, which seems to be a safe value.

Note that a long timeout is not as much of a problem as in previous
versions of the program where overlay.health RPCs were sent serially,
since the longer timeout can now happen in parallel.
  • Loading branch information
grondo committed Jan 31, 2025
1 parent 5c3ee62 commit d0bf6e6
Showing 1 changed file with 10 additions and 0 deletions.
10 changes: 10 additions & 0 deletions src/cmd/builtin/overlay.c
Original file line number Diff line number Diff line change
Expand Up @@ -1089,8 +1089,18 @@ static int subcmd_errors (optparse_t *p, int ac, char *av[])
{
flux_t *h = builtin_get_flux_handle (p);
struct overlay_errors *ctx = NULL;
uint32_t size;
double timeout;

/* On large systems with a flat tbon, the default timeout of 0.5s
* may not be long enough because the program may spend a long time
* simply processing the initial JSON response payload for rank 0, and
* meanwhile the then timeout is ticking. Therefore, if the current
* instance is large, scale the timeout by the instance size:
*/
if (flux_get_size (h, &size) == 0 && size > 2048)
default_timeout *= (size / 2000.); /* ~2.5s timeout on 10K nodes */

Check warning on line 1102 in src/cmd/builtin/overlay.c

View check run for this annotation

Codecov / codecov/patch

src/cmd/builtin/overlay.c#L1102

Added line #L1102 was not covered by tests

timeout = optparse_get_duration (p, "timeout", default_timeout);
if (timeout == 0)
timeout = -1.0; // disabled
Expand Down

0 comments on commit d0bf6e6

Please sign in to comment.