Skip to content

Commit

Permalink
cmd: extend default timeout when needed in flux overlay errors
Browse files Browse the repository at this point in the history
Problem: On systems with ~10K nodes, `flux overlay errors` sometimes
reports "Connection timed out" for some ranks for which RPCs are
issued on the first iteration. The problem seems to be that the
timeout starts immediately when `flux_future_then(3)` is called, but
for large systems with a flat TBON the program may not re-enter the
reactor for >0.5s due to the size of the initial payload.

While one solution would be to delay sending _any_ RPCs until the
first time the check watcher is called, this unnecessarily extends
the runtime of the program by at least the initial payload processing
time. Instead, scale the timeout for large systems (>2K nodes) by
the size of system, such that 10K node systems get a roughly 2.5s
timeout, which seems to be a safe value.

Note that a long timeout is not as much of a problem as in previous
versions of the program where overlay.health RPCs were sent serially,
since the longer timeout can now happen in parallel.
  • Loading branch information
grondo committed Jan 31, 2025
1 parent 3942fa2 commit 2e38408
Showing 1 changed file with 9 additions and 0 deletions.
9 changes: 9 additions & 0 deletions src/cmd/builtin/overlay.c
Original file line number Diff line number Diff line change
Expand Up @@ -1090,8 +1090,17 @@ static int subcmd_errors (optparse_t *p, int ac, char *av[])
{
flux_t *h = builtin_get_flux_handle (p);
struct overlay_errors *ctx = NULL;
uint32_t size;
double timeout;

/* On large systems with a flat tbon, the default 0.5s timeout may be
* too short because processing the initial JSON response for rank 0
* can take longer than that. To address this, increase the timeout
* based on the instance size:
*/
if (flux_get_size (h, &size) == 0 && size > 2048)
default_timeout *= (size / 2000.); /* ~2.5s timeout on 10K nodes */

Check warning on line 1102 in src/cmd/builtin/overlay.c

View check run for this annotation

Codecov / codecov/patch

src/cmd/builtin/overlay.c#L1102

Added line #L1102 was not covered by tests

timeout = optparse_get_duration (p, "timeout", default_timeout);
if (timeout == 0)
timeout = -1.0; // disabled
Expand Down

0 comments on commit 2e38408

Please sign in to comment.