cmd: extend default timeout when needed in flux overlay errors

Problem: On systems with ~10K nodes, `flux overlay errors` sometimes reports "Connection timed out" for some ranks for which RPCs are issued on the first iteration. The problem seems to be that the timeout starts immediately when `flux_future_then(3)` is called, but for large systems with a flat TBON the program may not re-enter the reactor for >0.5s due to the size of the initial payload. While one solution would be to delay sending _any_ RPCs until the first time the check watcher is called, this unnecessarily extends the runtime of the program by at least the initial payload processing time. Instead, scale the timeout for large systems (>2K nodes) by the size of system, such that 10K node systems get a roughly 2.5s timeout, which seems to be a safe value. Note that a long timeout is not as much of a problem as in previous versions of the program where overlay.health RPCs were sent serially, since the longer timeout can now happen in parallel.
flux-framework · Jan 31, 2025 · 2e38408 · 2e38408
1 parent 3942fa2
commit 2e38408
Showing 1 changed file with 9 additions and 0 deletions.
diff --git a/src/cmd/builtin/overlay.c b/src/cmd/builtin/overlay.c
@@ -1090,8 +1090,17 @@ static int subcmd_errors (optparse_t *p, int ac, char *av[])
 {
     flux_t *h = builtin_get_flux_handle (p);
     struct overlay_errors *ctx = NULL;
+    uint32_t size;
     double timeout;
 
+    /* On large systems with a flat tbon, the default 0.5s timeout may be
+     * too short because processing the initial JSON response for rank 0
+     * can take longer than that. To address this, increase the timeout
+     * based on the instance size:
+     */
+    if (flux_get_size (h, &size) == 0 && size > 2048)
+        default_timeout *= (size / 2000.); /* ~2.5s timeout on 10K nodes */
+
     timeout = optparse_get_duration (p, "timeout", default_timeout);
     if (timeout == 0)
         timeout = -1.0; // disabled