Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
cmd: extend default timeout when needed in
flux overlay errors
Problem: On systems with ~10K nodes, `flux overlay errors` sometimes reports "Connection timed out" for some ranks for which RPCs are issued on the first iteration. The problem seems to be that the timeout starts immediately when `flux_future_then(3)` is called, but for large systems with a flat TBON the program may not re-enter the reactor for >0.5s due to the size of the initial payload. While one solution would be to delay sending _any_ RPCs until the first time the check watcher is called, this unnecessarily extends the runtime of the program by at least the initial payload processing time. Instead, scale the timeout for large systems (>2K nodes) by the size of system, such that 10K node systems get a roughly 2.5s timeout, which seems to be a safe value. Note that a long timeout is not as much of a problem as in previous versions of the program where overlay.health RPCs were sent serially, since the longer timeout can now happen in parallel.
- Loading branch information