1000 elcap nodes are shown as "lost connection" on rank 0 but compute node thinks it is still connected #6626

garlick · 2025-02-11T23:53:48Z

Problem: about 1000 nodes of el cap got into a state where the rank 0 broker thought they were disconnected, but from the point of view of the nodes, the rank 0 broker was just unresponsive.

Specifically, on rank 0

[root@elcap1:conf.d]# flux overlay status |grep elcap1124
├─ 820 elcap1124: lost lost connection

but on the node, everything seemed OK except flux commands that needed to contact rank 0 would hang.

The actual TCP connection to the node appeared to be in connected state. Here is elcap12119 (another node in that state):

tcp        0 138488 eelcap12119:46878       eelcap1:8050            ESTABLISHED 188356/broker

and the same connection on elcap1

tcp   246536      0 eelcap1:8050            eelcap12119:46878       ESTABLISHED 3180474/broker

Stopping flux on the compute node ran into the systemd timeout, but problems immediately went away upon restart.

The text was updated successfully, but these errors were encountered:

garlick · 2025-02-12T00:01:50Z

Note that the "lost connection" status indicates that a send on rank 0 failed with EHOSTUNREACH.

Because we have ZMQ_ROUTER_MANDATORY set[1], if the destination UUID is unknown (presumably because it disconnected), a zmq_send() fails with EHOSTUNEACH. We catch that and set SUBTREE_STATUS_LOST with the error "lost connection".

[1] https://libzmq.readthedocs.io/en/latest/zmq_setsockopt.html

garlick · 2025-02-13T16:11:21Z

Hmm, we have tuned TCP_USER_TIMEOUT (see tcp(7)) to 2m on these systems

[tbon]
tcp_user_timeout = "2m"

Given that we have untransmitted data in the socket queue on both ends I would have expected the kernel to have forcibly closed these connections.

grondo · 2025-02-23T15:19:29Z

This issue occurred again today on elcap. Again ~1000 nodes affected.

garlick · 2025-02-23T18:39:29Z

Not addressing the root cause of this (which we have yet to understand), but as a stopgap, what if we add an optional heartbeat timeout?

Right now the heartbeat module is only loaded on rank 0. It publishes a heartbeat.pulse message at a configurable interval. What if we made a small change to load that module on all ranks. On rank 0 it would only publish the heartbeat. On follower ranks it would subscribe to it and could implement a configurable timeout.

It could then take some drastic action if the heartbeat is not received for some period of time.

garlick · 2025-02-24T23:04:56Z

The thoughts I had on this one are

the above heartbeat timeout probably makes sense to have as an option regardless
maybe there is a way to get zmq debugging enabled all the time without log spam. For example, create the monitor socket but don't connect to it? Or connect to it and keep the logs in a small circular buffer until logging is enabled?
I vaguely was wondering about the short zmq socket queue depths on el cap could somehow cause deadlock at the zmq level but that probably does not make sense

Regarding housekeeping and drain state

in WIP: run housekeeping, prolog, epilog in the flux systemd instance #6662 drain logic will have to move to the management node. In this failure mode, the managemement node sees a disconnect, so after that change, the node should get drained.
If we get the compute node unstuck with a heartbeat timeout, the broker would restart and any RPCs from the housekeeping script to the broker should fail and allow housekeeeping to move on.
If the broker restarts but the housekeeping unit continues to run, avoid scheduling jobs on compute nodes that are not cleaned up #6616 should prevent the scheduler from running jobs on it until it is done.

garlick mentioned this issue Mar 3, 2025

heartbeat: add optional timeout #6679

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1000 elcap nodes are shown as "lost connection" on rank 0 but compute node thinks it is still connected #6626

1000 elcap nodes are shown as "lost connection" on rank 0 but compute node thinks it is still connected #6626

garlick commented Feb 11, 2025

garlick commented Feb 12, 2025

garlick commented Feb 13, 2025 •

edited

Loading

grondo commented Feb 23, 2025

garlick commented Feb 23, 2025 •

edited

Loading

garlick commented Feb 24, 2025

1000 elcap nodes are shown as "lost connection" on rank 0 but compute node thinks it is still connected #6626

1000 elcap nodes are shown as "lost connection" on rank 0 but compute node thinks it is still connected #6626

Comments

garlick commented Feb 11, 2025

garlick commented Feb 12, 2025

garlick commented Feb 13, 2025 • edited Loading

grondo commented Feb 23, 2025

garlick commented Feb 23, 2025 • edited Loading

garlick commented Feb 24, 2025

garlick commented Feb 13, 2025 •

edited

Loading

garlick commented Feb 23, 2025 •

edited

Loading