-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1000 elcap nodes are shown as "lost connection" on rank 0 but compute node thinks it is still connected #6626
Comments
Note that the "lost connection" status indicates that a send on rank 0 failed with EHOSTUNREACH. Because we have ZMQ_ROUTER_MANDATORY set[1], if the destination UUID is unknown (presumably because it disconnected), a [1] https://libzmq.readthedocs.io/en/latest/zmq_setsockopt.html |
Hmm, we have tuned TCP_USER_TIMEOUT (see tcp(7)) to 2m on these systems [tbon]
tcp_user_timeout = "2m" Given that we have untransmitted data in the socket queue on both ends I would have expected the kernel to have forcibly closed these connections. |
This issue occurred again today on elcap. Again ~1000 nodes affected. |
Not addressing the root cause of this (which we have yet to understand), but as a stopgap, what if we add an optional heartbeat timeout? Right now the heartbeat module is only loaded on rank 0. It publishes a It could then take some drastic action if the heartbeat is not received for some period of time. |
The thoughts I had on this one are
Regarding housekeeping and drain state
|
Problem: about 1000 nodes of el cap got into a state where the rank 0 broker thought they were disconnected, but from the point of view of the nodes, the rank 0 broker was just unresponsive.
Specifically, on rank 0
but on the node, everything seemed OK except flux commands that needed to contact rank 0 would hang.
The actual TCP connection to the node appeared to be in connected state. Here is elcap12119 (another node in that state):
and the same connection on elcap1
Stopping flux on the compute node ran into the systemd timeout, but problems immediately went away upon restart.
The text was updated successfully, but these errors were encountered: