Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

heartbeat: add optional timeout #6679

Merged
merged 11 commits into from
Mar 5, 2025
3 changes: 2 additions & 1 deletion doc/Makefile.am
Original file line number Diff line number Diff line change
Expand Up @@ -363,7 +363,8 @@ MAN5_FILES_PRIMARY = \
man5/flux-config-ingest.5 \
man5/flux-config-kvs.5 \
man5/flux-config-policy.5 \
man5/flux-config-queues.5
man5/flux-config-queues.5 \
man5/flux-config-heartbeat.5


MAN7_FILES = $(MAN7_FILES_PRIMARY)
Expand Down
94 changes: 94 additions & 0 deletions doc/man5/flux-config-heartbeat.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
========================
flux-config-heartbeat(5)
========================


DESCRIPTION
===========

The Flux heartbeat service publishes periodic ``heartbeat.pulse`` messages
from the leader broker for synchronization. Follower brokers subscribe
to these messages and may optionally force a disconnect from their overlay
network parent when they are are missed for a configurable period.

The ``heartbeat`` table may be used to tune the heartbeat service. It may
contain the following keys:

KEYS
====

period
(optional) The interval (in RFC 23 Flux Standard Duration format) between
the publication of heartbeat messages. Default: *2s*.

timeout
(optional) The period (in RFC 23 Flux Standard Duration format) after
which a follower broker will forcibly disconnect from its overlay network
parent if it hasn't received a heartbeat message. Set to *0* or *infinity*
to disable. Default: *5m*.

warn_thresh
(optional) The number of missed heartbeat periods after which a warning
message will be logged. Default: 3.

EXAMPLE
=======

::

[heartbeat]
period = "5s"
timeout = "1m"
warn_thresh = 3

USE CASES
=========

Heartbeats may be used to synchronize Flux activities across brokers to
reduce the operating system jitter that affects some sensitive bulk-synchronous
applications. :man3:`flux_sync_create` provides a way to invoke work that
is synchronized with the heartbeat.

.. note::
The efficacy of heartbeats to mitigate noise is limited by the propagation
delay of published messages through the tree based overlay network; however,
this may be reduced in the future with a side channel transport such as
TCP multicast, hardware collectives, or quantum entanglement.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A future quantum entanglement method for synchronization should definitely get an entry on our flashy presentation!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also: probably add to commit message that this commit adds a USE CASES section as well as documenting the heartbeat timeout configuration.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It says "future" so I figured what the hell :-)

The use case section seemed slightly incomplete to me. Does this change help or hurt? (Gah, I've read it too many times now)

@@ -61,13 +61,20 @@ loses the TCP connection to its overlay parent without a shutdown (for example,
 if the parent crashes or there is a network partition and TCP times out),
 ZeroMQ tries indefinitely to re-establish the connection without informing
 the broker.  The child broker remains in RUN state with any upstream RPCs
-blocked.
-
-A heartbeat timeout forces the broker to fail those RPCs and shut down its
-overlay subtree and exit.  In a system instance, systemd restarts the broker
-which will remain in JOIN state until the parent returns to service.
-Otherwise, the outcome depends on whether or not the broker is one of
-the *critical ranks* described in :man7:`flux-broker-attributes`.
+blocked until the parent returns to service, after which it is forced to
+disconnect and shut down, which causes the RPCs to fail.  A heartbeat timeout
+forces the broker to "fail fast", with the same net effect, but arriving
+at a steady state sooner.
+
+The effect of a follower broker shutdown depends on its role.  If it is
+not a leaf node, the effect applies to its entire subtree.  In a system
+instance, systemd restarts brokers that shut down this way.  Upon restart,
+the brokers remain in JOIN state until the parent returns to service.
+The heartbeat service is not loaded until after the parent connection is
+established, so heartbeat timeouts do not apply in this phase.  In a user
+allocation where brokers are not restarted, the outcome depends on whether
+or not the broker is one of the *critical ranks* described in
+:man7:`flux-broker-attributes`.
 
 RESOURCES
 =========

Or maybe I've gone off the rails adding stuff that shouldn't be in a config man page :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the use case was really useful. I only mentioned it because the commit message only called out the addition of the timeout documentation.

The update above seems good to me!


The heartbeat timeout may be used to work around a peculiarity of ZeroMQ,
the software layer underpinning the overlay network. When a Flux broker
loses the TCP connection to its overlay parent without a shutdown (for example,
if the parent crashes or there is a network partition and TCP times out),
ZeroMQ tries indefinitely to re-establish the connection without informing
the broker. The child broker remains in RUN state with any upstream RPCs
blocked until the parent returns to service, after which it is forced to
disconnect and shut down, which causes the RPCs to fail. A heartbeat timeout
forces the broker to "fail fast", with the same net effect, but arriving
at a steady state sooner.

The effect of a follower broker shutdown depends on its role. If it is
not a leaf node, the effect applies to its entire subtree. In a system
instance, systemd restarts brokers that shut down this way. Upon restart,
the brokers remain in JOIN state until the parent returns to service.
The heartbeat service is not loaded until after the parent connection is
established, so heartbeat timeouts do not apply in this phase. In a user
allocation where brokers are not restarted, the outcome depends on whether
or not the broker is one of the *critical ranks* described in
:man7:`flux-broker-attributes`.

RESOURCES
=========

.. include:: common/resources.rst


FLUX RFC
========

:doc:`rfc:spec_23`


SEE ALSO
========

:man5:`flux-config`
1 change: 1 addition & 0 deletions doc/manpages.py
Original file line number Diff line number Diff line change
Expand Up @@ -358,6 +358,7 @@
('man5/flux-config-queues', 'flux-config-queues', 'configure Flux job queues', [author], 5),
('man5/flux-config-job-manager', 'flux-config-job-manager', 'configure Flux job manager service', [author], 5),
('man5/flux-config-kvs', 'flux-config-kvs', 'configure Flux kvs service', [author], 5),
('man5/flux-config-heartbeat', 'flux-config-heartbeat', 'configure Flux heartbeat service', [author], 5),
('man7/flux-broker-attributes', 'flux-broker-attributes', 'overview Flux broker attributes', [author], 7),
('man7/flux-jobtap-plugins', 'flux-jobtap-plugins', 'overview Flux jobtap plugin API', [author], 7),
('man7/flux-environment', 'flux-environment', 'Flux environment overview', [author], 7),
Expand Down
2 changes: 1 addition & 1 deletion etc/rc1
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,7 @@ fi

modload all job-ingest
modload 0 job-exec
modload 0 heartbeat
modload all heartbeat
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The statement in the commit message "the new heartbeat timer requires it to be loaded on all ranks" confused me. Is it actually a future commit that will require it be to loaded on all ranks, or should the message read something like "is supported on all ranks"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll reword. I meant the future commit.


core_dir=$(cd ${0%/*} && pwd -P)
all_dirs=$core_dir${FLUX_RC_EXTRA:+":$FLUX_RC_EXTRA"}
Expand Down
2 changes: 1 addition & 1 deletion etc/rc3
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ for rcdir in $all_dirs; do
done
done

modrm 0 heartbeat
modrm all heartbeat
modrm 0 sched-simple
modrm all resource
modrm 0 job-exec
Expand Down
40 changes: 36 additions & 4 deletions src/broker/overlay.c
Original file line number Diff line number Diff line change
Expand Up @@ -1059,6 +1059,16 @@
log_tracker_error (ov->h, msg, errno);
}

static void parent_disconnect (struct overlay *ov)
{
if (ov->parent.zsock) {
(void)zmq_disconnect (ov->parent.zsock, ov->parent.uri);
ov->parent.offline = true;
rpc_track_purge (ov->parent.tracker, fail_parent_rpc, ov);
overlay_monitor_notify (ov, FLUX_NODEID_ANY);
}
}

static void parent_cb (flux_reactor_t *r,
flux_watcher_t *w,
int revents,
Expand Down Expand Up @@ -1115,10 +1125,7 @@
"%s (rank %lu) sent disconnect control message",
flux_get_hostbyrank (ov->h, ov->parent.rank),
(unsigned long)ov->parent.rank);
(void)zmq_disconnect (ov->parent.zsock, ov->parent.uri);
ov->parent.offline = true;
rpc_track_purge (ov->parent.tracker, fail_parent_rpc, ov);
overlay_monitor_notify (ov, FLUX_NODEID_ANY);
parent_disconnect (ov);
}
else
logdrop (ov, OVERLAY_UPSTREAM, msg, "unknown control type");
Expand Down Expand Up @@ -1940,6 +1947,25 @@
flux_log_error (h, "error responding to overlay.disconnect-subtree");
}

/* Log a message then force the parent to disconnect.
*/
static void overlay_disconnect_parent_cb (flux_t *h,
flux_msg_handler_t *mh,
const flux_msg_t *msg,
void *arg)
{
struct overlay *ov = arg;
const char *reason;

if (flux_request_unpack (msg, NULL, "{s:s}", "reason", &reason) < 0)
goto error;

Check warning on line 1961 in src/broker/overlay.c

View check run for this annotation

Codecov / codecov/patch

src/broker/overlay.c#L1961

Added line #L1961 was not covered by tests
flux_log (h, LOG_CRIT, "disconnecting: %s", reason);
parent_disconnect (ov);
return;
error:
flux_log_error (h, "overlay.disconnect-parent error");

Check warning on line 1966 in src/broker/overlay.c

View check run for this annotation

Codecov / codecov/patch

src/broker/overlay.c#L1965-L1966

Added lines #L1965 - L1966 were not covered by tests
}

static void overlay_trace_cb (flux_t *h,
flux_msg_handler_t *mh,
const flux_msg_t *msg,
Expand Down Expand Up @@ -2457,6 +2483,12 @@
overlay_disconnect_subtree_cb,
0
},
{
FLUX_MSGTYPE_REQUEST,
"overlay.disconnect-parent",
overlay_disconnect_parent_cb,
0
},
{
FLUX_MSGTYPE_REQUEST,
"overlay.goodbye",
Expand Down
Loading
Loading