Heartbeats #40

BenediktBurger · 2023-02-02T11:27:04Z

Let us discuss on heartbeats in this issue.
Related to #4

Accepted ideas:

Every Message received is a heartbeat.

Some ideas to get heartbeats:

Reply to every not empty message with an empty one
hope to get a message before expiration. If a connection expires for the first time, send a status request, if it expires a second time, drop it.
Components shall send heartbeats at a certain rate, equally Coordinators.

bilderbuchi · 2023-02-04T21:07:28Z

Do we want to differentiate between STATUS requests out of "interest" for the status, and because of checking for life signs (HEALTHCHECK? POKE?) If yes, receiving a HEALTHCHECK would signal to a Component that it is not sending heartbeats often enough, and it should increase the hearbeat frequency.

bilderbuchi · 2023-02-04T21:09:42Z

We will need a way to configure/coordinate desired heartbeat intervals.
For later maybe: Probably some random jitter is a good idea so that we don't get "bursts" of heartbeats over our connections around every interval. There are known algorithms to deal with that.

bilderbuchi · 2023-02-04T21:14:16Z

I was previously lamenting that we lose message/response symmetry if we answer every message with a heartbeat.
However, I think I found a way to deal with this conceptually: Heartbeats never "count" when tallying messages. With that idea I'm more comfortable with the idea of replying to every nonempty message with an empty one.
Heartbeats also maybe won't be shown in sequence diagrams not specifically concerned with heartbeat mechanics, to not clutter up everything.

BenediktBurger · 2023-02-05T17:51:18Z

We could call it "Ping", if you desire a life sign.

If we have that fixed rule (respond empty), we can leave them out as they do not contain content and are easy to understand.

We will need a way to configure/coordinate desired heartbeat intervals.

I like the idea of the Zmq manual, which states, that the clients (not the server) know best their connection and heartbeat requirements.
So my proposal is, that we have some lax heartbeat interval on the Coordinator (around 1 to 10 seconds), which is also the default of Components.
Whoever needs a higher interval, may send heartbeats more often (and acquire thus replies as well).

BenediktBurger · 2023-02-06T06:26:26Z

An additional question: how empty shall an empty message be?
As heartbeats are just between directly connected peers (that is a Coordinator and a Component or another Coordinator), the empty message could consist in a single, empty frame, without sender or recipient information: "||".

The Component understands: My connection peer (Coordinator) is still alive.
The Coordinator receives the connection identity and knows, that the peer at the end of that connection is alive.

bilderbuchi · 2023-02-06T16:23:14Z

"PING": sounds good.
Clients determine frequency: sounds good.

Empty frame heartbeat: I agree with your reasoning. Can we do logging well in that case? (It might be convenient to log heartbeats at debug level, for example)

bilderbuchi · 2023-02-06T16:27:16Z

I guess the exception in the message format for heartbeats does not unduly complicate things?

BenediktBurger · 2023-02-08T08:18:37Z

I guess the exception in the message format for heartbeats does not unduly complicate things?

You add a first check in the Coordinator:

identity, message = socket.recv_multipart()
handle_heartbeat(identity)
if message == [b""]:
    return

In a Component you have:

while True:
    msg = socket.recv_multipart()
    if msg != [b""]:
        break

Empty frame heartbeat: I agree with your reasoning. Can we do logging well in that case? (It might be convenient to log heartbeats at debug level, for example)

The Coordinator can log the identity (which has an entry in the directory) and a Component can log that it received a heartbeat of its Coordinator.

Another idea.

We drop that automatic response (which adds complexity in the message routing, as you have to decide, whether you will send an answer later, maybe an error message, or directly respond).

The idea:

A response is only sent on PING request.
If the answer is not an empty message, but a full fledged message with command PONG, you could PING some Component via Coordinators (e.g. from N1.CA to N2.CB) as well.
If a connection expires for the first time, a PING request is sent (from Component to Coordinator or Coordinator to corresponding Component), if it expires a second time, the connection is signed out.

BenediktBurger · 2023-02-09T14:06:32Z

If the answer is not an empty message, but a full fledged message with command PONG, you could PING some Component via Coordinators (e.g. from N1.CA to N2.CB) as well.

Instead of PONG, we could use a contentless message (e.g. V|Recipient|Sender|H, in contrast to an empty message ||).

BenediktBurger · 2023-02-10T09:07:24Z

A Coordinator expects a heartbeat interval of 700 ms. A Component uses a heartbeat interval of 1000 ms. Thus, after a while, the component will send a PING, which the Component answers. If there is no change, the same situation will recur again.
If, on the other hand the Component says "I received a PING, so my heartbeat interval must be too long for the Coordinator's comfort, let's drop it by (e.g.) 10%", both will align naturally over time/a couple of iterations and avoid further PINGs.
It's always the recipient of the PING that should adjust the interval, because the sender does not have a clear signal, as it does not know if the PING target is actually still alive (but the PING recipient knows that).

by @bilderbuchi in https://github.com/pymeasure/leco-protocol/pull/38/files#r1101914525

suggestion: After not having received a heartbeat from a peer for a chosen heartbeat interval, a Component should and a Coordinator shall send a PING and wait for a PONG response for 3 heartbeat intervals before considering a connection dead.

by @bilderbuchi in https://github.com/pymeasure/leco-protocol/pull/38/files#r1100683377

BenediktBurger · 2023-02-10T09:15:29Z

Regarding the 3 hearbeat_intervals:

I'd separate the "heartbeat_interval" (time between two heartbeats) and the "expiration_time", time when you get suspicious, that the other side is dead. Obviously, the second value should be larger the first value. Lastly we have a "heartbeat_check_interval", the time between two checks, whether a connection expired or not.

Did you mean, that the expiration_time is 3 times heartbeat_interval, @bilderbuchi ?

my idea is:

now = get_current_time()
last_message = get_time_of_last_message_of_Component()
if now > last_message + expiration_time + heartbeat_check_interval:
    delete_connection()
elif now > last_message + expiration_time:
    send_ping()

bilderbuchi · 2023-02-10T12:38:28Z

Isn't that too complicated? I'd rather not configure/keep track of 3 different time intervals, to be honest.

Assuming that your code snippet is part of an event loop, what about:

# "pinged_at" is 0 in the beginning
now = get_current_time()
last_message = get_time_of_last_message_of_Component()
pinged_since_last_contact = pinged_at > last_message
# this _accurately_ tolerates one heartbeat interval after the ping:
if pinged_since_last_contact and (now > pinged_at + heartbeat_interval):
        delete_connection()
if now > last_message + heartbeat_interval and not pinged_since_last_contact:
        send_ping()
        pinged_at = now

I'd say forget about the 3 intervals. I suspect I based this on an now outdated notion of the reconnection process. What I wanted to achieve is that the heartbeat-ping mechanism leaves enough time between one Component realising the Coordinator is dead (restarting) for all Components being reconnected, before declaring failure. That could make for a smoother reconnection process.
Let's revisit that when we have the first iteration done.

BenediktBurger · 2023-02-10T12:48:30Z

pinged_at + heartbeat_interval == last_message + 2 heartbeat_interval, so we do not have to store that additional information.

Basically you set all three variables to the same value?

I would not set the expiration time to the heartbeat_interval: Maybe the heartbeat will arive immediately later.
We could set the expiration_time to some factor (1.1 for not a lot or 2), so : now > last_message + heartbeat_interval * factor (factor is going to be defined).

This reduces to my original proposal, but defining the variables in terms of heartbeat_interval:

if now > last_message + 3 * heartbeat_interval:
    delelete_connection()
elif now > last_message + 2 * heartbeat_interval:
    send_ping()

with the heartbeat_check_interval I did not want to check the heartbeats expiration between pinging and expiring again. However, we can to that check as often as the heartbeats.

bilderbuchi · 2023-02-10T13:19:23Z

pinged_at + heartbeat_interval == last_message + 2 heartbeat_interval

no, that's not accurate. A timeline assuming 1000 ms heartbeat interval (with exaggerated delays for demonstration):

t=0: last message received
1000: first warning gate, coordinator is now allowed to become suspicious
1500: Coordinator comes around to checking, realises that Component is silent too long, sends a PING
2000: 2 heartbeat intervals have passed since the last message, but only half an interval since the PING. Using your logic, the connection will now be severed, but my logic gives a grace period of one heartbeat
2500: Now one heartbeat has elapsed since the PING, according to my proposed logic, now the connection is severed.

bklebel · 2023-02-10T14:04:00Z

I like @bilderbuchi's last proposal of the general protocol, except that I would, in general, be a bit more lenient with the heartbeats, and only sever the connection in case a few heartbeat intervals have elapsed, not just one ping sent without a pong response, but maybe 2 or 3.
Alternatively, I would give the Components the option, similar to what @bmoneke suggested, to announce (maybe in the SIGNIN message), what their preferred heartbeat interval would be - maybe a particular Component knows that it will be unresponsive for longer intervals than the default heartbeat interval, and it does not have the option to just reduce its own heartbeat interval to accommodate the other side wanting to go faster, and thus speed up.
So, we could have default heartbeat interval X, but a Component might say that heartbeats should not be expected to come from it more often than every Y second, which may well be larger than X.

In the #35 Status handling, we could flesh out a status level where "the Coordinator becomes suspicious, but does not yet sever the connection outright".

BenediktBurger · 2023-02-10T14:25:39Z

2000: 2 heartbeat intervals have passed since the last message, but only half an interval since the PING. Using your logic, the connection will now be severed, but my logic gives a grace period of one heartbeat

This depends no, how often you check the expiration. I proposed to use the heartbeat interval in that message.

BenediktBurger · 2023-02-10T14:40:36Z

Maybe we should start with the goals: What do we want to achieve with the heartbeat?

fast resolution?
detect a problem at some point?
leave it open for the user?

@bklebel mentioned slow reacting Components, which influences the Actor design:

I thought about single threaded actors, which handle a message at a time. If the handling takes time (due to device communication), there won't be any heartbeats nor any responses to ping requests.
The advantage is, that the actors are simple in design: read a message, handle it, respond, repeat.

Another proposal (from the cern middle ware paper) assumes a thread dedicated to message handling.
This assures regular heartbeats, fast responses to pings etc.
It requires also a queue (in the English sense) for commands and some way to store the a message information during handling etc.

If we assume the first, you can still write a actor of the second type (for example for a very slow instrument etc).

BenediktBurger · 2023-02-10T14:47:04Z

We could leave it up to the user and define it generally:

After expiration_time1 you send a ping
Expiration_time2 after the last message or after the ping (TBD), you may cut the connection

Then the users can set their intervals suiting their setup.

As we have the ping message, it is no problem, if the heartbeat rate is too low in one Component, as the other side can ping.

Just the expiration time has to be larger than the slowest component connected to that Coordinator.

bilderbuchi · 2023-02-11T08:20:49Z

I like @bilderbuchi's last proposal of the general protocol, except that I would, in general, be a bit more lenient with the heartbeats, and only sever the connection in case a few heartbeat intervals have elapsed, not just one ping sent without a pong response, but maybe 2 or 3.

Alternatively, I would give the Components the option, similar to what @bmoneke suggested, to announce (maybe in the SIGNIN message), what their preferred heartbeat interval would be - maybe a particular Component knows that it will be unresponsive for longer intervals than the default heartbeat interval, and it does not have the option to just reduce its own heartbeat interval to accommodate the other side wanting to go faster, and thus speed up. So, we could have default heartbeat interval X, but a Component might say that heartbeats should not be expected to come from it more often than every Y second, which may well be larger than X.

In the #35 Status handling, we could flesh out a status level where "the Coordinator becomes suspicious, but does not yet sever the connection outright".

👍 to all these points.

Also, I caution against too much configurability/parameters -- in this first iteration, we can still keep the logic simple, and not yet account for all use cases and scenarios. Later, we can adjust that and expand with more options if needed.

bklebel · 2023-02-13T09:25:18Z

Also, I caution against too much configurability/parameters -- in this first iteration, we can still keep the logic simple, and not yet account for all use cases and scenarios. Later, we can adjust that and expand with more options if needed.

I fully agree. We could start with a default heartbeat interval for all, and introduce "personalised" heartbeat intervals later - if we are lenient enough with severing connections, we should not run into troubles too quickly, although we will detect failures a bit slower, would be fine with me.

slow reacting Components, which influences the Actor design

That will be another story altogether, I don't yet have a clear picture of how that will go. I would however be careful with having a dedicated thread to send heartbeats, this might lead to heartbeats being sent even though the thread which deals with hardware connections died and was not restarted, so we get inaccurate information. This could be prevented in a good implementation, but if it can be excluded by design, that might be better.

BenediktBurger added distributed_ops Aspects of a distributed operation, networked or on a node discussion-needed A solution still needs to be determined messages Concerns the message format labels Feb 2, 2023

BenediktBurger mentioned this issue Feb 2, 2023

Control protocol transport layer. #38

Merged

5 tasks

BenediktBurger linked a pull request Feb 2, 2023 that will close this issue

Control protocol transport layer. #38

Merged

5 tasks

BenediktBurger added this to the zmq routing of Control Messages version 0 milestone Feb 8, 2023

BenediktBurger removed a link to a pull request Feb 14, 2023

Control protocol transport layer. #38

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Heartbeats #40

Heartbeats #40

BenediktBurger commented Feb 2, 2023 •

edited

Loading

bilderbuchi commented Feb 4, 2023

bilderbuchi commented Feb 4, 2023

bilderbuchi commented Feb 4, 2023

BenediktBurger commented Feb 5, 2023

BenediktBurger commented Feb 6, 2023

bilderbuchi commented Feb 6, 2023

bilderbuchi commented Feb 6, 2023

BenediktBurger commented Feb 8, 2023

BenediktBurger commented Feb 9, 2023

BenediktBurger commented Feb 10, 2023

BenediktBurger commented Feb 10, 2023

bilderbuchi commented Feb 10, 2023

BenediktBurger commented Feb 10, 2023

bilderbuchi commented Feb 10, 2023 •

edited

Loading

bklebel commented Feb 10, 2023

BenediktBurger commented Feb 10, 2023

BenediktBurger commented Feb 10, 2023

BenediktBurger commented Feb 10, 2023

bilderbuchi commented Feb 11, 2023

bklebel commented Feb 13, 2023

Heartbeats #40

Heartbeats #40

Comments

BenediktBurger commented Feb 2, 2023 • edited Loading

bilderbuchi commented Feb 4, 2023

bilderbuchi commented Feb 4, 2023

bilderbuchi commented Feb 4, 2023

BenediktBurger commented Feb 5, 2023

BenediktBurger commented Feb 6, 2023

bilderbuchi commented Feb 6, 2023

bilderbuchi commented Feb 6, 2023

BenediktBurger commented Feb 8, 2023

Another idea.

BenediktBurger commented Feb 9, 2023

BenediktBurger commented Feb 10, 2023

BenediktBurger commented Feb 10, 2023

bilderbuchi commented Feb 10, 2023

BenediktBurger commented Feb 10, 2023

bilderbuchi commented Feb 10, 2023 • edited Loading

bklebel commented Feb 10, 2023

BenediktBurger commented Feb 10, 2023

BenediktBurger commented Feb 10, 2023

BenediktBurger commented Feb 10, 2023

bilderbuchi commented Feb 11, 2023

bklebel commented Feb 13, 2023

BenediktBurger commented Feb 2, 2023 •

edited

Loading

bilderbuchi commented Feb 10, 2023 •

edited

Loading