Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heartbeats #40

Open
BenediktBurger opened this issue Feb 2, 2023 · 20 comments
Open

Heartbeats #40

BenediktBurger opened this issue Feb 2, 2023 · 20 comments
Labels
discussion-needed A solution still needs to be determined distributed_ops Aspects of a distributed operation, networked or on a node messages Concerns the message format

Comments

@BenediktBurger
Copy link
Member

BenediktBurger commented Feb 2, 2023

Let us discuss on heartbeats in this issue.
Related to #4

Accepted ideas:

  • Every Message received is a heartbeat.

Some ideas to get heartbeats:

  1. Reply to every not empty message with an empty one
  2. hope to get a message before expiration. If a connection expires for the first time, send a status request, if it expires a second time, drop it.
  3. Components shall send heartbeats at a certain rate, equally Coordinators.
@BenediktBurger BenediktBurger added distributed_ops Aspects of a distributed operation, networked or on a node discussion-needed A solution still needs to be determined messages Concerns the message format labels Feb 2, 2023
@BenediktBurger BenediktBurger linked a pull request Feb 2, 2023 that will close this issue
5 tasks
@bilderbuchi
Copy link
Member

Do we want to differentiate between STATUS requests out of "interest" for the status, and because of checking for life signs (HEALTHCHECK? POKE?) If yes, receiving a HEALTHCHECK would signal to a Component that it is not sending heartbeats often enough, and it should increase the hearbeat frequency.

@bilderbuchi
Copy link
Member

We will need a way to configure/coordinate desired heartbeat intervals.
For later maybe: Probably some random jitter is a good idea so that we don't get "bursts" of heartbeats over our connections around every interval. There are known algorithms to deal with that.

@bilderbuchi
Copy link
Member

I was previously lamenting that we lose message/response symmetry if we answer every message with a heartbeat.
However, I think I found a way to deal with this conceptually: Heartbeats never "count" when tallying messages. With that idea I'm more comfortable with the idea of replying to every nonempty message with an empty one.
Heartbeats also maybe won't be shown in sequence diagrams not specifically concerned with heartbeat mechanics, to not clutter up everything.

@BenediktBurger
Copy link
Member Author

We could call it "Ping", if you desire a life sign.

If we have that fixed rule (respond empty), we can leave them out as they do not contain content and are easy to understand.

We will need a way to configure/coordinate desired heartbeat intervals.

I like the idea of the Zmq manual, which states, that the clients (not the server) know best their connection and heartbeat requirements.
So my proposal is, that we have some lax heartbeat interval on the Coordinator (around 1 to 10 seconds), which is also the default of Components.
Whoever needs a higher interval, may send heartbeats more often (and acquire thus replies as well).

@BenediktBurger
Copy link
Member Author

An additional question: how empty shall an empty message be?
As heartbeats are just between directly connected peers (that is a Coordinator and a Component or another Coordinator), the empty message could consist in a single, empty frame, without sender or recipient information: "||".

The Component understands: My connection peer (Coordinator) is still alive.
The Coordinator receives the connection identity and knows, that the peer at the end of that connection is alive.

@bilderbuchi
Copy link
Member

"PING": sounds good.
Clients determine frequency: sounds good.

Empty frame heartbeat: I agree with your reasoning. Can we do logging well in that case? (It might be convenient to log heartbeats at debug level, for example)

@bilderbuchi
Copy link
Member

I guess the exception in the message format for heartbeats does not unduly complicate things?

@BenediktBurger
Copy link
Member Author

I guess the exception in the message format for heartbeats does not unduly complicate things?

You add a first check in the Coordinator:

identity, message = socket.recv_multipart()
handle_heartbeat(identity)
if message == [b""]:
    return

In a Component you have:

while True:
    msg = socket.recv_multipart()
    if msg != [b""]:
        break

Empty frame heartbeat: I agree with your reasoning. Can we do logging well in that case? (It might be convenient to log heartbeats at debug level, for example)

The Coordinator can log the identity (which has an entry in the directory) and a Component can log that it received a heartbeat of its Coordinator.

Another idea.

We drop that automatic response (which adds complexity in the message routing, as you have to decide, whether you will send an answer later, maybe an error message, or directly respond).

The idea:

  • A response is only sent on PING request.
  • If the answer is not an empty message, but a full fledged message with command PONG, you could PING some Component via Coordinators (e.g. from N1.CA to N2.CB) as well.
  • If a connection expires for the first time, a PING request is sent (from Component to Coordinator or Coordinator to corresponding Component), if it expires a second time, the connection is signed out.

@BenediktBurger
Copy link
Member Author

If the answer is not an empty message, but a full fledged message with command PONG, you could PING some Component via Coordinators (e.g. from N1.CA to N2.CB) as well.

Instead of PONG, we could use a contentless message (e.g. V|Recipient|Sender|H, in contrast to an empty message ||).

@BenediktBurger
Copy link
Member Author

A Coordinator expects a heartbeat interval of 700 ms. A Component uses a heartbeat interval of 1000 ms. Thus, after a while, the component will send a PING, which the Component answers. If there is no change, the same situation will recur again.
If, on the other hand the Component says "I received a PING, so my heartbeat interval must be too long for the Coordinator's comfort, let's drop it by (e.g.) 10%", both will align naturally over time/a couple of iterations and avoid further PINGs.
It's always the recipient of the PING that should adjust the interval, because the sender does not have a clear signal, as it does not know if the PING target is actually still alive (but the PING recipient knows that).

by @bilderbuchi in https://github.com/pymeasure/leco-protocol/pull/38/files#r1101914525

suggestion: After not having received a heartbeat from a peer for a chosen heartbeat interval, a Component should and a Coordinator shall send a PING and wait for a PONG response for 3 heartbeat intervals before considering a connection dead.

by @bilderbuchi in https://github.com/pymeasure/leco-protocol/pull/38/files#r1100683377

@BenediktBurger
Copy link
Member Author

Regarding the 3 hearbeat_intervals:

I'd separate the "heartbeat_interval" (time between two heartbeats) and the "expiration_time", time when you get suspicious, that the other side is dead. Obviously, the second value should be larger the first value. Lastly we have a "heartbeat_check_interval", the time between two checks, whether a connection expired or not.

Did you mean, that the expiration_time is 3 times heartbeat_interval, @bilderbuchi ?

my idea is:

now = get_current_time()
last_message = get_time_of_last_message_of_Component()
if now > last_message + expiration_time + heartbeat_check_interval:
    delete_connection()
elif now > last_message + expiration_time:
    send_ping()

@bilderbuchi
Copy link
Member

Isn't that too complicated? I'd rather not configure/keep track of 3 different time intervals, to be honest.

Assuming that your code snippet is part of an event loop, what about:

# "pinged_at" is 0 in the beginning
now = get_current_time()
last_message = get_time_of_last_message_of_Component()
pinged_since_last_contact = pinged_at > last_message
# this _accurately_ tolerates one heartbeat interval after the ping:
if pinged_since_last_contact and (now > pinged_at + heartbeat_interval):
        delete_connection()
if now > last_message + heartbeat_interval and not pinged_since_last_contact:
        send_ping()
        pinged_at = now

I'd say forget about the 3 intervals. I suspect I based this on an now outdated notion of the reconnection process. What I wanted to achieve is that the heartbeat-ping mechanism leaves enough time between one Component realising the Coordinator is dead (restarting) for all Components being reconnected, before declaring failure. That could make for a smoother reconnection process.
Let's revisit that when we have the first iteration done.

@BenediktBurger
Copy link
Member Author

pinged_at + heartbeat_interval == last_message + 2 heartbeat_interval, so we do not have to store that additional information.

Basically you set all three variables to the same value?

I would not set the expiration time to the heartbeat_interval: Maybe the heartbeat will arive immediately later.
We could set the expiration_time to some factor (1.1 for not a lot or 2), so : now > last_message + heartbeat_interval * factor (factor is going to be defined).

This reduces to my original proposal, but defining the variables in terms of heartbeat_interval:

if now > last_message + 3 * heartbeat_interval:
    delelete_connection()
elif now > last_message + 2 * heartbeat_interval:
    send_ping()

with the heartbeat_check_interval I did not want to check the heartbeats expiration between pinging and expiring again. However, we can to that check as often as the heartbeats.

@bilderbuchi
Copy link
Member

bilderbuchi commented Feb 10, 2023

pinged_at + heartbeat_interval == last_message + 2 heartbeat_interval

no, that's not accurate. A timeline assuming 1000 ms heartbeat interval (with exaggerated delays for demonstration):

  1. t=0: last message received
  2. 1000: first warning gate, coordinator is now allowed to become suspicious
  3. 1500: Coordinator comes around to checking, realises that Component is silent too long, sends a PING
  4. 2000: 2 heartbeat intervals have passed since the last message, but only half an interval since the PING. Using your logic, the connection will now be severed, but my logic gives a grace period of one heartbeat
  5. 2500: Now one heartbeat has elapsed since the PING, according to my proposed logic, now the connection is severed.

@bklebel
Copy link
Collaborator

bklebel commented Feb 10, 2023

I like @bilderbuchi's last proposal of the general protocol, except that I would, in general, be a bit more lenient with the heartbeats, and only sever the connection in case a few heartbeat intervals have elapsed, not just one ping sent without a pong response, but maybe 2 or 3.
Alternatively, I would give the Components the option, similar to what @bmoneke suggested, to announce (maybe in the SIGNIN message), what their preferred heartbeat interval would be - maybe a particular Component knows that it will be unresponsive for longer intervals than the default heartbeat interval, and it does not have the option to just reduce its own heartbeat interval to accommodate the other side wanting to go faster, and thus speed up.
So, we could have default heartbeat interval X, but a Component might say that heartbeats should not be expected to come from it more often than every Y second, which may well be larger than X.

In the #35 Status handling, we could flesh out a status level where "the Coordinator becomes suspicious, but does not yet sever the connection outright".

@BenediktBurger
Copy link
Member Author

  1. 2000: 2 heartbeat intervals have passed since the last message, but only half an interval since the PING. Using your logic, the connection will now be severed, but my logic gives a grace period of one heartbeat

This depends no, how often you check the expiration. I proposed to use the heartbeat interval in that message.

@BenediktBurger
Copy link
Member Author

Maybe we should start with the goals: What do we want to achieve with the heartbeat?

  • fast resolution?
  • detect a problem at some point?
  • leave it open for the user?

@bklebel mentioned slow reacting Components, which influences the Actor design:

I thought about single threaded actors, which handle a message at a time. If the handling takes time (due to device communication), there won't be any heartbeats nor any responses to ping requests.
The advantage is, that the actors are simple in design: read a message, handle it, respond, repeat.

Another proposal (from the cern middle ware paper) assumes a thread dedicated to message handling.
This assures regular heartbeats, fast responses to pings etc.
It requires also a queue (in the English sense) for commands and some way to store the a message information during handling etc.

If we assume the first, you can still write a actor of the second type (for example for a very slow instrument etc).

@BenediktBurger
Copy link
Member Author

We could leave it up to the user and define it generally:

  1. After expiration_time1 you send a ping
  2. Expiration_time2 after the last message or after the ping (TBD), you may cut the connection

Then the users can set their intervals suiting their setup.

As we have the ping message, it is no problem, if the heartbeat rate is too low in one Component, as the other side can ping.

Just the expiration time has to be larger than the slowest component connected to that Coordinator.

@bilderbuchi
Copy link
Member

I like @bilderbuchi's last proposal of the general protocol, except that I would, in general, be a bit more lenient with the heartbeats, and only sever the connection in case a few heartbeat intervals have elapsed, not just one ping sent without a pong response, but maybe 2 or 3.

Alternatively, I would give the Components the option, similar to what @bmoneke suggested, to announce (maybe in the SIGNIN message), what their preferred heartbeat interval would be - maybe a particular Component knows that it will be unresponsive for longer intervals than the default heartbeat interval, and it does not have the option to just reduce its own heartbeat interval to accommodate the other side wanting to go faster, and thus speed up. So, we could have default heartbeat interval X, but a Component might say that heartbeats should not be expected to come from it more often than every Y second, which may well be larger than X.

In the #35 Status handling, we could flesh out a status level where "the Coordinator becomes suspicious, but does not yet sever the connection outright".

👍 to all these points.

Also, I caution against too much configurability/parameters -- in this first iteration, we can still keep the logic simple, and not yet account for all use cases and scenarios. Later, we can adjust that and expand with more options if needed.

@bklebel
Copy link
Collaborator

bklebel commented Feb 13, 2023

Also, I caution against too much configurability/parameters -- in this first iteration, we can still keep the logic simple, and not yet account for all use cases and scenarios. Later, we can adjust that and expand with more options if needed.

I fully agree. We could start with a default heartbeat interval for all, and introduce "personalised" heartbeat intervals later - if we are lenient enough with severing connections, we should not run into troubles too quickly, although we will detect failures a bit slower, would be fine with me.

slow reacting Components, which influences the Actor design

That will be another story altogether, I don't yet have a clear picture of how that will go. I would however be careful with having a dedicated thread to send heartbeats, this might lead to heartbeats being sent even though the thread which deals with hardware connections died and was not restarted, so we get inaccurate information. This could be prevented in a good implementation, but if it can be excluded by design, that might be better.

@BenediktBurger BenediktBurger removed a link to a pull request Feb 14, 2023
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion-needed A solution still needs to be determined distributed_ops Aspects of a distributed operation, networked or on a node messages Concerns the message format
Projects
None yet
Development

No branches or pull requests

3 participants