-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Heartbeats #40
Comments
Do we want to differentiate between |
We will need a way to configure/coordinate desired heartbeat intervals. |
I was previously lamenting that we lose message/response symmetry if we answer every message with a heartbeat. |
We could call it "Ping", if you desire a life sign. If we have that fixed rule (respond empty), we can leave them out as they do not contain content and are easy to understand.
I like the idea of the Zmq manual, which states, that the clients (not the server) know best their connection and heartbeat requirements. |
An additional question: how empty shall an empty message be? The Component understands: My connection peer (Coordinator) is still alive. |
"PING": sounds good. Empty frame heartbeat: I agree with your reasoning. Can we do logging well in that case? (It might be convenient to log heartbeats at debug level, for example) |
I guess the exception in the message format for heartbeats does not unduly complicate things? |
You add a first check in the Coordinator: identity, message = socket.recv_multipart()
handle_heartbeat(identity)
if message == [b""]:
return In a Component you have: while True:
msg = socket.recv_multipart()
if msg != [b""]:
break
The Coordinator can log the identity (which has an entry in the directory) and a Component can log that it received a heartbeat of its Coordinator. Another idea.We drop that automatic response (which adds complexity in the message routing, as you have to decide, whether you will send an answer later, maybe an error message, or directly respond). The idea:
|
Instead of PONG, we could use a contentless message (e.g. |
|
Regarding the 3 hearbeat_intervals: I'd separate the "heartbeat_interval" (time between two heartbeats) and the "expiration_time", time when you get suspicious, that the other side is dead. Obviously, the second value should be larger the first value. Lastly we have a "heartbeat_check_interval", the time between two checks, whether a connection expired or not. Did you mean, that the expiration_time is 3 times heartbeat_interval, @bilderbuchi ? my idea is: now = get_current_time()
last_message = get_time_of_last_message_of_Component()
if now > last_message + expiration_time + heartbeat_check_interval:
delete_connection()
elif now > last_message + expiration_time:
send_ping() |
Isn't that too complicated? I'd rather not configure/keep track of 3 different time intervals, to be honest. Assuming that your code snippet is part of an event loop, what about: # "pinged_at" is 0 in the beginning
now = get_current_time()
last_message = get_time_of_last_message_of_Component()
pinged_since_last_contact = pinged_at > last_message
# this _accurately_ tolerates one heartbeat interval after the ping:
if pinged_since_last_contact and (now > pinged_at + heartbeat_interval):
delete_connection()
if now > last_message + heartbeat_interval and not pinged_since_last_contact:
send_ping()
pinged_at = now I'd say forget about the 3 intervals. I suspect I based this on an now outdated notion of the reconnection process. What I wanted to achieve is that the heartbeat-ping mechanism leaves enough time between one Component realising the Coordinator is dead (restarting) for all Components being reconnected, before declaring failure. That could make for a smoother reconnection process. |
Basically you set all three variables to the same value? I would not set the expiration time to the heartbeat_interval: Maybe the heartbeat will arive immediately later. This reduces to my original proposal, but defining the variables in terms of heartbeat_interval: if now > last_message + 3 * heartbeat_interval:
delelete_connection()
elif now > last_message + 2 * heartbeat_interval:
send_ping() with the heartbeat_check_interval I did not want to check the heartbeats expiration between pinging and expiring again. However, we can to that check as often as the heartbeats. |
no, that's not accurate. A timeline assuming 1000 ms heartbeat interval (with exaggerated delays for demonstration):
|
I like @bilderbuchi's last proposal of the general protocol, except that I would, in general, be a bit more lenient with the heartbeats, and only sever the connection in case a few heartbeat intervals have elapsed, not just one ping sent without a pong response, but maybe 2 or 3. In the #35 Status handling, we could flesh out a status level where "the Coordinator becomes suspicious, but does not yet sever the connection outright". |
This depends no, how often you check the expiration. I proposed to use the heartbeat interval in that message. |
Maybe we should start with the goals: What do we want to achieve with the heartbeat?
@bklebel mentioned slow reacting Components, which influences the Actor design: I thought about single threaded actors, which handle a message at a time. If the handling takes time (due to device communication), there won't be any heartbeats nor any responses to ping requests. Another proposal (from the cern middle ware paper) assumes a thread dedicated to message handling. If we assume the first, you can still write a actor of the second type (for example for a very slow instrument etc). |
We could leave it up to the user and define it generally:
Then the users can set their intervals suiting their setup. As we have the ping message, it is no problem, if the heartbeat rate is too low in one Component, as the other side can ping. Just the expiration time has to be larger than the slowest component connected to that Coordinator. |
👍 to all these points. Also, I caution against too much configurability/parameters -- in this first iteration, we can still keep the logic simple, and not yet account for all use cases and scenarios. Later, we can adjust that and expand with more options if needed. |
I fully agree. We could start with a default heartbeat interval for all, and introduce "personalised" heartbeat intervals later - if we are lenient enough with severing connections, we should not run into troubles too quickly, although we will detect failures a bit slower, would be fine with me.
That will be another story altogether, I don't yet have a clear picture of how that will go. I would however be careful with having a dedicated thread to send heartbeats, this might lead to heartbeats being sent even though the thread which deals with hardware connections died and was not restarted, so we get inaccurate information. This could be prevented in a good implementation, but if it can be excluded by design, that might be better. |
Let us discuss on heartbeats in this issue.
Related to #4
Accepted ideas:
Some ideas to get heartbeats:
The text was updated successfully, but these errors were encountered: