-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
.nack handling discussion #52
Comments
SWIM protocol has 3 different types of communications:
Above communications can produce 3 different types of responses:
PingWhen ping happens, there can be 2 outcomes:
Indirect PingWhen indirect ping happens, there can be 2 outcomes:
Now, indirect pinger did its roundtrip with a ping target and responds to a ping origin node. PingRequestWhen pingRequest happens, there can be 3 outcomes:
Here’s some justification on why LHM component decides to either act or not to act on certain types of messages:
Generally, I found helpful to think about Lifegurad extensions as a tool to prevent false positive failure detection. The reason LHM component doesn't act on indirect ping's |
Thanks a lot for the writeup, that's all correct and sounds good. First let's clear out the things we're definitely agreeing on:
right, no need for the intermediary to ever bump it's LHM because of someone else trying to pingReq -- it may be the origin or the target failing, and unlikely us 👍
That's a great observation and thanks for talking this one over yesterday, agree this sounds good, esp the gradual coming back to a good level, rather than jumping back quickly because we may be an indirect node for many requests.
Also agreed, a What's missing in my opinion though is: "ack after nack" handling. Which gets somewhat SWIMNIO specific but I guess the instance may need to help here a bit. 4 peers, A (ping request origin), B and C intermediaries, and D "maybe unreachable" node.
...
...
Long story short: We'd need to modify sending indirect pings, in such way that:
A similar dance has to be done in receiving nacks; the nack must not remove the handler completely; it must mark it as such "if the timeout triggers, don't change the LHM" which we today do implicitly by the nack simply removing the callback/cancelling the timer such that the timeout never happens and we today never run that +1 code. We'd need to keep the callback slot after a nack around until timeout or ack, and then handle ack normally, but handle the timeout as "we know we had a nack, so don't +1 the LHM". This can be done soon though, but we don't have to rush on it right now. |
We discussed this a bunch with @avolokhov, hope he can drop his notes here for future reference.
Thanks a lot for digging into this 🙇♂️
The text was updated successfully, but these errors were encountered: