You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Mar 4, 2024. It is now read-only.
As discussed in this blogpost and with @Bathtor, partial failures could result in Raft not making any progress due to no stable leader can be established.
Initially, I thought our Ballot Leader Election was not affected because we only include our own ballot, and not the largest seen ballot, in the HeartbeatReply. However, I overlooked that we are gossiping the largest seen ballot, max_ballot, in the HeartbeatRequests and updating our own max_ballot to the largest seen value. As a result, this could result in liveness issue in a similar manner.
Consider the following example:
p1, p2, p3 are initially all connected to each other and p3 became leader with <0, 3>.
p1 and p3 get disconnected so our connectivity becomes: p1-p2-p3.
As p1 does not receive HeartbeatReply from p3, it increments its ballot to <1, 1> and send it to p2.
p2 sends a HeartbeatRequest with max_ballot = <1, 1> to p3.
Upon getting that HeartbeatRequest from p2, p3 sets its own max_ballot to <1, 1> but will never actually get the HeartbeatReply from p1. In check_leader(), p3 will realise that the top_ballot in the HeartbeatReplies is less than its max_ballot and will therefore increment its ballot to <2, 3>, as seen here.
Then the same from step 4 onwards will occur again but this time with max_ballot = <2, 3> gossiped from p2 to p1. p1 will not get the heartbeat from p3 and increment its ballot. We will never establish a stable leader.
If we do not include the max_ballot in the HeartbeatRequest, then when we are fine even at p1-p2-p3. Again, p1 increments to <1, 1> but p2 will now NOT gossip it to p3 (as we have removed max_ballot from HeartbeatRequest). Hence, p3 will just continue to get HeartbeatReply of <0, 2> from p2 and will therefore still think it is the leader when it has actually been overtaken, but that's fine.
Removing max_ballot from HeartbeatRequest does not break anything in the normal case.
The text was updated successfully, but these errors were encountered:
As discussed in this blogpost and with @Bathtor, partial failures could result in Raft not making any progress due to no stable leader can be established.
Initially, I thought our Ballot Leader Election was not affected because we only include our own ballot, and not the largest seen ballot, in the HeartbeatReply. However, I overlooked that we are gossiping the largest seen ballot,
max_ballot
, in the HeartbeatRequests and updating our ownmax_ballot
to the largest seen value. As a result, this could result in liveness issue in a similar manner.Consider the following example:
max_ballot = <1, 1>
to p3.max_ballot
to <1, 1> but will never actually get the HeartbeatReply from p1. Incheck_leader()
, p3 will realise that thetop_ballot
in the HeartbeatReplies is less than itsmax_ballot
and will therefore increment its ballot to <2, 3>, as seen here.Then the same from step 4 onwards will occur again but this time with
max_ballot = <2, 3>
gossiped from p2 to p1. p1 will not get the heartbeat from p3 and increment its ballot. We will never establish a stable leader.If we do not include the
max_ballot
in the HeartbeatRequest, then when we are fine even at p1-p2-p3. Again, p1 increments to <1, 1> but p2 will now NOT gossip it to p3 (as we have removedmax_ballot
from HeartbeatRequest). Hence, p3 will just continue to get HeartbeatReply of <0, 2> from p2 and will therefore still think it is the leader when it has actually been overtaken, but that's fine.Removing max_ballot from HeartbeatRequest does not break anything in the normal case.
The text was updated successfully, but these errors were encountered: