Skip to content
This repository has been archived by the owner on Mar 4, 2024. It is now read-only.

Remove the max_ballot field from HeartbeatRequest #20

Open
haraldng opened this issue Feb 17, 2021 · 0 comments
Open

Remove the max_ballot field from HeartbeatRequest #20

haraldng opened this issue Feb 17, 2021 · 0 comments
Labels
bug Something isn't working

Comments

@haraldng
Copy link
Collaborator

haraldng commented Feb 17, 2021

As discussed in this blogpost and with @Bathtor, partial failures could result in Raft not making any progress due to no stable leader can be established.

Initially, I thought our Ballot Leader Election was not affected because we only include our own ballot, and not the largest seen ballot, in the HeartbeatReply. However, I overlooked that we are gossiping the largest seen ballot, max_ballot, in the HeartbeatRequests and updating our own max_ballot to the largest seen value. As a result, this could result in liveness issue in a similar manner.

Consider the following example:

  1. p1, p2, p3 are initially all connected to each other and p3 became leader with <0, 3>.
  2. p1 and p3 get disconnected so our connectivity becomes: p1-p2-p3.
  3. As p1 does not receive HeartbeatReply from p3, it increments its ballot to <1, 1> and send it to p2.
  4. p2 sends a HeartbeatRequest with max_ballot = <1, 1> to p3.
  5. Upon getting that HeartbeatRequest from p2, p3 sets its own max_ballot to <1, 1> but will never actually get the HeartbeatReply from p1. In check_leader(), p3 will realise that the top_ballot in the HeartbeatReplies is less than its max_ballot and will therefore increment its ballot to <2, 3>, as seen here.

Then the same from step 4 onwards will occur again but this time with max_ballot = <2, 3> gossiped from p2 to p1. p1 will not get the heartbeat from p3 and increment its ballot. We will never establish a stable leader.

If we do not include the max_ballot in the HeartbeatRequest, then when we are fine even at p1-p2-p3. Again, p1 increments to <1, 1> but p2 will now NOT gossip it to p3 (as we have removed max_ballot from HeartbeatRequest). Hence, p3 will just continue to get HeartbeatReply of <0, 2> from p2 and will therefore still think it is the leader when it has actually been overtaken, but that's fine.

Removing max_ballot from HeartbeatRequest does not break anything in the normal case.

@haraldng haraldng added the bug Something isn't working label Feb 17, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant