-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test: local cluster of 6 nodes, leader elected + 5 followers failover fails #3
Comments
On the other run, its a different case: leader gets elected, but its state is not properly propagated to the other nodes:
|
Basically it drops down to the fact, that all the followers become candidates at the same time and lock down their vote for themselves.
|
@pgte the whole voting and reelection part really feels broken. Basically to simulate what I'm experiencing what one needs to do is simply kill the leader during the tests. Then in 70% of the cases madness will start happening. I've partially remedied this by adjusting voting, to make sure that the nodes actually vote for the other nodes, if the request term is higher than the local election term. This is an updated local vote counter, takes into account late responses from earlier terms function onBroadcastResponse(err, args) {
if (args) {
var currentTerm = self.node.currentTerm();
if (args.voteGranted && currentTerm === args.term) {
votedForMe ++;
verifyMajority();
}
}
} And in state.js S.onRequestVote = function onRequestVote(args, cb) {
var self = this;
var state = this.node.commonState.persisted;
var currentTerm = state.currentTerm;
var electionTerm = args.term;
if (electionTerm < currentTerm) {
return callback(false);
}
// update to the latest term
if (electionTerm > currentTerm) {
state.currentTerm = currentTerm = electionTerm;
state.votedFor = false;
}
if (!state.votedFor || state.votedFor == args.candidateId) {
var lastLog = state.log.last();
if (lastLog && lastLog.term < args.lastLogTerm) {
callback(true);
}
else if (args.lastLogIndex >= state.log.length()) {
callback(true);
}
else {
callback(false);
}
}
else {
callback(false);
}
function callback(grant) {
if (grant) {
state.votedFor = args.candidateId;
self.node.emit('vote granted', state.votedFor);
}
cb(null, { term: currentTerm, voteGranted: grant });
}
}; One thing I'm not sure of is how updating the local term without appending a log entry affects the node's state, but I believe this should be fine Anyway, even after these changes there are more problems: In Extremely rare cases I've seen 2 leaders pop up at the same time: state snapshot is taken every 200 ms. And after 2 leaders had been elected they start installing their snapshot, even on the other leader node and we end up as all followers without the leader
Often peer outgoing request queue is stuck, which I believe is the root of the problem. Callback is not getting invoked on the specific request, and, therefore, leader can not propagate it's election to candidates, therefore they are stuck requesting the votes. Other case would be that every candidate has actually stopped requesting votes, because the outgoing pipe is also "stuck". I'm looking why the callback is not getting called (one thing I've noticed that there is no listener for connection close event, but it shouldnt be the case during local tests). I will be trying to trace it back to the receiving node and see where it gets stuck there |
Basically I create a local cluster, first boot a leader, then other nodes, then leader joins other nodes.
After that I kill the leader using .close() - and then its a madness of 5 followers that became candidates, but none of them can get the votes in more than 10 seconds.
Here are some logs
The text was updated successfully, but these errors were encountered: