-
-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
locks_leader:call/2 timeouts after joining new node #30
Comments
Race condition is possible since this happens not every time. |
Sorry for not digging into this yet, but I've been a bit swamped. One thing to check when this happens is whether the locks themselves are consistent (e.g. by checking |
This happened recently in group of 3 nodes, result of ets:tab2list(locks_server_locks) is below:
|
Thanks, that's interesting (and sorry for missing your reply for so long). |
Not sure if the above solves anything, but it closes a possible loophole, and adds test cases. I have not yet been able to reproduce the problem. |
This project https://github.com/ten0s/locks-test reproduces the issue. All nodes are started with '-connect_all false'. A new node is about to join a connected cluster with a leader. It starts locks_leader without connecting to others and becomes a leader in its own separate cluster. A bit later the nodes ping each other, the netsplit happens and sometimes some nodes hang at https://github.com/uwiger/locks/blob/master/src/locks_leader.erl#L510. If the new node connects to others before starting locks_leader https://github.com/ten0s/locks-test/blob/master/src/test.erl#L73, then everything works. The above fix doesn't change anything. |
FYI, I've been running your test program. The thing that seems to happen, at least some of the time, is that a node detects that it's no more the leader, broadcasts a One problem I found from visual inspection is that the The good news is that the core locking algorithm doesn't seem to be at fault. I will have to attend to other things for a while, but will get back to this. Thanks for the interesting test case, and apologies for the delay in responding. |
Before pushing, I let the |
I upgraded the locks version in https://github.com/ten0s/locks-test. AFAIS, the previous version c9b585a hangs within 30 secs,
|
Thanks for being so tenacious! :) |
I've traced the issue in more details and found one stable pattern: node_A: set node_A a leader The key moment is the am_leader message from node_A is sent before the am_leader message from node_B, but these messages are received by node_C in the "reverse" order. This is pretty possible and totally breaks the logic implemented in the locks_leader. I couldn't find any easy way to fix the locks_leader. The locks app addresses all these asynchronous communications disbalance doing the job pretty well, but the locks_leader rolled it a bit back... Could it be possible to use internal locks info to deduce who is the leader instead of passing the leadership through the announcement messages building an extra layer of async communications? |
Yes, the |
A possibility, albeit clumsy, might be to have a timeout on the safe_loop, where the |
Currently |
Regarding timeouts - this is what we do as a workaround in the app using |
It might be possible to detect accidental reordering by saving a limited number of I don't have time to experiment with it right now, but will try to get around to it soon. |
Pls note PR #42 |
Trying to use locks app (master branch, c9b585a) I've got interesting failure in a scenario described below.
There were 4 nodes alive connected to each other - A, B, C and D. Node D was a leader.
One time new node E was started, it discovered other running nodes and connected to them.
Before new node E even connected to other nodes, it decided it was a leader.
Once node E connected to other nodes, it sent its leadership info to them. For all 3 non-leaders A, B and C node E locks_leader’s callback elected(State, Election, Pid) was called with "Pid" of the “joined” node (A, B and C) process. In its turn, node’s A, B and C locks_leader’s callback surrendered(State, Synch, Election) was called.
When new leader E connected to old leader D, netsplit happened. Node D won, it’s locks_leader’s callback elected(State, Election, undefined) was called and all other nodes (A, B, C and E) received notification in a callback surrendered(State, Synch, Election), so node E was informed that it was not a leader anymore.
Since then all calls locks_leader:call/2 made in nodes A, B and C ended up with timeout. Same call made in D and E worked as usual with no errors. So it seems that internal state of locks leader of the "passive" nodes A, B and C was compromised by fighting leaders D and E...
The text was updated successfully, but these errors were encountered: