-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error handling and recovery for DC failures #422
Comments
@shamouda Since you've been working on adding redundancy and fault tolerance to DCs, maybe you can comment if you also observed this kind of problem and if your work will fix this. |
Peter,
I read Matthew’s description differently. He talks about replicas, which makes me thing that each node is emulating a full DC. The three nodes represent three DCs. Doesn’t this look a lot like the known issue with lack of anti-entropy?
Marc
… Le 27 mai 2020 à 12h02, Peter Zeller ***@***.***> a écrit :
When a node within a DC temporarily fails for about 1 minute the DC does not recover automatically after the node comes back up.
The cluster needs to be joined again manually.
One would expect that with one node down the DC would still be able to serve requests for keys not stored on the failed node. Apparently that is not the case.
This was reported by Matthew on Slack, full report below. I have not yet tried to reproduce it on my machine.
Hi, we have antidote running on three machines, all directly connected to each other via a dedicated interface (so 2 interfaces per machine, with 3 total wires). It behaves correctly in the absence of failures; however, there is an issue when we test bringing down interfaces between the nodes. With 3 nodes in a cluster, bringing down the interfaces of Node 1, one at a time, causes an asymmetric connection between the other two connected nodes. The behavior we are witnessing is that updates are not replicated in both directions. Node 2 can send updates that are replicated to Node 3, but not vice versa. Neither node 2 nor node 3 have had their interfaces touched, and their dedicated link remains healthy.
If we take down the interfaces on Node 1 all at once the cluster stays healthy. We are thinking that this could be because between when Node 1 loses connection to Node 2 and when Node 1 loses connection to Node 3, Node 1 is reporting Node 2's “failure” to Node 3, causing Nodes 1 and 3 to believe they are a majority partition. Then when Node 1 loses connection to Node 3, Node 3 believes it is alone. What is surprising to us is that Node 2's updates continue to reach Node 3 in this scenario, but not the reverse.
Have we hit upon the correct diagnosis for our strange behavior? If we have, do you folks know how we can resolve this network state ?
We're bringing down the connection within a single datacenter, and restoring it after about 1 minute (just long enough for timeouts to fire)
we do need to manually resubscribe to the restored node when it returns if we keep it down long enough, but that's not what worries us
what worries us is that two unrelated nodes experience communication interruption after we take one node down. all nodes are in the same DC.
we had assumed that the only possible effect of restricting communication to a single node in the DC is that the remaining healthy members would unsubscribe from that node. we did not anticipate that this could cause healthy nodes to unsubscribe from each other.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub <#422>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABS23H7WX6TGCZHSEVMKCJLRTTQK7ANCNFSM4NL7KSDA>.
|
Actually I had gotten confused myself about our setup. Marc is right, we have each node representing its own DC. |
We are using the native (Erlang) API to connect each DC via RPC. Not one of the clients that Peter asked about on Slack that use |
Thanks for the clarification. Maybe you can try if the problem is fixed with the changes from pull request #421. You can either compile the branch yourself or use the Docker image |
Unfortunately those changes did not change the behavior we are seeing. |
When a DC temporarily fails for about 1 minute the other DCs also fail to communicate.
After the failing DC has restarted the DCs needs to be joined again manually.
This was reported by Matthew on Slack, full report below. I have not yet tried to reproduce it on my machine.
The text was updated successfully, but these errors were encountered: