Skip to content
This repository has been archived by the owner on Jan 29, 2019. It is now read-only.

Add permdown to topologies #14

Open
michalmuskala opened this issue May 1, 2018 · 8 comments
Open

Add permdown to topologies #14

michalmuskala opened this issue May 1, 2018 · 8 comments

Comments

@michalmuskala
Copy link
Contributor

This should send a named_permdown message to the sync_named subscribers once the topology detects a permdown of a node.

@christianjgreen
Copy link

@michalmuskala Will we be using similar logic to PHX pubsub? A timer keeping track of how long a node is down to determine if it's permdown.

@michalmuskala
Copy link
Contributor Author

We've discussed this today. The way the permdown detection should work in the Erlang topology is that:

  • once another node with the same name, but different version comes up, the old should be considered permdown. The permdown event needs to be emitted strictly before the up event is emitted for the new node.
  • after some configurable timeout, if the node doesn't come back up again, it's considered permdown. The documentation should make it clear that it might come up again, nonetheless, because networks. This is similar to the current strategy in Phoenix.Tracker.

@bitwalker
Copy link

I'm a little surprised by the first point - that a node with the same name but different version would be considered permanently down. In many consistency models, Raft for example, a configuration is permanent until the cluster is instructed to change that configuration, and so nodes with the same name/address (one's choice) can go down for long periods of time, come back up, and still be considered a member of the cluster. In the model you've expressed though, an implementation of Raft would not be able to use the permdown event for anything useful, or rather, it would just ignore it.

Perhaps it is not important for it to be generally applicable to higher-level consistency models, but I do wonder if that implies that the concept of a permdown event is not particularly useful, since it doesn't always mean what one thinks it means. Or put another way, the concept of being permanently down is in many cases determined by application rules, not the network topology. I could see the concept of permdown based on a timeout being useful, as it would allow a Raft implementation to generate automatic configuration changes based on that event (whereas nodedown is just not reliable for that), but I don't think versioning is as useful.

I'm also curious how this potentially impacts systems being upgraded, i.e. why would a cluster want to treat a node as permanently down if it can potentially come up with the same state as the old version (such as in cases where the node commits to a persistent log).

Perhaps I'm thinking at the wrong level here, in which case definitely let me know, but since I'm working on stuff that potentially would leverage firenest, I figured I would chime in with my thoughts on what you've proposed.

@christianjgreen
Copy link

I may be interpreting this incorrectly, but I believe @michalmuskala is making the same point @bitwalker. If a new nodes comes up with a different version, the consistency model should mark the node with the previous version as permdown.

I'm also curious how this potentially impacts systems being upgraded, i.e. why would a cluster want to treat a node as permanently down if it can potentially come up with the same state as the old version (such as in cases where the node commits to a persistent log).

I believe this is precisely why this feature should be implemented. By setting a timeout for what one considers permdown, you avoid waiting indefinitely for a node to come back. All of this is assuming the same model as phoenix pubsub however.

@michalmuskala
Copy link
Contributor Author

To clarify - by version I mean that in Firenest.Topology a node is identified right now by a tuple consisting of a regular node name and a unique version generated during startup.

The version of a node is different each time a node starts. So if a node comes down and back up again it has a different version. The only reason for a node to go through a down/up cycle is network issues - a node can't resurrect with the same version.

It's entirely reasonable for something that implements on top of Firenest.Typology to logically unify the same node with different version as you suggested, @bitwalker, a raft implementation might do. It's still useful to handle permdown in many cases when you know the ephemeral data on another node is gone, because the node is gone.

@bitwalker
Copy link

@ArThien The reason why it doesn't matter how long a node is down in Raft (for example) is because when a node comes back, it is caught up by the leader regardless of how long it was gone, but while it is down it does not participate in elections (but is still counted in the cluster for purposes of determining quorum sizes). In many consistency models it is explicitly bad to automatically retire nodes, since you can't control how long a partition lasts, and you must be able to ensure that a split brain situation cannot occur - if you automatically retire a node based on a timeout, you potentially can have a quorum on both sides of a partition based on the new cluster size as seen by both partitions.

It's still useful to handle permdown in many cases when you know the ephemeral data on another node is gone, because the node is gone.

I think that's reasonable - I guess my point was more about whether that really belongs in Firenest versus the application layer, because that is where the rules around what constitutes permdown really are defined. That said, if it is a feature that consumers of Firenest opt into, rather than having to opt out of, then I think it is much more useful (since you can explicitly decide whether Firenest's rules around permdown are useful for your application, or decide to provide your own, but be able to surface your own permdown using the same message).

@josevalim
Copy link
Member

A permdown will only be delivered after a down and it is just a notification - the meaning is always added at the application layer. If a system does not care about the topology definition of a permdown, then it could ignore it.

Although the alternative is to keep this definition at the application level indeed. One benefit is that this does not need to be implemented for every new topology. The downside is that we may need to make the node name a bit less opaque (not sure if this is possible today though). Maybe this is indeed best defined as a feature of the SyncedServer.

@keathley
Copy link

keathley commented Nov 6, 2018

Just wanted to add some clarity around the raft use case. Here's a single use case that I think will help me explain my pov.

Let's say we a cluster of 5 nodes: A, B, C, D, E. Due to operator error or some sort of egregious fault the cluster is partitioned into a group of 3 (A,B,C) and a group of 2 (D, E). During this time the cluster of 3 maintains a majority so they can continue servicing requests. The error is not transient so after the timeout nodes on either side of the split receive permdown events from the nodes on the other side.

At this point there isn't a safe operation that we can take to try to automatically heal the cluster. For instance we could issue a configuration change message between D and E so that they can start receiving traffic. But doing that effectively splits the operators cluster forever with no way for them to reconcile their state. From that perspective our raft library won't ever try to use these events to take actions because they are inherently dangerous and could cause lost writes.

In the above scenario the correct decision is to empower the operator to choose how they want to heal their cluster. They might choose to issue cluster change commands directly, attempt to repair the partition, etc. For them to take these actions we need to send alerts to operators when we see issues like this. So while we probably couldn't ever safely use permdown to do any cluster configuration we could definitely use it for triggering alarms to operators.

I'm not sure thats a use case that makes sense or how this would affect the underlying design decisions but I thought I would provide some clarity here.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants