Could not discover and join another node. Proceeding as a standalone node - wait a minute![Suggestion] #13257

SimonUnge · 2025-02-12T23:31:20Z

SimonUnge
Feb 12, 2025

RabbitMQ series

4.0.x

Operating system (distribution) used

*nix

How is RabbitMQ deployed?

Community Docker image

What would you like to suggest for a future version of RabbitMQ?

Hey,

Lets say we have a situation where we have a cluster of 3 nodes, A, B and C, that is up and running, and clustered with peer discovery.
For whatever unfortunate reason, Node A is restarted, with its state whipped, i.e a 'fresh' start for Node A. Node A is trying to join the configured cluster, in its eyes, for the first time. Node B and C disagrees, as they think Node A is already a member.
Node A will retry, by default, 30 times. If it ends up failing, it will proceed as a standalone node. Lets throw some fuel on the fire, and have a load balancer infront of these 3 nodes, who only sees these 3 nodes as up and running, and assumes they are clustered. To end up here there ofc needs to be a series of unfortunate events, and failures in external monitoring systems to catch this issue. But still.

I wonder if there has been any thoughts about having configuration to not allowing a node to start if it fails to join a configured cluster using peer discovery, and after the 30 (default) retries, stop the node from starting at all, to avoid a situation where we have '2' clusters started, instead of 1?

Answered by mkuratczyk

Mar 17, 2025

We discussed this today and agree that:

starting as an independent node after a certainly number of attempts is never the right thing to do
automated "forget the previous incarnation and let the node rejoin" behaviour is reasonable and in fact already implemented for khepri: rabbitmq/khepri_mnesia_migration#16

Moving this behaviour form Mnesia->Khepri migration to RabbitMQ directly in a way which would cover both Mnesia and Khepri-enabled clusters would be the best approach. Right now it's the khepri migration code (which also executes on Khepri startup) that reconfigures the remote nodes, effectively making them forget the node and allowing it to join as a new node.

View full answer

michaelklishin · 2025-02-13T01:56:18Z

michaelklishin
Feb 13, 2025
Maintainer

@SimonUnge it was discussed. There wasn't much opposition to it but then the focus has shifted to #13050 and this particular change went nowhere.

0 replies

SimonUnge · 2025-02-13T17:24:24Z

SimonUnge
Feb 13, 2025
Author

@michaelklishin Oh ok, interesting. So, basically the default would now be unlimited retires? This might actually be 'good enough', as one should have mechanisms that notices that a node fails to come up and the logs would give us the heads up on why.

If my understanding is correct, I think we can close this one as answered :)

10 replies

mkuratczyk Feb 16, 2025
Maintainer

No objection from me. I don't know the details of some of the other peer discovery plugins so I'm not 100% sure it'd be a good idea to expose it as a generic option but perhaps yes.

michaelklishin Feb 16, 2025
Maintainer

I don't think there would be any mechanism-specific issues. We target the case where a node could not join the cluster to begin with after all :)

SimonUnge Feb 18, 2025
Author

Yeah, and as an 'opt-in' option only... So, I'll go ahead and add the schema version of it and see how that PR sits with you later!

mkuratczyk Mar 17, 2025
Maintainer

We discussed this today and agree that:

starting as an independent node after a certainly number of attempts is never the right thing to do
automated "forget the previous incarnation and let the node rejoin" behaviour is reasonable and in fact already implemented for khepri: m2k_cluster_sync: Handle the case where a clustered node lost its state khepri_mnesia_migration#16

Moving this behaviour form Mnesia->Khepri migration to RabbitMQ directly in a way which would cover both Mnesia and Khepri-enabled clusters would be the best approach. Right now it's the khepri migration code (which also executes on Khepri startup) that reconfigures the remote nodes, effectively making them forget the node and allowing it to join as a new node.

Answer selected by michaelklishin

SimonUnge Mar 18, 2025
Author

Perfect, I will take a stab at it very soon! (unless already worked on)

mkuratczyk Mar 18, 2025
Maintainer

We are not working on this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Could not discover and join another node. Proceeding as a standalone node - wait a minute![Suggestion] #13257

{{title}}

Replies: 2 comments 10 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Could not discover and join another node. Proceeding as a standalone node - wait a minute![Suggestion] #13257

SimonUnge Feb 12, 2025

RabbitMQ series

Operating system (distribution) used

How is RabbitMQ deployed?

What would you like to suggest for a future version of RabbitMQ?

Replies: 2 comments · 10 replies

michaelklishin Feb 13, 2025 Maintainer

SimonUnge Feb 13, 2025 Author

mkuratczyk Feb 16, 2025 Maintainer

michaelklishin Feb 16, 2025 Maintainer

SimonUnge Feb 18, 2025 Author

mkuratczyk Mar 17, 2025 Maintainer

SimonUnge Mar 18, 2025 Author

mkuratczyk Mar 18, 2025 Maintainer

SimonUnge
Feb 12, 2025

Replies: 2 comments 10 replies

michaelklishin
Feb 13, 2025
Maintainer

SimonUnge
Feb 13, 2025
Author

mkuratczyk Feb 16, 2025
Maintainer

michaelklishin Feb 16, 2025
Maintainer

SimonUnge Feb 18, 2025
Author

mkuratczyk Mar 17, 2025
Maintainer

SimonUnge Mar 18, 2025
Author

mkuratczyk Mar 18, 2025
Maintainer