Quorum queue setup on a datacenter with exactly two availability zones (racks) #11877

dev4342345235 · 2024-08-01T12:58:53Z

dev4342345235
Aug 1, 2024

Hi,

we would like to setup a RabbitMQ cluster using quorum queues in a single datacenter consisting of exactly two availability zones (=racks).

Unfortunately the RabbitMQ documentation is not clear to us, if this is a supported scenario. For example in RackAwareness three Racks are mentioned, i.e. an odd number of racks. In quorum-requirements it is mentioned that the cluster will not work, if the majority of nodes fails, which could happen with only two racks and an odd number of RabbitMQ nodes.

At the moment it seems, that it is not possible to setup a reliable quorum cluster spanning two racks.

Therefore please could you give us some insights what we could do?

Some ideas we were thinking about but not sure it they will work:

In case of 3 quorum nodes add a fourth node to the rack hosting which hosts only one quorum node. In case of others rack fail add this "waiting node" to quorum group. But how to ensure, that quorum is always done on 2 nodes sitting in different racks?
Use an even number of RabbitMQ node as quorum and therefore "force" RabbitMQ to sync to all nodes in all racks. Not sure if an even number is supported.

Edit: To clarify my question: My point is about failure handling of a whole Rack. All nodes on that Rack will immediately be unavailable. If that Rack would have hosted 2 of 3 RabbitMQ nodes, we only have one node left which is, as far as I understand, not a working condition for QQ.

The examples in the RabbitMQ blog (see link above) use an odd number of racks, this would avoid that issue, but at the moment we only have two racks.

Best
Chris

Answered by michaelklishin

Aug 1, 2024

With two racks, assuming an entire rack can fail, you cannot. You need to use three or assume that the risk of a rack failure is not important enough compared to host/node failure.

Two replica QQs and streams is an explicitly unsupported configuration (of course, you can extend a QQ or stream to just two replicas but it won't offer much in terms of availability).

View full answer

michaelklishin · 2024-08-01T14:25:24Z

michaelklishin
Aug 1, 2024
Maintainer

@dev4342345235 it is perfectly possible to set up a reliable quorum queue that has three replicas, which means three cluster nodes. They can use two racks if that's your limit for any reason.

Two replica QQs are not supported. There is no such thing as a clear majority, which is very important for any practical Raft-based system, in a two node replica quorum queue for fairly obvious reasons.

2 replies

dev4342345235 Aug 1, 2024
Author

Sorry for unmarking this as answer as I still did not get the point of availability in case of Rack failure. Please see my additional comment.

Best
Chris

dev4342345235 Aug 1, 2024
Author

I have updated my original question to be more clear about my specific point.

Best
Chris

michaelklishin · 2024-08-01T14:26:58Z

michaelklishin
Aug 1, 2024
Maintainer

Quorum queues will not be aware of racks. You need to make sure every rack has a RabbitMQ node deployed to it, and that there are three nodes total. Nothing beyond that. Exactly the same limitation and solution applies to streams.

0 replies

dev4342345235 · 2024-08-01T14:47:49Z

dev4342345235
Aug 1, 2024
Author

Thank you for you fast response!

Good to hear but I still don't get, how the cluster is able to survice if the rack goes down which hosts the majority of the nodes. Please take a look to the diagram from the RabbitMQ website:

With only two racks we would have a scenario as depicted in the first picture:

The documentation recommends to split the three nodes across three racks:

So my question is, how we could achieve availability with two racks and a failure of the rack hosting the majority of the nodes? Do we need to go up to 5 nodes, put 3 nodes on Rack-A and 2 on Rack-B? But then we would still loose the majority of the nodes in case of Rack-A failure. From my understanding only 2 nodes are acceptable for failure if 5 are used in total.

Best
Chris

12 replies

michaelklishin Aug 1, 2024
Maintainer

With two racks, assuming an entire rack can fail, you cannot. You need to use three or assume that the risk of a rack failure is not important enough compared to host/node failure.

Two replica QQs and streams is an explicitly unsupported configuration (of course, you can extend a QQ or stream to just two replicas but it won't offer much in terms of availability).

Answer selected by dev4342345235

dev4342345235 Aug 1, 2024
Author

Ok, thank you for you super fast support here. :)

I was already afraid that with only two racks we will have an issue. As far as I understand you, the only way to go now is to convince the customer to add an additional rack, if they want to be prepared in case of full rack-faildown and still have a running cluster.

Best
Chris

michaelklishin Aug 1, 2024
Maintainer

Like I said, full rack failure is a condition that many environments can consider to be too improbable to spend extra money on addressing. This is the majority of RabbitMQ clusters in the wild.

When that's not the case, then yes, three racks is the minimum since that's the minimal replication factor supported by modern replicated data types in RabbitMQ.

kamilzzz Jan 10, 2025

What about having for example 5 node cluster with nodes spread between 3 racks.
As I can see there is no option to configure quorum queue members to be spread between racks/AZs/physical hosts.

In such case, how do I ensure specific quorum queue members does not end up being located on 2 racks which in the end would have the same end result - that rack going down resulting in queue unavailability (even though we had 3 racks, quorum members were located only on 2 of them, with one having 1 member, and another one having 2 members).

dev4342345235 Jan 10, 2025
Author

By no means being a RabbitMQ expert by myself I would assume that in your case the queue is distributed to at least 3 of 5 nodes of the cluster and then one rack with 2 nodes could fail having one surving node on a different rack storing the data... But definitly a RabbitMQ expert should comment on that

Quorum queue setup on a datacenter with exactly two availability zones (racks) #11877

Uh oh!

Uh oh!

dev4342345235 Aug 1, 2024

Replies: 3 comments · 14 replies

Uh oh!

michaelklishin Aug 1, 2024 Maintainer

Uh oh!

dev4342345235 Aug 1, 2024 Author

Uh oh!

dev4342345235 Aug 1, 2024 Author

Uh oh!

michaelklishin Aug 1, 2024 Maintainer

Uh oh!

dev4342345235 Aug 1, 2024 Author

Uh oh!

Uh oh!

michaelklishin Aug 1, 2024 Maintainer

Uh oh!

dev4342345235 Aug 1, 2024 Author

Uh oh!

michaelklishin Aug 1, 2024 Maintainer

Uh oh!

kamilzzz Jan 10, 2025

Uh oh!

dev4342345235 Jan 10, 2025 Author

dev4342345235
Aug 1, 2024

Replies: 3 comments 14 replies

michaelklishin
Aug 1, 2024
Maintainer

dev4342345235 Aug 1, 2024
Author

dev4342345235 Aug 1, 2024
Author

michaelklishin
Aug 1, 2024
Maintainer

dev4342345235
Aug 1, 2024
Author

michaelklishin Aug 1, 2024
Maintainer

dev4342345235 Aug 1, 2024
Author

michaelklishin Aug 1, 2024
Maintainer

dev4342345235 Jan 10, 2025
Author