QQ grow to a target quorum cluster size #13873

Ayanda-D · 2025-05-08T15:57:46Z

Proposed Changes

Hello team 👋

The following changes are a proposed extension to the rabbitmq-queues grow command. Currently this command only allows growing queues to a single target node per execution, for a set of queues matching the --queue-pattern. This means the command needs to be executed a number of times to fully restore a quorum queue (replicas) back to N-number of nodes (which can also be expensive from the multiple RPCs from the temporary CLI execution node). As an extension to the rabbitmq-queues grow command, we'd like to allow growing queues to a specified number of nodes N / target quorum cluster size (allowing the broker to do all the grow processing to multiple nodes in one CLI execution). The command signature becomes:

rabbitmq-queues grow <node | quorum_cluster_size> ....

When quorum queues get into a bad state, e.g. become leaderless (#12701, #13101) and shrink operations (#12427) to a single node to attempt rescue are applied, fast recovery/growing back to N nodes / to a target quorum cluster size for the set of queues becomes a critical requirement.

Please take a look. Thanks!

Types of Changes

What types of changes does your code introduce to this project?
Put an x in the boxes that apply

Bug fix (non-breaking change which fixes issue #NNNN)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause an observable behavior change in existing systems)
Documentation improvements (corrections, new content, etc)
Cosmetic change (whitespace, formatting, etc)
Build system and/or CI

Checklist

Put an x in the boxes that apply.
You can also fill these out after creating the PR.
If you're unsure about any of them, don't hesitate to ask on the mailing list.
We're here to help!
This is simply a reminder of what we are going to look for before merging your code.

I have read the CONTRIBUTING.md document
I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
I have added tests that prove my fix is effective or that my feature works
All tests pass locally with my changes
If relevant, I have added necessary documentation to https://github.com/rabbitmq/rabbitmq-website
If relevant, I have added this change to the first version(s) in release-notes that I expect to introduce it

…option

…it_quorum_queue api

deadtrickster · 2025-05-08T17:06:22Z

Hi, what happens when add member call errors in the middle of the loop?

michaelklishin

@Ayanda-D from the early days of Ra and QQs, we have decided to not add N members at a time, by design. Some other Raft implementations explicitly prohibit such parallel membership changes, or the logic of the spec becomes really hard to follow at times.

So at best we can do this after a pause of say, 10 seconds minimum. And ideally after learning that the previous membership change has completed.

Ayanda-D · 2025-05-09T14:15:49Z

hi folks, thanks for taking a look.

@michaelklishin i agree "parallel membership changes" would be a bit tricky and hard to follow, here. Everything is still sequential. We are still retaining the exact same behaviour of grow <node> the main extension for grow N, is that the loop simply runs N times executing grow <node-N> where <node-N> is chosen for the CLI caller (which users are currently, manually having to do over RPC, and we just want the broker to do this for us for faster queue recovery). This is still the same behaviour, just running N-times (or QuorumClusterSize-times). For each matched queue, if QuorumClusterSize has already been reached, then there's no need to grow and the loop continues to the next matched queue from --queue-pattern (rest of the behaviour still the same). After loop runs N-times, results are just sent back to the CLI caller as current grow implementation.

@deadtrickster if grow fails middle of the loop, still the same behaviour of current grow <node> loop is retained, we just add the failure to the list of results, which are returned and formatted on the CLI grow command callback.

Nothing much's changed functionality wise, we're just running the same grow command, N times or TargetQuorumClusterSize times, each run an available node is being chosen and attempted for grow for the caller. List of success and failures are returned to the caller as per current behaviour. Might be slower than doing in parallel, but yeah I agree keeping this sequential is safe. We're still seeing these issues and need this option for faster recovery of the broken queues.

michaelklishin · 2025-05-09T15:37:57Z

@Ayanda-D per discussion with @kjnilsson, we suggest doing the following before adding more members (replicas).

We need to make sure that every QQ member is a voter. That means that they have caught up enough to assume that it's really safe to proceed. If that's not the case yet, the algorithm should skip the queue where it's been the case. Then we can retry the process for such queues.

Worst case the operator will have to re-run the grow command.

We should also do that before we start the process (if we don't do so already).

That variation of this change we are willing to accept.

Ayanda-D added 4 commits May 8, 2025 14:21

Extend QQ grow command to support target quorum cluster size

662bbd5

Add tests for QQ grow to target quorum cluster size command

a13f5d6

Update QQ grow command tests to reflect correct use of queue-pattern …

32a73fa

…option

Dont allow QQ grow for target quorum cluster size less than 0 in rabb…

37bc300

…it_quorum_queue api

michaelklishin requested changes May 8, 2025

View reviewed changes

Randomly select next target node on grow to quorum cluster size

5d67171

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QQ grow to a target quorum cluster size #13873

QQ grow to a target quorum cluster size #13873

Ayanda-D commented May 8, 2025 •

edited

Loading

deadtrickster commented May 8, 2025

michaelklishin left a comment •

edited

Loading

Ayanda-D commented May 9, 2025 •

edited

Loading

michaelklishin commented May 9, 2025 •

edited

Loading

QQ grow to a target quorum cluster size #13873

Are you sure you want to change the base?

QQ grow to a target quorum cluster size #13873

Conversation

Ayanda-D commented May 8, 2025 • edited Loading

Proposed Changes

Types of Changes

Checklist

deadtrickster commented May 8, 2025

michaelklishin left a comment • edited Loading

Choose a reason for hiding this comment

Ayanda-D commented May 9, 2025 • edited Loading

michaelklishin commented May 9, 2025 • edited Loading

Ayanda-D commented May 8, 2025 •

edited

Loading

michaelklishin left a comment •

edited

Loading

Ayanda-D commented May 9, 2025 •

edited

Loading

michaelklishin commented May 9, 2025 •

edited

Loading