Description
Is your feature request related to a problem? Please describe.
Existing issue is the ability to quiesce a cluster for the ability to perform backups. The second part is performing a scale-down operation to replace the cluster during a restore.
Backup issue: (use Kasten v6.0.11 and custom Kanister blueprint)
- Have a 4 x node rabbit cluster managed by operator
- the statefulset created by the operator has a 'readiness' check that looks for a TCPsocket of the AMQP port
- the operator does not allow scale-down features
Backup Idea:
This idea will cause the entire cluster to stop responding clients requests. That is acceptable at this point (as it is really the responsibility of the client to retry anyway).
Also want to accomplish this without the Operator trying to re-deploy the entire cluster for changes (it would make the backup procedure too long).
- Put all nodes of the cluster into maintenance mode (this is all controlled by Kanister blueprint)
- On each node run
rabittmq-upgrade --timeout 10 drain
- All nodes will eventually will be pseudo up (pod will show as not Ready but will not trigger the 'readiness' to reboot the pod)
- At this point, we have all nodes in a "quiesced" state (no client connections, no listeners working, messages are static/stable).
- In theory, we can now snapshot the underlying storage with all Rabbit configs and messages
- On each node run
Idea Issue:
- Kasten will not backup a workload (statefulset, pod, container) that in not in a ready state
- Because we put the Rabbit nodes into maintenance mode, the pods and statefulset are not in a ready state
- So figured I would be smart and temporarily modify the 'readiness' config on the statefulset.
- This does not work as the Operator kicks in and reverts that setting. Even if I could override the 'readiness' via the operator, this would require the cluster to be redeployed (we do not want that)
- Cannot modify the 'readiness' on the pod as K8s does not allow that, as the pod is part of a statefulset
Restore Issue:
As mentioned, I currently use Kasten to backup my K8s workloads. Inherently, when Kasten is performing a restore to an existing workload with a PV attached, it will scaledown the workload to remove/replace the PV with the backed up data.
- Operator does not support the scaledown feature
- So Kasten cannot restore the Rabbit cluster, as it cannot remove the existing PVs (Kasten just loops and eventually times out/fails)
- I can bypass this by essentially creating a Kanister execution hook (blueprint) that will "delete" the entire existing Rabbit cluster. Now Kasten can replace cluster as the objects do not currently exist
- So Kasten cannot restore the Rabbit cluster, as it cannot remove the existing PVs (Kasten just loops and eventually times out/fails)
Ideas??
So any ideas how this logic to quiesce a Rabbit cluster to backup might be accomplished