Skip to content

Backup Restore rabbit cluster managed by operator #1491

Closed as not planned
Closed as not planned
@brandtwinchell

Description

@brandtwinchell

Is your feature request related to a problem? Please describe.
Existing issue is the ability to quiesce a cluster for the ability to perform backups. The second part is performing a scale-down operation to replace the cluster during a restore.

Backup issue: (use Kasten v6.0.11 and custom Kanister blueprint)

  • Have a 4 x node rabbit cluster managed by operator
    • the statefulset created by the operator has a 'readiness' check that looks for a TCPsocket of the AMQP port
    • the operator does not allow scale-down features

Backup Idea:
This idea will cause the entire cluster to stop responding clients requests. That is acceptable at this point (as it is really the responsibility of the client to retry anyway).
Also want to accomplish this without the Operator trying to re-deploy the entire cluster for changes (it would make the backup procedure too long).

  • Put all nodes of the cluster into maintenance mode (this is all controlled by Kanister blueprint)
    • On each node run rabittmq-upgrade --timeout 10 drain
      • All nodes will eventually will be pseudo up (pod will show as not Ready but will not trigger the 'readiness' to reboot the pod)
    • At this point, we have all nodes in a "quiesced" state (no client connections, no listeners working, messages are static/stable).
      • In theory, we can now snapshot the underlying storage with all Rabbit configs and messages

Idea Issue:

  • Kasten will not backup a workload (statefulset, pod, container) that in not in a ready state
    • Because we put the Rabbit nodes into maintenance mode, the pods and statefulset are not in a ready state
    • So figured I would be smart and temporarily modify the 'readiness' config on the statefulset.
      • This does not work as the Operator kicks in and reverts that setting. Even if I could override the 'readiness' via the operator, this would require the cluster to be redeployed (we do not want that)
      • Cannot modify the 'readiness' on the pod as K8s does not allow that, as the pod is part of a statefulset

Restore Issue:
As mentioned, I currently use Kasten to backup my K8s workloads. Inherently, when Kasten is performing a restore to an existing workload with a PV attached, it will scaledown the workload to remove/replace the PV with the backed up data.

  • Operator does not support the scaledown feature
    • So Kasten cannot restore the Rabbit cluster, as it cannot remove the existing PVs (Kasten just loops and eventually times out/fails)
      • I can bypass this by essentially creating a Kanister execution hook (blueprint) that will "delete" the entire existing Rabbit cluster. Now Kasten can replace cluster as the objects do not currently exist

Ideas??
So any ideas how this logic to quiesce a Rabbit cluster to backup might be accomplished

Metadata

Metadata

Assignees

No one assigned

    Labels

    closed-staleIssue or PR closed due to long period of inactivitystaleIssue or PR with long period of inactivity

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions