Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v2.8] BRO not backing up namespaces created for FleetWorkspaces #482

Closed
Daemonslayer2048 opened this issue Jun 26, 2024 · 19 comments
Closed
Assignees
Milestone

Comments

@Daemonslayer2048
Copy link

Daemonslayer2048 commented Jun 26, 2024

Rancher Server Setup

  • Rancher version: 2.8.1
  • Installation option (Docker install/Helm Chart): Helm Install
  • Kubernetes Version and Engine: v1.25.4+rke2r1

SURE-8919

Describe the bug
When using FleetWorkspaces in Rancher this will create a new namespace for said workspace. Due to this if a user attempts to restore on a new cluster the restore process will fail as it will not create said namespace.

To Reproduce
Steps to reproduce the behavior:

  1. Create a fresh cluster
  2. Create FleetWorkspaces (See additional context below)
  3. Install backup operator and take a backup
  4. Delete cluster
  5. Restore Rancher on totally new cluster
  6. Observe restore failure

Expected behavior
I would expect one of two things to happen:

  1. Rancher restore should create the namespace as needed to allow the FleetWorkspace to be repopulated
  2. Rancher restore will skip creating FleetWorkspaces as to not prevent the restore from completing
    Option one is preferable but two will at leas prevent end users from getting stuck.

Screenshots
Not needed

Additional context
Sample Fleet config

---
apiVersion: management.cattle.io/v3
kind: FleetWorkspace
metadata:
  name: enterprise
 
---
apiVersion: management.cattle.io/v3
kind: FleetWorkspace
metadata:
  name: edge

---
apiVersion: fleet.cattle.io/v1alpha1
kind: ClusterGroup
metadata:
  name: infra
  namespace: enterprise
spec:
  selector:
    matchExpressions: []
    matchLabels:
      infra: "true"
---
apiVersion: fleet.cattle.io/v1alpha1
kind: ClusterGroup
metadata:
  name: apps
  namespace: edge
spec:
  selector:
    matchExpressions: []
    matchLabels:
      apps: "true"
@ericpromislow
Copy link
Collaborator

Is this a dup of the other recent failure to save fleet secrets?

@Daemonslayer2048
Copy link
Author

Do you happen to know what issue number that would be?

@ericpromislow
Copy link
Collaborator

rancher/rancher#44033

@dsmithbauer
Copy link

@ericpromislow I do not believe so. This ticket specifically addresses the problem where restoring a Rancher backup from the Backup Operator fails if additional Fleet workspaces are used (besides fleet-default) because the Fleet workspace namespaces are not recreated/stored in the backup. However, in the ticket you mention, we may want to also consider backing up secrets in any Fleet namespace (not just fleet-default). Otherwise, using the Rancher Backup Operator to restore a Rancher MCM cluster backup where Fleet is utilized to deploy services to downstream clusters does not work well and will fail without manual intervention (will not create Fleet workspaces and secrets).

@mallardduck
Copy link
Member

@Daemonslayer2048 - Can you clarify a few things for us:

  • When you tried this out, did the FleetWorkspace you created get backed up (is it in the backup file)?
  • If yes, did an error occur during the restore because of that resource that caused it to fail restoring other resources?
  • If yes, what other resources specifically failed to restore due to the FleetWorkspace failing to restore?

@jbiers jbiers self-assigned this Jul 24, 2024
@jbiers
Copy link
Member

jbiers commented Jul 25, 2024

@Daemonslayer2048 - During the triage process for this issue we were not able to reproduce the error. The FleetWorkspaces are being backed up and once restored into a new cluster, seemed to successfully recreate its namespace and the resources in it.

If you can, share with us the error message you get when trying to restore the backup and we can investigate it further. Also, make sure you have the default ResourceSet unedited in your local cluster. The questions asked by @mallardduck are also important to help us understand your case.

@jbiers
Copy link
Member

jbiers commented Jul 25, 2024

Also, be sure you are following the migration instructions step by step as detailed in the docs here.

@Daemonslayer2048
Copy link
Author

I believe some things may be getting mixed up. @mallardduck when you attempted to reproduce did you do a an etcd backup and restore, or just attempt a restore via the Rancher Backup? If you use an etcd backup and restore the namespace does get created so that is not the issue here.

@jbiers I did follow those docs exactly, the issue does not show up unless the ClusterGroups are namespaced.

@mallardduck
Copy link
Member

@Daemonslayer2048 - No, i'm not referring to etcd backups (snapshots). We do not interact with etcd snapshots in the Rancher Backup tool or our code base here. So we do not use or interact with it during issue triage processes.

As mentioned, it would be helpful if you could provide us with the additional context requested in these questions:

  • When you tried this out, did the FleetWorkspace you created get backed up (is it in the backup file)?
  • If yes, did an error occur during the restore because of that resource that caused it to fail restoring other resources?
  • If yes, what other resources specifically failed to restore due to the FleetWorkspace failing to restore?

Having this context would be very helpful to better understand the issue you're reporting her and the root cause behind it. Our team is very interested in helping sort this out whether this is caused by a bug in Rancher Backups or potentially somewhere in Fleet. First we will need a better understanding of the report here to identify the root cause.


As @jbiers mentioned, in Rancher Backups we already capture FleetWorkspace resources and upon restore the Fleet controller seems to correctly redeploy a matching Namespace for the FleetWorkspace. The wording of this case seemed to imply that was the root issue - however I believe I see what you're intending to report now and would like you to confirm it.

Instead of the issue actually being with FleetWorkspace restores - rather the issue is that to restore the ClusterGroup the Namespace it lives in needs to exist first. However, if we restore a FleetWorkspace and rely on fleet controller to recreate the Namespace, then it's possible we could attempt to restore a ClusterGroup into a Namespace that doesn't yet exist? Is this getting warmer or colder?

If this is warmer, then that's where the additional context I requested in those points above could have served to clarify. If we could review the logs from the Rancher Backup container during the restore, then it's possible the sequence of events and root error would be more obvious. Granted - if I've reached the correct understanding now those logs may not be necessary.

@jbiers
Copy link
Member

jbiers commented Aug 2, 2024

@Daemonslayer2048 I have managed to reproduce your issue, or at least partially. Here are the steps I followed and the observed behavior

  • Spin up a cluster with Rancher and the Backup tool installed.
  • Create a FleetWorkspace in it of name enterprise, which automatically creates the enterprise namespace.
apiVersion: management.cattle.io/v3
kind: FleetWorkspace
metadata:
  name: enterprise
  • Create a ClusterGroup in the enterprise namespace.
apiVersion: fleet.cattle.io/v1alpha1
kind: ClusterGroup
metadata:
  name: infra
  namespace: enterprise
spec:
  selector:
    matchExpressions: []
    matchLabels:
      infra: "true"
  • Take a backup.
  • Restore the backup in a new cluster.

The observed behavior in this case was indeed a failure in the Backup resource: error restoring namespaced resources, check logs for exact error. However, the Backup resource automatically starts out a retry process, in which the previously deployed resources are kept and only the failed ones are retried. As such, creating the ClusterGroup ends up being successful with the namespace now created, it just takes a while longer.

What we suppose could be happening in your case is that in a production cluster with a lot to be restored, the retry process could indeed take a while, and it might be a matter of waiting longer until it is completed. Can you try and see if you observe the same behavior in a lighter sandbox cluster?

Also, if you are a paying Rancher subscriber then a case could be opened with Rancher Support where they can formally file an issue and securely collect logs and debug data.

@dsmithbauer
Copy link

Also, if you are a paying Rancher subscriber then a case could be opened with Rancher Support where they can formally file an issue and securely collect logs and debug data.

We are. That's how this came about. We actually worked with support and through professional consulting. They, in turn, opened this issue. If there is a more expedient way for us to work through this, please let me know.

@mallardduck
Copy link
Member

Appreciate the reply @dsmithbauer - we'll follow up with @Daemonslayer2048 about creating an internal issue.

As mentioned in @jbiers latest reply, our replication attempts found that FleetWorkspaces are restored eventually. However due to how Rancher Backups works it may not be something that's restored on the "first pass" and instead falls into the "retry process" that @jbiers explained happens. As such if your backup has a lot of resources the time it takes for the restore to reach this point could be longer than expected.

However, eventually the resource is restored and then subsequently the namespace is recreated too. If there are other more specific issues/failures after this then it's not something we've observed. In that case it would be something we need more specific details on to replicate.

So good next steps might be to attempt in a lighter sandbox cluster, and to also collect logs/info from the affected cluster. Doing the first option is something you can try anytime, however to securely share logs and other information we'll want you to interface with your Support team - and subsequently they can securely share logs from the support system to us via an internal ticket (which as me mentioned we will ask @Daemonslayer2048 to create).

@Daemonslayer2048
Copy link
Author

Hey @mallardduck, I just made an internal ticket on our side minutes before hand.

@manno
Copy link
Member

manno commented Sep 10, 2024

It's my opinion that backup/restore doesn't support custom workspaces at all. I added a PR to backup the resource and its namespace.

To test this manually one could install the backup restore operator and update the its resourceset resource. That resource has a list of selectors, adding the selectors from the PR should be sufficient.

As a workaround for this issue, adding just the workspace's namespace directly should allow the backup to continue.

- apiVersion: "v1"
  kindsRegexp: "^namespaces$"
  resourceNameRegexp: "^enterpriseworkspace$"

@mallardduck mallardduck changed the title Rancher restore fails with FleetWorkspaces BRO not backing up namespaces created for FleetWorkspaces Sep 10, 2024
@mallardduck
Copy link
Member

mallardduck commented Sep 10, 2024

@manno - Does enterpriseworkspace get the label created on it when the controller creates it? If so then I think we can start backing up the namespaces fleet created via the rule here:

- apiVersion: "v1"
  kindsRegexp: "^namespaces$"
  labelSelectors:
    matchExpressions:
      - key: "app.kubernetes.io/managed-by"
        operator: "In"
        values: ["rancher"]

Or something similar to this label selector that is more explicit could be added if not.

Edit: just tested - yes it is. PR to add: #577

@mallardduck
Copy link
Member

mallardduck commented Sep 10, 2024

Fix (to capture namespaces) merged into main, and PRs for backports ready to merge after freeze is over. Leaving this as blocked status for now - once current freeze is over we will merge and move to QA.

@mallardduck mallardduck self-assigned this Sep 10, 2024
@mallardduck mallardduck added this to the v2.8.9 milestone Sep 23, 2024
@mallardduck
Copy link
Member

/forwardport v2.9.3

@mallardduck mallardduck changed the title BRO not backing up namespaces created for FleetWorkspaces [v2.8] BRO not backing up namespaces created for FleetWorkspaces Sep 23, 2024
@mallardduck
Copy link
Member

mallardduck commented Sep 23, 2024

For QA use - this issue with track this PR: #579
This release will have Charts usable for QA: https://github.com/rancher/backup-restore-operator/releases/tag/v4.0.4-rc.1

@mallardduck mallardduck added the fleet Related to fleet integration label Sep 23, 2024
@nickwsuse nickwsuse self-assigned this Sep 26, 2024
@nickwsuse
Copy link

Verified on v2.8-head ID: 16f72cb && Rancher Backups v103.0.4+up4.0.4-rc.2

No errors were recorded in the restore logs for any of the following tests, and the workspaces were successfully restored.

Fresh Install

  1. Rancher Backups resourceSet shows updated regex for fleet workspaces ✅
  2. In-Place restore after creating Fleet workspaces and deleting a cluster ✅
  3. RKE1 local with RKE1 downstream cluster ✅
  4. RKE1 local with RKE2 downstream cluster ✅
  5. RKE1 local with RKE1 downstream cluster migrated to RKE2 local ✅
  6. RKE1 local with RKE2 downstream cluster migrated to RKE2 local ✅
  7. RKE2 local with RKE1 downstream cluster migrated to RKE2 local ✅
  8. RKE2 local with RKE2 downstream cluster migrated to RKE2 local ✅

Upgrade

  1. Rancher Backups resourceSet shows updated regex for fleet workspaces ✅
  2. In-Place restore after creating Fleet workspaces and deleting a cluster ✅
  3. RKE1 local with RKE1 downstream cluster ✅
  4. RKE1 local with RKE2 downstream cluster ✅
  5. RKE1 local with RKE1 downstream cluster migrated to RKE2 local ✅
  6. RKE1 local with RKE2 downstream cluster migrated to RKE2 local ✅
  7. RKE2 local with RKE1 downstream cluster migrated to RKE2 local ✅
  8. RKE2 local with RKE2 downstream cluster migrated to RKE2 local ✅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants