-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[v2.8] BRO not backing up namespaces created for FleetWorkspaces #482
Comments
Is this a dup of the other recent failure to save fleet secrets? |
Do you happen to know what issue number that would be? |
@ericpromislow I do not believe so. This ticket specifically addresses the problem where restoring a Rancher backup from the Backup Operator fails if additional Fleet workspaces are used (besides fleet-default) because the Fleet workspace namespaces are not recreated/stored in the backup. However, in the ticket you mention, we may want to also consider backing up secrets in any Fleet namespace (not just fleet-default). Otherwise, using the Rancher Backup Operator to restore a Rancher MCM cluster backup where Fleet is utilized to deploy services to downstream clusters does not work well and will fail without manual intervention (will not create Fleet workspaces and secrets). |
@Daemonslayer2048 - Can you clarify a few things for us:
|
@Daemonslayer2048 - During the triage process for this issue we were not able to reproduce the error. The If you can, share with us the error message you get when trying to restore the backup and we can investigate it further. Also, make sure you have the default ResourceSet unedited in your local cluster. The questions asked by @mallardduck are also important to help us understand your case. |
Also, be sure you are following the migration instructions step by step as detailed in the docs here. |
I believe some things may be getting mixed up. @mallardduck when you attempted to reproduce did you do a an etcd backup and restore, or just attempt a restore via the Rancher Backup? If you use an etcd backup and restore the namespace does get created so that is not the issue here. @jbiers I did follow those docs exactly, the issue does not show up unless the ClusterGroups are namespaced. |
@Daemonslayer2048 - No, i'm not referring to etcd backups (snapshots). We do not interact with etcd snapshots in the Rancher Backup tool or our code base here. So we do not use or interact with it during issue triage processes. As mentioned, it would be helpful if you could provide us with the additional context requested in these questions:
Having this context would be very helpful to better understand the issue you're reporting her and the root cause behind it. Our team is very interested in helping sort this out whether this is caused by a bug in Rancher Backups or potentially somewhere in Fleet. First we will need a better understanding of the report here to identify the root cause. As @jbiers mentioned, in Rancher Backups we already capture Instead of the issue actually being with If this is warmer, then that's where the additional context I requested in those points above could have served to clarify. If we could review the logs from the Rancher Backup container during the restore, then it's possible the sequence of events and root error would be more obvious. Granted - if I've reached the correct understanding now those logs may not be necessary. |
@Daemonslayer2048 I have managed to reproduce your issue, or at least partially. Here are the steps I followed and the observed behavior
The observed behavior in this case was indeed a failure in the Backup resource: What we suppose could be happening in your case is that in a production cluster with a lot to be restored, the retry process could indeed take a while, and it might be a matter of waiting longer until it is completed. Can you try and see if you observe the same behavior in a lighter sandbox cluster? Also, if you are a paying Rancher subscriber then a case could be opened with Rancher Support where they can formally file an issue and securely collect logs and debug data. |
We are. That's how this came about. We actually worked with support and through professional consulting. They, in turn, opened this issue. If there is a more expedient way for us to work through this, please let me know. |
Appreciate the reply @dsmithbauer - we'll follow up with @Daemonslayer2048 about creating an internal issue. As mentioned in @jbiers latest reply, our replication attempts found that However, eventually the resource is restored and then subsequently the namespace is recreated too. If there are other more specific issues/failures after this then it's not something we've observed. In that case it would be something we need more specific details on to replicate. So good next steps might be to attempt in a lighter sandbox cluster, and to also collect logs/info from the affected cluster. Doing the first option is something you can try anytime, however to securely share logs and other information we'll want you to interface with your Support team - and subsequently they can securely share logs from the support system to us via an internal ticket (which as me mentioned we will ask @Daemonslayer2048 to create). |
Hey @mallardduck, I just made an internal ticket on our side minutes before hand. |
It's my opinion that backup/restore doesn't support custom workspaces at all. I added a PR to backup the resource and its namespace. To test this manually one could install the backup restore operator and update the its resourceset resource. That resource has a list of selectors, adding the selectors from the PR should be sufficient. As a workaround for this issue, adding just the workspace's namespace directly should allow the backup to continue.
|
@manno -
Edit: just tested - yes it is. PR to add: #577 |
Fix (to capture namespaces) merged into main, and PRs for backports ready to merge after freeze is over. Leaving this as blocked status for now - once current freeze is over we will merge and move to QA. |
/forwardport v2.9.3 |
For QA use - this issue with track this PR: #579 |
Verified on v2.8-head ID:
|
Rancher Server Setup
SURE-8919
Describe the bug
When using FleetWorkspaces in Rancher this will create a new namespace for said workspace. Due to this if a user attempts to restore on a new cluster the restore process will fail as it will not create said namespace.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
I would expect one of two things to happen:
Option one is preferable but two will at leas prevent end users from getting stuck.
Screenshots
Not needed
Additional context
Sample Fleet config
The text was updated successfully, but these errors were encountered: