OADP-2410: Workaround for restic restore failure due to changed PSA policy

anarnold97 · anarnold97 · commit c9e78b57b326 · 2023-09-11T12:38:08.000+01:00
diff --git a/backup_and_restore/application_backup_and_restore/backing_up_and_restoring/backing-up-applications.adoc b/backup_and_restore/application_backup_and_restore/backing_up_and_restoring/backing-up-applications.adoc
@@ -2,6 +2,7 @@
 [id="backing-up-applications"]
 = Backing up applications
 include::_attributes/common-attributes.adoc[]
+include::_attributes/attributes-openshift-dedicated.adoc[]
 :context: backing-up-applications
 
 toc::[]
@@ -31,6 +32,15 @@ You can schedule backups by creating a `Schedule` CR instead of a `Backup` CR. S
 include::modules/oadp-creating-backup-cr.adoc[leveloffset=+1]
 include::modules/oadp-backing-up-pvs-csi.adoc[leveloffset=+1]
 include::modules/oadp-backing-up-applications-restic.adoc[leveloffset=+1]
+
+.Known issues
+
+{ocp} 4.14 enforces a pod security admission (PSA) policy that can hinder the readiness of pods during a Restic restore process. 
+
+This issue has been resolved in the OADP 1.1.6 and OADP 1.2.2 releases, therefore it is recommended that users upgrade to these releases. 
+
+For more information, see xref:../../../backup_and_restore/application_backup_and_restore/troubleshooting.adoc#oadp-restic-restore-failing-psa-policy_oadp-troubleshooting[Restic restore partially failing on OCP 4.14 due to changed PSA policy].
+
 include::modules/oadp-using-data-mover-for-csi-snapshots.adoc[leveloffset=+1]
 
 [id="oadp-12-data-mover-ceph"]
diff --git a/backup_and_restore/application_backup_and_restore/troubleshooting.adoc b/backup_and_restore/application_backup_and_restore/troubleshooting.adoc
@@ -2,6 +2,7 @@
 [id="troubleshooting"]
 = Troubleshooting
 include::_attributes/common-attributes.adoc[]
+include::_attributes/attributes-openshift-dedicated.adoc[]
 :context: oadp-troubleshooting
 :namespace: openshift-adp
 :local-product: OADP
@@ -87,6 +88,7 @@ include::modules/oadp-item-restore-timeouts.adoc[leveloffset=+2]
 include::modules/oadp-item-backup-timeouts.adoc[leveloffset=+2]
 include::modules/oadp-backup-restore-cr-issues.adoc[leveloffset=+1]
 include::modules/oadp-restic-issues.adoc[leveloffset=+1]
+include::modules/oadp-restic-restore-failing-psa-policy.adoc[leveloffset=+2]
 
 include::modules/migration-using-must-gather.adoc[leveloffset=+1]
 include::modules/oadp-monitoring.adoc[leveloffset=+1]
diff --git a/modules/oadp-backing-up-applications-restic.adoc b/modules/oadp-backing-up-applications-restic.adoc
@@ -39,3 +39,5 @@ spec:
 ...
 ----
 <1> Add `defaultVolumesToRestic: true` to the `spec` block.
+
+
diff --git a/modules/oadp-backup-restore-cr-issues.adoc b/modules/oadp-backup-restore-cr-issues.adoc
@@ -55,12 +55,12 @@ You do not need to clean up the backup location because a `Backup` CR in progres
 [id="backup-cr-remains-partiallyfailed_{context}"]
 == Backup CR status remains in PartiallyFailed
 
-The status of a `Backup` CR without Restic in use remains in the `PartiallyFailed` phase and does not complete. A snapshot of the affiliated PVC is not created. 
+The status of a `Backup` CR without Restic in use remains in the `PartiallyFailed` phase and does not complete. A snapshot of the affiliated PVC is not created.
 
 .Cause
 
 If the backup is created based on the CSI snapshot class, but the label is missing, CSI snapshot plugin fails to create a snapshot. As a result, the `Velero` pod logs an error similar to the following:
-+
+
 [source,text]
 ----
 time="2023-02-17T16:33:13Z" level=error msg="Error backing up item" backup=openshift-adp/user1-backup-check5 error="error executing custom action (groupResource=persistentvolumeclaims, namespace=busy1, name=pvc1-user1): rpc error: code = Unknown desc = failed to get volumesnapshotclass for storageclass ocs-storagecluster-ceph-rbd: failed to get volumesnapshotclass for provisioner openshift-storage.rbd.csi.ceph.com, ensure that the desired volumesnapshot class has the velero.io/csi-volumesnapshot-class label" logSource="/remote-source/velero/app/pkg/backup/backup.go:417" name=busybox-79799557b5-vprq
@@ -84,4 +84,4 @@ $ oc delete backup <backup> -n openshift-adp
 $ oc label volumesnapshotclass/<snapclass_name> velero.io/csi-volumesnapshot-class=true
 ----
 
-. Create a new `Backup` CR.
+. Create a new `Backup` CR.
diff --git a/modules/oadp-restic-restore-failing-psa-policy.adoc b/modules/oadp-restic-restore-failing-psa-policy.adoc
@@ -0,0 +1,74 @@
+:_content-type: PROCEDURE
+[id="oadp-restic-restore-failing-psa-policy_{context}"]
+= Restic restore partially failing on OCP 4.14 due to changed PSA policy
+
+{ocp} 4.14 enforces a Pod Security Admission (PSA) policy that can hinder the readiness of pods during a Restic restore process. 
+
+If a `SecurityContextConstraints` (SCC) resource is not found when a pod is created, and the PSA policy on the pod is not set up to meet the required standards, pod admission is denied. 
+
+This issue arises due to the resource restore order of Velero.
+
+.Sample error
+[source,text]
+----
+\"level=error\" in line#2273: time=\"2023-06-12T06:50:04Z\" 
+level=error msg=\"error restoring mysql-869f9f44f6-tp5lv: pods\\\
+"mysql-869f9f44f6-tp5lv\\\" is forbidden: violates PodSecurity\\\
+"restricted:v1.24\\\": privil eged (container \\\"mysql\\\
+" must not set securityContext.privileged=true), 
+allowPrivilegeEscalation != false (containers \\\
+"restic-wait\\\", \\\"mysql\\\" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (containers \\\
+"restic-wait\\\", \\\"mysql\\\" must set securityContext.capabilities.drop=[\\\"ALL\\\"]), seccompProfile (pod or containers \\\
+"restic-wait\\\", \\\"mysql\\\" must set securityContext.seccompProfile.type to \\\
+"RuntimeDefault\\\" or \\\"Localhost\\\")\" logSource=\"/remote-source/velero/app/pkg/restore/restore.go:1388\" restore=openshift-adp/todolist-backup-0780518c-08ed-11ee-805c-0a580a80e92c\n 
+velero container contains \"level=error\" in line#2447: time=\"2023-06-12T06:50:05Z\" 
+level=error msg=\"Namespace todolist-mariadb, 
+resource restore error: error restoring pods/todolist-mariadb/mysql-869f9f44f6-tp5lv: pods \\\
+"mysql-869f9f44f6-tp5lv\\\" is forbidden: violates PodSecurity \\\"restricted:v1.24\\\": privileged (container \\\
+"mysql\\\" must not set securityContext.privileged=true), 
+allowPrivilegeEscalation != false (containers \\\
+"restic-wait\\\",\\\"mysql\\\" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (containers \\\
+"restic-wait\\\", \\\"mysql\\\" must set securityContext.capabilities.drop=[\\\"ALL\\\"]), seccompProfile (pod or containers \\\
+"restic-wait\\\", \\\"mysql\\\" must set securityContext.seccompProfile.type to \\\
+"RuntimeDefault\\\" or \\\"Localhost\\\")\" 
+logSource=\"/remote-source/velero/app/pkg/controller/restore_controller.go:510\" 
+restore=openshift-adp/todolist-backup-0780518c-08ed-11ee-805c-0a580a80e92c\n]",
+----
+
+.Solution
+
+. In your DPA custom resource (CR), check or set the `restore-resource-priorities` field on the Velero server to ensure that `securitycontextconstraints` is listed in order before `pods` in the list of resources:
++
+[source,terminal]
+----
+$ oc get dpa -o yaml
+----
++
+.Example DPA CR
+[source,yaml]
+----
+# ... 
+configuration:
+  restic:
+    enable: true
+  velero:
+    args:
+      restore-resource-priorities: 'securitycontextconstraints,customresourcedefinitions,namespaces,storageclasses,volumesnapshotclass.snapshot.storage.k8s.io,volumesnapshotcontents.snapshot.storage.k8s.io,volumesnapshots.snapshot.storage.k8s.io,datauploads.velero.io,persistentvolumes,persistentvolumeclaims,serviceaccounts,secrets,configmaps,limitranges,pods,replicasets.apps,clusterclasses.cluster.x-k8s.io,endpoints,services,-,clusterbootstraps.run.tanzu.vmware.com,clusters.cluster.x-k8s.io,clusterresourcesets.addons.cluster.x-k8s.io' <1>
+    defaultPlugins:
+    - gcp
+    - openshift
+----
+<1> If you have an existing restore resource priority list, ensure you combine that existing list with the complete list.
+
+. Ensure that the security standards for the application pods are aligned, as provided in link:https://access.redhat.com/solutions/7002730[Fixing PodSecurity Admission warnings for deployments], to prevent deployment warnings. If the application is not aligned with security standards, an error can occur regardless of the SCC. 
+
+[NOTE]
+====
+This solution is temporary, and ongoing discussions are in progress to address it. 
+====
+
+
+[role="_additional-resources"]
+.Additional resources
+
+* link:https://access.redhat.com/solutions/7002730[Fixing PodSecurity Admission warnings for deployments]

-Original file line number
+Diff line change
 ...
 ----
 <1> Add `defaultVolumesToRestic: true` to the `spec` block.
++
++