Velero restore doesn't create the AWS EBS volume from snapshot #8372
-
Hello, I use AWS EKS 1.31, Velero 1.14.1, and velero-plugin-for-aws 1.10.0. I created a backup using velero - everything seems to be fine (the backup files are uploaded into the S3 bucket and the volume snapshots are created. Until I tried to restore a backup. The strange behavior occurs when I want to restore the backup: the namespace, the pod definition, PVC and PV are created. But... the pod remains in ContainerCreating because the volume from snapshot is not created in AWS EC2 - the specific volume-id mentioned by velero cannot be found. Below you can see the outputs from the
PV and PVC are bounded. The error from the restored pod:
The volume vol-0c5a4dda9327a30ef should be the volume used by the PV - but it is not created in AWS from the existing snapshot (the error is correct when it says that it cannot find the volume):
The storageClass:
I don't see any other errors in the velero log. |
Beta Was this translation helpful? Give feedback.
Replies: 9 comments 7 replies
-
For CSI backup, the volume should actually be sourced from VolumeSnapshot which has snapshothandle, not volumehandle seen here. |
Beta Was this translation helpful? Give feedback.
-
Thank you for the answer. I think I missed something. To understand correctly, is the AWS plugin not sufficient? However, the snapshot is created using only velero-plugin-for-aws, only the volume restore from the snapshot is not done. Edit: below is my current values.yaml file for velero (I'll add EnableCSI: true under configuration section) image:
repository: velero/velero
tag: v1.14.1
configuration:
backupStorageLocation:
- name: default
provider: aws
bucket: my-bucket-velero
config:
region: eu-west-1
volumeSnapshotLocation:
- name: aws
provider: aws
config:
region: eu-west-1
namespace: management
credentials:
useSecret: false
backupSyncPeriod: 60m
fsBackupTimeout: 1h
initContainers:
- name: velero-plugin-for-aws
image: velero/velero-plugin-for-aws:v1.10.0
volumeMounts:
- mountPath: /target
name: plugins
serviceAccount:
server:
create: true
annotations:
eks.amazonaws.com/role-arn: "arn:aws:iam::111111111111:role/my-cluster-velero-role"
metrics:
enabled: true
scrapeInterval: 30s
scrapeTimeout: 10s
podAnnotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8085"
prometheus.io/path: "/metrics"
serviceMonitor:
enabled: true
additionalLabels:
release: kube-prometheus
prometheusRule:
enabled: false
schedules:
mybackup:
disabled: false
schedule: "12 02 * * *"
template:
ttl: "24h"
includeClusterResources: true
includedNamespaces:
- "*"
snapshotVolumes: true
deployNodeAgent: true
nodeAgent:
podVolumePath: /var/lib/kubelet/pods
#privileged: false
tolerations:
- effect: "NoSchedule"
operator: "Exists"
configMaps:
fs-restore-action-config:
labels:
velero.io/plugin-config: ""
velero.io/pod-volume-restore: RestoreItemAction
data:
image: velero/velero-restore-helper:v1.10.2
kubectl:
image:
repository: docker.io/bitnami/kubectl
upgradeCRDs: true |
Beta Was this translation helpful? Give feedback.
-
@Bobses Your pvc is using a csi storageclass, right ? for backing up csi pvc you need to set
|
Beta Was this translation helpful? Give feedback.
-
No, I don't have EKS CSI snapshot controller addon installed - I'm going to install it. So, let's summarize:
I'll be back with the result. |
Beta Was this translation helpful? Give feedback.
-
I redeployed velero with EnableCSI set on true: configuration:
features:
EnableCSI: true I installed the AWS CSI snapshot controller for EKS and I created a new IAM role for it with the following policy (Terraform installation): {
"Statement": [
{
"Action": [
"ec2:CreateSnapshot",
"ec2:DeleteSnapshot",
"ec2:DescribeSnapshots"
],
"Effect": "Allow",
"Resource": "*"
}
],
"Version": "2012-10-17"
} I created a volumesnapshotclass: apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: ebs-csi-snapshot-class
labels:
velero.io/csi-volumesnapshot-class: "true"
driver: ebs.csi.aws.com
deletionPolicy: Delete I created a new velero backup: apiVersion: velero.io/v1
kind: Backup
metadata:
name: backup2
namespace: management
annotations:
velero.io/csi-volumesnapshot-class_ebs.csi.aws.com: "ebs-csi-snapshot-class"
spec:
includedNamespaces:
- test-ebs Unfortunately, from the description of the backup2 looks like the CSI snapshot is not used and the volume snapshot was taken using the velero-plugin-for-aws - of course, the volume is not recreated from the taken snapshot when I use the velero backup describe backup2 -n management
Name: backup2
Namespace: management
Labels: velero.io/storage-location=default
Annotations: kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"velero.io/v1","kind":"Backup","metadata":{"annotations":{"velero.io/csi-volumesnapshot-class_ebs.csi.aws.com":"ebs-csi-snapshot-class"},"name":"backup2","namespace":"management"},"spec":{"includedNamespaces":["test-ebs"]}}
velero.io/csi-volumesnapshot-class_ebs.csi.aws.com=ebs-csi-snapshot-class
velero.io/resource-timeout=10m0s
velero.io/source-cluster-k8s-gitversion=v1.31.2-eks-7f9249a
velero.io/source-cluster-k8s-major-version=1
velero.io/source-cluster-k8s-minor-version=31
Phase: Completed
Namespaces:
Included: test-ebs
Excluded: <none>
Resources:
Included: *
Excluded: <none>
Cluster-scoped: auto
Label selector: <none>
Or label selector: <none>
Storage Location: default
Velero-Native Snapshot PVs: auto
Snapshot Move Data: false
Data Mover: velero
TTL: 720h0m0s
CSISnapshotTimeout: 10m0s
ItemOperationTimeout: 4h0m0s
Hooks: <none>
Backup Format Version: 1.1.0
Started: 2024-11-06 11:16:10 +0200 EET
Completed: 2024-11-06 11:16:13 +0200 EET
Expiration: 2024-12-06 11:16:10 +0200 EET
Total items to be backed up: 14
Items backed up: 14
Backup Volumes:
Velero-Native Snapshots:
ebs-pv-test: specify --details for more information
CSI Snapshots: <none included>
Pod Volume Backups: <none included>
HooksAttempted: 0
HooksFailed: 0 Where am I wrong? |
Beta Was this translation helpful? Give feedback.
-
I found the error: it is not configuration:
features:
EnableCSI: true but configuration:
features: EnableCSI Now the volumes are recreated from snapshots - I tested deleting a full namespace with PV, PVC, deployments, statefulset, etc., but I see a warning:
It seems that it does not affect the velero restore, but I want to know how can be it solved. From my point of view, CSI should be enabled by default during installation. Another issue: I'm not able to get a full velero backup because I have PVs with efs.csi.aws.com driver. I tried to exclude all resources from efs (see below code), but no success. Any idea here, please?
|
Beta Was this translation helpful? Give feedback.
-
The restored EBS PVCs seem to be ok. As I said, I encountered the issue with the EFS PVCs - I have a lot of them in the cluster. On the other hand, we use AWS Backup for backing up the EFS. The solution I found: patching all PVCs created with the efs.csi.aws.com driver and adding the velero.io/exclude-from-backup=true label. Thus, I was able to get an almost full velero backup of the cluster (except for the definition for the excluded PVCs). Do you have another suggestion for getting a full backup of the k8s cluster when there are a lot of EFS PVCs? |
Beta Was this translation helpful? Give feedback.
-
That sounds good, but the volumePolicy for excluding a specific storage class cannot be created from the helm chart values.yaml file (at least I don't see any section where it can be added). Anyway, I am going to try the volumePolicy solution. Thanks for the suggestion. LE: unfortunately, volumePolicy doesn't work.
version: v1
volumePolicies:
- conditions:
csi:
driver: aws.efs.csi.driver
storageClass:
- efs-sc-immediate
action:
type: skip
velero backup describe backup1 -n management
Name: backup1
Namespace: management
Labels: velero.io/storage-location=default
Annotations: velero.io/resource-timeout=10m0s
velero.io/source-cluster-k8s-gitversion=v1.31.2-eks-7f9249a
velero.io/source-cluster-k8s-major-version=1
velero.io/source-cluster-k8s-minor-version=31
Phase: PartiallyFailed (run `velero backup logs backup1` for more information)
Resource policies:
Type: configmap
Name: exclude-efs
Errors:
Velero: message: /Timed out awaiting reconciliation of volumesnapshot ......................
name: /.................... message: /Error backing up item error: /error executing custom action (groupResource=volumesnapshots.snapshot.storage.k8s.io, namespace=........., name=.........................): rpc error: code = Unknown desc = failed to get volumesnapshot ............../.....................................: client rate limiter Wait returned an error: context deadline exceeded
message: /Timed out awaiting reconciliation of volumesnapshot ......................
name: /.................... message: /Error backing up item error: /error executing custom action (groupResource=volumesnapshots.snapshot.storage.k8s.io, namespace=........., name=.........................): rpc error: code = Unknown desc = failed to get volumesnapshot ............../.....................................: client rate limiter Wait returned an error: context deadline exceeded
Cluster: <none>
Namespaces: <none>
Namespaces:
Included: *
Excluded: <none>
Resources:
Included: *
Excluded: <none>
Cluster-scoped: auto
Label selector: <none>
Or label selector: <none>
Storage Location: default
Velero-Native Snapshot PVs: auto
Snapshot Move Data: false
Data Mover: velero
TTL: 720h0m0s
CSISnapshotTimeout: 10m0s
ItemOperationTimeout: 4h0m0s
Hooks: <none>
Backup Format Version: 1.1.0
Started: 2024-11-07 11:30:45 +0200 EET
Completed: 2024-11-07 11:52:32 +0200 EET
Expiration: 2024-12-07 11:30:45 +0200 EET
Total items to be backed up: 5049
Items backed up: 5049
Backup Item Operations: 5 of 5 completed successfully, 0 failed (specify --details for more information)
Backup Volumes:
Velero-Native Snapshots: <none included>
CSI Snapshots:
management/kube-prometheus-grafana:
Snapshot: included, specify --details for more information
management/prometheus-kube-prometheus-kube-prome-prometheus-db-prometheus-kube-prometheus-kube-prome-prometheus-0:
Snapshot: included, specify --details for more information
rmq/persistence-rabbitmq-cluster-server-0:
Snapshot: included, specify --details for more information
test-ebs/ebs-claim-test:
Snapshot: included, specify --details for more information
vcluster/data-vcluster-etcd-0:
Snapshot: included, specify --details for more information
Pod Volume Backups: <none included>
HooksAttempted: 0
HooksFailed: 0 |
Beta Was this translation helpful? Give feedback.
-
Yes, you are right, this was the solution: I replaced the correct name of the AWS EFS driver in configmap and the backup is now fine, without the PVC from the efs storageclass. It was my mistake; I copied and pasted from your site without verifying the driver's name. (I just replaced ebs with efs): It would be great if there is a possibility to create this configmap from the helm chart's values.yaml file. I will mark your last answer as the solution, even though I had multiple questions in the same post. I'll come back if I encounter problems with this topic's issues or open a new topic for other issues. Thank you again! |
Beta Was this translation helpful? Give feedback.
driver: aws.efs.csi.driver
That seems like a typo. Per aws code, the driver name should be
efs.csi.aws.com