Velero restore doesn't create the AWS EBS volume from snapshot #8372

Bobses · 2024-11-05T13:43:28Z

Bobses
Nov 5, 2024

Hello,

I use AWS EKS 1.31, Velero 1.14.1, and velero-plugin-for-aws 1.10.0.

I created a backup using velero - everything seems to be fine (the backup files are uploaded into the S3 bucket and the volume snapshots are created. Until I tried to restore a backup.

The strange behavior occurs when I want to restore the backup: the namespace, the pod definition, PVC and PV are created. But... the pod remains in ContainerCreating because the volume from snapshot is not created in AWS EC2 - the specific volume-id mentioned by velero cannot be found.

Below you can see the outputs from the velero backup describe and velero restore describe commands:

velero backup describe test-ebs-bck-1 -n management
Name:         test-ebs-bck-1
Namespace:    management
Labels:       velero.io/storage-location=default
Annotations:  velero.io/resource-timeout=10m0s
              velero.io/source-cluster-k8s-gitversion=v1.31.2-eks-7f9249a
              velero.io/source-cluster-k8s-major-version=1
              velero.io/source-cluster-k8s-minor-version=31

Phase:  Completed


Namespaces:
  Included:  test-ebs
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        <none>
  Cluster-scoped:  auto

Label selector:  <none>

Or label selector:  <none>

Storage Location:  default

Velero-Native Snapshot PVs:  auto
Snapshot Move Data:          false
Data Mover:                  velero

TTL:  720h0m0s

CSISnapshotTimeout:    10m0s
ItemOperationTimeout:  4h0m0s

Hooks:  <none>

Backup Format Version:  1.1.0

Started:    2024-11-04 18:01:26 +0200 EET
Completed:  2024-11-04 18:01:30 +0200 EET

Expiration:  2024-12-04 18:01:26 +0200 EET

Total items to be backed up:  12
Items backed up:              12

Backup Volumes:
  Velero-Native Snapshots:
    ebs-pv-test: specify --details for more information

  CSI Snapshots: <none included>

  Pod Volume Backups: <none included>

HooksAttempted:  0
HooksFailed:     0

----

velero restore get -n management
NAME                                 BACKUP                STATUS            STARTED                         COMPLETED                       ERRORS   WARNINGS   CREATED
     SELECTOR
test-ebs-bck-1-20241104180634        test-ebs-bck-1        Completed         2024-11-04 18:06:35 +0200 EET   2024-11-04 18:06:38 +0200 EET   0        0          2024-11-04 18:06:35 +0200 EET   <none>

-----

velero restore describe test-ebs-bck-1-20241104180634 -n management
Name:         test-ebs-bck-1-20241104180634
Namespace:    management
Labels:       <none>
Annotations:  <none>

Phase:                       Completed
Total items to be restored:  6
Items restored:              6

Started:    2024-11-04 18:06:35 +0200 EET
Completed:  2024-11-04 18:06:38 +0200 EET

Backup:  test-ebs-bck-1

Namespaces:
  Included:  all namespaces found in the backup
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        nodes, events, events.events.k8s.io, backups.velero.io, restores.velero.io, resticrepositories.velero.io, csinodes.storage.k8s.io, volumeattachments.storage.k8s.io, backuprepositories.velero.io
  Cluster-scoped:  auto

Namespace mappings:  <none>

Label selector:  <none>

Or label selector:  <none>

Restore PVs:  auto

CSI Snapshot Restores: <none included>

Existing Resource Policy:   update
ItemOperationTimeout:       4h0m0s

Preserve Service NodePorts:  auto

Uploader config:


HooksAttempted:   0
HooksFailed:      0

PV and PVC are bounded.

The error from the restored pod:

 Warning  FailedAttachVolume  105s (x2 over 3m48s)   attachdetach-controller  AttachVolume.Attach failed for volume "ebs-pv-test" : rpc error: code = Internal desc = Could not attach volume "vol-0c5a4dda9327a30ef" to node "i-01a32a1a02344cfb2": could not attach volume "vol-0c5a4dda9327a30ef" to node "i-01a32a1a02344cfb2": operation error EC2: AttachVolume, https response error StatusCode: 400, RequestID: 9875-a68d-......, api error InvalidVolume.NotFound: The volume 'vol-0c5a4dda9327a30ef' does not exist.

The volume vol-0c5a4dda9327a30ef should be the volume used by the PV - but it is not created in AWS from the existing snapshot (the error is correct when it says that it cannot find the volume):

kubectl describe pv ebs-pv-test
Name:              ebs-pv-test
Labels:            velero.io/backup-name=test-ebs-bck-1
                  velero.io/restore-name=test-ebs-bck-1-20241104180634
Annotations:       pv.kubernetes.io/provisioned-by: ebs.csi.aws.com
Finalizers:        [kubernetes.io/pv-protection external-provisioner.volume.kubernetes.io/finalizer external-attacher/ebs-csi-aws-com]
StorageClass:      eks-upgrade0-sc-gp3-encrypted-waitforfirstconsumer
Status:            Bound
Claim:             test-ebs/ebs-claim-test
Reclaim Policy:    Delete
Access Modes:      RWO
VolumeMode:        Filesystem
Capacity:          1Gi
Node Affinity:
 Required Terms:
   Term 0:        topology.kubernetes.io/zone in [eu-west-1c]
Message:
Source:
   Type:              CSI (a Container Storage Interface (CSI) volume source)
   Driver:            ebs.csi.aws.com
   FSType:            ext4
   VolumeHandle:      vol-0c5a4dda9327a30ef
   ReadOnly:          false
   VolumeAttributes:  <none>
Events:                <none>

The storageClass:

kubectl get sc
NAME                                                           PROVISIONER       RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
eks-upgrade0-sc-gp3-encrypted-waitforfirstconsumer (default)   ebs.csi.aws.com   Delete          WaitForFirstConsumer   true                   126d

I don't see any other errors in the velero log.
Please help me to understand why the volume is not created from the velero snapshot.

Answered by kaovilai

Nov 7, 2024

driver: aws.efs.csi.driver

That seems like a typo. Per aws code, the driver name should be efs.csi.aws.com

View full answer

kaovilai · 2024-11-05T23:32:00Z

kaovilai
Nov 5, 2024

For CSI backup, the volume should actually be sourced from VolumeSnapshot which has snapshothandle, not volumehandle seen here.
Ensure you have added feature flag like shown in docs?

0 replies

Bobses · 2024-11-06T06:40:17Z

Bobses
Nov 6, 2024
Author

Thank you for the answer.

I think I missed something. To understand correctly, is the AWS plugin not sufficient? However, the snapshot is created using only velero-plugin-for-aws, only the volume restore from the snapshot is not done.
I will also set EnableCSI to true in the values.yaml file and install VolumeClassSnapshot. Do I need to install anything else, like the EKS Snapshot-Controller addon or external-snapshot?
Is there specific Velero documentation for AWS EKS available somewhere?

Edit: below is my current values.yaml file for velero (I'll add EnableCSI: true under configuration section)

image:
  repository: velero/velero
  tag: v1.14.1
configuration:
  backupStorageLocation:
    - name: default
      provider: aws
      bucket: my-bucket-velero
      config: 
        region: eu-west-1
  volumeSnapshotLocation:
    - name: aws
      provider: aws
      config:
        region: eu-west-1
  namespace: management  
credentials:
  useSecret: false
backupSyncPeriod: 60m 
fsBackupTimeout: 1h

initContainers:
- name: velero-plugin-for-aws
  image: velero/velero-plugin-for-aws:v1.10.0
  volumeMounts:
  - mountPath: /target
    name: plugins
serviceAccount:
  server:
    create: true
    annotations:
      eks.amazonaws.com/role-arn: "arn:aws:iam::111111111111:role/my-cluster-velero-role"

metrics:
  enabled: true
  scrapeInterval: 30s
  scrapeTimeout: 10s

  podAnnotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8085"
    prometheus.io/path: "/metrics"
  serviceMonitor:
    enabled: true
    additionalLabels:    
      release: kube-prometheus  
  prometheusRule:
    enabled: false    

schedules:
  mybackup:
    disabled: false
      schedule: "12 02 * * *"
    template:
      ttl: "24h"
      includeClusterResources: true
      includedNamespaces:
        - "*"
      snapshotVolumes: true  

deployNodeAgent: true

nodeAgent:
  podVolumePath: /var/lib/kubelet/pods
  #privileged: false
  tolerations:
  - effect: "NoSchedule" 
    operator: "Exists"   

configMaps:
  fs-restore-action-config:
    labels:
      velero.io/plugin-config: ""
      velero.io/pod-volume-restore: RestoreItemAction
    data:
      image: velero/velero-restore-helper:v1.10.2

kubectl:
  image:
    repository: docker.io/bitnami/kubectl

upgradeCRDs: true

0 replies

shubham-pampattiwar · 2024-11-06T07:24:20Z

shubham-pampattiwar
Nov 6, 2024
Maintainer

@Bobses Your pvc is using a csi storageclass, right ? for backing up csi pvc you need to set EnableCSI flag. Now once you do this you have 2 options to backup your pvc:

Local CSI snapshots: https://velero.io/docs/v1.15/csi/
CSI snapshots stored off cluster using data mover: https://velero.io/docs/v1.15/csi-snapshot-data-movement/
I think your cluster might already have EKS csi snapshot controller.

1 reply

kaovilai Nov 7, 2024

This EFS driver do not support snapshots actually.. just noting that CSI would never work.

Their code specifically will error.
https://github.com/kubernetes-sigs/aws-efs-csi-driver/blob/d5f02ca1e1a7188cdc53a3aa8a0b2c9f5fbc1c00/pkg/driver/controller.go#L542-L544

Tracking issue is closed from inactivity. kubernetes-sigs/aws-efs-csi-driver#511

Bobses · 2024-11-06T07:29:46Z

Bobses
Nov 6, 2024
Author

No, I don't have EKS CSI snapshot controller addon installed - I'm going to install it.

So, let's summarize:

EnableCSI: true
install VolumeClassSnapshot
install EKS CSI snapshot controller

I'll be back with the result.

0 replies

Bobses · 2024-11-06T09:39:01Z

Bobses
Nov 6, 2024
Author

I redeployed velero with EnableCSI set on true:

configuration:
  features:
    EnableCSI: true

I installed the AWS CSI snapshot controller for EKS and I created a new IAM role for it with the following policy (Terraform installation):

{
    "Statement": [
        {
            "Action": [
                "ec2:CreateSnapshot",
                "ec2:DeleteSnapshot",
                "ec2:DescribeSnapshots"
            ],
            "Effect": "Allow",
            "Resource": "*"
        }
    ],
    "Version": "2012-10-17"
}

I created a volumesnapshotclass:

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: ebs-csi-snapshot-class
  labels:
    velero.io/csi-volumesnapshot-class: "true"
driver: ebs.csi.aws.com
deletionPolicy: Delete

I created a new velero backup:

apiVersion: velero.io/v1
kind: Backup
metadata:
  name: backup2
  namespace: management
  annotations:
    velero.io/csi-volumesnapshot-class_ebs.csi.aws.com: "ebs-csi-snapshot-class"
spec:
    includedNamespaces:
    - test-ebs

Unfortunately, from the description of the backup2 looks like the CSI snapshot is not used and the volume snapshot was taken using the velero-plugin-for-aws - of course, the volume is not recreated from the taken snapshot when I use the velero restore command:

velero backup describe backup2 -n management
Name:         backup2
Namespace:    management
Labels:       velero.io/storage-location=default
Annotations:  kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"velero.io/v1","kind":"Backup","metadata":{"annotations":{"velero.io/csi-volumesnapshot-class_ebs.csi.aws.com":"ebs-csi-snapshot-class"},"name":"backup2","namespace":"management"},"spec":{"includedNamespaces":["test-ebs"]}}

 velero.io/csi-volumesnapshot-class_ebs.csi.aws.com=ebs-csi-snapshot-class
 velero.io/resource-timeout=10m0s
 velero.io/source-cluster-k8s-gitversion=v1.31.2-eks-7f9249a
 velero.io/source-cluster-k8s-major-version=1
 velero.io/source-cluster-k8s-minor-version=31

Phase:  Completed


Namespaces:
 Included:  test-ebs
 Excluded:  <none>

Resources:
 Included:        *
 Excluded:        <none>
 Cluster-scoped:  auto

Label selector:  <none>

Or label selector:  <none>

Storage Location:  default

Velero-Native Snapshot PVs:  auto
Snapshot Move Data:          false
Data Mover:                  velero

TTL:  720h0m0s

CSISnapshotTimeout:    10m0s
ItemOperationTimeout:  4h0m0s

Hooks:  <none>

Backup Format Version:  1.1.0

Started:    2024-11-06 11:16:10 +0200 EET
Completed:  2024-11-06 11:16:13 +0200 EET

Expiration:  2024-12-06 11:16:10 +0200 EET

Total items to be backed up:  14
Items backed up:              14

Backup Volumes:
 Velero-Native Snapshots:
   ebs-pv-test: specify --details for more information

 CSI Snapshots: <none included>

 Pod Volume Backups: <none included>

HooksAttempted:  0
HooksFailed:     0

Where am I wrong?

0 replies

Bobses · 2024-11-06T13:02:21Z

Bobses
Nov 6, 2024
Author

I found the error:

it is not

configuration:
  features:
    EnableCSI: true

but

configuration:
  features: EnableCSI

Now the volumes are recreated from snapshots - I tested deleting a full namespace with PV, PVC, deployments, statefulset, etc., but I see a warning:

Warning: Cluster: [VolumeSnapshotContent.snapshot.storage.k8s.io](http://volumesnapshotcontent.snapshot.storage.k8s.io/) "snapcontent-..............." is invalid: spec.source: Invalid value: "object": snapshotHandle is required once set

It seems that it does not affect the velero restore, but I want to know how can be it solved.

From my point of view, CSI should be enabled by default during installation.

Another issue: I'm not able to get a full velero backup because I have PVs with efs.csi.aws.com driver. I tried to exclude all resources from efs (see below code), but no success. Any idea here, please?

schedules:
  cluster-backup:
    schedule: "0 2 * * *"  
    template:
      ttl: 720h 
      labelSelector:
        matchExpressions:
          - key: driver
            operator: NotIn
            values:
              - efs.csi.aws.com

1 reply

kaovilai Nov 6, 2024

That might be a valid issue if it keeps recurring.

kubernetes-csi/external-snapshotter@4d6c52e is a relatively new change and haven't made it into most platforms yet, OpenShift included :/

Were you able to validate that data added to pvc before backup by application are restored?

Bobses · 2024-11-06T16:29:42Z

Bobses
Nov 6, 2024
Author

The restored EBS PVCs seem to be ok.

As I said, I encountered the issue with the EFS PVCs - I have a lot of them in the cluster. On the other hand, we use AWS Backup for backing up the EFS. The solution I found: patching all PVCs created with the efs.csi.aws.com driver and adding the velero.io/exclude-from-backup=true label. Thus, I was able to get an almost full velero backup of the cluster (except for the definition for the excluded PVCs).

Do you have another suggestion for getting a full backup of the k8s cluster when there are a lot of EFS PVCs?

1 reply

kaovilai Nov 6, 2024

adding the velero.io/exclude-from-backup=true label

Do you have another suggestion for getting a full backup of the k8s cluster when there are a lot of EFS PVCs?

are you asking for an easier way to exclude all from specific storage class? if so set volumePolicies to exclude storage class https://velero.io/docs/v1.15/resource-filtering/#supported-volumepolicy-actions

Bobses · 2024-11-07T09:15:34Z

Bobses
Nov 7, 2024
Author

That sounds good, but the volumePolicy for excluding a specific storage class cannot be created from the helm chart values.yaml file (at least I don't see any section where it can be added). Anyway, I am going to try the volumePolicy solution. Thanks for the suggestion.

LE: unfortunately, volumePolicy doesn't work.

created the volumePolicy

version: v1
volumePolicies:
- conditions:
    csi:
      driver: aws.efs.csi.driver
    storageClass:
      - efs-sc-immediate 
  action:
    type: skip

created the configmap
kubectl create cm exclude-efs --from-file cm-volume-policies.yaml -n management
initiated the backup
velero backup create backup1 --resource-policies-configmap exclude-efs -n management
the backup was partially failed because the EFS volumes

velero backup describe backup1 -n management

Name:         backup1
Namespace:    management
Labels:       velero.io/storage-location=default
Annotations:  velero.io/resource-timeout=10m0s
              velero.io/source-cluster-k8s-gitversion=v1.31.2-eks-7f9249a
              velero.io/source-cluster-k8s-major-version=1
              velero.io/source-cluster-k8s-minor-version=31

Phase:  PartiallyFailed (run `velero backup logs backup1` for more information)

Resource policies:
  Type:  configmap
  Name:  exclude-efs


Errors:
  Velero:    message: /Timed out awaiting reconciliation of volumesnapshot ......................
             name: /.................... message: /Error backing up item error: /error executing custom action (groupResource=volumesnapshots.snapshot.storage.k8s.io, namespace=........., name=.........................): rpc error: code = Unknown desc = failed to get volumesnapshot ............../.....................................: client rate limiter Wait returned an error: context deadline exceeded
              message: /Timed out awaiting reconciliation of volumesnapshot ......................
             name: /.................... message: /Error backing up item error: /error executing custom action (groupResource=volumesnapshots.snapshot.storage.k8s.io, namespace=........., name=.........................): rpc error: code = Unknown desc = failed to get volumesnapshot ............../.....................................: client rate limiter Wait returned an error: context deadline exceeded
  Cluster:    <none>
  Namespaces: <none>

Namespaces:
  Included:  *
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        <none>
  Cluster-scoped:  auto

Label selector:  <none>

Or label selector:  <none>

Storage Location:  default

Velero-Native Snapshot PVs:  auto
Snapshot Move Data:          false
Data Mover:                  velero

TTL:  720h0m0s

CSISnapshotTimeout:    10m0s
ItemOperationTimeout:  4h0m0s

Hooks:  <none>

Backup Format Version:  1.1.0

Started:    2024-11-07 11:30:45 +0200 EET
Completed:  2024-11-07 11:52:32 +0200 EET

Expiration:  2024-12-07 11:30:45 +0200 EET

Total items to be backed up:  5049
Items backed up:              5049

Backup Item Operations:  5 of 5 completed successfully, 0 failed (specify --details for more information)
Backup Volumes:
  Velero-Native Snapshots: <none included>

  CSI Snapshots:
    management/kube-prometheus-grafana:
      Snapshot: included, specify --details for more information
    management/prometheus-kube-prometheus-kube-prome-prometheus-db-prometheus-kube-prometheus-kube-prome-prometheus-0:
      Snapshot: included, specify --details for more information
    rmq/persistence-rabbitmq-cluster-server-0:
      Snapshot: included, specify --details for more information
    test-ebs/ebs-claim-test:
      Snapshot: included, specify --details for more information
    vcluster/data-vcluster-etcd-0:
      Snapshot: included, specify --details for more information

  Pod Volume Backups: <none included>

HooksAttempted:  0
HooksFailed:     0

4 replies

kaovilai Nov 7, 2024

Can you post pod logs to help locate the troubling code?

kaovilai Nov 7, 2024

ok here.

velero/pkg/util/csi/volume_snapshot.go

Line 760 in 23ca089

    
           "Timed out awaiting reconciliation of VolumeSnapshot, VolumeSnapshotContent %s has error: %v",

kaovilai Nov 7, 2024

velero/pkg/util/csi/volume_snapshot.go

Lines 699 to 709 in 23ca089

    
           if err := crClient.Get( 
        
           	ctx, 
        
           	crclient.ObjectKeyFromObject(volSnap), 
        
           	vs, 
        
           ); err != nil { 
        
           	return false, 
        
           		errors.Wrapf(err, fmt.Sprintf( 
        
           			"failed to get volumesnapshot %s/%s", 
        
           			volSnap.Namespace, volSnap.Name), 
        
           		) 
        
           }

kaovilai Nov 7, 2024

driver: aws.efs.csi.driver

That seems like a typo. Per aws code, the driver name should be efs.csi.aws.com

Answer selected by Bobses

Bobses · 2024-11-08T08:10:28Z

Bobses
Nov 8, 2024
Author

Yes, you are right, this was the solution: I replaced the correct name of the AWS EFS driver in configmap and the backup is now fine, without the PVC from the efs storageclass. It was my mistake; I copied and pasted from your site without verifying the driver's name. (I just replaced ebs with efs):

It would be great if there is a possibility to create this configmap from the helm chart's values.yaml file.

I will mark your last answer as the solution, even though I had multiple questions in the same post. I'll come back if I encounter problems with this topic's issues or open a new topic for other issues.

Thank you again!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Velero restore doesn't create the AWS EBS volume from snapshot #8372

{{title}}

Replies: 9 comments 7 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Velero restore doesn't create the AWS EBS volume from snapshot #8372

Replies: 9 comments · 7 replies

Bobses Nov 6, 2024 Author

shubham-pampattiwar Nov 6, 2024 Maintainer

Bobses Nov 6, 2024 Author

Bobses Nov 6, 2024 Author

Bobses Nov 6, 2024 Author

Bobses Nov 6, 2024 Author

Bobses Nov 7, 2024 Author

Bobses Nov 8, 2024 Author

Replies: 9 comments 7 replies

Bobses
Nov 6, 2024
Author

shubham-pampattiwar
Nov 6, 2024
Maintainer

Bobses
Nov 6, 2024
Author

Bobses
Nov 6, 2024
Author

Bobses
Nov 6, 2024
Author

Bobses
Nov 6, 2024
Author

Bobses
Nov 7, 2024
Author

Bobses
Nov 8, 2024
Author