Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore using Velero for backups of Kubeflow user-namespaces #1097

Open
kimwnasptd opened this issue Oct 2, 2024 · 6 comments
Open

Explore using Velero for backups of Kubeflow user-namespaces #1097

kimwnasptd opened this issue Oct 2, 2024 · 6 comments
Labels
enhancement New feature or request

Comments

@kimwnasptd
Copy link
Contributor

kimwnasptd commented Oct 2, 2024

Context

As part of being able to do reversible upgrades #1096 we need to be able to make a copy of

  1. User Profile CRs
  2. The contents of the namespaces (Notebooks, ISVCs, PVCs and their contents, what else?)

One popular project for this purpose is Velero, which can utilise both the public cloud's architecture (i.e. EBS CSI driver for backup all PersistentVolumes in AWS) and also File System Backup, for vanilla K8s clusters (i.e. on top of OpenStack).

What needs to get done

  1. Deploy Velero alongside a CKF/KF cluster
  2. Try to make a backup of the cluster and restore

Definition of Done

  1. We have a list of commands on doing backup/restore
  2. A list of limitations and issues on using Velero
@kimwnasptd kimwnasptd added the enhancement New feature or request label Oct 2, 2024
Copy link

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-6345.

This message was autogenerated

@kimwnasptd
Copy link
Contributor Author

kimwnasptd commented Oct 2, 2024

For setting up the necessary infra and CLIs I used the following instructions:

Install the Velero binary

wget https://github.com/vmware-tanzu/velero/releases/download/v1.14.1/velero-v1.14.1-linux-amd64.tar.gz
tar -xsvf velero-v1.14.1-linux-amd64.tar.gz
mv velero-v1.14.1-linux-amd64/velero ~/.local/bin

rm -rf velero-v1.14.1-linux-amd64
rm -rf velero-v1.14.1-linux-amd64.tar.gz

Setup S3
One option here is to use a local KF installation (MicroK8s, KinD etc) and a local S3 like RadosGW from Ceph/MicroCeph. The above will work but Velero has the limitation of not being able to make a backup of PersistentVolumes that are backed by hostpath storate (KinD, MicroK8s).

For RadosGW

sudo snap install microceph
sudo microceph cluster bootstrap
sudo ceph -s

sudo microceph disk add loop,4G,3
sudo ceph -s

sudo microceph enable rgw

USER=kimwnasptd
sudo radosgw-admin user create --uid=$USER --display-name=$USER
# sudo radosgw-admin user list

ACCESS_KEY=$(sudo radosgw-admin user info \
	--uid kimwnasptd \
	| jq -r ".keys[0].access_key")

SECRET_KEY=$(sudo radosgw-admin user info \
        --uid kimwnasptd \
        | jq -r ".keys[0].secret_key")

mc alias set radosgw "http://$(hostname)" $ACCESS_KEY $SECRET_KEY

# Buckets
mc mb radosgw/velero
mc ls radosgw

For AWS S3, you can follow Velero's instruction for setting up IAM and Policies for Velero ServiceAccount to talk to an AWS S3 bucket https://github.com/vmware-tanzu/velero-plugin-for-aws#setup

Velero credentials

cat <<EOF > "./credentials-velero"
[default]
aws_access_key_id = $ACCESS_KEY
aws_secret_access_key = $SECRET_KEY
EOF

@kimwnasptd
Copy link
Contributor Author

Installing

Note that we can perform all the management of Velero (installation, operations) via kubectl and by manipulating K8s resources (Backup, Restore). The CLI is sugar on top of having to interact with K8s.

For installing we can use the following command (RadosGW)

velero install \
    --provider aws \
    --plugins velero/velero-plugin-for-aws:v1.10.0 \
    --bucket $BACKUP_BUCKET \
    --secret-file ./credentials-velero \
    --use-volume-snapshots=false \
    --use-node-agent \
    --backup-location-config region=minio,s3ForcePathStyle="true",s3Url=http://$(hostname),publicUrl=http://$(hostname)

Pay close attention to the --use-node-agent flag. This is used for the File System Backup functionality of Velero. Under the hood it installs a DaemonSet, which will be responsible for making the backup of the PV filesystems (via PVCs mounted to Pods). But this needs a bit more investigation on how exactly it's working.

We can get a list of all the manifests that get applied by the velero CLI by setting --dry-run -o yaml

Lastly, the CLI helps with the creation of a BackupStorageLocation, which configures where the backups will be stored, and a VolumeSnapshotLocation for configuring where PV snapshots should be stored (only for providers, i.e. AWS, Azure etc)

@kimwnasptd
Copy link
Contributor Author

Next up I tried to explore how can we make backups of Profile CRs and the contents of the namespaces.

I managed to do this with the following command:

BACKUP_NAME=backup-$(date '+%Y-%m-%d--%H-%M')
velero backup create \
	$BACKUP_NAME \
	--include-namespaces kubeflow-user-example-com \
	--include-cluster-scoped-resources profiles.kubeflow.org
	# --default-volumes-to-fs-backup

velero backup describe $BACKUP_NAME --details

The above will make a backup of all Profile CRs and in the above case of the kubeflow-user-example-com namespace explicitly. But we can also specify the namespace we want to take a snapshot of based on labels.

The --default-volumes-to-fs-backup is for using the opt-out approach, for taking a backup of all PVCs of running Pods.

There are 2 approaches we can take:

  1. Use the --default-volumes-to-fs-backup flag if we are not on a supported provider. Limitation is this works for PVCs of running Pods
  2. Rely on the provider for taking volume snapshots and restoring them

With my current understanding we can have as the default one to be the provider, and if not supported then fall-back to File System Backup. But we will need to investigate more when we might want one over the other.

@kimwnasptd
Copy link
Contributor Author

Lastly, I tried to then restore the Profile and kubeflow-user-example-com namespace with the following command:

velero restore create --from-backup $BACKUP_NAME

This will result in a Restore CR, that points to the Backup CR that was created from the previous step. When making the restore though, I noticed the following 2 issues:

  1. AWS, the contents of PV/PVC would be copied for folders but the contents of the root level files would be empty
  2. When restoring, the namespace will not contain ownerReferences anymore to the Profile restore with object metadata.ownerreference and metadata.finalizer vmware-tanzu/velero#4707

The second one can be resolved if we manually add the ownerReferences to the Profile after the restore. But we will need to also iron out the restore process, and what it will be doing if a namespace might already exist.

For this I had seen some messages in the velero server like the following:

could not restore, Service "test" already exists. Warning: the in-cluster version is different than the backed-up version

@kimwnasptd
Copy link
Contributor Author

To summarise until this point:

  • I managed to use Velero to make a backup of a Profile and its namespace
  • Velero can use both file system backup, or provider specific mechanisms for taking a backup of PersistentVolumes and PVCs
  • Velero installation and operations can all be done via plain manifests and interacting with CustomResources
  • There are some small rough edges, as exposed in the previous comment
  • We would need some further investigation on
    • Understanding the Velero components and their functionality (what does the node-agent DaemonSet do?)
    • Solidifying the story of when to use File System Backup and when to rely on providers
    • The inner workings of how the Restore and Backup CRs work, especially when some objects might already exist and what happens with ownerReferences

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant