Skip to content

Latest commit

 

History

History
142 lines (106 loc) · 3.71 KB

README.md

File metadata and controls

142 lines (106 loc) · 3.71 KB

Demo Scripts

This repository contains scripts to demonstrate Cedana features.

Prerequisites

  • A Kubernetes cluster with Cedana installed

  • A GPU node with Nvidia GPUs

  • Works with Ubuntu 22.04

Setup a k8s cluster

  1. Install GPU drivers and CUDA toolkit: 12.4.1 is the latest version we support, as of writing this document
# setup nvidia drivers and cuda + toolkit
# use the link to use runtime or deb files on ubuntu: 
# https://developer.nvidia.com/cuda-12-4-0-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04
  1. Setup paths for CRIU (should be fixed in the future, created ticket on linear (we setup external mounts that follow links, note this can vary depending on the system and package version)):
# ensure /usr/lib/x86_64-linux-gnu/libelf.so.1
# and /usr/lib/x86_64-linux-gnu/libz.so.1
# are not symblinks

# copy the actual file over to the above paths and create duplicates
# eg: cp /usr/lib/x86_64-linux-gnu/libelf.so.2 /usr/lib/x86_64-linux-gnu/libelf.so.1
  1. Setup K3s on the cluster (we will set it up as the root user to avoid permission issues):
# install k3s using k3sup
curl -sLS https://get.k3sup.dev | sh
sudo install k3sup /usr/local/bin/

# install k3s
k3sup install --local
# you can pick the channel for k8s version and docker for the container runtime as well, but it's
not useful for this demo
  1. Setup Cedana using the helm chart (with the given images, the changes haven't been merged into main at the time of writing this document):l
# install helm chart with
cedana/cedana-helper-test:fix-gpu-restore-runc
# and,
cedana/cedana-controller-test:fix-gpu-runc-restore
# also ensure;
ImagePullPolicy: Always
# (this is a temporary fix, to allow us to easily update images, if images are working no need to
# setup any more)

4.5. Additionally

# on k3s, after the install of helper pod and cedana is done,
# please restart your k3s instance to make cedana runtime available.
# use:
systemctl restart k3s
# this will reload our runtime configs, updated by the helm chart we just installed (ideally this
# would happen automatically, but it requires some changes to how we run background services to
# perform without unwanted disruptions, and issues)

# Also ensure all pods on k3s get restarted properly, restart them again if they are in unknown
state
  1. Setup the demo scripts
# ssh into the node
ssh root@<node-ip>
# clone the cedana-demo-scripts
git clone https://github.com/cedana/cedana-demo-scripts.git
cd cedana-demo-scripts

Usage of the demo scripts

  1. Select a workload type to checkpoint and restore
# Currently, we support two workload types:
# - Default: A CUDA Throughput test, that stresses the GPU and measures the throughput
# - Complex: A more complex workload that maybe a mix of some CPU and GPU work

# To select a workload type, just export COMPLEX=1 or COMPLEX="" to select complex and default
# workload respectively

export COMPLEX="" # default, if COMPLEX is not set then we pick default as well
export COMPLEX=1 # complex
  1. Setup the checkpoint, to start a checkpoint pod and container
# setup the checkpoint
./cedana checkpointsetup
# or
./cedana cs
  1. Perform the checkpoint, the script will create a checkpoint and delete the checkpointed container & pod to save space
# perform the checkpoint
./cedana checkpoint
# or
./cedana c
  1. Perform the restoration, the script will directly restore to with a restorable pod
# perform the restoration
./cedana restore
# or
./cedana r
  1. To get the logs of checkpoint restore, we forward it to the same fd
./cedana cl
# or
./cedana rl
  1. Lastly to get the logs of cedana daemon
./cedana dlog