Level 2 - Automation
Automating checkpoint/migrate/restore!
Now that you've completed a simple manual checkpoint/restore, from one node to another, we can make things interesting via automations.
Cedana gives you access to two basic automations to start, that are applied as labels to your workload's yaml:
cedana.ai/node-restore: "true"
Configures your workload to be automatically re-provisioned on a new node by Kubernetes, using the last taken checkpoint for the workload.
Currently, our controller listens to three types of triggers; node deletion, eviction (drain) and cordon.
cedana.ai/heartbeat-checkpoint
: 60Configures the frequency of automatic, background checkpoints (in seconds).
The best way to try out the automation is to test it for yourself!
Taking it for a spin
Let's use one of our kubernetes samples for this walkthrough:
apiVersion: v1
kind: Pod
metadata:
generateName: cuda-vector-add-
namespace: default
labels:
cedana.ai/node-restore: "true"
cedana.ai/heartbeat-checkpoint: 120
spec:
runtimeClassName: cedana # required for GPU C/R support (use nvidia for native)
containers:
- name: cuda-vector-add
image: cedana/cedana-samples:cuda
args:
- -c
- gpu_smr/vector_add
resources:
limits:
nvidia.com/gpu: 1
Notice the labels added as mentioned above! If you allow the heartbeat checkpoints to occur in the background (and validate that they exist via the UI), the easiest way to test this is by cordoning the node, evicting it or deleting it!
To cordon a node via kubectl
:
kubectl cordon <my-node>
where <my-node>
is the node the workload is currently running on. Cedana will work with Kubernetes to schedule a restore on the next available node, matching the node requirements from the original workload.
Deleting via EC2 console (as shown in the video below) takes some time to propagate, so try deleting from k9s or kubectl to see the instant restore!
See the video below for a demonstration showing automatic restores on node termination on EKS from the EC2 console.
Last updated
Was this helpful?