Level 2 - Automation

Automating checkpoint/migrate/restore!

Now that you've completed a simple manual checkpoint/restore, from one node to another, we can make things interesting via automations.

Cedana gives you access to two basic automations to start, that are applied as labels to your workload's yaml:

cedana.ai/node-restore: "true"
- Configures your workload to be automatically re-provisioned on a new node by Kubernetes, using the last taken checkpoint for the workload.
- Currently, our controller listens to three types of triggers; node deletion, eviction (drain) and cordon.
cedana.ai/heartbeat-checkpoint: 60
- Configures the frequency of automatic, background checkpoints (in seconds).

The best way to try out the automation is to test it for yourself!

Taking it for a spin

Let's use one of our kubernetes samples for this walkthrough:

apiVersion: v1
kind: Pod
metadata:
  generateName: cuda-vector-add-
  namespace: default
  labels:
    cedana.ai/node-restore: "true"
    cedana.ai/heartbeat-checkpoint: 120 
spec:
  runtimeClassName: cedana # required for GPU C/R support (use nvidia for native)
  containers:
    - name: cuda-vector-add
      image: cedana/cedana-samples:cuda
      args:
        - -c
        - gpu_smr/vector_add
      resources:
        limits:
          nvidia.com/gpu: 1

Notice the labels added as mentioned above! If you allow the heartbeat checkpoints to occur in the background (and validate that they exist via the UI), the easiest way to test this is by cordoning the node, evicting it or deleting it!

To cordon a node via kubectl:

kubectl cordon <my-node>

where <my-node> is the node the workload is currently running on. Cedana will work with Kubernetes to schedule a restore on the next available node, matching the node requirements from the original workload.

Deleting via EC2 console (as shown in the video below) takes some time to propagate, so try deleting from k9s or kubectl to see the instant restore!

See the video below for a demonstration showing automatic restores on node termination on EKS from the EC2 console.

Moving Past Pods

The cordon-based policy works best for pods, but if you're running other CRDs or kinds that have their own lifecycles, it can conflict. Thankfully, cedana has a seamless solution for job support! See Managing Kubernetes Jobsfor more information on automating jobs with Cedana.

PreviousLevel 1 - Simple GPU C/R NextLevel 3 - Customization

Last updated 25 days ago

Was this helpful?