Level 2 - Automation

Automating checkpoint/migrate/restore!

Now that you've completed a simple manual checkpoint/restore, from one node to another, we can make things interesting via automations.

Cedana gives you access to two basic automations to start, that are applied as labels to your workload's yaml:

  • cedana.ai/node-restore: "true"

    • Configures your workload to be automatically re-provisioned on a new node by Kubernetes, using the last taken checkpoint for the workload.

    • Currently, our controller listens to three types of triggers; node deletion, eviction (drain) and cordon.

  • cedana.ai/heartbeat-checkpoint: 60

    • Configures the frequency of automatic, background checkpoints (in seconds).

The best way to try out the automation is to test it for yourself!

Taking it for a spin

Let's use one of our kubernetes samples for this walkthrough:

apiVersion: v1
kind: Pod
metadata:
  generateName: cuda-vector-add-
  namespace: default
  labels:
    cedana.ai/node-restore: "true"
    cedana.ai/heartbeat-checkpoint: 120 
spec:
  runtimeClassName: cedana # required for GPU C/R support (use nvidia for native)
  containers:
    - name: cuda-vector-add
      image: cedana/cedana-samples:cuda
      args:
        - -c
        - gpu_smr/vector_add
      resources:
        limits:
          nvidia.com/gpu: 1

Notice the labels added as mentioned above! If you allow the heartbeat checkpoints to occur in the background (and validate that they exist via the UI), the easiest way to test this is by cordoning the node, evicting it or deleting it!

To cordon a node via kubectl:

kubectl cordon <my-node> 

where <my-node> is the node the workload is currently running on. Cedana will work with Kubernetes to schedule a restore on the next available node, matching the node requirements from the original workload.

See the video below for a demonstration showing automatic restores on node termination on EKS from the EC2 console.

Last updated

Was this helpful?