Quickstart

Begin checkpoint/migrate/restoring stateful workloads in Kubernetes in under 5 minutes!

For CPU workloads, no additional configuration is required. With our controller and helpers running on your nodes, you can start by deploying this sample stateful Reinforcement Learning job (running Stable Baselines 3 (https://github.com/DLR-RM/stable-baselines3).

Deployment

# test-pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: cedana-sample-ppo-sb3
  labels:
    app: cedana-sample-ppo-sb3
    cedana.ai/node-restore: "true"
    cedana.ai/heartbeat-checkpoint: 120 
spec:
  restartPolicy: Never
  containers:
  - name: cedana-sample-container
    image: "cedana/cedana-samples:latest"
    command: ["python3", "/app/cpu_smr/rl/ppo_sb3.py"]
    resources:
      requests:
        cpu: "1"  
      limits:
        cpu: "1"  

This looks like a normal Pod with the exception of two additional configurable labels that our controller can act on:

  1. cedana.ai/node-restore: "true" which ensures that this Pod will always resume from the last taken checkpoint. This gets triggered on:

    1. Node drain (via kubectl drain)

    2. Node cordoning (via kubectl cordon)

    3. Node deletion. This can be triggered in a few ways, including from deleting through a dashboard (for e.g the EC2 dashboard in the case of EKS or deleting a node in k9s). How long it takes to propagate depends on the control plane however.

Note that for the ^ behavior, it might make more sense to use a different kind like Jobs, where we have a tighter and more natural integration. See Managing Kubernetes Jobsfor more information.

The above path also depends on checkpoints being taken, which you can either do manually via our UI or by setting policies.

Checkpoint

If a heartbeat policy isn't active for your workload, you can manually checkpoint via our UI. See Using the Cedana Platform for more information. If you'd like to programmatically do so via an API, see API.

Restore

If you have the cedana.ai/node-restore label set to true for the workload, a restore can be triggered by draining, cordoning or deleting the node. Alternatively, you can access the checkpoints and manually restore individually via the Cedana Platform or API.

If the pod belongs to some other CRD, you're best off using a method as found in Managing Kubernetes Jobs.

Example

Below you can find an example of this workflow:

If you've made it this far, congratulations! You've successfully used Cedana to move a stateful workload between nodes and have it pick up work where it left off. Take a look at the panel on the left to see more examples (like GPU Save, Migrate & Resume (SMR) on Kubernetes)!

Last updated

Was this helpful?