C/R Quickstart

Begin checkpoint/migrate/restoring stateful workloads in Kubernetes in under 5 minutes!

For CPU workloads, no additional configuration is required. With our controller and helpers running on your nodes, you can start by deploying this sample stateful Reinforcement Learning job (running Stable Baselines 3 (https://github.com/DLR-RM/stable-baselines3).

Deployment

# test-pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: cedana-sample-ppo-sb3
  labels:
    app: cedana-sample-ppo-sb3
    cedana.ai/node-restore: "true"
    cedana.ai/heartbeat-checkpoint: 120 
spec:
  restartPolicy: Never
  containers:
  - name: cedana-sample-container
    image: "cedana/cedana-samples:latest"
    command: ["python3", "/app/cpu_smr/rl/ppo_sb3.py"]
    resources:
      requests:
        cpu: "1"  
      limits:
        cpu: "1"  

This looks like a normal Pod with the exception of two additional configurable labels that our controller can act on:

  1. cedana.ai/node-restore: "true" which ensures that this Pod will always resume from the last taken checkpoint. This gets triggered on:

    1. Node drain (via kubectl drain)

    2. Node cordoning (via kubectl cordon)

    3. Node deletion. This can be triggered in a few ways, including from deleting through a dashboard (for e.g the EC2 dashboard in the case of EKS or deleting a node in k9s). How long it takes to propagate depends on the control plane however.

  2. cedana.ai/heartbeat-checkpoint: n to take a heartbeat checkpoint in the background every n seconds. Use your best judgement for setting this duration, considering the size of the workload.

Checkpoint

If heartbeat isn't turned on, you can manually checkpoint via our UI. See Using the Cedana Platform for more information. If you'd like to programmatically do so via an API, see API.

Restore

If you have the cedana.ai/node-restore label set to true for the workload, a restore can be triggered by draining, cordoning or deleting the node. Alternatively, you can access the checkpoints and manually restore individually via the Cedana Platform or API.

Example

Below you can find an example of this workflow:

If you've made it this far, congratulations! You've successfully used Cedana to move a stateful workload between nodes and have it pick up work where it left off. Take a look at the panel on the left to see more examples (like GPU Save, Migrate & Resume (SMR) on Kubernetes)!

Last updated

Was this helpful?