Quickstart
Begin checkpoint/migrate/restoring stateful workloads in Kubernetes in under 5 minutes!
For CPU workloads, no additional configuration is required. With our controller and helpers running on your nodes, you can start by deploying this sample stateful Reinforcement Learning job (running Stable Baselines 3 (https://github.com/DLR-RM/stable-baselines3).
Deployment
# test-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: cedana-sample-ppo-sb3
labels:
app: cedana-sample-ppo-sb3
cedana.ai/node-restore: "true"
cedana.ai/heartbeat-checkpoint: 120
spec:
restartPolicy: Never
containers:
- name: cedana-sample-container
image: "cedana/cedana-samples:latest"
command: ["python3", "/app/cpu_smr/rl/ppo_sb3.py"]
resources:
requests:
cpu: "1"
limits:
cpu: "1"
This looks like a normal Pod with the exception of two additional configurable labels that our controller can act on:
cedana.ai/node-restore: "true"which ensures that this Pod will always resume from the last taken checkpoint. This gets triggered on:Node drain (via
kubectl drain)Node cordoning (via
kubectl cordon)Node deletion. This can be triggered in a few ways, including from deleting through a dashboard (for e.g the EC2 dashboard in the case of EKS or deleting a node in k9s). How long it takes to propagate depends on the control plane however.
The above path also depends on checkpoints being taken, which you can either do manually via our UI or by setting policies.
Checkpoint
If a heartbeat policy isn't active for your workload, you can manually checkpoint via our UI. See Using the Cedana Platform for more information. If you'd like to programmatically do so via an API, see API.
Restore
If you have the cedana.ai/node-restore label set to true for the workload, a restore can be triggered by draining, cordoning or deleting the node. Alternatively, you can access the checkpoints and manually restore individually via the Cedana Platform or API.
If the pod belongs to some other CRD, you're best off using a method as found in Managing Kubernetes Jobs.
Example
Below you can find an example of this workflow:
If you've made it this far, congratulations! You've successfully used Cedana to move a stateful workload between nodes and have it pick up work where it left off. Take a look at the panel on the left to see more examples (like GPU Save, Migrate & Resume (SMR) on Kubernetes)!
Last updated
Was this helpful?