C/R Quickstart
Begin checkpoint/migrate/restoring stateful workloads in Kubernetes in under 5 minutes!
For CPU workloads, no additional configuration is required. With our controller and helpers running on your nodes, you can start by deploying this sample stateful Reinforcement Learning job (running Stable Baselines 3 (https://github.com/DLR-RM/stable-baselines3).
Deployment
# test-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: cedana-sample-ppo-sb3
labels:
app: cedana-sample-ppo-sb3
cedana.ai/node-restore: "true"
cedana.ai/heartbeat-checkpoint: 120
spec:
restartPolicy: Never
containers:
- name: cedana-sample-container
image: "cedana/cedana-samples:latest"
command: ["python3", "/app/cpu_smr/rl/ppo_sb3.py"]
resources:
requests:
cpu: "1"
limits:
cpu: "1"
This looks like a normal Pod with the exception of two additional configurable labels that our controller can act on:
cedana.ai/node-restore: "true"
which ensures that this Pod will always resume from the last taken checkpoint. This gets triggered on:Node drain (via
kubectl drain
)Node cordoning (via
kubectl cordon
)Node deletion. This can be triggered in a few ways, including from deleting through a dashboard (for e.g the EC2 dashboard in the case of EKS or deleting a node in k9s). How long it takes to propagate depends on the control plane however.
cedana.ai/heartbeat-checkpoint: n
to take a heartbeat checkpoint in the background everyn
seconds. Use your best judgement for setting this duration, considering the size of the workload.
Checkpoint
If heartbeat isn't turned on, you can manually checkpoint via our UI. See Using the Cedana Platform for more information. If you'd like to programmatically do so via an API, see API.
Restore
If you have the cedana.ai/node-restore
label set to true for the workload, a restore can be triggered by draining, cordoning or deleting the node. Alternatively, you can access the checkpoints and manually restore individually via the Cedana Platform or API.
Example
Below you can find an example of this workflow:
If you've made it this far, congratulations! You've successfully used Cedana to move a stateful workload between nodes and have it pick up work where it left off. Take a look at the panel on the left to see more examples (like GPU Save, Migrate & Resume (SMR) on Kubernetes)!
Last updated
Was this helpful?