For the complete documentation index, see llms.txt. This page is also available as Markdown.

Manual Checkpoint/Restore

Begin checkpoint/migrate/restoring stateful workloads in Kubernetes in under 5 minutes!

For CPU workloads, no additional configuration is required. With Cedana running on your cluster, you can start by deploying this sample stateful reinforcement learning job (running Stable Baselines 3).

Deploy

# test-pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: cedana-sample-ppo-sb3
  labels:
    app: cedana-sample-ppo-sb3
spec:
  restartPolicy: Never
  containers:
    - name: cedana-sample-container
      image: "cedana/cedana-samples:latest"
      command: ["python3", "/app/cpu_smr/rl/ppo_sb3.py"]
      resources:
        requests:
          cpu: "1"
        limits:
          cpu: "1"

Note that for any sort of automation (resume on failure), it might make more sense to use Kubernetes Jobs. See checkpoint/restoring Jobs for more information.

Deploy this pod to your cluster using:

Checkpoint

You can either create a heartbeat policy to automatically checkpoint at regular intervals, or you can manually checkpoint this pod on the Pods Page.

Restore

You can manually restore the workload on the Checkpoints Page.

Automated restores are currently best performed within the context of the Kubernetes lifecycle, where we integrate cleanly. For example, if you're using a kind (e.g. Kubernetes Deployment) that automatically reschedules the pod on node failure or eviction, it will restore from the latest checkpoint instead of starting from scratch. Check checkpoint/restoring Jobs for more information.

Example

Below you can find an example of this workflow:

If you've made it this far, congratulations! You've successfully used Cedana to move a stateful workload between nodes and have it pick up work where it left off.

Take a look at the left sidebar to see more examples, such as GPU Save/Migrate/Resume on Kubernetes.

Last updated

Was this helpful?