githubEdit

Checkpoint/Restore

Begin checkpoint/migrate/restoring stateful workloads in Kubernetes in under 5 minutes!

For CPU workloads, no additional configuration is required. With Cedana running on your cluster, you can start by deploying this sample stateful reinforcement learning job (running Stable Baselines 3arrow-up-right).

Deploy

# test-pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: cedana-sample-ppo-sb3
  labels:
    app: cedana-sample-ppo-sb3
spec:
  restartPolicy: Never
  containers:
    - name: cedana-sample-container
      image: "cedana/cedana-samples:latest"
      command: ["python3", "/app/cpu_smr/rl/ppo_sb3.py"]
      resources:
        requests:
          cpu: "1"
        limits:
          cpu: "1"
circle-info

Note that for any sort of automation (resume on failure), it might make more sense to use Kubernetes Jobsarrow-up-right. See checkpoint/restoring Jobs for more information.

Deploy this pod to your cluster using:

Checkpoint

You can either create a heartbeat policy to automatically checkpoint at regular intervals, or you can manually checkpoint this pod on the Pods Pagearrow-up-right.

Restore

You can manually restore the workload on the Checkpoints Pagearrow-up-right.

circle-info

Automated restores are currently best performed within the context of the Kubernetes lifecycle, where we integrate cleanly. For example, if you're using a kind (e.g. Kubernetes Deploymentarrow-up-right) that automatically reschedules the pod on node failure or eviction, it will restore from the latest checkpoint instead of starting from scratch. Check checkpoint/restoring Jobs for more information.

Example

Below you can find an example of this workflow:

If you've made it this far, congratulations! You've successfully used Cedana to move a stateful workload between nodes and have it pick up work where it left off.

Take a look at the left sidebar to see more examples, such as GPU Save/Migrate/Resume on Kubernetesarrow-up-right.

Last updated

Was this helpful?