Quickstart

Begin checkpoint/migrate/restoring stateful workloads in Kubernetes in under 5 minutes!

For CPU workloads, no additional configuration is required. With our controller and helpers running on your nodes, you can start by deploying this sample stateful Reinforcement Learning job (running Stable Baselines 3 (https://github.com/DLR-RM/stable-baselines3).

Deployment

# test-pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: cedana-sample-ppo-sb3
  labels:
    app: cedana-sample-ppo-sb3
spec:
  restartPolicy: Never
  containers:
  - name: cedana-sample-container
    image: "cedana/cedana-samples:latest"
    command: ["python3", "/app/cpu_smr/rl/ppo_sb3.py"]
    resources:
      requests:
        cpu: "1"  
      limits:
        cpu: "1"

This looks just like a normal Pod! There is no additional configuration required. From here, you can take checkpoints via our UI or by setting policies.

Note that for any sort of automation (where the workload is resumed on failure), it might make more sense to use a different kind like Jobs, where we have a tighter and more natural integration. See Managing Kubernetes Jobsfor more information.

Checkpoint

If a heartbeat policy isn't active for your workload, you can manually checkpoint via our UI. See for more information.

Restore

You can manually restore the workload from the Checkpoints page via the Cedana platform.

Automated restores are currently best performed within the context of the Kubernetes lifecycle, where we integrate cleanly. For example, if you're using a kind that automatically reschedules the pod on node failure or eviction, it will instead pick up from the latest checkpoint instead of starting from scratch. More details on this method can be found in Managing Kubernetes Jobs.

Example

Below you can find an example of this workflow:

If you've made it this far, congratulations! You've successfully used Cedana to move a stateful workload between nodes and have it pick up work where it left off. Take a look at the panel on the left to see more examples (like GPU Save, Migrate & Resume (SMR) on Kubernetes)!

PreviousInstallation NextStorage

Last updated 22 days ago

Was this helpful?

hashtagDeployment

hashtagCheckpoint

hashtagRestore

hashtagExample

Deployment

Checkpoint

Restore

Example