Cedana
Cedana Docs
Cedana
Cedana Docs
  • Welcome to Cedana
  • Get started
    • Authentication
    • Deploying on Kubernetes
    • Using the Cedana Platform
    • Deploying locally
    • Supported container runtimes
  • ARTICLES
    • Performance of Cedana's GPU Interception
    • Cedana vs. CRIU CUDA for GPU Checkpoint/Restore
  • Examples
    • GPU Save, Migrate & Resume (SMR) on Kubernetes
    • Redis Save Migrate & Resume (SMR) on Kubernetes
    • LLaMA Inference GPU Save, Migrate & Resume (SMR)
  • References
    • API
      • API reference
    • Cedana Daemon
    • Cedana CLI
    • GitHub
Powered by GitBook
On this page
  • Unsupported Configurations
  • Prerequisites
  • Install the helm chart onto the cluster
  • Running a Container with CUDA
  • Notes:
  • Performing a Save
  • Performing a Resume

Was this helpful?

Edit on GitHub
  1. Examples

GPU Save, Migrate & Resume (SMR) on Kubernetes

PreviousCedana vs. CRIU CUDA for GPU Checkpoint/RestoreNextRedis Save Migrate & Resume (SMR) on Kubernetes

Last updated 23 days ago

Was this helpful?

SMR of GPU workloads in Kubernetes is still experimental! Some of the jank involved in setting up nodes is planned to be automated/smoothed out soon.

Unsupported Configurations

  • Ubuntu 24.04: While technically supported, we cannot guarantee functionality due to its use of a newer libelf version, which requires linking against an 2.38+ libc. Most containers may not operate correctly on this system as a result as we use the system libelf inside the containers, hence enforcing a minimum libc version requirement on container images.

  • CUDA Versions >12.4: We officially support CUDA up to version 12.8 from 12.0.

  • glibc Versions <2.34: We do not support glibc versions lower than or equal to 2.34. While we plan to transition to static binaries for some components, a minimum of glibc 2.34 will still be required for the short term. Additionally, our Kubernetes systems currently use CRIU 4.0, which also mandates at least glibc 2.34.

Prerequisites

Before installing Cedana using the Helm chart, ensure that the following are installed:

  • NVIDIA base drivers and CUDA drivers

Install the helm chart onto the cluster

git clone https://github.com/cedana/cedana-helm-charts.git
cd cedana-helm-charts
# edit the values.yaml file to set the correct values for your cluster for auth token and cedana url
helm install cedana ./cedana-helm  --create-namespace -n cedana-namespace -f cedana-values.yaml

Follow instructions in , and ensure you have cedana setup before proceeding further.

Verify that the Cedana helper pod logs indicate a valid CUDA version and display the message "GPU Enabled".

Running a Container with CUDA

Once everything is set up, you can run a container with CUDA support. Make sure to set the CEDANA_GPU environment variable in the container spec:

apiVersion: v1
kind: Pod
metadata:
  name: cuda-pod
  namespace: default
spec:
  runtimeClassName: cedana
  containers:
    - name: cuda-container-test
      # note: it will also work fine if you have older cuda versions
      image: cedana/cedana-example:inference
      tty: true
      env:
        - name: CEDANA_GPU
          value: "1"  # Value is irrelevant, but it must be set to enable GPU support

Notes:

  • Using tty: true is recommended, to ensure we don't have any io buffering issues.

  • The runtimeClass cedana is installed using the helm chart, and the internal containerd config is updated to use the new shim for the given runtime.

Performing a Save

You have two options to perform

Performing a Resume

For our resumes we require first creating a new container with the same image but with root PID in sleep so that it can be replaced by us. We plan to improve this workflow soon, but until then it's still considered a requirement.

apiVersion: v1
kind: Pod
metadata:
  name: cuda-pod-restore
  namespace: default
spec:
  runtimeClassName: cedana
  containers:
    - name: cuda-container-test
      # note: it will also work fine if you have older cuda versions
      image: swarnimcedana/cuda-vectoradd:cuda-12.4.1
      args: ["sh", "-c", "sleep infinity"]
      tty: true
      env:
        - name: CEDANA_GPU
          value: "1"  # Value is irrelevant, but it must be set to enable GPU support

After the restore pod is setup and running you can attempt your restore which should resume from a previously taken checkpoint:

# ensure pod is created and setup
$ kubectl create -f cuda-pod-restore.yaml

$ export ROOT="/run/containerd/runc/k8s.io"

# setup variables from above information
$ export RESTORE_CONTAINER=cuda-conatiner-test
$ export RESTORE_SANDBOX=cuda-pod-restore
$ export NAMESPACE=default

# path to store checkpoint on node's local filesystem
$ export CHECKPOINT_PATH=/tmp/ckpt-test

# now you can try running the restore
curl -X POST -H "Content-Type: application/json" -d '{
  "checkpoint_data": {
    "container_name": "'$RESTORE_CONTAINER'",
    "sandbox_name": "'$RESTORE_SANDBOX'",
    "namespace": "'$NAMESPACE'",
    "checkpoint_path": "'$CHECKPOINT_PATH'",
    "root": "'$ROOT'"
  }
}' http://localhost:1324/restore

With these steps completed, you should be able to leverage Cedana GPU support within your Kubernetes containers.

Cedana Cluster Installation