GPU Save, Migrate & Resume (SMR) on Kubernetes

SMR of GPU workloads in Kubernetes is still experimental!

Unsupported Configurations

  • Ubuntu 24.04: While technically supported, we cannot guarantee functionality due to its use of a newer libelf version, which requires linking against an 2.38+ libc. Most containers may not operate correctly on this system as a result as we use the system libelf inside the containers, hence enforcing a minimum libc version requirement on container images.

  • CUDA Versions > 12.8: We officially support CUDA up to version 12.8 from 12.0.

  • glibc Versions < 2.34: We do not support glibc versions lower than or equal to 2.34. While we plan to transition to static binaries for some components, a minimum of glibc 2.34 will still be required for the short term. Additionally, our Kubernetes systems currently use CRIU 4.0, which also mandates at least glibc 2.34.

Prerequisites

Before installing Cedana using the Helm chart, ensure that the following are installed:

  • NVIDIA base drivers and CUDA drivers

    • The recommended way to set up a Kubernetes cluster running Ubuntu 22.04+ nodes would be to use the NVIDIA k8s operator, found here.

  • At least 10 GiB of shared memory space in /dev/shm. You can auto-configure your nodes for this using our helm chart, by setting in Additional Configuration or passing --set shmConfig.enabled=true during the helm install.

  • Follow instructions in Kubernetes Setup, and ensure you have cedana setup before proceeding any further.

Running a Container with CUDA

Once everything is set up, you can run a GPU container with the runtimeClassName set to cedana! This nets you performance gains (see Performance of Cedana's GPU Interception) and the ability to checkpoint/restore GPU containers.

To give this a try, use any sample in our https://github.com/cedana/cedana-samples repo, or use the below example.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: cuda-vector-add
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cuda-vector-add
  template:
    metadata:
      labels:
        app: cuda-vector-add
    spec:
      runtimeClassName: cedana # required for GPU C/R support
      containers:
      - name: cuda-vector-add
        image: cedana/cedana-samples:latest
        args:
          - -c
          - gpu_smr/vector_add

Notes

  • The runtimeClass cedana is installed using the helm chart.

  • For automated checkpoint/restore and heartbeat checkpoints, check out C/R Quickstart.

Checkpoint/Restore

Works in the exact same way as described in C/R Quickstart! You can choose to have policy-based restores or manual restores via the Cedana Platform or API.

Check out this video to see a GPU inference workload moved between nodes:

This one is cordon-based, but you can do the same w/ draining or deleting the original node.

Last updated

Was this helpful?