GPU Save, Migrate & Resume (SMR) on Kubernetes
Unsupported Configurations
Ubuntu 24.04: While technically supported, we cannot guarantee functionality due to its use of a newer
libelf
version, which requires linking against an 2.38+libc
. Most containers may not operate correctly on this system as a result as we use the system libelf inside the containers, hence enforcing a minimum libc version requirement on container images.CUDA Versions > 12.8: We officially support CUDA up to version 12.8 from 12.0.
glibc Versions < 2.34: We do not support glibc versions lower than or equal to 2.34. While we plan to transition to static binaries for some components, a minimum of glibc 2.34 will still be required for the short term. Additionally, our Kubernetes systems currently use CRIU 4.0, which also mandates at least glibc 2.34.
Prerequisites
Before installing Cedana using the Helm chart, ensure that the following are installed:
NVIDIA base drivers and CUDA drivers
The recommended way to set up a Kubernetes cluster running Ubuntu 22.04+ nodes would be to use the NVIDIA k8s operator, found here.
At least 10 GiB of shared memory space in /dev/shm. You can auto-configure your nodes for this using our helm chart, by setting in Additional Configuration or passing
--set shmConfig.enabled=true
during the helm install.Follow instructions in Kubernetes Setup, and ensure you have cedana setup before proceeding any further.
Running a Container with CUDA
Once everything is set up, you can run a GPU container with the runtimeClassName
set to cedana
! This nets you performance gains (see Performance of Cedana's GPU Interception) and the ability to checkpoint/restore GPU containers.
To give this a try, use any sample in our https://github.com/cedana/cedana-samples repo, or use the below example.
apiVersion: apps/v1
kind: Deployment
metadata:
name: cuda-vector-add
namespace: default
spec:
replicas: 1
selector:
matchLabels:
app: cuda-vector-add
template:
metadata:
labels:
app: cuda-vector-add
spec:
runtimeClassName: cedana # required for GPU C/R support
containers:
- name: cuda-vector-add
image: cedana/cedana-samples:latest
args:
- -c
- gpu_smr/vector_add
Notes
The runtimeClass
cedana
is installed using the helm chart.For automated checkpoint/restore and heartbeat checkpoints, check out C/R Quickstart.
Checkpoint/Restore
Works in the exact same way as described in C/R Quickstart! You can choose to have policy-based restores or manual restores via the Cedana Platform or API.
Check out this video to see a GPU inference workload moved between nodes:
Last updated
Was this helpful?