GPU Save, Migrate & Resume (SMR) on Kubernetes
Last updated
Was this helpful?
Last updated
Was this helpful?
Ubuntu 24.04: While technically supported, we cannot guarantee functionality due to its use of a newer libelf
version, which requires linking against an 2.38+ libc
. Most containers may not operate correctly on this system as a result as we use the system libelf inside the containers, hence enforcing a minimum libc version requirement on container images.
CUDA Versions >12.4: We officially support CUDA up to version 12.8 from 12.0.
glibc Versions <2.34: We do not support glibc versions lower than or equal to 2.34. While we plan to transition to static binaries for some components, a minimum of glibc 2.34 will still be required for the short term. Additionally, our Kubernetes systems currently use CRIU 4.0, which also mandates at least glibc 2.34.
Before installing Cedana using the Helm chart, ensure that the following are installed:
NVIDIA base drivers and CUDA drivers
Follow instructions in , and ensure you have cedana setup before proceeding further.
Verify that the Cedana helper pod logs indicate a valid CUDA version and display the message "GPU Enabled".
Once everything is set up, you can run a container with CUDA support. Make sure to set the CEDANA_GPU
environment variable in the container spec:
Using tty: true
is recommended, to ensure we don't have any io buffering issues.
The runtimeClass cedana is installed using the helm chart, and the internal containerd config is updated to use the new shim for the given runtime.
You have two options to perform
For our resumes we require first creating a new container with the same image but with root PID in sleep so that it can be replaced by us. We plan to improve this workflow soon, but until then it's still considered a requirement.
After the restore pod is setup and running you can attempt your restore which should resume from a previously taken checkpoint:
With these steps completed, you should be able to leverage Cedana GPU support within your Kubernetes containers.