Experimental GPU SMR on Kubernetes
SMR of GPU workloads in Kubernetes is still experimental! Some of the jank involved in setting up nodes is planned to be automated/smoothed out soon.
Unsupported Configurations
Ubuntu 24.04: While technically supported, we cannot guarantee functionality due to its use of a newer
libelf
version, which requires linking against an 2.38+libc
. Most containers may not operate correctly on this system as a result as we use the system libelf inside the containers, hence enforcing a minimum libc version requirement on container images.CUDA Versions: We officially support CUDA up to version 12.4. Newer versions may work, but we have not conducted thorough testing. Due to the nature of our APIs, it can be challenging to determine if issues arise from version mismatches or other factors.
glibc Versions: We do not support glibc versions lower than 2.35. While we plan to transition to static binaries for some components, a minimum of glibc 2.31 will still be required. Additionally, our Kubernetes systems currently use CRIU 4.0, which also mandates at least glibc 2.35.
Prerequisites
GPU Support: Ensure that the CUDA toolkit and NVIDIA drivers are properly installed.
Verify that the NVIDIA libraries are in your
PATH
andldconfig
.
Note: Most setups should work if installed via system packages. Consider rebooting or sourcing
$HOME/.profile
.
Container Runtime: Ensure your setup uses
runc
. We currently do not supportcrun
or other runtimes.Installing Cedana: You can now install Cedana via the Helm chart (see instructions) on the node you just set up. The Cedana DaemonSet will handle the installation of necessary packages and enable GPU support if
nvidia-smi
is found in thePATH
.The DaemonSet will also download and install any additional packages and libraries required.
Download the Shim File: Retrieve the shim file using the command below:
Make sure to set the correct permissions:
You have two options for integrating the shim:
Add a Separate Runtime: Configure it in
containerd
’sconfig.toml
(usually found at/etc/containerd/config.toml
). Ensure you modify the correct file throughjournalctl
logs for containerd config call arguments.Replace the Existing Binary: This is the simplest approach for quick setup, though it may disrupt some containers. We recommend this for initial experiments.
Making Containerd Config Changes
If adding a new runtime:
Name the new runtime as per the binary you saved in
/usr/local/bin
.Set this new runtime as the default and specify the runtime type as
io.containerd.runc.v2-cedana
.Restart the
containerd
service.
Usage Flow
You can now start new workloads, which will have the
LD_PRELOAD
and required mounts configured automatically.Due to forced
LD_PRELOAD
and mounts, there may be compatibility issues with some containers.We recommend using this new service as the default only for experimental purposes.
For other workloads, use the
runtime:
label in the pod spec configuration.
Last updated