Checkpoint/restore with GPUs
Checkpoint/restore with GPUs is currently only supported for NVIDIA GPUs.
Prerequisites
Create an account with Cedana, to get access to the GPU plugin. See authentication.
Set the Cedana URL & authentication token in the configuration.
Install a GPU plugin.
Option 1: GPU Plugin
The GPU plugin is Cedana's proprietary plugin for high performance GPU checkpoint/restore that supports multi-process/node. If unavailable to you, check option 2.
sudo cedana plugin install gpuOption 2: CRIU CUDA Plugin
The CRIU CUDA plugin (CRIUgpu) is developed by the CRIU community and uses the NVIDIA CUDA checkpoint utility under the hood.
sudo cedana plugin install criu/cuda
Ensure the daemon is running, see installation.
Do a health check to ensure the plugin is ready, see health checks.
Check out Cedana vs. CRIUgpu for GPU Checkpoint/Restore for a performance comparison between the two plugins.
Cedana GPU
452
570
✅
✅
amd64, arm64
CRIU CUDA
570
570
✅
❌
amd64
Usage (GPU plugin)
Single process
You may clone the cedana-samples repository for some example GPU workloads.
Run a process with GPU support:
Checkpoint:
Restore:
Multi-process/node
For multi-process/node workloads, you just need to specify the --gpu-freeze-type option during dump. If the workload is multi-process/multi-node and using NCCL, use the nccl option.
You can then restore as usual. You may also set the default GPU freeze type in the configuration.
Usage (CRIU CUDA plugin)
Single process
You can checkpoint/restore normally as you do for CPU workloads. See checkpoint/restore basics.
Multi-process/node
This is currently not supported. You should use the Cedana GPU plugin for multi-process/node workloads.
Last updated
Was this helpful?