Checkpoint/restore with GPUs
Checkpoint/restore with GPUs is currently only supported for NVIDIA GPUs.
Prerequisites
Create an account with Cedana, to get access to the GPU plugin. See authentication.
Set the Cedana URL & authentication token in the configuration.
Install a GPU plugin.
Option 1: GPU Plugin
The GPU plugin is Cedana's proprietary plugin for high performance GPU checkpoint/restore that supports multi-process/node. If unavailable to you, check option 2.
sudo cedana plugin install gpu
Option 2: CRIU CUDA Plugin
The CRIU CUDA plugin (CRIUgpu) is developed by the CRIU community and uses the NVIDIA CUDA checkpoint utility under the hood.
sudo cedana plugin install criu/cuda
Ensure the daemon is running, see installation.
Do a health check to ensure the plugin is ready, see health checks.
Check out Cedana vs. CRIU CUDA for GPU Checkpoint/Restore for a performance comparison between the two plugins.
Cedana GPU
452
570
✅
✅
amd64, arm64
CRIU CUDA
570
570
✅
❌
amd64
Usage (GPU plugin)
Single process
NOTE: Cedana GPU checkpoint/restore is only possible for managed processes/containers, i.e., those that are spawned using cedana run --gpu-enabled
or managed using cedana manage --gpu-enabled
(see managed process/container).
You may clone the cedana-samples repository for some example GPU workloads.
Run a process with GPU support:
cedana run process --attach --gpu-enabled --jid <job_id> -- cedana-samples/gpu_smr/vector_add
Checkpoint:
cedana dump job <job_id>
Restore:
cedana restore job --attach <job_id>
Multi-process/node
For multi-process/node workloads, you just need to specify the --gpu-freeze-type
option during dump. If the workload is multi-process/multi-node and using NCCL, use the nccl
option.
cedana dump job <job_id> --gpu-freeze-type nccl
You can then restore as usual. You may also set the default GPU freeze type in the configuration.
Usage (CRIU CUDA plugin)
Single process
You can checkpoint/restore normally as you do for CPU workloads. See checkpoint/restore basics.
Multi-process/node
This is currently not supported. You should use the Cedana GPU plugin for multi-process/node workloads.
For all available CLI options, see CLI reference. Directly interacting with daemon is also possible through gRPC, see API reference.
Last updated
Was this helpful?