Checkpoint/restore with GPUs
Last updated
Was this helpful?
Last updated
Was this helpful?
Checkpoint/restore with GPUs is currently only supported for NVIDIA GPUs.
Create an account with Cedana, to get access to the GPU plugin. See authentication.
Set the Cedana URL & authentication token in the configuration.
Install a GPU plugin.
Option 1: GPU Plugin
The GPU plugin is Cedana's proprietary plugin for high performance GPU checkpoint/restore that supports multi-process/node. If unavailable to you, check option 2.
Option 2: CRIU CUDA Plugin
The CRIU CUDA plugin (CRIUgpu) is developed by the CRIU community and uses the NVIDIA CUDA checkpoint utility under the hood.
Ensure the daemon is running, see installation.
Do a health check to ensure the plugin is ready, see health checks.
Check out for a performance comparison between the two plugins.
Cedana GPU
452
570
✅
✅
amd64, arm64
CRIU CUDA
570
570
✅
❌
amd64
NOTE: Cedana GPU checkpoint/restore is only possible for managed processes/containers, i.e., those that are spawned using cedana run --gpu-enabled
or managed using cedana manage --gpu-enabled
(see managed process/container).
You may clone the cedana-samples repository for some example GPU workloads.
Run a process with GPU support:
Checkpoint:
Restore:
For multi-process/node workloads, you just need to specify the --gpu-freeze-type
option during dump. If the workload is multi-process/multi-node and using NCCL, use the nccl
option.
You can then restore as usual. You may also set the default GPU freeze type in the configuration.
You can checkpoint/restore normally as you do for CPU workloads. See checkpoint/restore basics.
This is currently not supported. You should use the Cedana GPU plugin for multi-process/node workloads.
For all available CLI options, see CLI reference. Directly interacting with daemon is also possible through gRPC, see API reference.