Checkpoint/restore with GPUs
Last updated
Was this helpful?
Last updated
Was this helpful?
Checkpoint/restore with GPUs is currently only supported for NVIDIA GPUs.
Create an account with Cedana, to get access to the GPU plugin. See .
Set the Cedana URL & authentication token in the .
Install a GPU plugin.
Option 1: GPU Plugin
The GPU plugin is Cedana's proprietary plugin for high performance GPU checkpoint/restore that supports multi-process/node. If unavailable to you, check option 2.
Option 2: CRIU CUDA Plugin
The CRIU CUDA plugin (CRIUgpu) is developed by the CRIU community and uses the under the hood.
Ensure the daemon is running, see .
Do a health check to ensure the plugin is ready, see .
Check out for a performance comparison between the two plugins.
Cedana GPU
452
570
✅
✅
amd64, arm64
CRIU CUDA
570
570
✅
❌
amd64
Run a process with GPU support:
Checkpoint:
Restore:
This is currently not supported. You should use the Cedana GPU plugin for multi-process/node workloads.
NOTE: Cedana GPU checkpoint/restore is only possible for managed processes/containers, i.e., those that are spawned using cedana run --gpu-enabled
or managed using cedana manage --gpu-enabled
(see ).
You may clone the for some example GPU workloads.
For multi-process/node workloads, you just need to specify the --gpu-type
option during run. If the workload is multi-process/multi-node and using , use the nccl
option.
You can then checkpoint/restore as usual. You may also set the default GPU multi-process type in the .
You can checkpoint/restore normally as you do for CPU workloads. See .
For all available CLI options, see . Directly interacting with daemon is also possible through gRPC, see .