Checkpoint/restore with GPUs
Last updated
Was this helpful?
Last updated
Was this helpful?
Checkpoint/restore with GPUs is currently only supported for NVIDIA GPUs.
Create an account with Cedana, to get access to the GPU plugin. See authentication.
Set the Cedana URL & authentication token in the configuration.
Install a GPU plugin.
Option 1: GPU Plugin
The GPU plugin is Cedana's proprietary plugin for high performance GPU checkpoint/restore. If unavailable to you, check option 2.
Minimum NVIDIA driver version: 452 (API 11.8)
Maximum NVIDIA driver version: 550 (API 12.4). Newer drivers are unstable and may not work.
Minimum CRIU version: 3.0
Option 2: CRIU CUDA Plugin
Minimum NVIDIA driver version: 570 (API 12.8)
Minimum CRIU version: 4.0
Check out for a performance comparison between the two plugins.
Ensure the daemon is running, see installation.
Do a health check to ensure the plugin is ready, see health checks.
NOTE: Cedana GPU checkpoint/restore is only possible for managed processes/containers, i.e., those that are spawned using cedana run --gpu-enabled
(see managed process/container).
You may clone the cedana-samples repository for some example GPU workloads.
Run a process with GPU support:
Checkpoint:
Restore:
You can checkpoint/restore normally as you do for CPU workloads. See checkpoint/restore basics.
For all available CLI options, see CLI reference. Directly interacting with daemon is also possible through gRPC, see API reference.