Checkpoint/restore with GPUs

Checkpoint/restore with GPUs is currently only supported for NVIDIA GPUs.

Prerequisites

Create an account with Cedana, to get access to the GPU plugin. See authentication.
Set the Cedana URL & authentication token in the configuration.
Install a GPU plugin.

Option 1: GPU Plugin
The GPU plugin is Cedana's proprietary plugin for high performance GPU checkpoint/restore that supports multi-process/node. If unavailable to you, check option 2.
```
sudo cedana plugin install gpu
```
Option 2: CRIU CUDA Plugin
The CRIU CUDA plugin (CRIUgpu) is developed by the CRIU community and uses the NVIDIA CUDA checkpoint utility under the hood.
```
sudo cedana plugin install criu/cuda
```

Ensure the daemon is running, see installation.
Do a health check to ensure the plugin is ready, see health checks.

Check out Cedana vs. CRIU CUDA for GPU Checkpoint/Restore for a performance comparison between the two plugins.

Min driver

Max driver

Multi-GPU

Multi-process

Arch

Cedana GPU

452

570

✅

amd64, arm64

CRIU CUDA

570

✅

❌

amd64

Usage (GPU plugin)

Single process

NOTE: Cedana GPU checkpoint/restore is only possible for managed processes/containers, i.e., those that are spawned using cedana run --gpu-enabled or managed using cedana manage --gpu-enabled (see managed process/container).

You may clone the cedana-samples repository for some example GPU workloads.
Run a process with GPU support:

cedana run process --attach --gpu-enabled --jid <job_id> -- cedana-samples/gpu_smr/vector_add

Checkpoint:

cedana dump job <job_id>

Restore:

cedana restore job --attach <job_id>

Multi-process/node

For multi-process/node workloads, you just need to specify the --gpu-freeze-type option during dump. If the workload is multi-process/multi-node and using NCCL, use the nccl option.

cedana dump job <job_id> --gpu-freeze-type nccl

You can then restore as usual. You may also set the default GPU freeze type in the configuration.

Usage (CRIU CUDA plugin)

Single process

You can checkpoint/restore normally as you do for CPU workloads. See checkpoint/restore basics.

Multi-process/node

This is currently not supported. You should use the Cedana GPU plugin for multi-process/node workloads.

For all available CLI options, see CLI reference. Directly interacting with daemon is also possible through gRPC, see API reference.

PreviousCheckpoint/restore basics NextCheckpoint/restore runc

Last updated 23 days ago

Was this helpful?