Checkpoint/restore with GPUs

Prerequisites

  1. Create an account with Cedana, to get access to the GPU plugin. See authentication.

  2. Set the Cedana URL & authentication token in the configuration.

  3. Install a GPU plugin.

  • Option 1: GPU Plugin

    The GPU plugin is Cedana's proprietary plugin for high performance GPU checkpoint/restore that supports multi-process/node. If unavailable to you, check option 2.

    sudo cedana plugin install gpu
  • Option 2: CRIU CUDA Plugin

    The CRIU CUDA plugin (CRIUgpu) is developed by the CRIU community and uses the NVIDIA CUDA checkpoint utility under the hood.

    sudo cedana plugin install criu/cuda
  1. Ensure the daemon is running, see installation.

  2. Do a health check to ensure the plugin is ready, see health checks.

Check out Cedana vs. CRIUgpu for GPU Checkpoint/Restore for a performance comparison between the two plugins.

Min driver
Max driver
Multi-GPU
Multi-process
Arch

Cedana GPU

452

570

amd64, arm64

CRIU CUDA

570

570

amd64

Usage (GPU plugin)

Single process

Cedana GPU checkpoint/restore is only possible for managed processes/containers, i.e., those that are spawned using cedana run --gpu-enabled or managed using cedana manage --gpu-enabled (see managed process/container).

  1. You may clone the cedana-samples repository for some example GPU workloads.

  2. Run a process with GPU support:

cedana run process --attach --gpu-enabled --jid <job_id> -- cedana-samples/gpu_smr/vector_add
  1. Checkpoint:

cedana dump job <job_id>
  1. Restore:

cedana restore job --attach <job_id>

Multi-process/node

For multi-process/node workloads, you just need to specify the --gpu-freeze-type option during dump. If the workload is multi-process/multi-node and using NCCL, use the nccl option.

cedana dump job <job_id> --gpu-freeze-type nccl

You can then restore as usual. You may also set the default GPU freeze type in the configuration.

Usage (CRIU CUDA plugin)

Single process

You can checkpoint/restore normally as you do for CPU workloads. See checkpoint/restore basics.

Multi-process/node

This is currently not supported. You should use the Cedana GPU plugin for multi-process/node workloads.

For all available CLI options, see CLI reference. Directly interacting with daemon is also possible through gRPC, see API reference.

Last updated

Was this helpful?