Checkpoint/restore with GPUs

Checkpoint/restore with GPUs is currently only supported for NVIDIA GPUs.

Prerequisites

  1. Create an account with Cedana, to get access to the GPU plugin. See authentication.

  2. Set the Cedana URL & authentication token in the configuration.

  3. Install a GPU plugin.

  • Option 1: GPU Plugin

    The GPU plugin is Cedana's proprietary plugin for high performance GPU checkpoint/restore that supports multi-process/node. If unavailable to you, check option 2.

    sudo cedana plugin install gpu
  • Option 2: CRIU CUDA Plugin

    The CRIU CUDA plugin (CRIUgpu) is developed by the CRIU community and uses the NVIDIA CUDA checkpoint utility under the hood.

    sudo cedana plugin install criu/cuda
  1. Ensure the daemon is running, see installation.

  2. Do a health check to ensure the plugin is ready, see health checks.

Check out Cedana vs. CRIU CUDA for GPU Checkpoint/Restore for a performance comparison between the two plugins.

Min driver
Max driver
Multi-GPU
Multi-process
Arch

Cedana GPU

452

570

amd64, arm64

CRIU CUDA

570

570

amd64

Usage (GPU plugin)

Single process

NOTE: Cedana GPU checkpoint/restore is only possible for managed processes/containers, i.e., those that are spawned using cedana run --gpu-enabled or managed using cedana manage --gpu-enabled (see managed process/container).

  1. You may clone the cedana-samples repository for some example GPU workloads.

  2. Run a process with GPU support:

cedana run process --attach --gpu-enabled --jid <job_id> -- cedana-samples/gpu_smr/vector_add
  1. Checkpoint:

cedana dump job <job_id>
  1. Restore:

cedana restore job --attach <job_id>

Multi-process/node

For multi-process/node workloads, you just need to specify the --gpu-freeze-type option during dump. If the workload is multi-process/multi-node and using NCCL, use the nccl option.

cedana dump job <job_id> --gpu-freeze-type nccl

You can then restore as usual. You may also set the default GPU freeze type in the configuration.

Usage (CRIU CUDA plugin)

Single process

You can checkpoint/restore normally as you do for CPU workloads. See checkpoint/restore basics.

Multi-process/node

This is currently not supported. You should use the Cedana GPU plugin for multi-process/node workloads.

For all available CLI options, see CLI reference. Directly interacting with daemon is also possible through gRPC, see API reference.

Last updated

Was this helpful?