Cedana
Cedana Daemon
Cedana
Cedana Daemon
  • Cedana Daemon
  • Get started
    • Installation
    • Authentication
    • Configuration
    • Health checks
    • Plugins
    • Feature matrix
  • Guides
    • Managed process/container
    • Checkpoint/restore basics
    • Checkpoint/restore with GPUs
    • Checkpoint/restore runc
    • Checkpoint/restore containerd
    • Checkpoint/restore streamer
    • Checkpoint/restore kata
      • how-to-create-custom-busybox-image
      • how-to-install-criu-in-guest
      • how-to-install-on-aws
      • how-to-make-kernel-criu-compatible
      • how-to-make-rootfs-criu-compatible
      • Checkpoint/Restore kata containers
  • Developer guides
    • Architecture
    • Profiling
    • Testing
    • Writing plugins
  • References
    • CLI
      • cedana
      • cedana attach
      • cedana checkpoint
      • cedana checkpoints
      • cedana completion
      • cedana completion bash
      • cedana completion fish
      • cedana completion powershell
      • cedana completion zsh
      • cedana daemon
      • cedana daemon check
      • cedana daemon start
      • cedana delete
      • cedana dump
      • cedana dump containerd
      • cedana dump job
      • cedana dump process
      • cedana dump runc
      • cedana exec
      • cedana features
      • cedana inspect
      • cedana job
      • cedana job attach
      • cedana job checkpoint
      • cedana job checkpoint inspect
      • cedana job checkpoint list
      • cedana job checkpoints
      • cedana job delete
      • cedana job inspect
      • cedana job kill
      • cedana job list
      • cedana jobs
      • cedana k8s-helper
      • cedana k8s-helper destroy
      • cedana kill
      • cedana manage
      • cedana manage containerd
      • cedana manage process
      • cedana manage runc
      • cedana plugin
      • cedana plugin features
      • cedana plugin install
      • cedana plugin list
      • cedana plugin remove
      • cedana plugins
      • cedana ps
      • cedana query
      • cedana query k8s
      • cedana query runc
      • cedana restore
      • cedana restore job
      • cedana restore process
      • cedana restore runc
      • cedana run
      • cedana run containerd
      • cedana run process
      • cedana run runc
    • API
    • GitHub
Powered by GitBook
On this page
  • Prerequisites
  • Usage (GPU plugin)
  • Single process
  • Multi-process/node
  • Usage (CRIU CUDA plugin)
  • Single process
  • Multi-process/node

Was this helpful?

Edit on GitHub
  1. Guides

Checkpoint/restore with GPUs

PreviousCheckpoint/restore basicsNextCheckpoint/restore runc

Last updated 8 hours ago

Was this helpful?

Checkpoint/restore with GPUs is currently only supported for NVIDIA GPUs.

Prerequisites

  1. Create an account with Cedana, to get access to the GPU plugin. See authentication.

  2. Set the Cedana URL & authentication token in the configuration.

  3. Install a GPU plugin.

  • Option 1: GPU Plugin

    The GPU plugin is Cedana's proprietary plugin for high performance GPU checkpoint/restore that supports multi-process/node. If unavailable to you, check option 2.

    sudo cedana plugin install gpu
  • Option 2: CRIU CUDA Plugin

    The CRIU CUDA plugin (CRIUgpu) is developed by the CRIU community and uses the NVIDIA CUDA checkpoint utility under the hood.

    sudo cedana plugin install criu/cuda
  1. Ensure the daemon is running, see installation.

  2. Do a health check to ensure the plugin is ready, see health checks.

Check out for a performance comparison between the two plugins.

Min driver
Max driver
Multi-GPU
Multi-process
Arch

Cedana GPU

452

570

✅

✅

amd64, arm64

CRIU CUDA

570

570

✅

❌

amd64

Usage (GPU plugin)

Single process

NOTE: Cedana GPU checkpoint/restore is only possible for managed processes/containers, i.e., those that are spawned using cedana run --gpu-enabled or managed using cedana manage --gpu-enabled (see managed process/container).

  1. You may clone the cedana-samples repository for some example GPU workloads.

  2. Run a process with GPU support:

cedana run process --attach --gpu-enabled --jid <job_id> -- cedana-samples/gpu_smr/vector_add
  1. Checkpoint:

cedana dump job <job_id>
  1. Restore:

cedana restore job --attach <job_id>

Multi-process/node

For multi-process/node workloads, you just need to specify the --gpu-freeze-type option during dump. If the workload is multi-process/multi-node and using NCCL, use the nccl option.

cedana dump job <job_id> --gpu-freeze-type nccl

You can then restore as usual. You may also set the default GPU freeze type in the configuration.

Usage (CRIU CUDA plugin)

Single process

You can checkpoint/restore normally as you do for CPU workloads. See checkpoint/restore basics.

Multi-process/node

This is currently not supported. You should use the Cedana GPU plugin for multi-process/node workloads.

For all available CLI options, see CLI reference. Directly interacting with daemon is also possible through gRPC, see API reference.

Cedana vs. CRIU CUDA for GPU Checkpoint/Restore