Cedana
Cedana Daemon
Cedana
Cedana Daemon
  • Cedana Daemon
  • Get started
    • Installation
    • Authentication
    • Configuration
    • Health checks
    • Plugins
    • Feature matrix
  • Guides
    • Managed process/container
    • Checkpoint/restore basics
    • Checkpoint/restore with GPUs
    • Checkpoint/restore runc
    • Checkpoint/restore containerd
    • Checkpoint/restore streamer
    • Checkpoint/restore kata
      • how-to-create-custom-busybox-image
      • how-to-install-criu-in-guest
      • how-to-install-on-aws
      • how-to-make-kernel-criu-compatible
      • how-to-make-rootfs-criu-compatible
      • Checkpoint/Restore kata containers
  • Developer guides
    • Architecture
    • Profiling
    • Testing
    • Writing plugins
  • References
    • CLI
      • cedana
      • cedana attach
      • cedana checkpoint
      • cedana checkpoints
      • cedana completion
      • cedana completion bash
      • cedana completion fish
      • cedana completion powershell
      • cedana completion zsh
      • cedana daemon
      • cedana daemon check
      • cedana daemon start
      • cedana delete
      • cedana dump
      • cedana dump containerd
      • cedana dump job
      • cedana dump process
      • cedana dump runc
      • cedana exec
      • cedana features
      • cedana inspect
      • cedana job
      • cedana job attach
      • cedana job checkpoint
      • cedana job checkpoint inspect
      • cedana job checkpoint list
      • cedana job checkpoints
      • cedana job delete
      • cedana job inspect
      • cedana job kill
      • cedana job list
      • cedana jobs
      • cedana k8s-helper
      • cedana k8s-helper destroy
      • cedana kill
      • cedana manage
      • cedana manage containerd
      • cedana manage process
      • cedana manage runc
      • cedana plugin
      • cedana plugin features
      • cedana plugin install
      • cedana plugin list
      • cedana plugin remove
      • cedana plugins
      • cedana ps
      • cedana query
      • cedana query k8s
      • cedana query runc
      • cedana restore
      • cedana restore job
      • cedana restore process
      • cedana restore runc
      • cedana run
      • cedana run containerd
      • cedana run process
      • cedana run runc
    • API
    • GitHub
Powered by GitBook
On this page
  • Prerequisites
  • Usage (GPU plugin)
  • Single process
  • Multi-process/node
  • Usage (CRIU CUDA plugin)
  • Single process
  • Multi-process/node

Was this helpful?

Edit on GitHub
  1. Guides

Checkpoint/restore with GPUs

PreviousCheckpoint/restore basicsNextCheckpoint/restore runc

Last updated 9 days ago

Was this helpful?

Checkpoint/restore with GPUs is currently only supported for NVIDIA GPUs.

Prerequisites

  1. Create an account with Cedana, to get access to the GPU plugin. See .

  2. Set the Cedana URL & authentication token in the .

  3. Install a GPU plugin.

  • Option 1: GPU Plugin

    The GPU plugin is Cedana's proprietary plugin for high performance GPU checkpoint/restore that supports multi-process/node. If unavailable to you, check option 2.

    sudo cedana plugin install gpu
  • Option 2: CRIU CUDA Plugin

    The CRIU CUDA plugin (CRIUgpu) is developed by the CRIU community and uses the under the hood.

    sudo cedana plugin install criu/cuda
  1. Ensure the daemon is running, see .

  2. Do a health check to ensure the plugin is ready, see .

Check out for a performance comparison between the two plugins.

Min driver
Max driver
Multi-GPU
Multi-process
Arch

Cedana GPU

452

570

✅

✅

amd64, arm64

CRIU CUDA

570

570

✅

❌

amd64

Usage (GPU plugin)

Single process

  1. Run a process with GPU support:

cedana run process --attach --gpu-enabled --jid <job_id> -- cedana-samples/gpu_smr/vector_add
  1. Checkpoint:

cedana dump job <job_id>
  1. Restore:

cedana restore job --attach <job_id>

Multi-process/node

cedana run process --gpu-enabled --gpu-type nccl ...

Usage (CRIU CUDA plugin)

Single process

Multi-process/node

This is currently not supported. You should use the Cedana GPU plugin for multi-process/node workloads.

NOTE: Cedana GPU checkpoint/restore is only possible for managed processes/containers, i.e., those that are spawned using cedana run --gpu-enabled or managed using cedana manage --gpu-enabled (see ).

You may clone the for some example GPU workloads.

For multi-process/node workloads, you just need to specify the --gpu-type option during run. If the workload is multi-process/multi-node and using , use the nccl option.

You can then checkpoint/restore as usual. You may also set the default GPU multi-process type in the .

You can checkpoint/restore normally as you do for CPU workloads. See .

For all available CLI options, see . Directly interacting with daemon is also possible through gRPC, see .

managed process/container
cedana-samples repository
NCCL
configuration
checkpoint/restore basics
CLI reference
API reference
authentication
configuration
NVIDIA CUDA checkpoint utility
installation
health checks
Cedana vs. CRIU CUDA for GPU Checkpoint/Restore