Checkpoint/restore with GPUs

Prerequisites

  1. Create an account with Cedana, to get access to the GPU plugin. See authentication.

  2. Set the Cedana URL & authentication token in the configuration.

  3. Install the GPU plugin with sudo cedana plugin install gpu.

  4. Ensure the daemon is running, see installation.

  5. Do a health check to ensure the plugin is ready, see health checks.

Usage

NOTE: GPU checkpoint/restore is only possible for managed processes/containers, i.e., those that are spawned using cedana run (see managed process/container).

  1. You may clone the cedana-samples repository for some example GPU workloads.

  2. Run a process with GPU support:

cedana run process --attach --gpu-enabled --jid <job_id> -- cedana-samples/gpu_smr/vector_add
  1. Checkpoint:

cedana dump job <job_id>
  1. Restore:

cedana restore job --attach <job_id>

For all available CLI options, see CLI reference. Directly interacting with daemon is also possible through gRPC, see API reference.

Last updated

Was this helpful?