GPU Save/Migrate/Resume
GPU Checkpointing
Cedana can also checkpoint and restore applications running on NVIDIA GPUs. You'll need to start and manage the workload with our daemon, which can be found here.
Currently, the max supported CUDA driver API version is 12.4. Minimum supported CUDA driver API version is 11.8. The binaries are backwards compatible with older CUDA versions.
You can check the CUDA driver API version by running nvidia-smi
and looking at the CUDA Version
field.
PyTorch 2.5.1 is currently unsupported as well - we recommend downgrading to 2.4.1 and below.
Quickstart and Setup
To get started quickly, clone the repo and run the ./build-start-daemon.sh
script, which starts the cedana
daemon in the background. On managed and Kubernetes clusters, the daemon is set up automatically as part of the cedana-attach
endpoint. To enable GPU checkpointing support, pass the --gpu flag to ./build-start-daemon
:
You can omit the systemctl
flag if you're not running on a systemctl
-enabled linux machine or would just rather the daemon run in the background. Logs (if run w/ --systemctl
) are forwarded to /var/log/cedana-daemon.log
and are also accessible from journalctl
.
Alternatively, you can start the daemon with sudo -E cedana daemon start --gpu-enabled &
. To communicate with our system, you'll need to set CEDANA_URL
and CEDANA_AUTH_TOKEN
. To get the auth token, you should follow the authentication steps in Authentication.
NOTE: During startup, the daemon will check for our gpu-checkpointing binaries and files on your system and if not present, will download them from the url set in
CEDANA_URL
. Ask your point of contact at Cedana for a validCEDANA_URL
.
In order to checkpoint/restore GPU applications, you'll need to start the task or job from cedana
.
stdout
and stderr
go automatically to /var/log/cedana-output.log
, which can be modified by passing the -l
flag with a path of your choice to the exec command.
Checkpoint/Restore
You can keep track of the running process by checking on it's PID or by using cedana ps
, which also keeps track of all checkpoints taken for that job id. To checkpoint the process:
to create a GPU checkpoint of the process. If -d
is omitted, the checkpoint gets placed in /tmp
.
To restore the process from the checkpoint, you can run:
Logs for the newly spawned process (from the checkpoint) are directed to /var/log/cedana-output-TIMESTAMP.log
.
Last updated