GPU Checkpointing

Cedana can also checkpoint and restore applications running on NVIDIA GPUs. You'll need to start and manage the workload with our daemon, which can be found here.

The following CUDA and accompanying library versions are supported:

CUDA Version	CuDNN Version
11.8	8.x
12.1	8.x
12.4	9.x

Quickstart and Setup

To get started quickly, clone the repo and run the ./build-start-daemon.sh script, which starts the cedana daemon in the background. On managed and Kubernetes clusters, the daemon is set up automatically as part of the cedana-attach endpoint. To enable GPU checkpointing support, pass the --gpu flag to ./build-start-daemon :

export CEDANA_GPU_ENABLED=true 
./build-start-daemon.sh --systemctl --gpu

You can omit the systemctl flag if you're not running on a systemctl-enabled linux machine or would just rather the daemon run in the background. Logs (if run w/ --systemctl) are forwarded to /var/log/cedana-daemon.log and are also accessible from journalctl.

Alternatively, you can start the daemon with sudo -E cedana daemon start --gpu-enabled &. To communicate with our system, you'll need to set CEDANA_URL and CEDANA_AUTH_TOKEN. To get the auth token, you should follow the authentication steps in Authentication.

NOTE: During startup, the daemon will check for our gpu-checkpointing binaries and files on your system and if not present, will download them from the url set in CEDANA_URL. Ask your point of contact at Cedana for a valid CEDANA_URL.

By default, the cedana daemon downloads our GPU binaries that have been built against CUDA 11.8. If your system is using 12.1 or 12.4, you can override this behavior by starting the daemon and passing the --cuda flag:

sudo -E cedana daemon start --gpu-enabled --cuda 12.4 &

In order to checkpoint/restore GPU applications, you'll need to start the task or job from cedana.

cedana exec "python3 llm_inference.py" -i llm_inference --gpu-enabled

stdout and stderr go automatically to /var/log/cedana-output.log, which can be modified by passing the -l flag with a path of your choice to the exec command.

Checkpoint/Restore

You can keep track of the running process by checking on it's PID or by using cedana ps, which also keeps track of all checkpoints taken for that job id. To checkpoint the process:

cedana dump job llm_inference -d DIRECTORY --gpu-enabled

to create a GPU checkpoint of the process. If -d is omitted, the checkpoint gets placed in /tmp.

To restore the process from the checkpoint, you can run:

cedana restore job llm_inference

Logs for the newly spawned process (from the checkpoint) are directed to /var/log/cedana-output-TIMESTAMP.log.