GPU Save/Migrate/Resume

GPU Checkpointing

Cedana can also checkpoint and restore applications running on NVIDIA GPUs. You'll need to start and manage the workload with our daemon, which can be found here.

Currently, the max supported CUDA driver API version is 12.4. Minimum supported CUDA driver API version is 11.8. The binaries are backwards compatible with older CUDA versions.

You can check the CUDA driver API version by running nvidia-smi and looking at the CUDA Version field.

PyTorch 2.5.1 is currently unsupported as well - we recommend downgrading to 2.4.1 and below.

Quickstart and Setup

To get started quickly, clone the repo and run the ./build-start-daemon.sh script, which starts the cedana daemon in the background. On managed and Kubernetes clusters, the daemon is set up automatically as part of the cedana-attach endpoint. To enable GPU checkpointing support, pass the --gpu flag to ./build-start-daemon :

export CEDANA_GPU_ENABLED=true 
./build-start-daemon.sh --systemctl --gpu

You can omit the systemctl flag if you're not running on a systemctl-enabled linux machine or would just rather the daemon run in the background. Logs (if run w/ --systemctl) are forwarded to /var/log/cedana-daemon.log and are also accessible from journalctl.

Alternatively, you can start the daemon with sudo -E cedana daemon start --gpu-enabled &. To communicate with our system, you'll need to set CEDANA_URL and CEDANA_AUTH_TOKEN. To get the auth token, you should follow the authentication steps in Authentication.

NOTE: During startup, the daemon will check for our gpu-checkpointing binaries and files on your system and if not present, will download them from the url set in CEDANA_URL. Ask your point of contact at Cedana for a valid CEDANA_URL.

In order to checkpoint/restore GPU applications, you'll need to start the task or job from cedana.

cedana exec "python3 llm_inference.py" -i llm_inference --gpu-enabled 

stdout and stderr go automatically to /var/log/cedana-output.log, which can be modified by passing the -l flag with a path of your choice to the exec command.

Checkpoint/Restore

You can keep track of the running process by checking on it's PID or by using cedana ps, which also keeps track of all checkpoints taken for that job id. To checkpoint the process:

cedana dump job llm_inference -d DIRECTORY --gpu-enabled

to create a GPU checkpoint of the process. If -d is omitted, the checkpoint gets placed in /tmp.

To restore the process from the checkpoint, you can run:

cedana restore job llm_inference

Logs for the newly spawned process (from the checkpoint) are directed to /var/log/cedana-output-TIMESTAMP.log.

Last updated