Quickstart
This guide provides a brief overview of the basic commands for checkpointing and restoring SLURM jobs using Cedana.
Checkpointing a Job
To save the current state of a running SLURM job, you can use the dump
command. This will create a checkpoint of the specified job, allowing it to be restored later.
Replace <job_id>
with the actual ID of the SLURM job you want to checkpoint.
cedana dump slurm <job_id> --dir <checkpoint_dir>
Restoring a Job
To queue a previously checkpointed job, use the restore
command. This will submit a new job to the SLURM queue that will resume execution from the last saved checkpoint.
cedana restore slurm <job_id> --path <checkpoint_path>
Automatic Restore on Preemption
For jobs running in a preemptive queue, you can configure them to be automatically checkpointed and restored if they are suspended. To enable this feature, simply add the --restore-on-suspend
flag when submitting your job with sbatch
.
Here is an example of how to submit a job with this flag enabled:
sbatch --restore-on-suspend job.sh
When a higher-priority job preempts this one, SLURM will suspend it, Cedana will automatically create a checkpoint, and then requeue the job to be restored once resources become available.
Last updated
Was this helpful?