Quickstart

This guide provides a brief overview of the basic commands for checkpointing and restoring SLURM jobs using Cedana.

Checkpointing a Job

To save the current state of a running SLURM job, you can use the dump command. This will create a checkpoint of the specified job, allowing it to be restored later.

Replace <job_id> with the actual ID of the SLURM job you want to checkpoint.

cedana dump slurm <job_id> --dir <checkpoint_dir>

Restoring a Job

To queue a previously checkpointed job, use the restore command. This will submit a new job to the SLURM queue that will resume execution from the last saved checkpoint.

cedana restore slurm <job_id> --path <checkpoint_path>

Automatic Restore on Preemption

For jobs running in a preemptive queue, you can configure them to be automatically checkpointed and restored if they are suspended. To enable this feature, simply add the --restore-on-suspend flag when submitting your job with sbatch.

Here is an example of how to submit a job with this flag enabled:

sbatch --restore-on-suspend job.sh

When a higher-priority job preempts this one, SLURM will suspend it, Cedana will automatically create a checkpoint, and then requeue the job to be restored once resources become available.

Last updated

Was this helpful?