githubEdit

Checkpoint/Restore

Begin checkpoint/migrate/restoring stateful workloads in SLURM in under 5 minutes!

For both CPU and GPU workloads, no additional configuration is required. Below is an example of a simple CPU workload that counts from 0 to 600, sleeping for 1 second between each count.

#!/bin/bash
#SBATCH --job-name=hello_world          # Job name
#SBATCH --output=hello_world.out        # Standard output log
#SBATCH --error=hello_world.err         # Standard error log
#SBATCH --time=00:10:00                 # Time limit (hh:mm:ss)
#SBATCH --nodes=1                       # Run on 1 node
#SBATCH --ntasks=1                      # Run 1 task

echo "Starting counter job on $(hostname)..."

# Loop from 0 to 600
for i in {0..600}
do
    echo "Counter: $i"
    sleep 1
done

echo "Job finished successfully."

Checkpoint

To checkpoint a job in SLURM, navigate to the SLURM Jobs Pagearrow-up-right, select the job, and checkpoint it.

Restore

To restore a job in SLURM, navigate to the SLURM Checkpoints Pagearrow-up-right, select the checkpoint, and restore it.

circle-exclamation

Last updated

Was this helpful?