githubEdit

Quickstart

This guide provides a brief overview of the basic commands for checkpointing and restoring SLURM jobs using Cedana.

For both CPU and GPU workloads, no additional configuration is required. Here we will demonstrate Cedana checkpoint and restore on a simple "Hello, World!" script.

#!/bin/bash
#SBATCH --job-name=hello_world      # Job name
#SBATCH --output=hello_world.out            # Standard output log
#SBATCH --error=hello_world.err             # Standard error log
#SBATCH --time=00:10:00              # Time limit (hh:mm:ss)
#SBATCH --nodes=1                    # Run on 1 node
#SBATCH --ntasks=1                   # Run 1 task

echo "Starting counter job on $(hostname)..."

# Loop from 0 to 600
for i in {0..600}
do
    echo "Counter: $i"
    sleep 1
done

echo "Job finished successfully."

Checkpointing a job

To checkpoint a job in SLURM, navigate to SLURM jobs in the Cedana Manager > SLURM Jobsarrow-up-right, select the job, and checkpoint it.

Restoring a job

To restore a job in SLURM, navigate the SLURM job checkpoints in the Cedana Manager > SLURM Checkpointsarrow-up-right, select the checkpoint, and restore it.

circle-exclamation

Last updated

Was this helpful?