githubEdit

Automation

Automating checkpoint/restore with SLURM

In most cases, the end user will likely not be using the UI to manually checkpoint/restore workloads - as most end users are researchers and not the sysadmins that have access to the UI.

Cedana offers a seamless automation where the user does not have to think about the underlying checkpointing at all. The preemption signal is intercepted, which triggers a checkpoint. Cedana then automatically requeues a restore into the same queue that was submitted.

This video highlights for example how the automation works without touching the UI:

Some things in the pipeline to help as well:

  • Notification via email to users on successful checkpoint/restore

  • Restoring with the same jobID as was initially submitted

  • Support for modifying restore properties (changing assigned SLURM resources)

  • Support for restoring into a different queue

Last updated

Was this helpful?