For the complete documentation index, see llms.txt. This page is also available as Markdown.

Installation

Begin checkpoint/migrate/restoring stateful workloads in SLURM in under 5 minutes!

You can also deploy fully self-hosted, with zero limitations on where you can store your checkpoints! Check out configuration. If you have any questions, please reach out to us at founders@cedana.ai.

You can install Cedana on a SLURM node in 3 ways:

The web installer will automatically install the latest stable version of Cedana and all plugins required for SLURM support with sane defaults.

Install

export CEDANA_URL=https://myorg.cedana.ai/v1
export CEDANA_AUTH_TOKEN=your_auth_token
export CEDANA_CLUSTER_ID=your_cluster_id

curl -fsSL "${CEDANA_URL}/install/slurm" -H "Authorization: Bearer ${CEDANA_AUTH_TOKEN}" | sudo -E bash -s -- --node-role <node-role>
  • Register a new cluster through Cedana Dashboard.

  • Use ?version=x.y.z query parameter to install a specific version.

  • Use ?build=alpha&version=feat/my-branch to install an alpha build from a branch.

  • Use --node-role controller on controller nodes, --node-role worker on worker nodes, and --node-role login on login (submission) nodes. --controller, --worker, and --login are accepted shorthands in scripts/install-release.sh.

Configure

For changes in configuration, follow instructions on Cedana Daemon configuration.

After you have made changes to the configuration, simply run the installer again to update and restart Cedana on the node.

And you're all set! Check out Manual Checkpoint/Restore to test it out. Below sections are on the alternative methods to install Cedana SLURM.

Install using Cedana

You can also install SLURM support directly using Cedana, if you have Cedana already installed.

Install

First, install Cedana by following instructions on Cedana Daemon installation.

Then, install the slurm plugin and run the setup:

  • Register a new cluster through Cedana Dashboard.

  • Use --node-role controller on controller nodes, --node-role worker on worker nodes, and --node-role login on login (submission) nodes. --controller, --worker, and --login are accepted shorthands in scripts/install-release.sh.

This should setup everything required. If you wish to setup manually, follow the next section.

Install (manual)

For deployments that require installing the plugin files manually, you can download the files directly.

First, install Cedana by following instructions on Cedana Daemon installation.

To get the Cedana SLURM plugin:

For alpha builds:

This will download the cedana-slurm binary to /usr/local/bin and the SLURM plugin files to /usr/local/lib. Remember to replace the slurm-25-11-5-1 above with the SLURM version your cluster is running.

To install the files, transfer the files to the required directories:

Update the /etc/slurm/plugstack.conf to include the spank_cedana.so:

Update the /etc/slurm/slurm.conf to include the plugins:

Reload the slurmctld and slurmd with:

On the database node (slurmdbd), start cedana-slurm:

Or, if you are using systemd, create the service file:

Configure

For changes in Cedana configuration, follow instructions on Cedana Daemon configuration.

Privileged mode (root)

In privileged mode, the checkpointing and restoring are done as root. Privileged mode requires no additional configuration.

Unprivileged mode (user)

In unprivileged mode, the checkpointing and restoring are done as the job's user, i.e., the UID of the SLURM job performs the checkpoint and restore. This configuration is useful when the root is demoted for security purposes. For example, NFS with root_squash requires unprivileged mode.

To enable unprivileged mode, set Slurm.Unprivileged to true in the Cedana Daemon configuration. Otherwise, just do this:

In addition, the cedana-slurm, cedana, and criu binaries must have the required capabilities for users to perform checkpoint and restore.

Build from source

Check make help for available build targets.

Build all binaries:

By default, the binaries will be built using the cedana/cedana-slurm:build docker image.

These binaries are useless on their own. You need to install Cedana to use them.

First, install Cedana by following instructions on Cedana Daemon installation. Then, install the slurm plugin after changing into the build directory:

You need to be in the build directory for the cedana slurm setup command to work, as it needs to find the binaries you just built.

You're all set up! Let's checkpoint some workloads. Continue to Checkpoint/restore to get started.

Uninstall

To remove Cedana SLURM completely, run on all nodes:

Last updated

Was this helpful?