Installation
Begin checkpoint/migrate/restoring stateful workloads in SLURM in under 5 minutes!
To use Cedana in SLURM, you need to be registered with us! Reach out to founders@cedana.ai to get set up with an organization.
You can also deploy fully self-hosted, with zero limitations on where you can store your checkpoints! Check out configuration. If you have any questions, please reach out to us at founders@cedana.ai.
You can install Cedana on a SLURM node in 3 ways:
[Option 1] Install from web (recommended)
[Option 2] Install using Cedana
[Option 3] Build from source
These steps must be performed on all SLURM controller and compute nodes in your cluster!
Install from web (recommended)
The web installer will automatically install the latest stable version of Cedana and all plugins required for SLURM support with sane defaults.
Install
Check Authentication for more details on how to get an authentication token.
export CEDANA_URL=https://myorg.cedana.ai/v1
export CEDANA_AUTH_TOKEN=your_auth_token
export CEDANA_CLUSTER_ID=your_cluster_id
curl -fsSL "${CEDANA_URL}/install/slurm" -H "Authorization: Bearer ${CEDANA_AUTH_TOKEN}" | sudo -E bash -s -- --node-role <node-role>Register a new cluster through Cedana Dashboard.
Use
?version=x.y.zquery parameter to install a specific version.Use
?build=alpha&version=feat/my-branchto install an alpha build from a branch.Use
--node-role controlleron controller nodes,--node-role workeron worker nodes, and--node-role loginon login (submission) nodes.--controller,--worker, and--loginare accepted shorthands inscripts/install-release.sh.
Configure
For changes in configuration, follow instructions on Cedana Daemon configuration.
After you have made changes to the configuration, simply run the installer again to update and restart Cedana on the node.
And you're all set! Check out Manual Checkpoint/Restore to test it out. Below sections are on the alternative methods to install Cedana SLURM.
Install using Cedana
You can also install SLURM support directly using Cedana, if you have Cedana already installed.
Install
First, install Cedana by following instructions on Cedana Daemon installation.
Check Authentication for more details on how to get an authentication token.
Then, install the slurm plugin and run the setup:
Register a new cluster through Cedana Dashboard.
Use
--node-role controlleron controller nodes,--node-role workeron worker nodes, and--node-role loginon login (submission) nodes.--controller,--worker, and--loginare accepted shorthands inscripts/install-release.sh.
This should setup everything required. If you wish to setup manually, follow the next section.
Install (manual)
For deployments that require installing the plugin files manually, you can download the files directly.
First, install Cedana by following instructions on Cedana Daemon installation.
Check Authentication for more details on how to get an authentication token.
To get the Cedana SLURM plugin:
For alpha builds:
This will download the cedana-slurm binary to /usr/local/bin and the SLURM plugin files to /usr/local/lib. Remember to replace the slurm-25-11-5-1 above with the SLURM version your cluster is running.
To install the files, transfer the files to the required directories:
Update the /etc/slurm/plugstack.conf to include the spank_cedana.so:
Update the /etc/slurm/slurm.conf to include the plugins:
Reload the slurmctld and slurmd with:
On the database node (slurmdbd), start cedana-slurm:
Or, if you are using systemd, create the service file:
Configure
For changes in Cedana configuration, follow instructions on Cedana Daemon configuration.
Privileged mode (root)
In privileged mode, the checkpointing and restoring are done as root. Privileged mode requires no additional configuration.
Unprivileged mode (user)
In unprivileged mode, the checkpointing and restoring are done as the job's user, i.e., the UID of the SLURM job performs the checkpoint and restore. This configuration is useful when the root is demoted for security purposes. For example, NFS with root_squash requires unprivileged mode.
To enable unprivileged mode, set Slurm.Unprivileged to true in the Cedana Daemon configuration. Otherwise, just do this:
In addition, the cedana-slurm, cedana, and criu binaries must have the required capabilities for users to perform checkpoint and restore.
Build from source
Check make help for available build targets.
Build all binaries:
By default, the binaries will be built using the cedana/cedana-slurm:build docker image.
These binaries are useless on their own. You need to install Cedana to use them.
First, install Cedana by following instructions on Cedana Daemon installation. Then, install the slurm plugin after changing into the build directory:
You need to be in the build directory for the cedana slurm setup command to work, as it needs to find the binaries you just built.
Check Authentication for more details on how to get an authentication token.
You're all set up! Let's checkpoint some workloads. Continue to Checkpoint/restore to get started.
Uninstall
To remove Cedana SLURM completely, run on all nodes:
Last updated
Was this helpful?