# Configuration

This document outlines the configurable parameters for the Cedana Helm chart. For up-to-date configuration, see [values.yaml](https://github.com/cedana/cedana-helm-charts/blob/main/cedana-helm/values.yaml).

## Global Settings

These settings control the overall behavior of the deployment.

| Parameter                 | Description                                                                                                                                                                                                                                                  | Default         |
| ------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | --------------- |
| `nameOverride`            | Overrides the name of the chart.                                                                                                                                                                                                                             | `cedana`        |
| `fullnameOverride`        | Overrides the full name of the release.                                                                                                                                                                                                                      | `cedana`        |
| `installKueue`            | If set to `true`, Kueue will be installed. **Note:** The Kueue CRDs must be applied before enabling this option. You can apply them with `kubectl apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/download/v0.10.1/manifests.yaml`. | `false`         |
| `kubernetesClusterDomain` | The Kubernetes cluster domain.                                                                                                                                                                                                                               | `cluster.local` |

## Cedana Configuration (`config`)

This section contains the core configuration for the Cedana platform.

### Authentication & Platform

| Parameter     | Description                                                                                                            | Default |
| ------------- | ---------------------------------------------------------------------------------------------------------------------- | ------- |
| `authToken`   | Your authentication token for the Cedana platform.                                                                     | `""`    |
| `url`         | The URL for the Cedana API.                                                                                            | `""`    |
| `clusterId`   | A unique ID for your cluster (you can generate one on the [Clusters Page](https://ui.cedana.com/monitoring/clusters)). | `""`    |
| `sqsQueueUrl` | The SQS queue URL for communication with Cedana.                                                                       | `""`    |

### Daemon Configuration

| Parameter  | Description                                                                                                                                                                                                                                                                | Default        |
| ---------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------- |
| `address`  | The address and port to bind the daemon on. Must be `0.0.0.0` for external accessibility.                                                                                                                                                                                  | `0.0.0.0:8080` |
| `protocol` | The protocol to use for daemon service. Options: `tcp` (TCP socket), `unix` (Unix socket), `vsock` (VSock for VM and hypervisor communication, useful for supporting VM-based migrations). **Note:** Other protocols might not be supported properly, update with caution. | `tcp`          |

### Checkpoint Storage & Streaming

| Parameter               | Description                                                                                                                                                                                                                                                                                                                                                                                    | Default |
| ----------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
| `checkpointDir`         | Specify path to directory for storing checkpoints. Options: `cedana://<path>` for Cedana-managed global storage (recommended), `s3://<bucket>/<path>` for your S3 storage, `<path>` for node-local storage (not recommended, as it won't be accessible across nodes).                                                                                                                          | `/tmp`  |
| `checkpointStreams`     | Specify the number of parallel streams to use for streaming checkpoint/restore operations. `0` means no streaming. `n > 0` means n parallel streams (or number of pipes) to use. Streaming ensures a low footprint by using memory efficiently with no intermediate disk space required, although performance may be slightly better or worse depending on disk and network read/write speeds. | `0`     |
| `checkpointCompression` | The compression algorithm for checkpoints. Options: `none`, `tar`, `lz4`, `gzip`, `zlib`.                                                                                                                                                                                                                                                                                                      | `none`  |
| `checkpointAsync`       | Specify whether to perform checkpoint compression/upload async. Recommended if anticipating large checkpoints, which could take too long to stream directly to a bucket. Note that restore will only be possible after upload is complete, so there might be a delay before a workload can be restored even though the checkpoint shows as complete.                                           | `false` |

### GPU Configuration

| Parameter                  | Description                                                                                                                                 | Default                                       |
| -------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------- |
| `gpuPoolSize`              | Number of GPU controllers to keep warm. Improves GPU workload startup/restore time but uses more memory.                                    | `0`                                           |
| `gpuShmSize`               | The shared memory size for GPU workloads. 8 GiB is enough for most workloads. Reduce if memory constrained or running small workloads only. | `8589934592` (8 GiB)                          |
| `gpuLdLibPath`             | Additional `LD_LIBRARY_PATH` to look for CUDA libraries.                                                                                    | `/run/nvidia/driver/usr/lib/x86_64-linux-gnu` |
| `gpuSkipNvidiaRuntimeHook` | Whether to skip adding the nvidia-container-runtime hook when starting GPU workloads.                                                       | `false`                                       |

### Plugin Versions

| Parameter                         | Description                                                                                                                                                          | Default   |
| --------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------- |
| `pluginsBuilds`                   | Specify the plugin versions to use. If set to `release`, then any release version for the plugins can be specified. If set to `alpha`, then specify the branch name. | `alpha`   |
| `pluginsNativeVersion`            | The version of the native plugin to use.                                                                                                                             | `latest`  |
| `pluginsCriuVersion`              | The version of the CRIU plugin to use.                                                                                                                               | `7/merge` |
| `pluginsContainerdRuntimeVersion` | The version of the containerd runtime plugin to use.                                                                                                                 | `v0.7.1`  |
| `pluginsGpuVersion`               | The version of the GPU plugin to use.                                                                                                                                | `v0.7.0`  |
| `pluginsStreamerVersion`          | The version of the streamer plugin to use.                                                                                                                           | `v0.0.8`  |

### Observability

| Parameter   | Description                | Default |
| ----------- | -------------------------- | ------- |
| `profiling` | Enable profiling.          | `true`  |
| `metrics`   | Enable metrics collection. | `true`  |
| `logLevel`  | The logging level.         | `info`  |

### AWS Configuration

| Parameter            | Description                                                                           | Default |
| -------------------- | ------------------------------------------------------------------------------------- | ------- |
| `awsAccessKeyId`     | AWS access key ID if using S3 storage (if `s3://<bucket>/<path>` in `checkpointDir`). | `""`    |
| `awsSecretAccessKey` | AWS secret access key if using S3 storage.                                            | `""`    |
| `awsRegion`          | AWS region if using S3 storage (uses default region if not set).                      | `""`    |
| `awsEndpoint`        | AWS endpoint if using S3-compatible storage.                                          | `""`    |

### Custom Secrets

| Parameter           | Description                                    | Default                              |
| ------------------- | ---------------------------------------------- | ------------------------------------ |
| `preExistingSecret` | Uncomment to use a custom pre-existing secret. | `cedana-secret-user` (commented out) |

## Host Configuration (`hostConfig`)

Configuration for host-level settings.

| Parameter           | Description                                                                                                          | Default                           |
| ------------------- | -------------------------------------------------------------------------------------------------------------------- | --------------------------------- |
| `containerdAddress` | Path to containerd socket.                                                                                           | `/run/containerd/containerd.sock` |
| `disableIoUring`    | Set to `true` to disable Linux kernel's io-uring option. Cedana does not support io-uring based checkpoint restores. | `true`                            |

### Shared Memory Configuration (`hostConfig.shmConfig`)

Optional configuration to increase `/dev/shm` size on nodes, which is useful for workloads requiring large shared memory.

| Parameter | Description                                           | Default |
| --------- | ----------------------------------------------------- | ------- |
| `enabled` | Set to `true` to enable `/dev/shm` size increase.     | `false` |
| `size`    | Size to set for `/dev/shm` (e.g., `10G`, `20G`).      | `10G`   |
| `minSize` | Minimum size to trigger remount (e.g., `10G`, `20G`). | `10G`   |

## Daemon Helper (`daemonHelper`)

Configuration for the `daemon-helper` DaemonSet.

### Service

| Parameter             | Description                                      | Default |
| --------------------- | ------------------------------------------------ | ------- |
| `service.annotations` | Annotations to add to the daemon helper service. | `{}`    |

### Image

| Parameter               | Description                                    | Default                |
| ----------------------- | ---------------------------------------------- | ---------------------- |
| `image.repository`      | The repository for the image.                  | `cedana/cedana-helper` |
| `image.tag`             | The tag for the image.                         | `v0.9.284`             |
| `image.digest`          | The digest for the image (ignores tag if set). | `""`                   |
| `image.imagePullPolicy` | The image pull policy.                         | `IfNotPresent`         |

### Update Strategy

| Parameter                       | Description                                                                   | Default |
| ------------------------------- | ----------------------------------------------------------------------------- | ------- |
| `updateStrategy.maxSurge`       | The maximum number of pods that can be created over the desired number.       | `0`     |
| `updateStrategy.maxUnavailable` | The maximum number of pods that can be unavailable during the update process. | `99999` |

### Scheduling

| Parameter      | Description                                   | Default                                                                |
| -------------- | --------------------------------------------- | ---------------------------------------------------------------------- |
| `tolerations`  | Tolerations for the daemon helper pods.       | Allows scheduling on nodes with `cedana.ai/not-ready:NoSchedule` taint |
| `affinity`     | Affinity settings for the daemon helper pods. | `{}`                                                                   |
| `nodeSelector` | Node selector for the daemon helper pods.     | `{}`                                                                   |

## Service Account (`serviceAccount`)

Configuration for the Kubernetes Service Account.

| Parameter     | Description                                                                                                               | Default                     |
| ------------- | ------------------------------------------------------------------------------------------------------------------------- | --------------------------- |
| `create`      | Specifies whether a service account should be created.                                                                    | `true`                      |
| `automount`   | Automatically mount a ServiceAccount's API credentials.                                                                   | `true`                      |
| `annotations` | Annotations to add to the service account.                                                                                | `{}`                        |
| `name`        | The name of the service account to use. If not set and `create` is true, a name is generated using the fullname template. | `cedana-controller-manager` |

## Controller Manager (`controllerManager`)

Configuration for the `cedana-controller-manager`.

### Autoscaling

| Parameter                                    | Description                                        | Default |
| -------------------------------------------- | -------------------------------------------------- | ------- |
| `autoscaling.enabled`                        | Enable autoscaling for the controller manager.     | `false` |
| `autoscaling.replicaCount`                   | The number of replicas for the controller manager. | `1`     |
| `autoscaling.deploymentRevisionHistoryLimit` | The number of old ReplicaSets to retain.           | `10`    |

### Service

| Parameter             | Description                                     | Default                                           |
| --------------------- | ----------------------------------------------- | ------------------------------------------------- |
| `service.annotations` | Annotations for the controller manager service. | `{}`                                              |
| `service.ports`       | The ports for the controller manager service.   | `[{protocol: TCP, port: 1324, targetPort: 1324}]` |

### Manager Configuration

| Parameter                          | Description                                                                                                                                                       | Default                                                                                      |
| ---------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------- |
| `manager.podAnnotations`           | Annotations for the controller manager pods.                                                                                                                      | `{}`                                                                                         |
| `manager.args`                     | Arguments for the controller manager container.                                                                                                                   | `[--health-probe-bind-address=:8081, --metrics-bind-address=127.0.0.1:8080, --leader-elect]` |
| `manager.containerSecurityContext` | The security context for the manager container. Controller doesn't require any privileges.                                                                        | `allowPrivilegeEscalation: false`, `capabilities: { drop: [ALL] }`                           |
| `manager.image.repository`         | The repository for the `cedana-controller` image.                                                                                                                 | `cedana/cedana-controller`                                                                   |
| `manager.image.tag`                | The tag for the `cedana-controller` image.                                                                                                                        | `v0.6.2`                                                                                     |
| `manager.image.digest`             | The digest for the image (ignores tag if set).                                                                                                                    | `""`                                                                                         |
| `manager.image.imagePullPolicy`    | The image pull policy.                                                                                                                                            | `IfNotPresent`                                                                               |
| `manager.resources`                | Resource limits and requests for the manager container. Empty to ensure minimal resource usage on demo/test deployments. Uncomment or add custom resource limits. | `{}`                                                                                         |

### RBAC Proxy

| Parameter        | Description                                                                                                                                                          | Default |
| ---------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
| `rbac.resources` | Resource limits and requests for the RBAC proxy container. Empty to ensure minimal resource usage on demo/test deployments. Uncomment or add custom resource limits. | `{}`    |

### Scheduling

| Parameter      | Description                                                                                                                                                                                                                                                                          | Default                                                                |
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------- |
| `tolerations`  | Tolerations for the controller manager pods. Allows scheduling on nodes with `cedana.ai/not-ready:NoSchedule` taint. To deploy the controller on a dedicated node, run `kubectl taint node <node-name> dedicated=cedana-manager:NoSchedule` and uncomment the additional toleration. | Allows scheduling on nodes with `cedana.ai/not-ready:NoSchedule` taint |
| `affinity`     | Affinity settings for the controller manager pods. To allow scheduling of controller pod on labeled nodes only, use `kubectl label nodes <node-name> dedicated=cedana-manager` and uncomment the nodeAffinity configuration.                                                         | `{}`                                                                   |
| `nodeSelector` | Node selector for the controller manager pods.                                                                                                                                                                                                                                       | `{}`                                                                   |

## Metrics Service (`metricsService`)

Configuration for the metrics service.

| Parameter | Description                        | Default                                                         |
| --------- | ---------------------------------- | --------------------------------------------------------------- |
| `ports`   | The ports for the metrics service. | `[{name: https, port: 8443, protocol: TCP, targetPort: https}]` |
| `type`    | The type of the metrics service.   | `ClusterIP`                                                     |
