# Configuration

This document outlines the configurable parameters for the Cedana Helm chart. For up-to-date configuration, see [values.yaml](https://github.com/cedana/cedana-helm-charts/blob/main/cedana-helm/values.yaml).

## Global Settings

These settings control the overall behavior of the deployment.

| Parameter                 | Description                                                                                                                                                                                                                                                  | Default         |
| ------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | --------------- |
| `nameOverride`            | Overrides the name of the chart.                                                                                                                                                                                                                             | `cedana`        |
| `fullnameOverride`        | Overrides the full name of the release.                                                                                                                                                                                                                      | `cedana`        |
| `installKueue`            | If set to `true`, Kueue will be installed. **Note:** The Kueue CRDs must be applied before enabling this option. You can apply them with `kubectl apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/download/v0.10.1/manifests.yaml`. | `false`         |
| `kubernetesClusterDomain` | The Kubernetes cluster domain.                                                                                                                                                                                                                               | `cluster.local` |

## Cedana Configuration (`config`)

This section contains the core configuration for the Cedana platform.

### Authentication & Platform

| Parameter     | Description                                                                                                            | Default |
| ------------- | ---------------------------------------------------------------------------------------------------------------------- | ------- |
| `authToken`   | Your authentication token for the Cedana platform.                                                                     | `""`    |
| `url`         | The URL for the Cedana API.                                                                                            | `""`    |
| `clusterId`   | A unique ID for your cluster (you can generate one on the [Clusters Page](https://ui.cedana.com/monitoring/clusters)). | `""`    |
| `sqsQueueUrl` | The SQS queue URL for communication with Cedana.                                                                       | `""`    |

### Daemon Configuration

| Parameter  | Description                                                                                                                                                                                                                                                                | Default        |
| ---------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------- |
| `address`  | The address and port to bind the daemon on. Must be `0.0.0.0` for external accessibility.                                                                                                                                                                                  | `0.0.0.0:8080` |
| `protocol` | The protocol to use for daemon service. Options: `tcp` (TCP socket), `unix` (Unix socket), `vsock` (VSock for VM and hypervisor communication, useful for supporting VM-based migrations). **Note:** Other protocols might not be supported properly, update with caution. | `tcp`          |

### Checkpoint Storage & Streaming

| Parameter               | Description                                                                                                                                                                                                                                                                                                                                                                                    | Default |
| ----------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
| `checkpointDir`         | Specify path to directory for storing checkpoints. Options: `cedana://<path>` for Cedana-managed global storage (recommended), `s3://<bucket>/<path>` for your S3 storage, `<path>` for node-local storage (not recommended, as it won't be accessible across nodes).                                                                                                                          | `/tmp`  |
| `checkpointStreams`     | Specify the number of parallel streams to use for streaming checkpoint/restore operations. `0` means no streaming. `n > 0` means n parallel streams (or number of pipes) to use. Streaming ensures a low footprint by using memory efficiently with no intermediate disk space required, although performance may be slightly better or worse depending on disk and network read/write speeds. | `0`     |
| `checkpointCompression` | The compression algorithm for checkpoints. Options: `none`, `tar`, `lz4`, `gzip`, `zlib`.                                                                                                                                                                                                                                                                                                      | `none`  |
| `checkpointAsync`       | Specify whether to perform checkpoint compression/upload async. Recommended if anticipating large checkpoints, which could take too long to stream directly to a bucket. Note that restore will only be possible after upload is complete, so there might be a delay before a workload can be restored even though the checkpoint shows as complete.                                           | `false` |

### GPU Configuration

| Parameter                  | Description                                                                                                                                 | Default                                       |
| -------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------- |
| `gpuPoolSize`              | Number of GPU controllers to keep warm. Improves GPU workload startup/restore time but uses more memory.                                    | `0`                                           |
| `gpuShmSize`               | The shared memory size for GPU workloads. 8 GiB is enough for most workloads. Reduce if memory constrained or running small workloads only. | `8589934592` (8 GiB)                          |
| `gpuLdLibPath`             | Additional `LD_LIBRARY_PATH` to look for CUDA libraries.                                                                                    | `/run/nvidia/driver/usr/lib/x86_64-linux-gnu` |
| `gpuSkipNvidiaRuntimeHook` | Whether to skip adding the nvidia-container-runtime hook when starting GPU workloads.                                                       | `false`                                       |

### Plugin Versions

| Parameter                         | Description                                                                                                                                                          | Default   |
| --------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------- |
| `pluginsBuilds`                   | Specify the plugin versions to use. If set to `release`, then any release version for the plugins can be specified. If set to `alpha`, then specify the branch name. | `alpha`   |
| `pluginsNativeVersion`            | The version of the native plugin to use.                                                                                                                             | `latest`  |
| `pluginsCriuVersion`              | The version of the CRIU plugin to use.                                                                                                                               | `7/merge` |
| `pluginsContainerdRuntimeVersion` | The version of the containerd runtime plugin to use.                                                                                                                 | `v0.7.1`  |
| `pluginsGpuVersion`               | The version of the GPU plugin to use.                                                                                                                                | `v0.7.0`  |
| `pluginsStreamerVersion`          | The version of the streamer plugin to use.                                                                                                                           | `v0.0.8`  |

### Observability

| Parameter   | Description                | Default |
| ----------- | -------------------------- | ------- |
| `profiling` | Enable profiling.          | `true`  |
| `metrics`   | Enable metrics collection. | `true`  |
| `logLevel`  | The logging level.         | `info`  |

### AWS Configuration

| Parameter            | Description                                                                           | Default |
| -------------------- | ------------------------------------------------------------------------------------- | ------- |
| `awsAccessKeyId`     | AWS access key ID if using S3 storage (if `s3://<bucket>/<path>` in `checkpointDir`). | `""`    |
| `awsSecretAccessKey` | AWS secret access key if using S3 storage.                                            | `""`    |
| `awsRegion`          | AWS region if using S3 storage (uses default region if not set).                      | `""`    |
| `awsEndpoint`        | AWS endpoint if using S3-compatible storage.                                          | `""`    |

### Custom Secrets

| Parameter           | Description                                    | Default                              |
| ------------------- | ---------------------------------------------- | ------------------------------------ |
| `preExistingSecret` | Uncomment to use a custom pre-existing secret. | `cedana-secret-user` (commented out) |

## Host Configuration (`hostConfig`)

Configuration for host-level settings.

| Parameter           | Description                                                                                                          | Default                           |
| ------------------- | -------------------------------------------------------------------------------------------------------------------- | --------------------------------- |
| `containerdAddress` | Path to containerd socket.                                                                                           | `/run/containerd/containerd.sock` |
| `disableIoUring`    | Set to `true` to disable Linux kernel's io-uring option. Cedana does not support io-uring based checkpoint restores. | `true`                            |

### Shared Memory Configuration (`hostConfig.shmConfig`)

Optional configuration to increase `/dev/shm` size on nodes, which is useful for workloads requiring large shared memory.

| Parameter | Description                                           | Default |
| --------- | ----------------------------------------------------- | ------- |
| `enabled` | Set to `true` to enable `/dev/shm` size increase.     | `false` |
| `size`    | Size to set for `/dev/shm` (e.g., `10G`, `20G`).      | `10G`   |
| `minSize` | Minimum size to trigger remount (e.g., `10G`, `20G`). | `10G`   |

## Daemon Helper (`daemonHelper`)

Configuration for the `daemon-helper` DaemonSet.

### Service

| Parameter             | Description                                      | Default |
| --------------------- | ------------------------------------------------ | ------- |
| `service.annotations` | Annotations to add to the daemon helper service. | `{}`    |

### Image

| Parameter               | Description                                    | Default                |
| ----------------------- | ---------------------------------------------- | ---------------------- |
| `image.repository`      | The repository for the image.                  | `cedana/cedana-helper` |
| `image.tag`             | The tag for the image.                         | `v0.9.284`             |
| `image.digest`          | The digest for the image (ignores tag if set). | `""`                   |
| `image.imagePullPolicy` | The image pull policy.                         | `IfNotPresent`         |

### Update Strategy

| Parameter                       | Description                                                                   | Default |
| ------------------------------- | ----------------------------------------------------------------------------- | ------- |
| `updateStrategy.maxSurge`       | The maximum number of pods that can be created over the desired number.       | `0`     |
| `updateStrategy.maxUnavailable` | The maximum number of pods that can be unavailable during the update process. | `99999` |

### Scheduling

| Parameter      | Description                                                                                                                                                                                      | Default                                                                |
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------- |
| `tolerations`  | Tolerations for the daemon helper pods. Cedana uses the `cedana.ai/not-ready:NoSchedule` taint while the node is bootstrapping, then removes it once the helper and health-check pods are Ready. | Allows scheduling on nodes with `cedana.ai/not-ready:NoSchedule` taint |
| `affinity`     | Affinity settings for the daemon helper pods.                                                                                                                                                    | `{}`                                                                   |
| `nodeSelector` | Node selector for the daemon helper pods.                                                                                                                                                        | `{}`                                                                   |

## Service Account (`serviceAccount`)

Configuration for the Kubernetes Service Account.

| Parameter     | Description                                                                                                               | Default                     |
| ------------- | ------------------------------------------------------------------------------------------------------------------------- | --------------------------- |
| `create`      | Specifies whether a service account should be created.                                                                    | `true`                      |
| `automount`   | Automatically mount a ServiceAccount's API credentials.                                                                   | `true`                      |
| `annotations` | Annotations to add to the service account.                                                                                | `{}`                        |
| `name`        | The name of the service account to use. If not set and `create` is true, a name is generated using the fullname template. | `cedana-controller-manager` |

## Controller Manager (`controllerManager`)

Configuration for the `cedana-controller-manager`.

### Autoscaling

| Parameter                                    | Description                                        | Default |
| -------------------------------------------- | -------------------------------------------------- | ------- |
| `autoscaling.enabled`                        | Enable autoscaling for the controller manager.     | `false` |
| `autoscaling.replicaCount`                   | The number of replicas for the controller manager. | `1`     |
| `autoscaling.deploymentRevisionHistoryLimit` | The number of old ReplicaSets to retain.           | `10`    |

### Service

| Parameter             | Description                                     | Default                                           |
| --------------------- | ----------------------------------------------- | ------------------------------------------------- |
| `service.annotations` | Annotations for the controller manager service. | `{}`                                              |
| `service.ports`       | The ports for the controller manager service.   | `[{protocol: TCP, port: 1324, targetPort: 1324}]` |

### Manager Configuration

| Parameter                          | Description                                                                                                                                                       | Default                                                                                      |
| ---------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------- |
| `manager.podAnnotations`           | Annotations for the controller manager pods.                                                                                                                      | `{}`                                                                                         |
| `manager.args`                     | Arguments for the controller manager container.                                                                                                                   | `[--health-probe-bind-address=:8081, --metrics-bind-address=127.0.0.1:8080, --leader-elect]` |
| `manager.containerSecurityContext` | The security context for the manager container. Controller doesn't require any privileges.                                                                        | `allowPrivilegeEscalation: false`, `capabilities: { drop: [ALL] }`                           |
| `manager.image.repository`         | The repository for the `cedana-controller` image.                                                                                                                 | `cedana/cedana-controller`                                                                   |
| `manager.image.tag`                | The tag for the `cedana-controller` image.                                                                                                                        | `v0.6.2`                                                                                     |
| `manager.image.digest`             | The digest for the image (ignores tag if set).                                                                                                                    | `""`                                                                                         |
| `manager.image.imagePullPolicy`    | The image pull policy.                                                                                                                                            | `IfNotPresent`                                                                               |
| `manager.resources`                | Resource limits and requests for the manager container. Empty to ensure minimal resource usage on demo/test deployments. Uncomment or add custom resource limits. | `{}`                                                                                         |

### RBAC Proxy

| Parameter        | Description                                                                                                                                                          | Default |
| ---------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
| `rbac.resources` | Resource limits and requests for the RBAC proxy container. Empty to ensure minimal resource usage on demo/test deployments. Uncomment or add custom resource limits. | `{}`    |

### Scheduling

| Parameter      | Description                                                                                                                                                                                                                                                                                                                                                           | Default                                                                |
| -------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------- |
| `tolerations`  | Tolerations for the controller manager pods. Cedana uses the `cedana.ai/not-ready:NoSchedule` taint while the node is bootstrapping, then removes it once the helper and health-check pods are Ready. To deploy the controller on a dedicated node, run `kubectl taint node <node-name> dedicated=cedana-manager:NoSchedule` and uncomment the additional toleration. | Allows scheduling on nodes with `cedana.ai/not-ready:NoSchedule` taint |
| `affinity`     | Affinity settings for the controller manager pods. To allow scheduling of controller pod on labeled nodes only, use `kubectl label nodes <node-name> dedicated=cedana-manager` and uncomment the nodeAffinity configuration.                                                                                                                                          | `{}`                                                                   |
| `nodeSelector` | Node selector for the controller manager pods.                                                                                                                                                                                                                                                                                                                        | `{}`                                                                   |

## Metrics Service (`metricsService`)

Configuration for the metrics service.

| Parameter | Description                        | Default                                                         |
| --------- | ---------------------------------- | --------------------------------------------------------------- |
| `ports`   | The ports for the metrics service. | `[{name: https, port: 8443, protocol: TCP, targetPort: https}]` |
| `type`    | The type of the metrics service.   | `ClusterIP`                                                     |


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.cedana.ai/cedana-kubernetes/configuration.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
