githubEdit

Configuration

This document outlines the configurable parameters for the Cedana Helm chart. For up-to-date configuration, see values.yamlarrow-up-right.

Global Settings

These settings control the overall behavior of the deployment.

Parameter
Description
Default

nameOverride

Overrides the name of the chart.

cedana

fullnameOverride

Overrides the full name of the release.

cedana

installKueue

If set to true, Kueue will be installed. Note: The Kueue CRDs must be applied before enabling this option. You can apply them with kubectl apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/download/v0.10.1/manifests.yaml.

false

kubernetesClusterDomain

The Kubernetes cluster domain.

cluster.local

Cedana Configuration (config)

This section contains the core configuration for the Cedana platform.

Authentication & Platform

Parameter
Description
Default

authToken

Your authentication token for the Cedana platform.

""

url

The URL for the Cedana API.

""

clusterId

A unique ID for your cluster (you can generate one on the Clusters Pagearrow-up-right).

""

sqsQueueUrl

The SQS queue URL for communication with Cedana.

""

Daemon Configuration

Parameter
Description
Default

address

The address and port to bind the daemon on. Must be 0.0.0.0 for external accessibility.

0.0.0.0:8080

protocol

The protocol to use for daemon service. Options: tcp (TCP socket), unix (Unix socket), vsock (VSock for VM and hypervisor communication, useful for supporting VM-based migrations). Note: Other protocols might not be supported properly, update with caution.

tcp

Checkpoint Storage & Streaming

Parameter
Description
Default

checkpointDir

Specify path to directory for storing checkpoints. Options: cedana://<path> for Cedana-managed global storage (recommended), s3://<bucket>/<path> for your S3 storage, <path> for node-local storage (not recommended, as it won't be accessible across nodes).

/tmp

checkpointStreams

Specify the number of parallel streams to use for streaming checkpoint/restore operations. 0 means no streaming. n > 0 means n parallel streams (or number of pipes) to use. Streaming ensures a low footprint by using memory efficiently with no intermediate disk space required, although performance may be slightly better or worse depending on disk and network read/write speeds.

0

checkpointCompression

The compression algorithm for checkpoints. Options: none, tar, lz4, gzip, zlib.

none

checkpointAsync

Specify whether to perform checkpoint compression/upload async. Recommended if anticipating large checkpoints, which could take too long to stream directly to a bucket. Note that restore will only be possible after upload is complete, so there might be a delay before a workload can be restored even though the checkpoint shows as complete.

false

GPU Configuration

Parameter
Description
Default

gpuPoolSize

Number of GPU controllers to keep warm. Improves GPU workload startup/restore time but uses more memory.

0

gpuShmSize

The shared memory size for GPU workloads. 8 GiB is enough for most workloads. Reduce if memory constrained or running small workloads only.

8589934592 (8 GiB)

gpuLdLibPath

Additional LD_LIBRARY_PATH to look for CUDA libraries.

/run/nvidia/driver/usr/lib/x86_64-linux-gnu

gpuSkipNvidiaRuntimeHook

Whether to skip adding the nvidia-container-runtime hook when starting GPU workloads.

false

Plugin Versions

Parameter
Description
Default

pluginsBuilds

Specify the plugin versions to use. If set to release, then any release version for the plugins can be specified. If set to alpha, then specify the branch name.

alpha

pluginsNativeVersion

The version of the native plugin to use.

latest

pluginsCriuVersion

The version of the CRIU plugin to use.

7/merge

pluginsContainerdRuntimeVersion

The version of the containerd runtime plugin to use.

v0.7.1

pluginsGpuVersion

The version of the GPU plugin to use.

v0.7.0

pluginsStreamerVersion

The version of the streamer plugin to use.

v0.0.8

Observability

Parameter
Description
Default

profiling

Enable profiling.

true

metrics

Enable metrics collection.

true

logLevel

The logging level.

info

AWS Configuration

Parameter
Description
Default

awsAccessKeyId

AWS access key ID if using S3 storage (if s3://<bucket>/<path> in checkpointDir).

""

awsSecretAccessKey

AWS secret access key if using S3 storage.

""

awsRegion

AWS region if using S3 storage (uses default region if not set).

""

awsEndpoint

AWS endpoint if using S3-compatible storage.

""

Custom Secrets

Parameter
Description
Default

preExistingSecret

Uncomment to use a custom pre-existing secret.

cedana-secret-user (commented out)

Host Configuration (hostConfig)

Configuration for host-level settings.

Parameter
Description
Default

containerdAddress

Path to containerd socket.

/run/containerd/containerd.sock

disableIoUring

Set to true to disable Linux kernel's io-uring option. Cedana does not support io-uring based checkpoint restores.

true

Shared Memory Configuration (hostConfig.shmConfig)

Optional configuration to increase /dev/shm size on nodes, which is useful for workloads requiring large shared memory.

Parameter
Description
Default

enabled

Set to true to enable /dev/shm size increase.

false

size

Size to set for /dev/shm (e.g., 10G, 20G).

10G

minSize

Minimum size to trigger remount (e.g., 10G, 20G).

10G

Daemon Helper (daemonHelper)

Configuration for the daemon-helper DaemonSet.

Service

Parameter
Description
Default

service.annotations

Annotations to add to the daemon helper service.

{}

Image

Parameter
Description
Default

image.repository

The repository for the image.

cedana/cedana-helper

image.tag

The tag for the image.

v0.9.284

image.digest

The digest for the image (ignores tag if set).

""

image.imagePullPolicy

The image pull policy.

IfNotPresent

Update Strategy

Parameter
Description
Default

updateStrategy.maxSurge

The maximum number of pods that can be created over the desired number.

0

updateStrategy.maxUnavailable

The maximum number of pods that can be unavailable during the update process.

99999

Scheduling

Parameter
Description
Default

tolerations

Tolerations for the daemon helper pods.

Allows scheduling on nodes with cedana.ai/not-ready:NoSchedule taint

affinity

Affinity settings for the daemon helper pods.

{}

nodeSelector

Node selector for the daemon helper pods.

{}

Service Account (serviceAccount)

Configuration for the Kubernetes Service Account.

Parameter
Description
Default

create

Specifies whether a service account should be created.

true

automount

Automatically mount a ServiceAccount's API credentials.

true

annotations

Annotations to add to the service account.

{}

name

The name of the service account to use. If not set and create is true, a name is generated using the fullname template.

cedana-controller-manager

Controller Manager (controllerManager)

Configuration for the cedana-controller-manager.

Autoscaling

Parameter
Description
Default

autoscaling.enabled

Enable autoscaling for the controller manager.

false

autoscaling.replicaCount

The number of replicas for the controller manager.

1

autoscaling.deploymentRevisionHistoryLimit

The number of old ReplicaSets to retain.

10

Service

Parameter
Description
Default

service.annotations

Annotations for the controller manager service.

{}

service.ports

The ports for the controller manager service.

[{protocol: TCP, port: 1324, targetPort: 1324}]

Manager Configuration

Parameter
Description
Default

manager.podAnnotations

Annotations for the controller manager pods.

{}

manager.args

Arguments for the controller manager container.

[--health-probe-bind-address=:8081, --metrics-bind-address=127.0.0.1:8080, --leader-elect]

manager.containerSecurityContext

The security context for the manager container. Controller doesn't require any privileges.

allowPrivilegeEscalation: false, capabilities: { drop: [ALL] }

manager.image.repository

The repository for the cedana-controller image.

cedana/cedana-controller

manager.image.tag

The tag for the cedana-controller image.

v0.6.2

manager.image.digest

The digest for the image (ignores tag if set).

""

manager.image.imagePullPolicy

The image pull policy.

IfNotPresent

manager.resources

Resource limits and requests for the manager container. Empty to ensure minimal resource usage on demo/test deployments. Uncomment or add custom resource limits.

{}

RBAC Proxy

Parameter
Description
Default

rbac.resources

Resource limits and requests for the RBAC proxy container. Empty to ensure minimal resource usage on demo/test deployments. Uncomment or add custom resource limits.

{}

Scheduling

Parameter
Description
Default

tolerations

Tolerations for the controller manager pods. Allows scheduling on nodes with cedana.ai/not-ready:NoSchedule taint. To deploy the controller on a dedicated node, run kubectl taint node <node-name> dedicated=cedana-manager:NoSchedule and uncomment the additional toleration.

Allows scheduling on nodes with cedana.ai/not-ready:NoSchedule taint

affinity

Affinity settings for the controller manager pods. To allow scheduling of controller pod on labeled nodes only, use kubectl label nodes <node-name> dedicated=cedana-manager and uncomment the nodeAffinity configuration.

{}

nodeSelector

Node selector for the controller manager pods.

{}

Metrics Service (metricsService)

Configuration for the metrics service.

Parameter
Description
Default

ports

The ports for the metrics service.

[{name: https, port: 8443, protocol: TCP, targetPort: https}]

type

The type of the metrics service.

ClusterIP

Last updated

Was this helpful?