Configuration
This document outlines the configurable parameters for the Cedana Helm chart. For up-to-date configuration, see values.yaml.
Global Settings
These settings control the overall behavior of the deployment.
nameOverride
Overrides the name of the chart.
cedana
fullnameOverride
Overrides the full name of the release.
cedana
installKueue
If set to true, Kueue will be installed. Note: The Kueue CRDs must be applied before enabling this option. You can apply them with kubectl apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/download/v0.10.1/manifests.yaml.
false
kubernetesClusterDomain
The Kubernetes cluster domain.
cluster.local
Cedana Configuration (config)
config)This section contains the core configuration for the Cedana platform.
Authentication & Platform
authToken
Your authentication token for the Cedana platform.
""
url
The URL for the Cedana API.
""
sqsQueueUrl
The SQS queue URL for communication with Cedana.
""
Daemon Configuration
address
The address and port to bind the daemon on. Must be 0.0.0.0 for external accessibility.
0.0.0.0:8080
protocol
The protocol to use for daemon service. Options: tcp (TCP socket), unix (Unix socket), vsock (VSock for VM and hypervisor communication, useful for supporting VM-based migrations). Note: Other protocols might not be supported properly, update with caution.
tcp
Checkpoint Storage & Streaming
checkpointDir
Specify path to directory for storing checkpoints. Options: cedana://<path> for Cedana-managed global storage (recommended), s3://<bucket>/<path> for your S3 storage, <path> for node-local storage (not recommended, as it won't be accessible across nodes).
/tmp
checkpointStreams
Specify the number of parallel streams to use for streaming checkpoint/restore operations. 0 means no streaming. n > 0 means n parallel streams (or number of pipes) to use. Streaming ensures a low footprint by using memory efficiently with no intermediate disk space required, although performance may be slightly better or worse depending on disk and network read/write speeds.
0
checkpointCompression
The compression algorithm for checkpoints. Options: none, tar, lz4, gzip, zlib.
none
checkpointAsync
Specify whether to perform checkpoint compression/upload async. Recommended if anticipating large checkpoints, which could take too long to stream directly to a bucket. Note that restore will only be possible after upload is complete, so there might be a delay before a workload can be restored even though the checkpoint shows as complete.
false
GPU Configuration
gpuPoolSize
Number of GPU controllers to keep warm. Improves GPU workload startup/restore time but uses more memory.
0
gpuShmSize
The shared memory size for GPU workloads. 8 GiB is enough for most workloads. Reduce if memory constrained or running small workloads only.
8589934592 (8 GiB)
gpuLdLibPath
Additional LD_LIBRARY_PATH to look for CUDA libraries.
/run/nvidia/driver/usr/lib/x86_64-linux-gnu
gpuSkipNvidiaRuntimeHook
Whether to skip adding the nvidia-container-runtime hook when starting GPU workloads.
false
Plugin Versions
pluginsBuilds
Specify the plugin versions to use. If set to release, then any release version for the plugins can be specified. If set to alpha, then specify the branch name.
alpha
pluginsNativeVersion
The version of the native plugin to use.
latest
pluginsCriuVersion
The version of the CRIU plugin to use.
7/merge
pluginsContainerdRuntimeVersion
The version of the containerd runtime plugin to use.
v0.7.1
pluginsGpuVersion
The version of the GPU plugin to use.
v0.7.0
pluginsStreamerVersion
The version of the streamer plugin to use.
v0.0.8
Observability
profiling
Enable profiling.
true
metrics
Enable metrics collection.
true
logLevel
The logging level.
info
AWS Configuration
awsAccessKeyId
AWS access key ID if using S3 storage (if s3://<bucket>/<path> in checkpointDir).
""
awsSecretAccessKey
AWS secret access key if using S3 storage.
""
awsRegion
AWS region if using S3 storage (uses default region if not set).
""
awsEndpoint
AWS endpoint if using S3-compatible storage.
""
Custom Secrets
preExistingSecret
Uncomment to use a custom pre-existing secret.
cedana-secret-user (commented out)
Host Configuration (hostConfig)
hostConfig)Configuration for host-level settings.
containerdAddress
Path to containerd socket.
/run/containerd/containerd.sock
disableIoUring
Set to true to disable Linux kernel's io-uring option. Cedana does not support io-uring based checkpoint restores.
true
Shared Memory Configuration (hostConfig.shmConfig)
hostConfig.shmConfig)Optional configuration to increase /dev/shm size on nodes, which is useful for workloads requiring large shared memory.
enabled
Set to true to enable /dev/shm size increase.
false
size
Size to set for /dev/shm (e.g., 10G, 20G).
10G
minSize
Minimum size to trigger remount (e.g., 10G, 20G).
10G
Daemon Helper (daemonHelper)
daemonHelper)Configuration for the daemon-helper DaemonSet.
Service
service.annotations
Annotations to add to the daemon helper service.
{}
Image
image.repository
The repository for the image.
cedana/cedana-helper
image.tag
The tag for the image.
v0.9.284
image.digest
The digest for the image (ignores tag if set).
""
image.imagePullPolicy
The image pull policy.
IfNotPresent
Update Strategy
updateStrategy.maxSurge
The maximum number of pods that can be created over the desired number.
0
updateStrategy.maxUnavailable
The maximum number of pods that can be unavailable during the update process.
99999
Scheduling
tolerations
Tolerations for the daemon helper pods.
Allows scheduling on nodes with cedana.ai/not-ready:NoSchedule taint
affinity
Affinity settings for the daemon helper pods.
{}
nodeSelector
Node selector for the daemon helper pods.
{}
Service Account (serviceAccount)
serviceAccount)Configuration for the Kubernetes Service Account.
create
Specifies whether a service account should be created.
true
automount
Automatically mount a ServiceAccount's API credentials.
true
annotations
Annotations to add to the service account.
{}
name
The name of the service account to use. If not set and create is true, a name is generated using the fullname template.
cedana-controller-manager
Controller Manager (controllerManager)
controllerManager)Configuration for the cedana-controller-manager.
Autoscaling
autoscaling.enabled
Enable autoscaling for the controller manager.
false
autoscaling.replicaCount
The number of replicas for the controller manager.
1
autoscaling.deploymentRevisionHistoryLimit
The number of old ReplicaSets to retain.
10
Service
service.annotations
Annotations for the controller manager service.
{}
service.ports
The ports for the controller manager service.
[{protocol: TCP, port: 1324, targetPort: 1324}]
Manager Configuration
manager.podAnnotations
Annotations for the controller manager pods.
{}
manager.args
Arguments for the controller manager container.
[--health-probe-bind-address=:8081, --metrics-bind-address=127.0.0.1:8080, --leader-elect]
manager.containerSecurityContext
The security context for the manager container. Controller doesn't require any privileges.
allowPrivilegeEscalation: false, capabilities: { drop: [ALL] }
manager.image.repository
The repository for the cedana-controller image.
cedana/cedana-controller
manager.image.tag
The tag for the cedana-controller image.
v0.6.2
manager.image.digest
The digest for the image (ignores tag if set).
""
manager.image.imagePullPolicy
The image pull policy.
IfNotPresent
manager.resources
Resource limits and requests for the manager container. Empty to ensure minimal resource usage on demo/test deployments. Uncomment or add custom resource limits.
{}
RBAC Proxy
rbac.resources
Resource limits and requests for the RBAC proxy container. Empty to ensure minimal resource usage on demo/test deployments. Uncomment or add custom resource limits.
{}
Scheduling
tolerations
Tolerations for the controller manager pods. Allows scheduling on nodes with cedana.ai/not-ready:NoSchedule taint. To deploy the controller on a dedicated node, run kubectl taint node <node-name> dedicated=cedana-manager:NoSchedule and uncomment the additional toleration.
Allows scheduling on nodes with cedana.ai/not-ready:NoSchedule taint
affinity
Affinity settings for the controller manager pods. To allow scheduling of controller pod on labeled nodes only, use kubectl label nodes <node-name> dedicated=cedana-manager and uncomment the nodeAffinity configuration.
{}
nodeSelector
Node selector for the controller manager pods.
{}
Metrics Service (metricsService)
metricsService)Configuration for the metrics service.
ports
The ports for the metrics service.
[{name: https, port: 8443, protocol: TCP, targetPort: https}]
type
The type of the metrics service.
ClusterIP
Last updated
Was this helpful?