Storage
There are two aspects to storage we cover here - where we checkpoint to, and what we checkpoint.
Where we checkpoint to
Simple, anywhere that's POSIX complaint. Our helm charts come configured to write to local storage at first so there's no issues, but while setting up Cedana you can configure a folder that's mounted in on a NAS or even an S3-compatible bucket. Here's an example:
--set config.awsAccessKeyId=$AWS_ACCESS_KEY_ID \
--set config.awsSecretAccessKey=$AWS_SECRET_ACCESS_KEY \
--set config.awsRegion=$PROVIDER_REGION \
--set config.awsEndpoint="https://storage.eu-north1.nebius.cloud"Note that in the above case, it's not sending to S3 but an S3-compatible bucket hosted on another cloud provider (Nebius).
What else we checkpoint
While we can move the state of your process (including both CPU and GPU state), there should be some careful consideration with open files.
Given this, we have three ways with which we currently deal with files that are being written to (as a restored or migrated process expects the file to have the exact same size so it can pick up where it left off). There are 2 scenarios we recommend for 2 different filesystem-writing regimes:
files <~ 1GB: Default Behavior
files >> 1GB: Volume snapshotting
Each scenario is described below.
Default Behavior
If you're writing files into the rootfs of the container, we just take them with us on the snapshot! We perform a diff of the filesystem and add it to our process/system-level checkpoint. This includes JIT compiled files, intermediate files created during install and more.
However, when files become very large >> 1GB, and persistence is necessary (if the files themselves are outputs of the run, like in physical simulations for example), we recommend using volumes.
Volume Snapshotting
This is still a very early work in progress! Please reach out to us if you're planning on using this.
The alternate method requires coordination with your CSI driver. We take advantage of the snapshotting primitives already present (https://kubernetes.io/docs/concepts/storage/volume-snapshots/), and take a reference to these with us; so when we restore, we restore from a Kubernetes Volume Snapshot.
Persistent Volume Claims
If your workload uses PersistentVolumeClaim-backed volumes, Cedana also checkpoints the PVC objects that are attached to the pod.
During checkpointing:
Cedana collects each PVC referenced by the pod.
If the PVC is bound to a PV, Cedana patches that PV to use the
Retainreclaim policy when needed.
During restore:
Cedana recreates missing PVCs for the restored pod.
If the original PVC was bound to a specific PV and that PV is still available, Cedana tries to reclaim it.
If the PV cannot be reclaimed or is no longer present, Cedana falls back to dynamic provisioning by clearing the explicit
volumeName.
This keeps volume-backed workloads restorable without requiring you to manually rewire storage objects after a checkpoint.
Last updated
Was this helpful?