githubEdit

Storage

There are two aspects to storage we cover here - where we checkpoint to, and what we checkpoint.

Where we checkpoint to

Simple, anywhere that's POSIX complaint. Our helm charts come configured to write to local storage at first so there's no issues, but while setting up Cedana you can configure a folder that's mounted in on a NAS or even an S3-compatible bucket. Here's an example:

--set config.awsAccessKeyId=$AWS_ACCESS_KEY_ID \ 
--set config.awsSecretAccessKey=$AWS_SECRET_ACCESS_KEY \
--set config.awsRegion=$PROVIDER_REGION \ 
--set config.awsEndpoint="https://storage.eu-north1.nebius.cloud"

Note that in the above case, it's not sending to S3 but an S3-compatible bucket hosted on another cloud provider (Nebius).

What else we checkpoint

While we can move the state of your process (including both CPU and GPU state), there should be some careful consideration with open files.

Given this, we have three ways with which we currently deal with files that are being written to (as a restored or migrated process expects the file to have the exact same size so it can pick up where it left off). There are 2 scenarios we recommend for 2 different filesystem-writing regimes:

  • files <~ 1GB: Default Behavior

  • files >> 1GB: Volume snapshotting

Each scenario is described below.

Default Behavior

If you're writing files into the rootfs of the container, we just take them with us on the snapshot! We perform a diff of the filesystem and add it to our process/system-level checkpoint. This includes JIT compiled files, intermediate files created during install and more.

However, when files become very large >> 1GB, and persistence is necessary (if the files themselves are outputs of the run, like in physical simulations for example), we recommend using volumes.

Volume Snapshotting

circle-info

This is still a very early work in progress! Please reach out to us if you're planning on using this.

The alternate method requires coordination with your CSI driver. We take advantage of the snapshotting primitives already present (https://kubernetes.io/docs/concepts/storage/volume-snapshots/arrow-up-right), and take a reference to these with us; so when we restore, we restore from a Kubernetes Volume Snapshot.

Last updated

Was this helpful?