Managing Storage

While we can move the state of your process (including both CPU and GPU state), there should be some careful consideration with open files.

Given this, we have three ways with which we currently deal with files that are being written to (as a restored or migrated process expects the file to have the exact same size so it can pick up where it left off). There are 2 scenarios we recommend for 2 different filesystem-writing regimes:

  • files <~ 1GB: Default Behavior

  • files >> 1GB: Volume snapshotting

Each scenario is described below.

Default Behavior

If you're writing files into the rootfs of the container, we just take them with us on the snapshot! We perform a diff of the filesystem and add it to our process/system-level checkpoint. This includes JIT compiled files, intermediate files created during install and more.

However, when files become very large >> 1GB, and persistence is necessary (if the files themselves are outputs of the run, like in physical simulations for example), we recommend using volumes.

Volume Snapshotting

This is still a very early work in progress! Please reach out to us if you're planning on using this.

The alternate method requires coordination with your CSI driver. We take advantage of the snapshotting primitives already present (https://kubernetes.io/docs/concepts/storage/volume-snapshots/), and take a reference to these with us; so when we restore, we restore from a Kubernetes Volume Snapshot.

Last updated

Was this helpful?