Managing Storage
While we can move the state of your process (including both CPU and GPU state), there should be some careful consideration with open files.
Given this, we have three ways with which we currently deal with files that are being written to (as a restored or migrated process expects the file to have the exact same size so it can pick up where it left off). There are 2 scenarios we recommend for 2 different filesystem-writing regimes:
files <~ 1GB: Default Behavior
files >> 1GB: Volume snapshotting
Each scenario is described below.
Default Behavior
If you're writing files into the rootfs of the container, we just take them with us on the snapshot! We perform a diff of the filesystem and add it to our process/system-level checkpoint. This includes JIT compiled files, intermediate files created during install and more.
However, when files become very large >> 1GB, and persistence is necessary (if the files themselves are outputs of the run, like in physical simulations for example), we recommend using volumes.
Volume Snapshotting
The alternate method requires coordination with your CSI driver. We take advantage of the snapshotting primitives already present (https://kubernetes.io/docs/concepts/storage/volume-snapshots/), and take a reference to these with us; so when we restore, we restore from a Kubernetes Volume Snapshot.
Last updated
Was this helpful?