githubEdit

Automation

Adding automation to checkpoint/restore in Kubernetes

Cedana offers two distinct types of automation to best take advantage of our checkpoint/restore.

Policies

Through the dashboard, you can enable and set policies, that configure bits of automation like heartbeat checkpointing, which will act as a cron job and checkpoint every n minutes.

Policies in cedana operate through a Trigger -> Filter -> Action mechanism, allowing users to build complex automations. For example, you could build:

  • a policy that automatically fires a checkpoint if a pod exceeds 80% memory utilization in a specific namespace or deployment,

  • a policy that deletes checkpoints past a certain age,

  • a policy that checkpoints on some webhook

We aim for the policy engine to be extensible, so value feedback for more features! Policies work best when integrated into Kubernetes however, as the below section illustrates.

Integration into Kubernetes

The second is simply our deep integration into Kubernetes. As users, you are likely to use tools such as Armada, Kueue, KServe, Dynamo, etc, that are come with their own opinions about container lifecycle, management, preemption and more.

As we work seamlessly with the scheduler, cedana automatically inserts the checkpoint if a pod (that had a checkpoint previously taken) auto-scales. For example, if using a tool like Karpenter, cedana will automatically restore from a checkpoint when the underlying node is preempted and a new node is spun up, instead of starting from scratch.

Our goal is to be as pain-free to integrate into your existing tooling as possible, and as such simply work with schedulers like the ones listed above. The best demonstration of this can be found in Jobs, which can more or less be applied universally to kubernetes kinds.

See our Managed Dynamo for an example with a more complex Kubernetes deployment.

Last updated

Was this helpful?