Cedana
Cedana Docs
Cedana
Cedana Docs
  • Welcome to Cedana
  • Get started
    • Authentication
    • Deploying on Kubernetes
    • Deploying locally
    • Supported container runtimes
  • ARTICLES
    • Performance of Cedana's GPU Interception
    • Cedana vs. CRIU CUDA for GPU Checkpoint/Restore
  • Examples
    • GPU Save, Migrate & Resume (SMR) on Kubernetes
    • Redis Save Migrate & Resume (SMR) on Kubernetes
    • LLaMA Inference GPU Save, Migrate & Resume (SMR)
  • References
    • API
      • API reference
    • Cedana Daemon
    • Cedana CLI
    • GitHub
Powered by GitBook
On this page
  • Checkpoint/Restore - REST Service
  • Known Limitations and Pitfalls

Was this helpful?

Edit on GitHub
  1. References

API

We provide endpoints with the Kubernetes controller, so once it is setup you can use the service to perform checkpoint and restore.

Checkpoint/Restore - REST Service

The Cedana REST Service provides a REST API for checkpointing and restoring containers in your Kubernetes cluster. The API runs concurrently with the Cedana Controller. Below are curl commands illustrating the schema of the API. All curls are using the in-cluster IP of the cedanacontroller pod. In order to do out-of-cluster checkpoint and restore, you can expose the pod and create an external IP address with Kubernetes services:

export CEDANA_CONTROLLER=$(kubectl get pods -n $CEDANA_NAMESPACE | grep manager | awk '{print $1}')
# it's cedana if you install with: helm install "cedana" $CHART_PATH
export HELM_INSTALL_NAME="cedana"

kubectl port-forward "$HELM_INSTALL_NAME-cedana-helm-manager" -n $CEDANA_NAMESPACE 1324:1324

Known Limitations and Pitfalls

  • Currently we don't support io_uring API checkpoint restore, consider checkpointing before creating and setting up urings.

  • We don't automatically detect and change behaviour of our checkpointing services, for example, for CRIO and Rootfs use the separately provided api endpoints.

  • Restore requires the pod we restore to not be active. This generally means you should put the restore pod to sleep using custom command and arg, while true; do sleep infinity; done;.

  • We can't work on AMIs that are read-only. For example, the GKE optimized images won't work.

PreviousLLaMA Inference GPU Save, Migrate & Resume (SMR)NextAPI reference

Last updated 2 months ago

Was this helpful?