Level 1 - Simple GPU C/R

A walkthrough of checkpoint/restoring a GPU workload.

Creating a Workload

You can run any GPU workload, but for ease of use, we've created some samples you can use! You can either deploy from the UI (from the library):

or by navigating to our samples repo and deploying one of the many (frequently updating!) yamls: https://github.com/cedana/cedana-samples/tree/main/kubernetes.

Use k9s , kubectl or the UI (Monitoring -> Pods) to see the status of the job.

Checkpointing

You can perform a manual checkpoint through the UI:

Notice the start time here, which is indicative of the inference cold start, as this workload has a readiness probe as mentioned in the hint above.

Restoring

Once the checkpoint is complete (which you can also check by taking a look at the cedana-helper logs on the same node that the workload was running on), you can restore manually by navigating to the Checkpoints page.

You can also choose what cluster to restore onto - demonstrating cross-cluster c/r!

Once restored, you can navigate back to the pod view to see all restored workloads w/ the restore tag:

And just like that, the workload's been restored! It's worth noticing a couple things here:

  • The restored pod is running on another instance: xxx-224.ec2 vs xxx-163.ec2.

  • The start time captured for the restored pod is about 2x faster. You can see even faster gains depending on the infrastructure of your project! This pushed/pulled to a storage bucket in GCP, but a quick win (which is explored further in Level 3 - Customization) is to switch to your own S3 bucket.

Here's a quick video summarizing the above:

From here, you can explore with some more workloads in our library or move on to Level 2, automation.

Last updated

Was this helpful?