Managing Kubernetes Jobs
The right question is: what schedulers don't we support?
While Cedana can checkpoint/migrate/restore arbitrary Kubernetes pods, if you come in with your own CRDs, use the standard kubernetes job or use a scheduler like Armada, Kueue or Volcano; things become more complicated.
Fortunately, Cedana has a very simple integration for making it as seamless as possible to work with most systems under the sun.
Simply add the CEDANA_CHECKPOINT
env to the spec of the container you're trying to checkpoint, and you're set!
Here's an example yaml for a standard k8s job:
apiVersion: batch/v1
kind: Job
metadata:
name: cuda-vector-add-job
namespace: default
spec:
backoffLimit: 1
completions: 1
parallelism: 1
completionMode: NonIndexed
manualSelector: false
podReplacementPolicy: TerminatingOrFailed
template:
metadata:
labels:
job-name: cuda-vector-add-job
spec:
restartPolicy: Never
runtimeClassName: cedana
priorityClassName: indiv-priority
volumes:
- name: repo-data
emptyDir: {}
initContainers:
- name: clone-repo
image: alpine/git:latest
command:
- sh
- -c
- |
echo "Init ran at $(date)" >> /workspace/init-log.txt
git clone https://github.com/mirror/busybox.git /workspace || echo "Clone failed"
volumeMounts:
- name: repo-data
mountPath: /workspace
containers:
- name: cuda-vector-add
image: cedana/cedana-samples:cuda
command:
- /bin/sh
- -c
- |
gpu_smr/vector_add
env:
- name: CEDANA_CHECKPOINT
value: job-preemption-test-3
resources:
limits:
nvidia.com/gpu: 1 # request one GPU
requests:
nvidia.com/gpu: 1
volumeMounts:
- name: repo-data
mountPath: /workspace
This comes with an initContainer
and a volumeMount
, which we have no issues supporting. Notice the env for the container we do wish to checkpoint/migrate/resume:
env:
- name: CEDANA_CHECKPOINT
value: job-preemption-test-3
Once applied, every subsequent application of that yaml will use the latest snapshot taken for that ID when starting the container.
Since this is how Kubernetes jobs work under the hood (when they get restarted if the job isn't complete), if the job fails for whatever reason (preemption, node failure, etc) - as long as a checkpoint has been taken with a unique ID, it'll pick up from where it left off.
Last updated
Was this helpful?