Managing Kubernetes Jobs

The right question is: what schedulers don't we support?

While Cedana can checkpoint/migrate/restore arbitrary Kubernetes pods, if you come in with your own CRDs, use the standard kubernetes job or use a scheduler like Armada, Kueue or Volcano; things become more complicated.

Fortunately, Cedana has a very simple integration for making it as seamless as possible to work with most systems under the sun.

Simply add the CEDANA_CHECKPOINT env to the spec of the container you're trying to checkpoint, and you're set!

Here's an example yaml for a standard k8s job:

apiVersion: batch/v1
kind: Job
metadata:
  name: cuda-vector-add-job
  namespace: default
spec:
  backoffLimit: 1
  completions: 1
  parallelism: 1
  completionMode: NonIndexed
  manualSelector: false
  podReplacementPolicy: TerminatingOrFailed
  template:
    metadata:
      labels:
        job-name: cuda-vector-add-job
    spec:
      restartPolicy: Never
      runtimeClassName: cedana
      priorityClassName: indiv-priority
      volumes:
        - name: repo-data
          emptyDir: {}
      initContainers:
        - name: clone-repo
          image: alpine/git:latest
          command:
            - sh
            - -c
            - |
              echo "Init ran at $(date)" >> /workspace/init-log.txt
              git clone https://github.com/mirror/busybox.git /workspace || echo "Clone failed"
          volumeMounts:
            - name: repo-data
              mountPath: /workspace
      containers:
        - name: cuda-vector-add
          image: cedana/cedana-samples:cuda
          command:
            - /bin/sh
            - -c
            - |
              gpu_smr/vector_add
          env: 
            - name: CEDANA_CHECKPOINT 
              value: job-preemption-test-3
          resources:
            limits:
              nvidia.com/gpu: 1   # request one GPU
            requests:
              nvidia.com/gpu: 1
          volumeMounts:
            - name: repo-data
              mountPath: /workspace

This comes with an initContainer and a volumeMount, which we have no issues supporting. Notice the env for the container we do wish to checkpoint/migrate/resume:

  env: 
            - name: CEDANA_CHECKPOINT 
              value: job-preemption-test-3

Once applied, every subsequent application of that yaml will use the latest snapshot taken for that ID when starting the container.

Since this is how Kubernetes jobs work under the hood (when they get restarted if the job isn't complete), if the job fails for whatever reason (preemption, node failure, etc) - as long as a checkpoint has been taken with a unique ID, it'll pick up from where it left off.

Last updated

Was this helpful?