Cedana
Cedana Docs
Cedana
Cedana Docs
  • Welcome to Cedana
  • Get started
    • Authentication
    • Deploying on Kubernetes
    • Deploying locally
    • Supported container runtimes
  • ARTICLES
    • Performance of Cedana's GPU Interception
    • Cedana vs. CRIU CUDA for GPU Checkpoint/Restore
  • Examples
    • GPU Save, Migrate & Resume (SMR) on Kubernetes
    • Redis Save Migrate & Resume (SMR) on Kubernetes
    • LLaMA Inference GPU Save, Migrate & Resume (SMR)
  • References
    • API
      • API reference
    • Cedana Daemon
    • Cedana CLI
    • GitHub
Powered by GitBook
On this page
  • Setup
  • Running LLaMA 8B
  • Save
  • Resume

Was this helpful?

Edit on GitHub
  1. Examples

LLaMA Inference GPU Save, Migrate & Resume (SMR)

Performing an SMR of a running LLaMA inference task using HuggingFace weights.

PreviousRedis Save Migrate & Resume (SMR) on KubernetesNextAPI

Last updated 2 months ago

Was this helpful?

Save, Migrate and Resume (SMR) has enormouse value for running inference workloads, especially when you're sensitive to cold starts. By virtue of what we're collecting from the GPU on a save (pretty much everything sitting in VRAM at the time) you can bypass all the initialization time and model optimizations that PyTorch or other inference engines (like vLLM or transformers-inference) do, and start inference workloads significantly faster.

Here's some benchmarking data to support that, which compares time to first token (TTFT) for a native start vs restoring from our checkpoint:

Cedana is consistently faster, and generally gets faster for even larger models; which tracks with the amount of work that needs to be done by an inference engine to prepare weights for inference.

Setup

Running LLaMA 8B

A simple python script that uses llama-3.1-8b would look like:

#!/usr/bin/env python3

import argparse
import time
from transformers import AutoModelForCausalLM, AutoTokenizer


# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    torch_dtype="auto",
)
model.cuda()

while True:
    user_input = "some prompt"

    # Tokenize input
    inputs = tokenizer(user_input, return_tensors="pt").to(model.device)

    # Generate tokens
    tokens = model.generate(
        **inputs,
        max_new_tokens=64,
        temperature=0.70,
        top_p=0.95,
        do_sample=True,
    )

    output = tokenizer.decode(tokens[0], skip_special_tokens=True)
    print(f"Generated Output:\n{output}")

Save

Start your workload with:

cedana run process -ga -j llama_inference -- python3 -u llama.py 

To perform save, it's as simple as:

cedana dump job llama_inference --compression=none

Use cedana dump job --help to see additional options (such as different compression schemes). You can also use cedana psto inspect the running workload.

Resume

To restore from a previously taken save:

cedana restore job llama_inference -a

You can restore from the same save indefinitely - so you can think of the file as information that stores model weights (as they're represented in VRAM) and everything in CPU state that coordinates execution.

A pretty powerful way some customers are using us is using a save as a proxy to model weights entirely - so now instead of loading model weights you just resume from the file - everywhere you need inference!

Follow the instructions in , to get started with Cedana locally. Optionally, take a look at to get an idea of how to use it.

This will download weights from HuggingFace the first time, so would ensure that the weights download correctly first by running it separately (with python3 llama_inference.py). You can find more workloads you can test with !

here
installation
checkpoint/restore with GPUs