LLaMA Inference GPU Save, Migrate & Resume (SMR)

Performing an SMR of a running LLaMA inference task using HuggingFace weights.

Save, Migrate and Resume (SMR) has enormouse value for running inference workloads, especially when you're sensitive to cold starts. By virtue of what we're collecting from the GPU on a save (pretty much everything sitting in VRAM at the time) you can bypass all the initialization time and model optimizations that PyTorch or other inference engines (like vLLM or transformers-inference) do, and start inference workloads significantly faster.

Here's some benchmarking data to support that, which compares time to first token (TTFT) for a native start vs restoring from our checkpoint:

Cedana is consistently faster, and generally gets faster for even larger models; which tracks with the amount of work that needs to be done by an inference engine to prepare weights for inference.

Setup

Follow the instructions in installation, to get started with Cedana locally. Optionally, take a look at checkpoint/restore with GPUs to get an idea of how to use it.

Running LLaMA 8B

A simple python script that uses llama-3.1-8b would look like:

#!/usr/bin/env python3

import argparse
import time
from transformers import AutoModelForCausalLM, AutoTokenizer


# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    torch_dtype="auto",
)
model.cuda()

while True:
    user_input = "some prompt"

    # Tokenize input
    inputs = tokenizer(user_input, return_tensors="pt").to(model.device)

    # Generate tokens
    tokens = model.generate(
        **inputs,
        max_new_tokens=64,
        temperature=0.70,
        top_p=0.95,
        do_sample=True,
    )

    output = tokenizer.decode(tokens[0], skip_special_tokens=True)
    print(f"Generated Output:\n{output}")

This will download weights from HuggingFace the first time, so would ensure that the weights download correctly first by running it separately (with python3 llama_inference.py). You can find more workloads you can test with here!

Save

Start your workload with:

cedana run process -ga -j llama_inference -- python3 -u llama.py

To perform save, it's as simple as:

cedana dump job llama_inference --compression=none

Use cedana dump job --help to see additional options (such as different compression schemes). You can also use cedana psto inspect the running workload.

Resume

To restore from a previously taken save:

cedana restore job llama_inference -a

You can restore from the same save indefinitely - so you can think of the file as information that stores model weights (as they're represented in VRAM) and everything in CPU state that coordinates execution.

A pretty powerful way some customers are using us is using a save as a proxy to model weights entirely - so now instead of loading model weights you just resume from the file - everywhere you need inference!

PreviousRedis Save Migrate & Resume (SMR) on Kubernetes NextAPI

Last updated 4 months ago

Was this helpful?