# LLaMA Inference GPU Save, Migrate & Resume (SMR)

Save, Migrate and Resume (SMR) has enormouse value for running inference workloads, especially when you're sensitive to cold starts. By virtue of what we're collecting from the GPU on a save (pretty much everything sitting in VRAM at the time) you can bypass all the initialization time and model optimizations that PyTorch or other inference engines (like vLLM or transformers-inference) do, and start inference workloads significantly faster.

Here's some benchmarking data to support that, which compares time to first token (TTFT) for a native start vs restoring from our checkpoint:

<figure><img src="/files/XXRcWdkAQCkKdGeKABtu" alt=""><figcaption></figcaption></figure>

Cedana is consistently faster, and generally gets faster for even larger models; which tracks with the amount of work that needs to be done by an inference engine to prepare weights for inference.

## Setup

Follow the instructions in [installation](/daemon/get-started/installation.md), to get started with Cedana locally. Optionally, take a look at [checkpoint/restore with GPUs](broken://spaces/Su8hW4oAhjiIohf3AFfl/pages/ieK7Li9lPJmOK4Jjfpem) to get an idea of how to use it.

## Running LLaMA 8B

A simple python script that uses `llama-3.1-8b` would look like:

```python
#!/usr/bin/env python3

import argparse
import time
from transformers import AutoModelForCausalLM, AutoTokenizer


# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    torch_dtype="auto",
)
model.cuda()

while True:
    user_input = "some prompt"

    # Tokenize input
    inputs = tokenizer(user_input, return_tensors="pt").to(model.device)

    # Generate tokens
    tokens = model.generate(
        **inputs,
        max_new_tokens=64,
        temperature=0.70,
        top_p=0.95,
        do_sample=True,
    )

    output = tokenizer.decode(tokens[0], skip_special_tokens=True)
    print(f"Generated Output:\n{output}")

```

This will download weights from HuggingFace the first time, so would ensure that the weights download correctly first by running it separately (with `python3 llama_inference.py`). You can find more workloads you can test with [here](https://github.com/cedana/cedana-samples/)!

## Save

Start your workload with:

```bash
cedana run process -ga -j llama_inference -- python3 -u llama.py 
```

To perform save, it's as simple as:

```bash
cedana dump job llama_inference --compression=none
```

Use `cedana dump job --help` to see additional options (such as different compression schemes). You can also use `cedana ps`to inspect the running workload.

## Resume

To restore from a previously taken save:

```bash
cedana restore job llama_inference -a
```

You can restore from the same save indefinitely - so you can think of the file as information that stores model weights (as they're represented in VRAM) and everything in CPU state that coordinates execution.

A pretty powerful way some customers are using us is using a save as a proxy to model weights entirely - so now instead of loading model weights you just resume from the file - everywhere you need inference!


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.cedana.ai/examples/llama-inference-gpu-save-migrate-and-resume-smr.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
