LLaMA Inference GPU SMR
Performing an SMR of a running LLaMA inference task using HuggingFace weights.
Last updated
Performing an SMR of a running LLaMA inference task using HuggingFace weights.
Last updated
SMR has a ton of value for running inference workloads, especially when you're sensitive to cold starts. By virtue of what we're collecting from the GPU on a save (pretty much everything sitting in VRAM at the time) you can bypass all the initialization time and model optimizations that PyTorch or other inference engines (like vLLM or transformers-inference) do, and start inference workloads significantly faster.
Here's some benchmarking data to support that, which compares time to first token (TTFT) for a native start vs restoring from our checkpoint:
Cedana is consistently faster, and generally gets faster for even larger models; which tracks with the amount of work that needs to be done by an inference engine to prepare weights for inference.
Follow the instructions in authentication, and then in GPU setup to set up an instance with Cedana.
A simple python script that uses llama-3.1-8b
would look like:
This will download weights from HuggingFace the first time, so would ensure that the weights download correctly first by running it separately (with python3 llama_inference.py
).
Start your workload with:
To perform save, it's as simple as:
To restore from a previously taken save:
You can restore from the same save indefinitely - so you can think of the file as information that stores model weights (as they're represented in VRAM) and everything in CPU state that coordinates execution.
A pretty powerful way some customers are using us is using a save as a proxy to model weights entirely - so now instead of loading model weights you just resume from the file - everywhere you need inference!