# Cedana + Dynamo for Inference at Scale

#### Why Cedana + Dynamo?

NVIDIA [Dynamo](https://github.com/ai-dynamo/dynamo) offers SOTA inference serving at scale, designed to enhance the performance of LLM inference. By leveraging advanced optimization techniques and dynamic scheduling, Dynamo significantly accelerates inference workloads.

Dynamo's architecture is built on the modular disaggregation of the prefill and decode phases, allowing each to be scaled independently on hardware best suited for their specific compute or memory profiles.

Its core components include:

* a Smart Router that uses Radix Tree indexing to track KV-cache locality across the cluster, ensuring requests land on workers that already hold relevant context.
* the KV Block Manager (KVBM) (for large caches) offloads cache data across a storage hierarchy of host DRAM, SSDs, and network storage
* NVIDIA Inference Transfer Library (NIXL), which enables zero-copy, asynchronous KV cache migration at near-memory speeds

And more! See their [Core Capabilities](https://github.com/ai-dynamo/dynamo?tab=readme-ov-file#core-capabilities) section in the docs for all the features Dynamo offers.

What makes Dynamo incredibly compelling is that it's a fast-moving project, keeping track of developments in research and the field to bring frontier-lab level performance to users with access to clusters of GPUs, both in datacenters and the cloud.

However some challenges still remain with using a tool like Dynamo that an integration with Cedana solves!

* **Cold Start TImes:** Cold starts won't generally go away, even with infrastructure optimizations like Dynamo presents as the weights still need to be downloaded and loaded into GPU memory for a worker or set of workers. Model initialization for large models like Llama-3-70B can exceed 80 seconds. Cedana solves this by restoring "ready-to-serve" Dynamo workers from checkpoints, enabling new replicas to come online 10x faster than a native cold start.
* **The Recompute Tax**: In long-context agentic sessions, losing a decode worker node results in the total loss of its in-memory KV cache. To resume, a new node must pay a "recompute tax", re-running the expensive prefill phase which can take 10+ seconds and waste significant GPU cycles. Cedana provides system-level Save, Migrate, and Resume (SMR) capabilities that transparently capture the exact state of the GPU and CPU memory, allowing generation to resume from the very next token without re-prefilling the context (or regenerating the KV-Cache!).
* **Resilience:** Dynamo's stateful nature makes it difficult to run reliably on high-volatility, low-cost infrastructure like spot nodes. Cedana integrates with cloud-native schedulers to monitor preemption notices and live-migrate active inference tasks to healthy nodes before they are terminated.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.cedana.ai/cedana-inference/cedana-+-dynamo-for-inference-at-scale.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
