Cedana + Dynamo for Inference at Scale
Why Cedana + Dynamo?
NVIDIA Dynamo offers SOTA inference serving at scale, designed to enhance the performance of LLM inference. By leveraging advanced optimization techniques and dynamic scheduling, Dynamo significantly accelerates inference workloads.
Dynamo's architecture is built on the modular disaggregation of the prefill and decode phases, allowing each to be scaled independently on hardware best suited for their specific compute or memory profiles.
Its core components include:
a Smart Router that uses Radix Tree indexing to track KV-cache locality across the cluster, ensuring requests land on workers that already hold relevant context.
the KV Block Manager (KVBM) (for large caches) offloads cache data across a storage hierarchy of host DRAM, SSDs, and network storage
NVIDIA Inference Transfer Library (NIXL), which enables zero-copy, asynchronous KV cache migration at near-memory speeds
And more! See their Core Capabilities section in the docs for all the features Dynamo offers.
What makes Dynamo incredibly compelling is that it's a fast-moving project, keeping track of developments in research and the field to bring frontier-lab level performance to users with access to clusters of GPUs, both in datacenters and the cloud.
However some challenges still remain with using a tool like Dynamo that an integration with Cedana solves!
Cold Start TImes: Cold starts won't generally go away, even with infrastructure optimizations like Dynamo presents as the weights still need to be downloaded and loaded into GPU memory for a worker or set of workers. Model initialization for large models like Llama-3-70B can exceed 80 seconds. Cedana solves this by restoring "ready-to-serve" Dynamo workers from checkpoints, enabling new replicas to come online 10x faster than a native cold start.
The Recompute Tax: In long-context agentic sessions, losing a decode worker node results in the total loss of its in-memory KV cache. To resume, a new node must pay a "recompute tax", re-running the expensive prefill phase which can take 10+ seconds and waste significant GPU cycles. Cedana provides system-level Save, Migrate, and Resume (SMR) capabilities that transparently capture the exact state of the GPU and CPU memory, allowing generation to resume from the very next token without re-prefilling the context (or regenerating the KV-Cache!).
Resilience: Dynamo's stateful nature makes it difficult to run reliably on high-volatility, low-cost infrastructure like spot nodes. Cedana integrates with cloud-native schedulers to monitor preemption notices and live-migrate active inference tasks to healthy nodes before they are terminated.
Last updated
Was this helpful?