Cedana vs. CRIU CUDA for GPU Checkpoint/Restore
We benchmark our SOTA GPU checkpoint/restore solution against CRIU’s CUDA plugin that’s based on NVIDIA’s CUDA Driver functionality
Last updated
Was this helpful?
We benchmark our SOTA GPU checkpoint/restore solution against CRIU’s CUDA plugin that’s based on NVIDIA’s CUDA Driver functionality
Last updated
Was this helpful?
Until 2024, checkpoint/restore of GPUs was only possible through some form of interception, either at the CUDA runtime API or driver API level. In April 2024, NVIDIA released native support for checkpoint/restore for their GPUs. A recent paper, CRIUgpu: Transparent Checkpointing of GPU-Accelerated Workloads, benchmarks the CRIU CUDA plugin that makes use of NVIDIA’s CUDA checkpoint utility to enable checkpoint/restore of GPUs-enabled processes using CRIU. This paper highlights valid issues with traditional GPU checkpoint/restore solutions that are based on interception.
At Cedana, we've been developing our own GPU checkpoint/restore solution through GPU virtualization. In this article, we compare our solution with the new CRIU CUDA plugin and also highlight the unique benefits enabled through virtualization & interception.
You can try out both our interception-based GPU checkpoint/restore and CRIU CUDA checkpoint/restore using the daemon. Check out for more info.
We benchmarked our against the CRIU CUDA checkpoint/restore on model inferencing across some popular large language models (LLMs), such as GPT2, LLaMa3, StableLM, Gemma 2 and more. Our customers generally cared about the following, which we collected data on:
Warm checkpoint time: Time it takes to checkpoint warmed-up inference jobs, i.e. when the model is fully loaded onto the GPU.
Cold start time: Time it takes to cold-start inferencing (also known as Time To First Token).
Save migrate resume: Total time taken to checkpoint, migrate across network, and restore on another machine.
vLLM throughput: Runtime throughput performance as measured in this vLLM benchmark.
Setup information (environment and machine specs) can be found at the bottom of the page.
Each reading taken in the benchmarks below is calculated by taking a minimum of 3 samples.
For this benchmark, we also include native cold start time as a baseline (or, time to first token when running natively).
Below, you can see results for smaller models (GPT2). Leftmost bars (gray) are native cold starts (without checkpoint/restore), middle bars (light blue) are cold starts using CRIU CUDA, and rightmost bars (blue) are cold starts using Cedana.
For larger models, we observe similar results. Cedana cold starts (blue) are the fastest. CRIU CUDA cold starts for these larger models are even slower than native cold starts. For the models gemma-2-9b, and stablelm-2-12b, CRIU CUDA checkpoint failed for every sample so there is no reading to show. Update: this is a known CRIU issue that was recently fixed, but unreleased when the readings were taken.
Below, you can see the results for smaller models (GPT2). Left bars (dark gray) are CRIU CUDA checkpoint times, right bars (light gray) are Cedana checkpoint times.
There is also an overlay plot in the same figure, comparing checkpoint throughputs (right y-axis for values). The topmost line (gray) is the maximum possible throughput (disk write speed), middle line (blue) is throughput observed when using Cedana, lowermost line (dark blue) is throughput observed when using CRIU CUDA.
You can see that Cedana’s throughput (~0.6 GiB/s) is pretty close to the maximum possible throughput (0.68 GiB/s), which means you’re likely going to see faster checkpoint times when using a faster disk/network as Cedana is spending most of the time writing to disk/network. Whereas, CRIU CUDA’s throughput appears to increase with model size, which likely means it’s not using all of the available disk/network throughput as it’s more busy doing other work.
For larger models, we observe similar results. The above observations on throughput still hold. Just like for cold start time, we are missing CRIU CUDA results for the gemma-2-9b, and stablelm-2-12b models, as CRIU CUDA failed to checkpoint these models in every sample.
Cedana checkpoint times are the fastest across the board.
Below, you can see results for smaller models (GPT2). Left bars (shades of gray) are CRIU CUDA save migrate resume times, right bars (shades of blue) are cold starts using Cedana. The stacks, from bottom to top, denote save, migrate, and resume, respectively.
For larger models, we observe similar results. Just like for cold start time, we are missing CRIU CUDA results for the gemma-2-9b, and stablelm-2-12b models, as CRIU CUDA failed to checkpoint these models in every sample.
Below are the results from running this benchmark on LLaMa 3.1 8B model comparing runtime throughput of using Cedana (that uses GPU interception) with native.
Requests/s
11.69
13.85
Total tokens/s
4833.43
5725.69
Output tokens/s
2318.23
2746.18
Runtime throughput when running with Cedana is about 11% faster. This is surprising, because using GPU interception for enabling GPU checkpoint/restore typically introduces some overhead to runtime performance. However, this overhead has been mitigated by optimizations that are only possible due to the asynchronous design of our GPU virtual machine (internally referred to as the GPU controller). This design allows the GPU controller to make decisions before actually executing CUDA driver API calls intercepted during operation. For example, it can combine multiple driver API calls or cancel redundant ones, effectively working like a just-in-time compiler (JIT).
Cedana version: v0.9.240-35-g9bd7886e Cedana GPU plugin version: v0.4.7-9-g1062be1 CRIU version: v4.0
The benchmarks were run on the following machine:
CPU cache (L1 data)
1.9 MiB (30 instances)
CPU cache (L2)
15 MiB (30 instances)
CPU cache (L3)
480 MiB (30 instances)
CPU cores
30
CPU memory (DRAM)
216.26 GiB
CPU model
AMD EPYC 7J13 64-Core Processor
CPU threads/core
1
Disk read speed
7.410 GiB/s
Disk write speed
.67225325884543761638 GiB/s
GPU API
12.8
GPU SM clock speed (max)
1410 MHz
GPU compute capability
8.0
GPU driver
570.124.04
GPU memory (total)
40960 MiB
GPU memory clock speed (max)
1215 MHz
GPU model
NVIDIA A100-SXM4-40GB
Mock internet speed
500 MiB/s
The above performance results make up a good case for why we continue to advance our interception-based GPU checkpoint/restore solution. As highlighted in this paper, there are several challenges to testing and maintaining an interception-based solution, but we believe it’s worth it.
Checkpoint/restore performance – As in benchmarks, our checkpoint/restore solution is much faster than CRIU CUDA. GPU interception provides enhanced visibility into program execution, enabling optimization decisions like identifying memory regions for checkpointing and determining which CUDA calls to replay or skip during restore. This confers additional benefits, such as incremental checkpointing.
Runtime performance – GPU interception enables our GPU controller to function like a JIT compiler, allowing real-time optimizations such as merging multiple CUDA graph API calls, cancelling redundancies, etc. Even with our current minimal set of optimizations, we’ve observed a 10-12% runtime performance improvement on the vLLM throughput benchmarks. There’s substantial untapped potential that can only be unlocked through this level of control.
Multi-GPU / Multi-node scaling – For workloads distributed across multiple GPUs—whether on the same host or across different hosts—transparent checkpoint/restore requires synchronization that is only possible through knowledge of the GPU state. CRIU CUDA currently does not support NCCL-accelerated workloads. While application-aware C/R is always an option, achieving true transparency without modifying the application requires this approach.
Beyond checkpoint/restore – GPU interception enables far more than just checkpoint/restore. It lays the groundwork for automatic GPU failover, live GPU scaling, and live GPU migration and more!