For AI makers and artists relying on scalable inference, the dreaded “cold start” is a productivity killer. When demand spikes on Kubernetes, systems often allocate GPUs only to leave them idle for minutes while models load. This latency creates a bottleneck where the infrastructure cannot serve requests fast enough, risking service-level agreements during traffic surges.
The delay isn’t just a single step; it is a sequence of heavy lifting. A single-GPU vLLM workload must pull container images, load weights into GPU memory, warm up CUDA kernels, compile graphs, and register with service discovery. NVIDIA’s AI research team has countered this with NVIDIA Dynamo Snapshot, a system that saves and restores the entire state of an inference worker to bypass these initialization steps.
Understanding the mechanics of CRIU and cuda-checkpoint
To freeze an inference worker’s state, the system must serialize two distinct environments. The device state lives on the GPU, encompassing CUDA contexts, streams, and memory mappings invisible to the host. The tool cuda-checkpoint dumps this GPU state into the process’s CPU memory. The host state comprises CPU memory, threads, and file descriptors, which CRIU (Checkpoint/Restore in Userspace) serializes to disk by walking the Linux kernel’s process tree.
The restoration process reverses this order. First, CRIU loads the host process tree from distributed storage like NFS or SMB. Next, cuda-checkpoint transfers the previously saved GPU state from CPU memory onto the new GPUs. Because CRIU acts as a freeze-and-thaw mechanism, execution resumes at the exact instruction where the snapshot occurred, unaware of the interruption. Consequently, any external coordination required before saving or after loading must be managed by an orchestrator or specific workload hooks.
How Dynamo Snapshot operates on Kubernetes
Kubernetes workloads run inside containers within pods. Since checkpoints reference the container’s writable filesystem, the entire process tree and filesystem must be saved together at the container level. NVIDIA deploys a privileged DaemonSet named snapshot-agent, installed via a Helm chart. This agent runs on every node and manages checkpoints for runc-managed containers without modifying the runtime itself.
During a checkpoint, the agent waits for the workload’s readiness probe, then invokes the tools from the host to write artifacts to shared storage. It also handles files created locally within the container’s overlay filesystem. Upon restoration, the agent spins up a lightweight placeholder pod, restores the filesystem, and loads the checkpoint into the new namespaces. This DaemonSet approach offers three advantages over native Kubernetes support: it is fully portable across clouds, allows tighter performance tuning of CRIU, and keeps artifacts in flexible storage rather than embedding them in OCI images.
The quiesce and resume pattern
A Dynamo worker starts in two phases. First, engine initialization loads weights and compiles graphs, making the worker “warm” but invisible to the outside world. Second, distributed runtime startup connects the worker to the control plane and registers with the discovery backend. If a snapshot were taken after this second phase, active TCP connections would prevent CRIU from capturing the state.
The solution involves signal files. The worker writes a “ready for checkpoint” signal after initialization but before connecting to the control plane. It then enters a polling loop waiting for a “restore complete” signal. When the agent saves the state, the worker pauses. Upon restoration, execution resumes inside the polling loop, detects the signal file, and proceeds with runtime initialization seamlessly. This pattern is also essential for future multi-GPU and multi-node checkpoints, where changing pod IPs would break established TCP connections.
Optimization 1: KV Cache Unmap and Release
Inference engines typically allocate a large buffer for the KV cache after loading weights and graphs. Since the snapshot is taken before any requests are served, this cache is unnecessary to save. However, its virtual address must remain stable because it is baked into the CUDA graph.
The fix involves allocating the KV cache via the CUDA Virtual Memory Management API (cuMemCreate and cuMemMap). The system then frees the underlying physical memory using cuMemUnmap and cuMemRelease, while keeping the virtual address range intact. This feature is native in vLLM via sleep() and wake_up(), and in SGLang via torch_memory_saver.
For the Qwen3-0.6B model on a B200 GPU, this reduces the artifact size from approximately 190 GiB to just 6 GiB. The savings are most significant for smaller models relative to the GPU size.
Optimization 2: Speeding up CRIU memory restore
Even with smaller artifacts, the upstream CRIU restore time remains a bottleneck. For large models, the restore duration can exceed the cold-start time, negating the benefits of checkpointing.
2.1 — Parallel memfd restore: Tools like vLLM’s memory saver move weight-tagged GPU allocations into pinned CPU shadow buffers. Inside the Linux kernel, these appear as memfds (anonymous, RAM-backed files). Upstream CRIU restores these buffers one by one. The optimized version enumerates the unique shared memory objects and uses a thread pool to restore them in parallel, utilizing available storage bandwidth and CPU cores.
2.2 — Linux native AIO for anonymous memory: The original CRIU uses a synchronous preadv loop with only one read in flight at a time. The replacement uses Linux native Asynchronous I/O (AIO), submitting a batch of iocbs via io_submit and maintaining a sliding window of up to 128 concurrent reads. As completions arrive via io_getevents, new requests backfill the window. Where supported, both memory types use O_DIRECT to avoid page cache pressure.
Performance gains vary by storage backend; on filesystems lacking O_DIRECT support, such as some NFS deployments, the system falls back to buffered I/O with sequential readahead, reducing the speedup.
| Model | Checkpoint Size | CRIU (upstream) | CRIU (AIO) | CRIU (AIO + parallel memfd) | Speedup | SOL* |
|---|---|---|---|---|---|---|
| Qwen3-0.6B | 6.2 GiB | 6.8 s | 2.9 s | 2.4 s | 2.8× | 0.95 s |
| Qwen3-8B | 26 GiB | 24 s | 11 s | 4.7 s | 5.1× | 1.8 s |
| gpt-oss-120b | 129 GiB | 119 s | 54 s | 15 s | 7.9× | 11 s |
*SOL (speed of light) represents the theoretical maximum restore speed based on storage bandwidth, the floor below which time cannot drop.
At this stage, CRIU restore time is near its limit, but end-to-end restore is still dominated by the serial bottleneck of moving weights sequentially from storage to host memory and then to the GPU.
Optimization 3: GPU Memory Service (GMS)
The final optimization, GPU Memory Service (GMS), addresses the remaining serial bottleneck where GPU memory must be populated before inference can begin. By decoupling the storage of model weights from the GPU memory allocation, GMS allows the system to pre-populate GPU memory from persistent storage in parallel with the restore process, further reducing the time required to bring an inference worker online.
Key takeaways
- Eliminate idle GPU time: Dynamo Snapshot allows Kubernetes inference workloads to skip the multi-minute cold-start sequence by restoring a saved state, ensuring GPUs are ready immediately upon demand.
- Massive size reduction: By unmapping and releasing the KV cache buffer before checkpointing, artifact sizes can be reduced from over 100 GiB to mere gigabytes, significantly lowering storage costs.
- Parallel restore speeds: Optimizations like parallel memfd restoration and Linux native AIO can accelerate the restore process by up to 8x, bringing large model startups closer to theoretical storage limits.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




