vLLM x Novita AI: PegaFlow for Production-Grade External KV Cache
vLLM and Novita AI collaborate on PegaFlow, externalizing the KV cache into a standalone service with a three-level cache hierarchy, achieving doubled startup speed and significantly higher throughput.
Key Points
- KV cache is externalized from the inference process into a standalone, long-lived service
- Adopts a three-level cache hierarchy: host memory, RDMA remote memory, and SSD
- Enables multiple engines and models to share the same cache pool with fault isolation
- In production, vLLM startup is 2.15x faster, with throughput gains up to 72%
Analysis
Why: The Need to 'Externalize' the KV Cache
For any engineer who has deployed large model inference services, the KV cache is both a familiar and 'expensive' entity. It stores the model's contextual understanding during text generation and is key to efficient autoregressive generation. However, in traditional designs, this 'memory asset,' which can consume hundreds of GBs of RAM, is deeply coupled with the inference engine process (e.g., a vLLM worker). This means that if the engine process restarts—due to a crash, rolling upgrade, or model switch—the entire massive cache pool evaporates instantly. Re-allocating and warming up this cache takes significant time, leading to service interruptions or performance fluctuations. In production environments, this coupling results in high operational costs and fragile failure domains. PegaFlow was created to address this pain point: decoupling the KV cache lifecycle from individual inference processes to make it a persistent, shareable service infrastructure.
Breakdown: PegaFlow's Core Architecture and How It Works
At its core, PegaFlow is a standalone daemon, written in Rust, that runs on each machine and takes over the KV cache resources previously managed by vLLM workers. Its key innovation lies in building a three-level cache hierarchy:
- Pinned Host Memory: The fastest tier, storing the hottest KV data.
- RDMA-accessible Remote Memory: Leverages high-speed networks (like the 8x400Gbps NICs in the tests) to share cache across cluster nodes, enabling cross-node cache pooling.
- SSD: Serves as the largest-capacity, slowest persistent tier for cold data or overflow buffering.
vLLM workers communicate with the local PegaFlow service via CUDA IPC (for high-speed data transfer) and gRPC (for control commands). This design brings several key advantages:
- Process Fault Isolation: vLLM processes can crash, restart, or upgrade at any time while the PegaFlow service remains alive, preserving the valuable cache data. The reverse is also true.
- Resource Pooling and Sharing: Multiple vLLM instances on the same host (potentially running different models or parallel configurations) can share a single cache pool managed by PegaFlow, greatly improving memory utilization. The article's tests show that eight Qwen3-8B instances sharing one cache pool achieved 56% higher throughput than each having an isolated pool.
- Faster Startup: Since the massive cache pool is pre-held by the always-on PegaFlow service, vLLM processes skip the time-consuming memory allocation during startup. In a test with a 500 GiB cache, vLLM's ready time dropped from 71.4 seconds to 33.2 seconds—a 2.15x improvement. This is crucial for services requiring rapid elastic scaling.
Trend Insight: From 'Built-in Engine' to 'Infrastructure'
PegaFlow's implementation reveals a clear evolutionary trend in the LLM inference stack: the infrastructuralization of key resources. Just as databases separated storage from compute and Kubernetes decoupled applications from underlying machines, the most expensive runtime asset in LLM inference—the KV cache—is transitioning from an 'internal state' of the engine to an 'external service' at the platform level.
This is not just a technical architecture optimization but a shift in operational paradigms. It allows cache management (e.g., capacity planning, data migration, cross-model reuse) to be handled independently of the inference engine, paving the way for building more stable, efficient, and manageable large-scale inference services. The article's mention of optimizations for DeepSeek-V3.2 MLA (an attention mechanism requiring special tensor parallel processing), which achieved a 72% throughput increase by avoiding redundant storage of logical KV data per TP rank, further demonstrates that specialized, external cache services can perform deep optimizations for specific model architectures—a feat difficult for general-purpose inference engines to achieve.
Practical Value and Counter-Intuitive Insights
For teams using or evaluating vLLM, PegaFlow offers a plug-and-play solution. It integrates via the standard kv_transfer_config interface, requiring no modifications to vLLM source code or maintaining long-lived forks, significantly lowering the adoption barrier. You can immediately consider: Is your service plagued by cache invalidation during engine updates? Are you running multiple model instances on the same host, wasting memory? If so, an external cache service might be a direction worth exploring.
A potentially counter-intuitive point is that externalizing the cache can actually improve overall performance. Intuitively, in-process access should be the fastest. However, PegaFlow achieves efficient inter-process data transfer via CUDA IPC, while simultaneously gaining greater system-wide throughput benefits through resource pooling and eliminating redundant storage (e.g., duplicate KV across TP ranks). This reminds us that in complex distributed system design, local optimum (fastest in-process) does not equal global optimum (highest system throughput).
Analysis generated by BitByAI · Read original English article