← Back to Home

Serving Agentic Workloads at Scale with vLLM x Mooncake

vLLM Blog 工具链 进阶 Impact: 8/10

By integrating Mooncake's distributed KV cache store, vLLM overcomes the efficiency bottleneck of recomputing long-context prefixes in AI Agent workloads, achieving a 3.8x throughput increase and 46x lower time-to-first-token.

Key Points

  • Unique structure of agentic workloads: long-context, multi-turn interactions where each turn adds minimal new content but requires reprocessing large prefixes (94.2% cache hit rate)
  • Limitations of local caching: limited capacity prone to eviction, and no cross-instance sharing, leading to recomputation during load migration
  • Core solution: building a distributed KV cache pool based on Mooncake Store to enable cache sharing across vLLM instances
  • Remarkable performance gains: on real agentic traces, 3.8x higher throughput, 46x lower TTFT, and 8.6x lower end-to-end latency

Analysis

Why We Must Talk About Agent Inference Efficiency Now

With the rise of LLM agents like Claude Code and OpenClaw, inference workloads are undergoing a fundamental shift. These agents are no longer simple chatbots but autonomous systems capable of planning, reasoning, and acting toward complex goals—a trend Jensen Huang highlighted in his GTC 2026 keynote. However, this transformation introduces new engineering challenges: agentic workloads have a unique structure, typically consisting of long-horizon, multi-turn loops that alternate between "reasoning steps" (processing context and producing intermediate thoughts) and "action steps" (issuing tool calls and receiving external outputs).

Analysis by the vLLM team of traces from Codex and GPT-5.4 on the SWE-bench Pro dataset reveals a striking pattern: by turn 30, context length grows to roughly 80K tokens, with the longest contexts exceeding 180K tokens. Yet each turn typically introduces only a few hundred to a few thousand new tokens; the rest is a prefix the model has already "seen." The dataset shows an average input-to-output token ratio of 131:1 and a cache hit rate of 94.2%. This means that if these prefixes can be effectively cached, the prefill cost becomes essentially zero, and the true per-turn cost is limited to the new delta.

Why Local Caching Fails and How Distributed Cache Pools Solve It

The problem is that traditional local KV cache offloading (to CPU DRAM or disk) hits two major bottlenecks in agentic scenarios:

  1. Limited Capacity and Eviction: A 100K-token context can occupy gigabytes of storage (e.g., ~3.8 GB for Kimi-2.5 FP8 KV caches). On a busy instance serving many long-running sessions, these large prefix caches can quickly saturate local capacity and trigger cache eviction, leading to cache misses.
  2. Cross-Instance Misses: For load balancing, a router may not always schedule the next turn of a session on the same vLLM instance. If the session migrates to a different instance, that instance has never seen the prefix and must recompute it from scratch, causing significant computational waste.

Therefore, the key insight is: for agentic workloads, we can no longer treat an inference service as a set of isolated vLLM replicas. Instances need to share a distributed KV cache pool that provides both larger aggregate capacity and cross-instance cache hits.

vLLM's solution is deep integration with Mooncake, an open-source, high-performance library for KV cache transfer and distributed storage. vLLM had already adopted Mooncake for prefill-decode (PD) disaggregation via the MooncakeConnector. Now, they take it a step further by building a distributed KV cache pool using Mooncake Store. Multiple vLLM instances embed Mooncake clients and share a cluster-wide Mooncake Store, managed by a Mooncake master for global metadata. This enables KV cache blocks produced by any instance to be quickly accessed by others, fundamentally solving the capacity and sharing problems.

Trend Insight: From Stateless Inference to Stateful Services

This work reveals a deeper trend: AI inference infrastructure is shifting from handling stateless, independent requests to managing stateful, long-lived sessions. An agent's "memory"—its ever-growing context—becomes the core asset and primary cost center of the service. This requires the underlying system to have capabilities for state management, sharing, and efficient migration.

Similar to how microservice architectures evolved from stateless containers to managing persistent state and caches (like Redis clusters), LLM inference services are undergoing a similar evolution. The distributed KV cache pool is the "inference Redis" of this era, abstracting the heterogeneity of underlying GPU instances and providing a unified, high-speed, high-capacity "memory" for upper-layer agent applications.

Practical Value and Counter-Intuitive Insights

For developers and architects, this means that when designing and evaluating LLM inference services, "cache hit rate" and "cross-instance cache sharing capability" must be elevated to the same level of importance as throughput and latency. When choosing an inference framework, you should look beyond single-node performance and examine its distributed state management capabilities. This collaboration between vLLM and Mooncake sets a benchmark for the industry: future inference engines must be "cache-aware" and "cluster-aware."

A counter-intuitive point is that although an agent's context may span hundreds of thousands of tokens, the actual computational increment per turn is minimal. The performance bottleneck is not the speed of decoding and generating new tokens, but avoiding repeated prefilling for massive historical prefixes. By shifting the bottleneck from "computation" to "storage and transmission," this work achieves order-of-magnitude performance improvements. This suggests that in AI system optimization, identifying and restructuring the level where the bottleneck resides is more critical than simply optimizing compute kernels.

Analysis generated by BitByAI · Read original English article

Originally from vLLM Blog

Automatically analyzed by BitByAI AI Editor

BitByAI — AI-powered, AI-evolved AI News