Serving Agentic Workloads at Scale with vLLM x Mooncake

vLLM integrates Mooncake's distributed KV cache to solve the bottleneck of recomputing long context prefixes in agentic workloads, achieving a 3.8x throughput increase and a 46x reduction in time-to-first-token.

智能体推理优化分布式系统 Large Language Models 缓存技术

KEY POINTS

The core characteristic of agentic workloads is ultra-long context with 94.2% KV cache reuse across multi-turn dialogues
Local caching is limited by capacity and cross-instance misses, becoming the main bottleneck for scaled services
Mooncake Store provides a cross-node distributed KV cache pool, enabling cache sharing and linear scalability
Integration achieves 3.8x throughput, 46x TTFT reduction, and 8.6x end-to-end latency improvement on real agentic traces

ANALYSIS

The Shift: When AI Agents Learn to 'Run Marathons'

You've likely noticed AI agents like Claude Code and OpenClaw becoming increasingly powerful. They're no longer simple Q&A chatbots but autonomous systems capable of planning, reasoning, and executing complex tasks. This transformation poses entirely new challenges for underlying inference services. Traditional architectures were designed for short conversations, but agents operate differently: they engage in long-horizon, multi-turn cycles, alternating between 'reasoning steps' (processing context and generating intermediate thoughts) and 'action steps' (making tool calls and receiving external outputs).

The vLLM team analyzed traces from Codex and GPT-5.4 on the SWE-bench Pro dataset and discovered a striking pattern: by turn 30, context length grows to roughly 80K tokens, with the longest exceeding 180K tokens. However, each turn typically introduces only a few hundred to a few thousand new tokens. The vast majority (averaging 94.2%) consists of prefixes the model has already 'seen' (e.g., system prompts, skills/memory, historical dialogue). The input-to-output token ratio reaches 131:1. This means if we can cache these prefixes, the true cost per inference turn becomes processing only that small delta of new content. The problem is that existing local caching solutions (like offloading to CPU memory or disk) fall short for agentic workloads.

The Breakdown: Local Cache 'Ceilings' and the Rise of Distributed 'Cache Pools'

Local caching faces two critical limitations. First is capacity and eviction. A 100K-token context can consume several GB of storage (e.g., Kimi-2.5 FP8 KV cache occupies about 3.8GB). When serving many long-running sessions simultaneously, these massive prefix caches quickly saturate local capacity, leading to frequent cache eviction and plummeting hit rates. Second is cross-instance misses. For load balancing, a router may not schedule the next turn of a session to the same vLLM instance. Once a session migrates to a new instance, that instance has never seen the prefix and must recompute it from scratch—a costly process.

The key insight is that we can no longer treat inference services as isolated vLLM replicas. For agentic workloads, instances need to share a distributed KV cache pool that provides both larger aggregate capacity and cross-instance cache hits. This is precisely where vLLM's integration with Mooncake Store comes in. Mooncake is an open-source, high-performance library for KV cache transfer and distributed storage. vLLM already uses Mooncake via MooncakeConnector for prefill-decode (PD) disaggregation. Now, they've taken it further by building a distributed KV cache pool with Mooncake Store. The architecture includes a master server managing metadata and clients running on GPU nodes that transfer KV blocks between them via RDMA high-speed networks, collectively forming a massive shared cache resource pool.

Trend Insight: From 'Stateless Inference' to 'Stateful Services'

This reveals a deeper trend: AI inference services are evolving from stateless request-response models to stateful, session-aware service paradigms. Agents need memory, and that memory is embodied in ever-growing KV caches. Managing the cost and efficiency of this 'state' (i.e., KV caches) will be key to determining whether agent services can scale and operate economically. The Mooncake-vLLM integration essentially creates an external, shared 'working memory' system for AI agents—similar to how humans rely not just on their brains but also on notebooks, whiteboards, and other external memory aids to maintain continuity and efficiency during complex projects.

Practical Value: What This Means for Developers and Architects

For developers and architects building or considering deploying AI agents, this progress offers clear guidance. First, when designing agent systems, context length and cache efficiency must be treated as core optimization metrics, not just single-inference latency. Second, when selecting inference frameworks, evaluate whether they possess cross-instance cache-sharing capabilities. A framework lacking this ability may face severe performance degradation and cost surges under agent workloads. The vLLM-Mooncake solution demonstrates near-linear scalability (tested up to 60 GPUs), meaning you can linearly increase total agent service throughput by adding GPU nodes, providing a clear scaling path to handle user growth.

Counterintuitive Insight: What Does a 131:1 Ratio Really Mean?

A potentially overlooked striking statistic is the 131:1 input-output token ratio. This strongly suggests that in agentic workloads, computing resources are primarily consumed on 'reading' and 'recalling' rather than 'generating'. Traditional inference optimization often focuses on accelerating decoding (token generation), but for agents, the more critical optimization point is how to efficiently and cost-effectively 're-read' historical context. This颠覆了我们对推理服务瓶颈的惯常认知, shifting the optimization focus from the output side to the input side—specifically, the reuse efficiency of long historical prefixes. The vLLM-Mooncake solution抓住了这个核心矛盾, using distributed caching to reduce the cost of 're-reading' to near zero, thereby achieving order-of-magnitude performance improvements.

Analysis by BitByAI · Read original

Originally from vLLM Blog · Analyzed by BitByAI