← Back to Home

vLLM Tops the Artificial Analysis Leaderboard

vLLM Blog 工具链 进阶 Impact: 8/10

The open-source inference engine vLLM outperforms all proprietary competitors in multiple frontier model inference benchmarks, thanks to deep kernel fusion optimizations tailored to each model's specific bottlenecks.

Key Points

  • vLLM achieves top-tier inference performance on models like DeepSeek V3.2, MiniMax-M2.5, and Qwen 3.5 397B
  • The performance advantage stems from extreme optimization of GPU kernel launch overhead at low batch sizes
  • All optimization code is open-source or being merged, challenging the assumption that 'best performance requires proprietary stacks'
  • This work lays the foundation for supporting next-generation models (e.g., DeepSeek V4)

Analysis

The Catalyst: A Leaderboard Result That Defies Conventions

A recent inference benchmark published by DigitalOcean has sent ripples through the AI infrastructure community. The results show that the open-source inference engine vLLM has outperformed all proprietary inference service providers on three frontier open-weight models: DeepSeek V3.2, MiniMax-M2.5, and Qwen 3.5 397B. For instance, on DeepSeek V3.2, vLLM achieved an output throughput of 230 TPS per user—more than four times what most providers reported. On Qwen 3.5 397B, it ranked first among all 12 providers measured, with a time-to-first-token (TTFT) under 1 second for 10,000-token prompts. This is significant because it directly challenges a deeply held assumption in production environments: that the best inference performance requires a closed, proprietary technology stack. vLLM has proven that, on the same NVIDIA Blackwell Ultra hardware, a community-built open-source engine can deliver top-tier performance.

Deconstruction: Where Does the Performance Come From? – Precision "Kernel Fusion" Surgery

The vLLM team didn’t use magic; instead, they performed extremely fine-grained optimizations targeting the specific bottlenecks of each model. The core idea is "kernel fusion"—combining multiple small operations (like normalization, rotary embeddings, and quantization) that originally required launching separate GPU kernels into fewer, more efficient ones. This drastically reduces GPU kernel launch overhead, which is especially critical at low batch sizes (i.e., fewer concurrent requests).

Take DeepSeek V3.2 as an example. The issue was that each of its Transformer layers originally required launching about 33 separate GPU kernels. While these operations themselves execute quickly (in microseconds), the cumulative fixed launch cost of each kernel became the primary performance bottleneck. vLLM’s solution was to fuse multiple operations along the attention path (such as Q/KV normalization, rotary embeddings, and FP8 quantization) into just 2 kernels, reducing the per-layer kernel count from about 33 to about 10. This single change delivered a 1.28× speedup at batch size 1. Additionally, they developed a new router GEMM kernel and a TopK kernel tailored for this model, further boosting performance. These optimizations not only apply to the current version but also form the foundation for supporting the next-generation DeepSeek V4.

For MiniMax-M2.5, in addition to kernel fusion, the team trained a custom EAGLE3 speculative decoding draft model using the open-source TorchSpec and vLLM, enabling high-acceptance-rate speculative decoding to further increase throughput. For Qwen 3.5 397B, the optimizations focused on specific fusions in its attention and normalization paths.

Trend Insight: Open Source Is Becoming the Core Engine of AI Infrastructure Innovation

vLLM’s victory reveals a deeper trend: in the critical infrastructure layer of AI inference, open-source projects are transitioning from "followers" to "leaders." In the past, it was often assumed that the most cutting-edge optimization techniques were hidden within the private codebases of large corporations. However, vLLM demonstrates that through open collaboration, the community can rapidly absorb, integrate, and innovate upon the latest optimization techniques (like kernel fusion and speculative decoding) and democratize them across the ecosystem at an unprecedented pace. All these optimizations are either already open-source or being merged into the main branch, meaning any developer can access world-class inference performance for free. This is reshaping the competitive landscape of AI infrastructure, shifting the focus from "competing on proprietary technology" to "competing on open-source collaboration and engineering depth."

Practical Value and Counter-Intuitive Insights

For AI practitioners, especially engineers responsible for model deployment, this case study offers direct actionable insights. First, it clearly identifies the key performance factor in low-batch-size scenarios—GPU kernel launch overhead. If your use case involves real-time interactive applications (like chatbots) with low concurrency, focusing on optimizations like kernel fusion is more effective than simply throwing more compute at the problem. Second, when choosing an inference framework, open-source solutions like vLLM have now demonstrated performance that rivals or even surpasses proprietary services, making them a top candidate for evaluation.

A potentially counter-intuitive point is that many people assume performance optimization is "black magic" that can only be done by hardware vendors or dedicated teams at top-tier corporations. However, vLLM’s work shows that it is more about systematic engineering optimization after gaining a deep understanding of model architecture and hardware characteristics. The open-source model enables the rapid dissemination and reuse of such deep optimization knowledge—for example, the fusion work done for DeepSeek V3.2 is directly carried over to support V4. This lowers the barrier for the entire industry to access cutting-edge technology.

Analysis generated by BitByAI · Read original English article

Originally from vLLM Blog

Automatically analyzed by BitByAI AI Editor

BitByAI — AI-powered, AI-evolved AI News