vLLM Tops the Artificial Analysis Leaderboard

The open-source inference engine vLLM has outperformed all proprietary competitors in deploying multiple frontier open-weight models, with its core optimization techniques like operator fusion publicly available, revealing the immense potential of open source in AI inference.

AI推理 Large Language Models Open Source 性能优化 Developer Tools

KEY POINTS

vLLM ranked first in inference performance for models like DeepSeek V3.2 and Qwen 3.5 397B, achieving up to 4x higher throughput than proprietary solutions.
The key to performance breakthroughs is optimization techniques like 'operator fusion', which consolidates dozens of GPU kernel launches into a few, drastically reducing overhead.
All optimization code is open-source or being merged to the main branch, challenging the industry assumption that 'best inference performance requires a proprietary stack'.
vLLM's success demonstrates that deep, model-specific optimizations (for architectures like MoE, linear attention) are central to boosting inference efficiency.

ANALYSIS

The Cause: A Test Result That Challenges Industry Assumptions

A recent benchmark by Artificial Analysis delivered a surprising result: the top-performing deployments for three frontier open-weight models—DeepSeek V3.2, MiniMax-M2.5, and Qwen 3.5 397B—all used the open-source inference engine vLLM. For DeepSeek V3.2, it achieved a per-user output throughput of 230 TPS, over 4 times higher than what most other providers reported. For Qwen 3.5 397B, it ranked first among all 12 providers measured, with a time-to-first-token (TTFT) under 1 second for 10,000-token prompts.

This matters because it directly challenges a deep-seated assumption in AI production environments: that achieving top-tier inference performance requires a closed-source, proprietary technology stack. This time, on identical NVIDIA Blackwell Ultra hardware, a community-built open-source engine outperformed all competitors. Crucially, the underlying optimizations are not locked away in private forks; they are all public or being merged into the vLLM main branch.

The Breakdown: The "Three Axes" of Performance Gains

The vLLM team didn’t use magic. Their work focused on precise, "surgical" optimizations targeting the specific bottlenecks of different models. We can understand their core techniques through three key actions:

Operator Fusion — Solving the "Launch Overhead" Problem: This is the most critical technique. Imagine GPU processing as an assembly line. Traditionally, small operations like "normalization" and "rotary embedding" are separate stations. The worker (GPU) has to stop and pick up a new task sheet (kernel launch) for each station. This "pickup" time (fixed overhead) can dominate the total time, especially with small batch sizes. vLLM’s optimization packs these small operations into a few "combined stations." For DeepSeek V3.2, they reduced the per-layer kernel launches from about 33 to about 10. This alone provided a 1.28x speedup at batch size 1. It’s like consolidating dozens of short workstations into a few longer ones, drastically reducing waiting and scheduling time.
Custom "Draft Models" — Boosting Speculative Decoding Efficiency: For MiniMax-M2.5, they employed speculative decoding (quickly guessing a few tokens, then having the main model verify them). The key was training a highly customized "draft model" specifically for MiniMax-M2.5 using the open-source TorchSpec framework and live hidden states generated by vLLM. This is akin to training a stenographer who deeply understands a particular writer’s思维 and vocabulary, leading to a very high guess acceptance rate and overall acceleration.

Model-Architecture-Level Optimization — Deep "Bone-Marrow" Tuning: They performed deep customizations to the model’s attention mechanisms and normalization paths. For example, they wrote specialized fused kernels for Qwen 3.5’s linear attention path and for MiniMax-M2.5’s non-standard attention normalization (where Q and K variances are computed after tensor-parallel reduction). This goes beyond generic optimization; it’s tuning deep into the "bone marrow" of the model architecture.

Trend Insight: Open Source is Defining the "Industrial Standard" for AI Inference

vLLM’s top ranking reveals a deeper trend beyond performance numbers: The "industrial standard" for AI inference is being defined by the open-source community, not monopolized by closed-source giants.

In the past, it was assumed that closed-source companies had more resources for low-level optimization to榨干 hardware performance. But vLLM proves that a vibrant, highly engineering-capable open-source community can achieve the same, or even better. Its optimizations are transparent, reproducible, and shareable across the entire ecosystem. The article notes that the optimizations done for DeepSeek V3.2 now form the foundation for supporting the next-generation DeepSeek V4. This ecosystem effect of "optimize once, benefit across generations" is hard for closed-source, proprietary solutions to match.

Practical Value: What Does This Mean for Developers and Teams?

Re-evaluate Technology Choices: If your team is deploying large model services, especially with open-weight models, vLLM should be your primary evaluation candidate for an inference engine. It proves that open-source solutions are fully capable of providing top-tier production performance, and you may no longer need to pay high licensing fees for so-called "closed-source optimizations" or be locked into a specific cloud vendor.
Focus on Low-Level Technologies like Operator Fusion: For technically ambitious engineers, understanding principles like operator fusion, CUDA graphs, and speculative decoding becomes more valuable. vLLM’s success shows that future performance competition will increasingly occur at these low-level optimization layers.
Embrace the "Leverage Effect" of the Open-Source Ecosystem: Choosing an active open-source project like vLLM means your system’s performance can automatically improve with the community’s rapid iteration. When the community finishes optimizing for a new model (like Qwen 3.5), you can gain those benefits at almost zero cost.

Counterintuitive / Surprising Angle

A point that might be overlooked is: The most极致 performance often comes from "non-generic" deep customization. vLLM did not try to build a "universal" inference engine. Instead, its victory stems from crafting "specialized" optimization keys for each of the architecturally diverse models like DeepSeek, MiniMax, and Qwen. This reminds us that in the AI Infra space, there is often a trade-off between generality and peak performance. True competitiveness may lie in the ability to provide the most in-depth "tailored" optimization for the most主流 and important model architectures.

Analysis by BitByAI · Read original

Originally from vLLM Blog · Analyzed by BitByAI