Tag: 推理优化 (11 articles)

Beyond One Model: Fusion in vLLM Semantic Router

vLLM Semantic Router introduces Fusion, a routing primitive that lets a panel of models produce independent answers, has a judge model analyze them, and synthesizes a single response — making model composition a first-class serving pattern.

vLLM Blog · Jun 16, 2026

Accelerating Laguna XS.2 Inference with vLLM, Speculators, and LLM Compressor

Poolside's 33B-parameter agentic coding model, Laguna XS.2, achieves 2-3x inference speedup without quality loss through native vLLM integration, DFlash speculative decoding, and LLM Compressor quantization.

vLLM Blog · May 28, 2026

Towards Speed-of-Light Text Generation with Nemotron-Labs Diffusion Language Models

NVIDIA's new diffusion language models generate tokens in parallel and refine them iteratively, potentially breaking the latency limits of traditional autoregressive models and enabling self-correction.

Hugging Face Blog · May 23, 2026

Unlocking asynchronicity in continuous batching

Hugging Face reveals the bottleneck of alternating CPU/GPU waits in continuous batching, and shows how asynchronizing their workloads can yield a free 24% throughput boost.

Hugging Face Blog · May 14, 2026

DeepSeek-V4: a million-token context that agents can actually use

DeepSeek-V4 makes million-token context windows practically usable for long-running AI agents by dramatically cutting inference costs and memory usage through its novel hybrid attention architecture.

Hugging Face Blog · Apr 24, 2026

DeepSeek V4 in vLLM: Efficient Long-context Attention

DeepSeek V4 achieves efficient million-token long-context inference on vLLM through innovative KV cache compression and sparse attention mechanisms, marking a new era for long-text processing.

vLLM Blog ·

Elastic Expert Parallelism in vLLM

vLLM introduces Elastic Expert Parallelism (Elastic EP), enabling runtime scaling of MoE inference deployments by adding or removing GPU workers without restarts, adapting to demand fluctuations and laying the groundwork for fault-tolerant serving.

vLLM Blog ·

Serving Agentic Workloads at Scale with vLLM x Mooncake

vLLM integrates Mooncake's distributed KV cache to solve the bottleneck of recomputing long context prefixes in agentic workloads, achieving a 3.8x throughput increase and a 46x reduction in time-to-first-token.

vLLM Blog ·

Speculators v0.5.0: DFlash Support and Online Training

The Speculators v0.5.0 release introduces the DFlash algorithm for speculative decoding, which generates draft tokens in a single forward pass, significantly reducing inference latency, and unifies online and offline training workflows.

vLLM Blog ·

The State of FP8 KV-Cache and Attention Quantization in vLLM

vLLM uses FP8 quantization for KV cache to halve memory usage and double throughput for long-context inference while maintaining accuracy, though specific performance pitfalls need attention.

vLLM Blog ·

Which tokens does a hybrid model predict better?

Hybrid models significantly outperform pure Transformers in semantic understanding and dynamic context tracking, but lag in verbatim repetition, revealing a clear architectural division of labor.

Hugging Face Blog ·