Accelerating Laguna XS.2 Inference with vLLM, Speculators, and LLM Compressor
Poolside's 33B-parameter agentic coding model, Laguna XS.2, achieves 2-3x inference speedup without quality loss through native vLLM integration, DFlash speculative decoding, and LLM Compressor quantization.
vLLM Blog · May 28, 2026
EAGLE 3.1: Advancing Speculative Decoding Through Collaboration Between the EAGLE Team, vLLM, and TorchSpec
The EAGLE team, in collaboration with vLLM and TorchSpec, releases EAGLE 3.1, which significantly improves speculative decoding robustness and acceptance length in long-context and varied chat scenarios by addressing the 'attention drift' problem.
vLLM Blog · May 26, 2026
Towards Speed-of-Light Text Generation with Nemotron-Labs Diffusion Language Models
NVIDIA's new diffusion language models generate tokens in parallel and refine them iteratively, potentially breaking the latency limits of traditional autoregressive models and enabling self-correction.
Hugging Face Blog · May 23, 2026
vLLM x Novita AI: PegaFlow for Production-Grade External KV Cache
vLLM and Novita AI collaborate on PegaFlow, externalizing the KV cache into a standalone service with a three-level cache hierarchy, achieving doubled startup speed and significantly higher throughput.
vLLM Blog · May 18, 2026
Unlocking asynchronicity in continuous batching
Hugging Face reveals the bottleneck of alternating CPU/GPU waits in continuous batching, and shows how asynchronizing their workloads can yield a free 24% throughput boost.
Hugging Face Blog · May 14, 2026
A First Comprehensive Study of TurboQuant: Accuracy and Performance
A large-scale benchmark by the vLLM team reveals that while TurboQuant's extreme low-bit compression saves memory, it significantly degrades inference speed and accuracy, making FP8 quantization the current best balance.
vLLM Blog · May 11, 2026
vLLM Tops the Artificial Analysis Leaderboard
The open-source inference engine vLLM outperforms all proprietary competitors in multiple frontier model inference benchmarks, thanks to deep kernel fusion optimizations tailored to each model's specific bottlenecks.
vLLM Blog · May 11, 2026
DeepSeek-V4: a million-token context that agents can actually use
DeepSeek-V4 makes million-token context windows practically usable for long-running AI agents by dramatically cutting inference costs and memory usage through its novel hybrid attention architecture.
Hugging Face Blog · Apr 24, 2026
DeepSeek V4 in vLLM: Efficient Long-context Attention
vLLM announces support for DeepSeek V4 models, featuring a novel attention mechanism that tackles the core challenges of memory and computational cost in million-token long-context inference.
vLLM Blog · Apr 24, 2026
The State of FP8 KV-Cache and Attention Quantization in vLLM
vLLM's comprehensive testing reveals that FP8 KV-cache quantization can significantly reduce memory usage and decoding costs under specific conditions, but introduces critical accuracy and performance pitfalls in certain models and scenarios, requiring careful adoption.
vLLM Blog · Apr 22, 2026