Tag: vLLM (5 articles)

vLLM V0 to V1: Correctness Before Corrections in RL

ServiceNow AI discovered that subtle differences in vLLM V1's inference engine could crash RL training, and restored stability by fixing four critical backend issues.

Hugging Face Blog · May 7, 2026

A First Comprehensive Study of TurboQuant: Accuracy and Performance

A comprehensive benchmark by the vLLM team reveals that TurboQuant generally underperforms FP8 quantization and is only potentially viable for extreme memory-constrained edge deployments.

vLLM Blog ·

Elastic Expert Parallelism in vLLM

vLLM introduces Elastic Expert Parallelism (Elastic EP), enabling runtime scaling of MoE inference deployments by adding or removing GPU workers without restarts, adapting to demand fluctuations and laying the groundwork for fault-tolerant serving.

vLLM Blog ·

Speculators v0.5.0: DFlash Support and Online Training

The Speculators v0.5.0 release introduces the DFlash algorithm for speculative decoding, which generates draft tokens in a single forward pass, significantly reducing inference latency, and unifies online and offline training workflows.

vLLM Blog ·

The State of FP8 KV-Cache and Attention Quantization in vLLM

vLLM uses FP8 quantization for KV cache to halve memory usage and double throughput for long-context inference while maintaining accuracy, though specific performance pitfalls need attention.

vLLM Blog ·