vLLM V0 to V1: Correctness Before Corrections in RL

The Catalyst: Why a Routine Upgrade Triggered Training Collapse ServiceNow AI's PipelineRL system uses vLLM as its inference engine to generate rollout data for reinforcement learning. When they upgraded the underlying engine from vLLM V0 to V1—a major rewrite—they found that previously stable training curves began to diverge catastrophically. This wasn't just performance fluctuation; training metrics like clip rate, KL divergence, entropy, and rewards went completely out of control. It revealed a deeper truth: in complex AI training stacks, subtle implementation differences in the inference engine can be amplified like a butterfly effect, ultimately destroying the entire training process. Deconstructing the Issue: Four Technical Culprits Behind "Training Instability" The team systematically categorized the problem into three layers: semantic mismatch, inference-path mismatch, and objective mismatch. Wisely, they started by investigating the first two layers (backend behavior issues) rather than prematurely suspecting the RL algorithm itself. They pinpointed four specific fixes: 1. Logprob Semantics: By default, V1 returned logprobs from the raw model outputs (before post-processing like temperature scaling or penalties), while the trainer expected logprobs from the processed distribution actually used by the sampler. A simple setting, logprobs-mode=processed_logprobs, fixed the mean offset. 2. Runtime Defaults: V1's default behavior for caching, scheduling, or request handling differed from V0, causing the same prompts to follow different inference paths and introducing hidden discrepancies. 3. In-flight Weight-Update Path: In RL training, the trainer periodically updates model weights and syncs them to the inference engine. V1's weight update mechanism was flawed, causing policy desynchronization. 4. The Final fp32 Projection Layer: The lm_head layer, which projects to the final vocabulary, might have been computed at a different precision (e.g., fp16) in V1. Using fp32 is often critical for numerical stability in logprob calculations. Trend Insight: AI Engineering Enters a "Micro-Discrepancy Sensitive" Era This incident marks a new phase in AI system engineering. Early on, the focus was on "making it work." Now, as tech stacks mature and grow more complex (e.g., online RL, distributed training), "consistency" becomes the core challenge. The inference engine is no longer just a black box for generating text; every output (like logprob) is part of the training optimization target. Any subtle difference in implementation—default parameters, numerical precision, processing order—gets amplified by optimization algorithms, leading to unpredictable outcomes. This foreshadows that future competition in AI infrastructure will be not just about speed and throughput, but about determinism, reproducibility, and cross-version consistency. Practical Value: Lessons for Developers and Teams 1. Upgrades Require Caution; Benchmarks are Lifelines: Before upgrading any core component (especially an inference engine), you must establish rigorous, quantifiable benchmarks. Don't just test inference speed; test consistency of downstream metrics (like RL training stability). 2. The "Backend-First" Troubleshooting Principle: When system issues arise, first assume discrepancies in underlying implementation or configuration (backend behavior problems) rather than that the algorithm or objective function needs adjustment. This saves immense time wasted on无效的 "parameter tuning." 3. Focus on "Interface Contracts": The interfaces between system components (e.g., the logprob format expected by the trainer) must have clear, strict definitions and validation. The root cause here was essentially vLLM V1 inadvertently changing the "implicit contract" with the trainer. 4. Never忽视 Numerical Precision: In final training stages (like the lm_head), insisting on higher-precision calculations like fp32 is often a necessary代价 to prevent gradient explosion/vanishing and maintain training stability. Counterintuitive Insight The most counterintuitive point is this: an engine upgrade aimed at improving inference performance landed its biggest pitfall not in slower speeds or memory overflows, but in changing the "meaning" of output numbers, thereby invalidating all the mathematical optimization formulas in the upper layer. This reminds us that in modern AI systems, software engineering correctness must precede algorithmic "corrections." Before adjusting the RL objective to fit a new engine, you must first ensure the new engine's behavior is mathematically equivalent to the old one. This执着于底层 determinism is a关键 step in AI's journey from the lab to reliable production environments.