容错设计 — Tag

Elastic Expert Parallelism in vLLM

vLLM introduces Elastic Expert Parallelism (Elastic EP), enabling runtime scaling of MoE inference deployments by adding or removing GPU workers without restarts, adapting to demand fluctuations and laying the groundwork for fault-tolerant serving.

vLLM Blog ·

Tag: 容错设计 (1 articles)

Elastic Expert Parallelism in vLLM