MoE模型 — Tag

Elastic Expert Parallelism in vLLM

vLLM introduces Elastic Expert Parallelism (Elastic EP), enabling runtime scaling of MoE inference deployments by adding or removing GPU workers without restarts, adapting to demand fluctuations and laying the groundwork for fault-tolerant serving.

vLLM Blog ·

Tag: MoE模型 (1 articles)

Elastic Expert Parallelism in vLLM