EMO: Pretraining mixture of experts for emergent modularity

The Dilemma of "Bulky" LLMs and the Promise of MoE Today's frontier large language models, often boasting trillions of parameters, are like all-powerful "giants." However, in practice, we usually only need them for specific tasks like coding, mathematical reasoning, or answering medical questions. Invoking the entire giant every time is as inefficient as moving a whole cow just to get a glass of milk—it incurs massive computational costs and memory overhead. Mixture-of-Experts (MoE) models should be the ideal solution. Their design resembles a committee of experts: the model contains many smaller "expert" networks, and for each input, only a few of the most relevant experts are activated. In theory, for a coding task, you'd only need to load the "coding expert." Yet, the reality has been far from this ideal. Why Traditional MoE Experts Are "Inseparable" The problem is that experts in existing MoE models don't specialize along the domains we envision (e.g., math, biology, code). Research shows they tend to specialize in very low-level linguistic patterns, like the usage of a specific preposition or punctuation mark. This means even a simple sentence might require activating multiple experts scattered across the model to handle different tokens. Consequently, you cannot reliably extract a "math expert" subset to solve math problems because these experts aren't "pure"—the generation process will inadvertently call upon all experts, leading to severe performance degradation. It's like assembling a "medical team" where one member is an expert on commas, another on the word "the," making them incapable of performing surgery independently. Trend Insight: From "Predefined" to "Emergent" Modularity Previous solutions, like BTX or our FlexOlmo, attempted to route tokens based on predefined domain labels during pretraining. But this has fundamental flaws: first, labeling massive pretraining corpora with clear, unambiguous domains is prohibitively expensive and difficult; second, it imposes human bias on how the model organizes itself, limiting its ability to discover better structures autonomously; most critically, the predefined framework breaks when new domains or capabilities emerge at inference time. EMO's core breakthrough is that it allows modular structure to emerge naturally during pretraining. Through an innovative training objective, it encourages the model to learn to organize experts into coherent, independently usable functional groups without any human-defined domain labels. This reveals a deeper trend: AI architecture design is shifting from "humans imposing structure" to "guiding models to discover optimal structures autonomously." We are no longer the model's "planner" but its "coach," setting goals (like modularity) and letting the model figure out how to achieve them. Practical Value and a Counter-Intuitive Insight The practical value of this work is revolutionary. It means that in the future, when deploying a trillion-parameter MoE model like EMO, we can, for downstream tasks, load only a small subset of required expert modules (e.g., 12.5%) like building with Lego blocks, achieving performance close to the full model. This will drastically reduce inference costs and memory usage, making it possible to run complex tasks on resource-constrained devices (e.g., phones, edge devices). For developers, the model transforms from a "black box" into a "composable toolkit." A potentially counter-intuitive point is: more parameters don't necessarily mean more cumbersome. Through "emergent modularity" like EMO, a model with huge total parameters (14 billion) but low active parameters (1 billion) can have lower effective computational costs and memory footprint than a smaller, structurally "rigid" dense model, while delivering superior performance. This颠覆了 the simplistic notion that "more parameters always equal higher cost," pointing to sparsity and modularity as the key path to achieving both high performance and high efficiency. In summary, EMO is not just a new model but a new philosophy for model construction. It evolves large models from "an indivisible giant" to "a flexibly assembled team of experts," potentially solving the critical "last-mile" challenges of cost and efficiency in deploying large models.