MoSE: Mixture of Slimmable Experts for Efficient and Adaptive Language Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Mixture-of-Experts (MoE) models scale large language models efficiently by sparsely activating experts, but once an expert is selected, it is executed fully. Hence, the trade-off between accuracy and computation in an MoE model typically exhibits large discontinuities. We propose Mixture of Slimmable Experts (MoSE), an MoE architecture in which each expert has a nested, slimmable structure that can be executed at variable widths. This enables conditional computation not only over which experts are activated, but also over how much of each expert is utilized. Consequently, a single pretrained MoSE model can support a more continuous spectrum of accuracy-compute trade-offs at inference time. We present a simple and stable training recipe for slimmable experts under sparse routing, combining multi-width training with standard MoE objectives. During inference, we explore strategies for runtime width determination, including a lightweight test-time training mechanism that learns how to map router confidence/probabilities to expert widths under a fixed budget. Experiments on GPT models trained on OpenWebText demonstrate that MoSE matches or improves upon standard MoE at full width and consistently shifts the Pareto frontier for accuracy vs. cost, achieving comparable performance with significantly fewer FLOPs.

💡 Research Summary

Mixture‑of‑Experts (MoE) has become a cornerstone for scaling large language models because it activates only a small subset of experts for each token, keeping per‑token computation low while allowing the overall parameter count to grow. However, once an expert is selected by the router, it is always executed at full capacity. This creates a coarse‑grained trade‑off: the model can only switch between different numbers of experts, but cannot adjust the amount of work each expert performs. Consequently, the accuracy‑compute curve of standard MoE exhibits large discontinuities.

The paper introduces MoSE (Mixture of Slimmable Experts), an architecture that equips every expert with a slimmable (nested) structure. Each expert is implemented as a transformer feed‑forward network (FFN) with an intermediate hidden dimension that is typically four times the model dimension. By defining a set of width multipliers w∈A (e.g., {0.25, 0.5, 0.75, 1.0}), the intermediate dimension can be sliced, yielding a family of sub‑networks that share parameters. At inference time the router still selects a sparse set of experts (top‑k), but an additional decision determines the execution width for each active expert. This adds a second axis of conditional computation: “which experts” and “how much of each expert”.

Training MoSE requires that experts perform well across multiple widths while preserving the stability of sparse routing. The authors adopt a simple multi‑width schedule: for each mini‑batch they run the model twice—once at the maximum width w_max and once at a randomly sampled width w∈

MoSE: Mixture of Slimmable Experts for Efficient and Adaptive Language Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment