DirMoE: Dirichlet-routed Mixture of Experts
Mixture-of-Experts (MoE) models have demonstrated exceptional performance in large-scale language models. Existing routers typically rely on non-differentiable Top-$k$+Softmax, limiting their performance and scalability. We argue that two distinct decisions, which experts to activate and how to distribute expert contributions among them, are conflated in standard Top-$k$+Softmax. We introduce Dirichlet-Routed MoE (DirMoE), a novel end-to-end differentiable routing mechanism built on a Dirichlet variational autoencoder framework. This design fundamentally disentangles the core routing problems: expert selection, modeled by a Bernoulli component, and expert contribution among chosen experts, handled by a Dirichlet component. The entire forward pass remains fully differentiable through the use of Gumbel-Sigmoid relaxation for the expert selection and implicit reparameterization for the Dirichlet distribution. Our training objective, a variational ELBO, includes a direct sparsity penalty that precisely controls the number of active experts in expectation, alongside a schedule for key hyperparameters that guides the model from an exploratory to a definitive routing state. Moreover, our DirMoE router matches or exceeds other methods while improving expert specialization.
💡 Research Summary
Mixture‑of‑Experts (MoE) layers have become a cornerstone for scaling language models, but their performance hinges on the quality of the router that decides which experts to activate and how much each contributes. The dominant approach, Top‑k + Softmax, intertwines a discrete selection step with a continuous probability allocation. This coupling creates two major drawbacks: (1) the discrete Top‑k operation blocks gradients, forcing the use of surrogate tricks such as straight‑through estimators, temperature annealing, or auxiliary load‑balancing losses; (2) the single Softmax distribution entangles expert selection with contribution, making it hard to interpret load distribution and to control sparsity precisely.
DirMoE (Dirichlet‑Routed Mixture of Experts) addresses both issues by factorising the routing distribution into a spike (binary expert mask) and a slab (probability simplex over the active experts). The spike is modelled as a Bernoulli vector z with per‑expert activation probabilities π_i(x). To keep the entire pipeline differentiable, the authors employ a Gumbel‑Sigmoid (also known as Binary‑Concrete) relaxation:
\
Comments & Academic Discussion
Loading comments...
Leave a Comment