Grassmannian Mixture-of-Experts: Concentration-Controlled Routing on Subspace Manifolds
Mixture-of-Experts models rely on learned routers to assign tokens to experts, yet standard softmax gating provides no principled mechanism to control the tradeoff between sparsity and utilization. We propose Grassmannian MoE (GrMoE), a routing framework that operates on the Grassmannian manifold of subspaces, where gating weights arise from the concentration parameters of Matrix Bingham distributions. This construction yields a single, interpretable knob – the concentration matrix $Λ$ – that continuously controls routing entropy, replacing discrete top-$k$ selection with a smooth, geometrically principled sparsity mechanism. We further develop an amortized variational inference procedure for posterior routing distributions, enabling uncertainty-aware expert assignment that naturally resists expert collapse. We formally prove tight bounds relating the Bingham concentration spectrum to routing entropy, expected top-$k$ mass, and an exponential bound on expert collapse, establishing the first formal theory of concentration-controlled sparsity. On synthetic routing tasks, a 350M-parameter MoE language model with 8 experts, a 1.3B-parameter model with 16 experts, and a 2.7B-parameter model with 32 experts, GrMoE achieves 0% routing collapse across all seeds, comparable or better perplexity with 15–30% improved load balance, and a smooth monotonic relationship between concentration and effective sparsity that enables post-hoc sparsity tuning without retraining. Token-level analysis reveals that experts learn heterogeneous concentration values that correlate with linguistic specialization, providing interpretable routing behavior.
💡 Research Summary
Mixture‑of‑Experts (MoE) models achieve remarkable scaling by activating only a subset of experts for each input token. However, current routing mechanisms—typically a linear projection followed by a softmax and a hard top‑k selection—suffer from three persistent problems: (1) expert collapse, where a few experts dominate the load while others remain under‑trained; (2) training instability caused by the discontinuous top‑k operation; and (3) the lack of a principled, controllable sparsity knob, forcing practitioners to retrain separate models for different compute budgets.
The authors propose Grassmannian MoE (GrMoE), a routing framework that treats each expert as a k‑dimensional subspace in the representation space and routes tokens based on their alignment with these subspaces. Formally, each expert e is represented by an orthonormal frame Uₑ∈ℝ^{d×k_r}, defining a subspace
Comments & Academic Discussion
Loading comments...
Leave a Comment