Mixture-of-Experts under Finite-Rate Gating: Communication--Generalization Trade-offs

Reading time: 5 minute
...

📝 Original Info

  • Title: Mixture-of-Experts under Finite-Rate Gating: Communication–Generalization Trade-offs
  • ArXiv ID: 2602.15091
  • Date: 2026-02-16
  • Authors: ** - A. Khalesi – Assistant Professor, Institut Polytechnique des Sciences Avancées (IPSA) & LINCS Lab, Paris, France (ali.khalesi@ipsa.fr) - M. R. Deylam Salehi – IEEE Member, Nice, France (reza.deylam@ieee.org) (논문에 명시된 다른 저자는 확인되지 않음) — **

📝 Abstract

Mixture-of-Experts (MoE) architectures decompose prediction tasks into specialized expert sub-networks selected by a gating mechanism. This letter adopts a communication-theoretic view of MoE gating, modeling the gate as a stochastic channel operating under a finite information rate. Within an information-theoretic learning framework, we specialize a mutual-information generalization bound and develop a rate-distortion characterization $D(R_g)$ of finite-rate gating, where $R_g:=I(X; T)$, yielding (under a standard empirical rate-distortion optimality condition) $\mathbb{E}[R(W)] \le D(R_g)+δ_m+\sqrt{(2/m)\, I(S; W)}$. The analysis yields capacity-aware limits for communication-constrained MoE systems, and numerical simulations on synthetic multi-expert models empirically confirm the predicted trade-offs between gating rate, expressivity, and generalization.

💡 Deep Analysis

📄 Full Content

MoE models [1], [2] combine multiple expert predictors via a gating network that assigns probabilistic weights or discrete routing decisions. This modular structure enables specialization, scalability, and sparse activation in large architectures such as the Switch Transformer [3], where only a small subset of experts is active per input. Despite their practical success, the theoretical understanding of MoE systems-particularly generalization under finite communication resources-remains limited. Classical analyses [4] gave Rademacher-based bounds that additively depend on the gate complexity and the sum of expert complexities, scaling linearly with the number of experts. More recently, Akretche et al. [5] introduced local differential privacy (LDP) regularization on the gate, obtaining tighter PAC-Bayesian and Rademacher bounds with logarithmic dependence on the number of experts. However, even in this setting, the gate is treated as a stochastic selector rather than an information-constrained communication process.

The proposed communication-generalization framework for MoE, beyond its theoretical significance in distributed learning, has a natural interpretation in aeronautical and aerospace communication systems [6], [7]. Modern aircraft, satellites, and unmanned aerial vehicles (UAVs) increasingly rely on distributed sensing and computation, where multiple onboard or remote modules act as experts processing heterogeneous sensor streams under stringent bandwidth, latency, and reliability constraints. The MoE gate parallels the routing logic A. Khalesi is an Assistant Professor at Institut Polytechnique des Sciences Avancées (IPSA) and LINCS Lab, Paris, France (ali.khalesi@ipsa.fr). M.R. Deylam Salehi is an IEEE member, Nice, France (reza.deylam@ieee.org). in such systems-deciding which local estimator or control unit should communicate with the flight computer or ground segment. In this context, the mutual-information rate constraints derived here quantify the performance degradation (in estimation, navigation, or control) induced by limited communication capacity, extending data-rate theorems [8], [9] to learned, data-driven aerospace decision systems.

Novelty and Contribution: From a communicationtheoretic perspective, this work develops an explicit communication-generalization trade-off for MoE models by reinterpreting the gating mechanism as a finite-capacity stochastic channel. We quantify the gating pathway by its mutual information rate R g = I(X; T ), which limits how much information about the input can reach the expert bank and thus controls expressivity and statistical robustness. Building on this view, we introduce a rate-distortion formulation of the gating problem and show how the best achievable prediction risk at a given gating rate is characterized by a ratedistortion function D(R g ), leading to the bound E[R(W )] ≤ D(R g ) + δ m + (2/m) I(S; W ), where R g enters through D(R g ). We further consider the practically relevant case where gate decisions are conveyed over a physical link of capacity C; when the gate is trained/regularized to operate near this budget (so that I(X; T ) ≈ C), the bound specializes to D(C), making the dependence on C explicit, and connects MoE gating to classical notions such as channel capacity and datarate limitations.

While information-theoretic generalization theory [10], [11], communication-limited learning [12], [13], and recent MoE risk bounds [5] are well established, to the best of our knowledge, while several works interpret specialization and routing through information-theoretic lenses, we are not aware of prior MoE risk bounds in which the gate is modeled explicitly as a rate-limited channel and the achievable risk is expressed with the gating rate I(X; T ) appearing as an explicit design parameter through a rate-distortion function. Related information-theoretic perspectives on specialization and hierarchical decision systems have also been studied, e.g., via mutual-information principles and online learning in hierarchical architectures [14]- [16]. Our contribution is to place MoE architectures within a unified rate-distortion and capacity framework, where the gating rate I(X; T ) acts as a system-level communication constraint (which, in practice, is upper-bounded by the available link capacity) shaping both expressivity and generalization, and where privacy or compression mechanisms can be interpreted as additional noisychannel layers. Theoretical results are supported by numerical simulations on multi-expert MoE models and on a binary symmetric one-bit gating scenario, which empirically illustrate the predicted trade-offs between gating rate, sample size, and generalization performance.

Organization: Section II formalizes MoE models as stochastic communication systems. Section III applies information-theoretic generalization bounds to MoE gating. Section IV introduces a rate-distortion formulation of the gating mechanism and establishes a

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut