Fast Model Selection and Stable Optimization for Softmax-Gated Multinomial-Logistic Mixture of Experts Models
Mixture-of-Experts (MoE) architectures combine specialized predictors through a learned gate and are effective across regression and classification, but for classification with softmax multinomial-logistic gating, rigorous guarantees for stable maximum-likelihood training and principled model selection remain limited. We address both issues in the full-data (batch) regime. First, we derive a batch minorization-maximization (MM) algorithm for softmax-gated multinomial-logistic MoE using an explicit quadratic minorizer, yielding coordinate-wise closed-form updates that guarantee monotone ascent of the objective and global convergence to a stationary point (in the standard MM sense), avoiding approximate M-steps common in EM-type implementations. Second, we prove finite-sample rates for conditional density estimation and parameter recovery, and we adapt dendrograms of mixing measures to the classification setting to obtain a sweep-free selector of the number of experts that achieves near-parametric optimal rates after merging redundant fitted atoms. Experiments on biological protein–protein interaction prediction validate the full pipeline, delivering improved accuracy and better-calibrated probabilities than strong statistical and machine-learning baselines.
💡 Research Summary
This paper tackles two longstanding challenges in softmax‑gated multinomial‑logistic Mixture‑of‑Experts (SGMLMoE) models: (1) stable batch maximum‑likelihood training and (2) principled selection of the number of experts.
The authors first derive a batch minorization‑maximization (MM) algorithm that constructs an explicit quadratic surrogate for the negative log‑likelihood. By introducing curvature matrices Bₙ,ᴷ and Bₙ,ᴹ, they obtain closed‑form coordinate‑wise updates for both the gating parameters (w) and the expert parameters (v). Each MM iteration exactly minimizes the surrogate, guaranteeing monotone ascent of the observed‑data log‑likelihood and global convergence to a stationary point, without the inner‑loop optimizations required by traditional EM. Identifiability issues inherent to the softmax gate and multinomial‑logistic experts are resolved by fixing the last gate and last class parameters to zero, eliminating translation invariance.
On the statistical side, the paper establishes finite‑sample convergence rates: the conditional density estimator converges in total variation at the parametric rate O(N⁻¹ᐟ²), and the parameter vectors (ω, υ) achieve the same L₂‑norm rate. When the model is over‑specified (K > K₀), several fitted experts cluster around true components, leading to slow convergence along non‑identifiable directions. To address this, the authors introduce a Voronoi‑type loss that respects the partition induced by the gating network and define a mixing‑measure representation for each expert. They then construct a hierarchical merging path (a dendrogram) based on pairwise distances (e.g., KL or Voronoi loss) between experts. By monitoring the increase in loss and the corresponding drop in log‑likelihood, they devise a Dendrogram Selection Criterion (DSC) that selects the optimal number of experts K̂ without fitting multiple models of different sizes. Theoretical analysis shows that after merging redundant atoms, the estimator attains near‑parametric rates and the selector is consistent.
Empirically, the method is evaluated on a large protein‑protein interaction (PPI) dataset with multiple classes. An over‑specified model with K = 8 (true K₀ ≈ 3) is fitted using the batch MM algorithm; the dendrogram procedure correctly recovers K̂ = 3. Compared against EM‑based MoE, deep neural classifiers (MLP, Transformer), and Bayesian merge‑truncate‑merge approaches, the proposed pipeline achieves 2–4 % higher accuracy and AUC, and markedly better calibration (lower Brier score and Expected Calibration Error). Training time is reduced by roughly 30 % relative to EM, and model selection requires only a single fit followed by dendrogram construction.
In summary, the paper delivers a mathematically rigorous, computationally efficient solution for SGMLMoE: a closed‑form batch MM optimizer with guaranteed monotonic ascent and convergence, and a sweep‑free, statistically sound model‑selection mechanism based on dendrogram merging. These contributions advance both the theory and practice of expert‑based classifiers, offering a robust tool for high‑dimensional, multi‑class problems.
Comments & Academic Discussion
Loading comments...
Leave a Comment