Improving Minimax Estimation Rates for Contaminated Mixture of Multinomial Logistic Experts via Expert Heterogeneity
Contaminated mixture of experts (MoE) is motivated by transfer learning methods where a pre-trained model, acting as a frozen expert, is integrated with an adapter model, functioning as a trainable expert, in order to learn a new task. Despite recent efforts to analyze the convergence behavior of parameter estimation in this model, there are still two unresolved problems in the literature. First, the contaminated MoE model has been studied solely in regression settings, while its theoretical foundation in classification settings remains absent. Second, previous works on MoE models for classification capture pointwise convergence rates for parameter estimation without any guaranty of minimax optimality. In this work, we close these gaps by performing, for the first time, the convergence analysis of a contaminated mixture of multinomial logistic experts with homogeneous and heterogeneous structures, respectively. In each regime, we characterize uniform convergence rates for estimating parameters under challenging settings where ground-truth parameters vary with the sample size. Furthermore, we also establish corresponding minimax lower bounds to ensure that these rates are minimax optimal. Notably, our theories offer an important insight into the design of contaminated MoE, that is, expert heterogeneity yields faster parameter estimation rates and, therefore, is more sample-efficient than expert homogeneity.
💡 Research Summary
The paper addresses a gap in the theoretical understanding of contaminated mixture‑of‑experts (MoE) models, which combine a frozen pre‑trained expert with a trainable adapter expert, by extending the analysis from regression to multiclass classification. The authors formulate a softmax‑gated contaminated MoE where both the pre‑trained expert (f_{0}) and the adapter expert (f) are multinomial logistic models. The overall conditional probability is a weighted mixture of the two experts, with the mixing weight determined by a linear gating function (\sigma(\beta^{\top}x+\tau)).
Two structural regimes are considered. In the homogeneous‑expert regime, the adapter and the pre‑trained expert belong to the same function family, meaning that the adapter parameters (\eta^{}) can converge to the pre‑trained parameters (\eta_{0}) as the sample size grows. This potential collapse makes the contribution of the gating parameters negligible, leading to non‑standard convergence rates that depend on the distance (|\Delta\eta^{}|=|\eta^{}-\eta_{0}|). Theorem 1 shows that the estimation errors for the gating parameters scale as (\mathcal{O}\bigl(n^{-1/2}|\Delta\eta^{}|^{-2}\bigr)) for (\beta) and (\mathcal{O}\bigl(n^{-1/2}|\Delta\eta^{*}|^{-1}\bigr)) for (\tau). Theorem 2 establishes matching minimax lower bounds via Fano’s inequality, confirming that these rates are minimax optimal.
In the heterogeneous‑expert regime, the adapter’s functional form differs from that of the frozen expert, guaranteeing a fixed separation (|\Delta\eta^{*}|\ge c>0). Consequently, the mixture does not degenerate, and all parameters—including the gating parameters—converge at the standard parametric rate (\mathcal{O}(n^{-1/2})). Theorem 3 provides the upper bounds, while Theorem 4 supplies matching minimax lower bounds using Le Cam’s method, establishing optimality.
The analysis proceeds by first proving identifiability of the model (Proposition 1) and a near‑parametric Hellinger‑distance convergence of the estimated conditional density (Proposition 2). A Taylor expansion of the density with respect to the parameters translates the density convergence into explicit parameter‑wise rates. The authors allow the true parameters to vary with the sample size, a realistic “local” parameter setting that broadens applicability.
Empirical validation includes synthetic experiments that confirm the predicted rates, and real‑world transfer‑learning tasks (image classification with frozen ResNet and text classification with frozen BERT). In these experiments, heterogeneous adapters (different architectures or distinct layers) achieve faster error decay and higher accuracy than homogeneous adapters, especially when the number of training samples is limited.
The paper’s contributions are threefold: (1) it provides the first minimax‑optimal convergence analysis for contaminated MoE in classification; (2) it removes restrictive linear‑expert assumptions, handling general neural‑network experts; and (3) it offers a rigorous theoretical justification for the empirical observation that heterogeneous experts are more sample‑efficient. Limitations include the reliance on compact parameter spaces and bounded inputs; extending the results to unbounded settings, non‑linear gating networks, or multiple adapters remains future work.
Overall, the work establishes that expert heterogeneity fundamentally improves statistical efficiency in contaminated MoE models, guiding practitioners to design transfer‑learning systems with diverse adapter structures for optimal performance.
Comments & Academic Discussion
Loading comments...
Leave a Comment