A Theoretical Framework for Modular Learning of Robust Generative Models

Reading time: 5 minute
...

📝 Original Info

  • Title: A Theoretical Framework for Modular Learning of Robust Generative Models
  • ArXiv ID: 2602.17554
  • Date: 2026-02-19
  • Authors: ** 논문에 명시된 저자 정보가 제공되지 않았습니다. (저자명 및 소속을 확인하려면 원문을 참조하십시오.) **

📝 Abstract

Training large-scale generative models is resource-intensive and relies heavily on heuristic dataset weighting. We address two fundamental questions: Can we train Large Language Models (LLMs) modularly-combining small, domain-specific experts to match monolithic performance-and can we do so robustly for any data mixture, eliminating heuristic tuning? We present a theoretical framework for modular generative modeling where a set of pre-trained experts are combined via a gating mechanism. We define the space of normalized gating functions, $G_{1}$, and formulate the problem as a minimax game to find a single robust gate that minimizes divergence to the worst-case data mixture. We prove the existence of such a robust gate using Kakutani's fixed-point theorem and show that modularity acts as a strong regularizer, with generalization bounds scaling with the lightweight gate's complexity. Furthermore, we prove that this modular approach can theoretically outperform models retrained on aggregate data, with the gap characterized by the Jensen-Shannon Divergence. Finally, we introduce a scalable Stochastic Primal-Dual algorithm and a Structural Distillation method for efficient inference. Empirical results on synthetic and real-world datasets confirm that our modular architecture effectively mitigates gradient conflict and can robustly outperform monolithic baselines.

💡 Deep Analysis

📄 Full Content

Training large-scale generative models, such as Large Language Models (LLMs), is notoriously expensive and often impractical to repeat for every new dataset [Brown et al., 2020, Hoffmann et al., 2022]. The computational cost and environmental footprint of these dense models have raised significant sustainability concerns [Strubell et al., 2019, Schwartz et al., 2020]. This monolithic paradigm faces two critical challenges. First, sustainability and adaptability: can we train LLMs modularly, learning small, accurate models on individual domains (e.g., math, coding) and combining them to match a giant model? If so, training becomes dramatically cheaper and greener; updates require training only a new module and the lightweight combiner, avoiding catastrophic forgetting [Kirkpatrick et al., 2017, Parisi et al., 2019] and enabling the efficient reuse of pretrained experts [Pfeiffer et al., 2023]. In future, privacy regulations could also restrict access to data domains, smaller models trained by the data owners could constitute the only viable path to data access. Second, robustness: standard training relies on heuristic importance weights across datasets [Gao et al., 2020, Touvron et al., 2023], or static optimization targets [Xie et al., 2023], often failing when test distributions differ from training assumptions [Koh et al., 2021]. Can we build a modular LLM that is accurate for any mixture of datasets, eliminating heuristic weighting entirely?

We provide an affirmative answer to both questions, offering the first rigorous game-theoretic framework for robust modularity. Unlike heuristic approaches like simple parameter averaging (Model Soups) [Wortsman et al., 2022], task arithmetic [Ilharco et al., 2022], or standard Mixture of Experts which rely on auxiliary load-balancing losses [Shazeer et al., 2017, Fedus et al., 2022], we seek a single system that is robust to any arbitrary mixture of the underlying source distributions. We propose a gated solution, π g (x) = ∑ k g(x, k)π k (x), where an adaptive gate dynamically reweights frozen experts. Our goal is to find a robust gate g * that minimizes the divergence to the worst-case data mixture, akin to Distributionally Robust Optimization (DRO) [Sagawa et al., 2020].

Contributions. Our main contributions are:

  1. Theoretical Framework: We define the normalized gate space G 1 and formulate robustness as a minimax game. We prove the existence of a robust gate using Kakutani’s fixed-point theorem, establishing a stable upper bound on the worst-case risk (Theorem 3).

We derive bounds showing that sample complexity scales with the lightweight gate complexity and the expert coincidence norm C Π , rather than the massive expert parameters.

counterpart. It is also not clear how this construction scales beyond two models, as it may require a quadratic number of pairwise cross-attention connections. Complementary to this is model stitching [Jiang and Li, 2024], where pre-trained blocks from disparate models, such as BERT and GPT, are integrated directly.

Similarly, recent frameworks like StitchLLM [Hu et al., 2025] dynamically route requests across stitched blocks-for instance, feeding the lower layers of one model into the upper layers of another-to optimize the trade-off between latency and accuracy. Crucially, neither approach provides theoretical analysis or guarantees for the resulting composed model. In contrast, our approach preserves experts as black boxes and offers strong theoretical guarantees for a gating mechanism robust to worst-case distribution mixtures.

Theoretical Routing and Learning to Defer. Our problem shares conceptual similarities with routing in learning to defer, where a learner chooses between predicting or deferring to experts. Foundational work by Cortes, DeSalvo, andMohri [2016a, 2024], Mohri, Andor, Choi, Collins, Mao, and Zhong [2024] established the theory for learning with rejection in binary classification. This line of work was significantly expanded to multi-class settings and predictor-rejector frameworks by Mao et al. [2024aMao et al. [ ,b, 2023Mao et al. [ , 2024cMao et al. [ ,e,d, 2025]], DeSalvo et al. [2025], Mao [2025]. Our approach diverges from this literature in three key aspects. First, unlike standard routing which performs a hard selection of a single expert, our gated framework induces a distribution over base models. Second, rather than optimizing for average-case performance, we address robustness against adversarial distribution mixtures. Finally, while computational cost is a primary consideration in standard model routing, our current framework focuses purely on statistical performance guarantees.

Modular Marketplaces and Ecosystems. Beyond functional integration, the rise of LLMs has spurred interest in the economic dynamics of modular systems. Bhawalkar et al. [2025] analyze “modular marketplaces” from a game-theoretic perspective, focusing on price equilibria where module owners act strategically to maxi

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut