Sparsity-Controllable Dynamic Top-p MoE for Large Foundation Model Pre-training

Reading time: 5 minute
...

📝 Original Info

  • Title: Sparsity-Controllable Dynamic Top-p MoE for Large Foundation Model Pre-training
  • ArXiv ID: 2512.13996
  • Date: 2025-12-16
  • Authors: Can Jin, Hongwu Peng, Mingcan Xiang, Qixin Zhang, Xiangchi Yuan, Amit Hasan, Ohiremen Dibua, Yifan Gong, Yan Kang, Dimitris N. Metaxas

📝 Abstract

Sparse Mixture-of-Experts (MoE) architectures effectively scale model capacity by activating only a subset of experts for each input token. However, the standard Top-k routing strategy imposes a uniform sparsity pattern that ignores the varying difficulty of tokens. While Top-p routing offers a flexible alternative, existing implementations typically rely on a fixed global probability threshold, which results in uncontrolled computational costs and sensitivity to hyperparameter selection. In this paper, we propose DTop-p MoE, a sparsity-controllable dynamic Top-p routing mechanism. To resolve the challenge of optimizing a non-differentiable threshold, we utilize a Proportional-Integral (PI) Controller that dynamically adjusts the probability threshold to align the running activated-expert sparsity with a specified target. Furthermore, we introduce a dynamic routing normalization mechanism that adapts layer-wise routing logits, allowing different layers to learn distinct expert-selection patterns while utilizing a global probability threshold. Extensive experiments on Large Language Models and Diffusion Transformers demonstrate that DTop-p consistently outperforms both Top-k and fixed-threshold Top-p baselines. Our analysis confirms that DTop-p maintains precise control over the number of activated experts while adaptively allocating resources across different tokens and layers. Furthermore, DTop-p exhibits strong scaling properties with respect to expert granularity, expert capacity, model size, and dataset size, offering a robust framework for large-scale MoE pre-training.

💡 Deep Analysis

📄 Full Content

Sparsity-Controllable Dynamic Top-p MoE for Large Foundation Model Pre-training Can Jin1∗ Hongwu Peng2∗ Mingcan Xiang2,3 Qixin Zhang4 Xiangchi Yuan5 Amit Hasan2 Ohiremen Dibua2 Yifan Gong2 Yan Kang2† Dimitris N. Metaxas1† 1Rutgers University 2Adobe Research 3UMass Amherst 4Nanyang Technological University 5Georgia Institute of Technology Abstract Sparse Mixture-of-Experts (MoE) architectures effectively scale model capacity by activating only a subset of experts for each input token. However, the standard Top-k routing strategy imposes a uniform sparsity pattern that ignores the varying difficulty of tokens. While Top-p routing offers a flexible alternative, existing implementations typically rely on a fixed global probability threshold, which results in uncontrolled computational costs and sensitivity to hyperparameter selection. In this paper, we propose DTop-p MoE, a sparsity-controllable dynamic Top-p routing mechanism. To resolve the challenge of optimizing a non-differentiable threshold, we utilize a Proportional-Integral (PI) controller that dynamically adjusts the probability threshold to align the running activated-expert sparsity with a specified target. Furthermore, we introduce a dynamic routing normalization mechanism that adapts layer-wise routing logits, allowing different layers to learn distinct expert-selection patterns while utilizing a global probability threshold. Extensive experiments on Large Language Models and Diffusion Transformers demonstrate that DTop-p consistently outperforms both Top-k and fixed-threshold Top-p baselines. Our analysis confirms that DTop-p maintains precise control over the number of activated experts while adaptively allocating resources across different tokens and layers. Furthermore, DTop-p exhibits strong scaling properties with respect to expert granularity, expert capacity, model size, and dataset size, offering a robust framework for large-scale MoE pre-training. 1 Introduction The recent development of large foundation models (LFMs) [19, 32, 38, 39, 44] has been driven largely by the observation that scaling model capacity reliably improves performance across modali- ties and tasks [28]. However, simply increasing model size quickly encounters computational and memory constraints. Consequently, Sparse Mixture-of-Experts (MoE) architectures have become a primary component of state-of-the-art systems [14, 29, 48, 50, 53, 54]. By activating only a small subset of experts for each token, MoE models scale parameter count and training throughput simultaneously, often outperforming dense counterparts while remaining trainable in parallel [12, 37]. Most existing MoE systems rely on Top-k routing: for every token at every layer, the router selects exactly k experts [12, 14, 37, 53]. While computationally efficient and hardware-friendly, this design enforces a uniform sparsity pattern across tokens and layers. In practice, however, different tokens exhibit distinct routing requirements: complex or rare tokens may benefit from more experts, while ∗Equal Contribution, † Equal Advising. Correspondence to: Can Jin . Work done during an internship at Adobe Research. arXiv:2512.13996v1 [cs.AI] 16 Dec 2025 simple or redundant tokens may require fewer [23, 27]. To address this, recent works explore Top- p routing, which selects experts whose cumulative routing probabilities exceed a fixed threshold, thereby allowing different tokens to activate varying numbers of experts [23, 55]. Yet, current Top-p variants face two primary limitations: (i) they typically utilize a global, fixed probability threshold, which is unlikely to remain optimal throughout the entire training process; and (ii) the number of activated experts is uncontrolled, which presents significant challenges for large-scale pre-training where computational budgets must be strictly managed. In this paper, we first demonstrate that a fixed-threshold Top-p MoE yields only marginal gains over Top-k routing when activating a similar number of parameters. Furthermore, its performance and computational cost are highly sensitive to the selection of p and introduce uncontrolled sparsity, resulting in unstable computational costs. This observation motivates the development of a sparsity- controllable dynamic Top-p routing mechanism. The primary challenge lies in the fact that the probability threshold does not naturally receive gradients; it serves only to binarize the expert mask (0/1), so treating it as a standard learnable parameter is ineffective. To overcome this, we draw inspiration from Proportional–Integral (PI) control in classical control theory and treat the global sparsity as a target signal [1, 62]. We estimate the current activated-expert sparsity from the router and update the probability threshold via a PI controller. In doing so, we (i) render the threshold effectively learnable, (ii) maintain the overall activated-expert ratio close to a user-specified budget across tokens and

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut