📝 Original Info
- Title: Sparsity-Controllable Dynamic Top-p MoE for Large Foundation Model Pre-training
- ArXiv ID: 2512.13996
- Date: 2025-12-16
- Authors: Can Jin, Hongwu Peng, Mingcan Xiang, Qixin Zhang, Xiangchi Yuan, Amit Hasan, Ohiremen Dibua, Yifan Gong, Yan Kang, Dimitris N. Metaxas
📝 Abstract
Sparse Mixture-of-Experts (MoE) architectures effectively scale model capacity by activating only a subset of experts for each input token. However, the standard Top-k routing strategy imposes a uniform sparsity pattern that ignores the varying difficulty of tokens. While Top-p routing offers a flexible alternative, existing implementations typically rely on a fixed global probability threshold, which results in uncontrolled computational costs and sensitivity to hyperparameter selection. In this paper, we propose DTop-p MoE, a sparsity-controllable dynamic Top-p routing mechanism. To resolve the challenge of optimizing a non-differentiable threshold, we utilize a Proportional-Integral (PI) Controller that dynamically adjusts the probability threshold to align the running activated-expert sparsity with a specified target. Furthermore, we introduce a dynamic routing normalization mechanism that adapts layer-wise routing logits, allowing different layers to learn distinct expert-selection patterns while utilizing a global probability threshold. Extensive experiments on Large Language Models and Diffusion Transformers demonstrate that DTop-p consistently outperforms both Top-k and fixed-threshold Top-p baselines. Our analysis confirms that DTop-p maintains precise control over the number of activated experts while adaptively allocating resources across different tokens and layers. Furthermore, DTop-p exhibits strong scaling properties with respect to expert granularity, expert capacity, model size, and dataset size, offering a robust framework for large-scale MoE pre-training.
💡 Deep Analysis
📄 Full Content
Sparsity-Controllable Dynamic Top-p MoE for Large
Foundation Model Pre-training
Can Jin1∗
Hongwu Peng2∗
Mingcan Xiang2,3
Qixin Zhang4
Xiangchi Yuan5
Amit Hasan2
Ohiremen Dibua2
Yifan Gong2
Yan Kang2†
Dimitris N. Metaxas1†
1Rutgers University
2Adobe Research
3UMass Amherst
4Nanyang Technological University
5Georgia Institute of Technology
Abstract
Sparse Mixture-of-Experts (MoE) architectures effectively scale model capacity
by activating only a subset of experts for each input token. However, the standard
Top-k routing strategy imposes a uniform sparsity pattern that ignores the varying
difficulty of tokens. While Top-p routing offers a flexible alternative, existing
implementations typically rely on a fixed global probability threshold, which results
in uncontrolled computational costs and sensitivity to hyperparameter selection.
In this paper, we propose DTop-p MoE, a sparsity-controllable dynamic Top-p
routing mechanism. To resolve the challenge of optimizing a non-differentiable
threshold, we utilize a Proportional-Integral (PI) controller that dynamically adjusts
the probability threshold to align the running activated-expert sparsity with a
specified target. Furthermore, we introduce a dynamic routing normalization
mechanism that adapts layer-wise routing logits, allowing different layers to learn
distinct expert-selection patterns while utilizing a global probability threshold.
Extensive experiments on Large Language Models and Diffusion Transformers
demonstrate that DTop-p consistently outperforms both Top-k and fixed-threshold
Top-p baselines. Our analysis confirms that DTop-p maintains precise control
over the number of activated experts while adaptively allocating resources across
different tokens and layers. Furthermore, DTop-p exhibits strong scaling properties
with respect to expert granularity, expert capacity, model size, and dataset size,
offering a robust framework for large-scale MoE pre-training.
1
Introduction
The recent development of large foundation models (LFMs) [19, 32, 38, 39, 44] has been driven
largely by the observation that scaling model capacity reliably improves performance across modali-
ties and tasks [28]. However, simply increasing model size quickly encounters computational and
memory constraints. Consequently, Sparse Mixture-of-Experts (MoE) architectures have become
a primary component of state-of-the-art systems [14, 29, 48, 50, 53, 54]. By activating only a
small subset of experts for each token, MoE models scale parameter count and training throughput
simultaneously, often outperforming dense counterparts while remaining trainable in parallel [12, 37].
Most existing MoE systems rely on Top-k routing: for every token at every layer, the router selects
exactly k experts [12, 14, 37, 53]. While computationally efficient and hardware-friendly, this design
enforces a uniform sparsity pattern across tokens and layers. In practice, however, different tokens
exhibit distinct routing requirements: complex or rare tokens may benefit from more experts, while
∗Equal Contribution, † Equal Advising. Correspondence to: Can Jin . Work done
during an internship at Adobe Research.
arXiv:2512.13996v1 [cs.AI] 16 Dec 2025
simple or redundant tokens may require fewer [23, 27]. To address this, recent works explore Top-
p routing, which selects experts whose cumulative routing probabilities exceed a fixed threshold,
thereby allowing different tokens to activate varying numbers of experts [23, 55]. Yet, current Top-p
variants face two primary limitations: (i) they typically utilize a global, fixed probability threshold,
which is unlikely to remain optimal throughout the entire training process; and (ii) the number of
activated experts is uncontrolled, which presents significant challenges for large-scale pre-training
where computational budgets must be strictly managed.
In this paper, we first demonstrate that a fixed-threshold Top-p MoE yields only marginal gains over
Top-k routing when activating a similar number of parameters. Furthermore, its performance and
computational cost are highly sensitive to the selection of p and introduce uncontrolled sparsity,
resulting in unstable computational costs. This observation motivates the development of a sparsity-
controllable dynamic Top-p routing mechanism. The primary challenge lies in the fact that the
probability threshold does not naturally receive gradients; it serves only to binarize the expert mask
(0/1), so treating it as a standard learnable parameter is ineffective. To overcome this, we draw
inspiration from Proportional–Integral (PI) control in classical control theory and treat the global
sparsity as a target signal [1, 62]. We estimate the current activated-expert sparsity from the router and
update the probability threshold via a PI controller. In doing so, we (i) render the threshold effectively
learnable, (ii) maintain the overall activated-expert ratio close to a user-specified budget across tokens
and
Reference
This content is AI-processed based on open access ArXiv data.