Dynamic Expert Sharing: Decoupling Memory from Parallelism in Mixture-of-Experts Diffusion LLMs
Among parallel decoding paradigms, diffusion large language models (dLLMs) have emerged as a promising candidate that balances generation quality and throughput. However, their integration with Mixture-of-Experts (MoE) architectures is constrained by an expert explosion: as the number of tokens generated in parallel increases, the number of distinct experts activated grows nearly linearly. This results in substantial memory traffic that pushes inference into a memory-bound regime, negating the efficiency gains of both MoE and parallel decoding. To address this challenge, we propose Dynamic Expert Sharing (DES), a novel technique that shifts MoE optimization from token-centric pruning and conventional expert skipping methods to sequence-level coreset selection. To maximize expert reuse, DES identifies a compact, high-utility set of experts to satisfy the requirements of an entire parallel decoding block. We introduce two innovative selection strategies: (1) Intra-Sequence Sharing (DES-Seq), which adapts optimal allocation to the sequence level, and (2) Saliency-Aware Voting (DES-Vote), a novel mechanism that allows tokens to collectively elect a coreset based on aggregated router weights. Extensive experiments on MoE dLLMs demonstrate that DES reduces unique expert activations by over 55% and latency by up to 38%, while retaining 99% of vanilla accuracy, effectively decoupling memory overhead from the degree of parallelism.
💡 Research Summary
The paper tackles a critical bottleneck that arises when combining diffusion large language models (dLLMs) with Mixture‑of‑Experts (MoE) architectures under parallel decoding. While dLLMs generate multiple tokens simultaneously, MoE traditionally routes each token independently to its top‑K experts. As the parallel block size N grows, the union of all selected experts expands almost linearly—a phenomenon the authors call “expert explosion.” This leads to a dominant memory‑bound regime where the cost of fetching unique expert weights from high‑bandwidth memory (HBM) to on‑chip SRAM (denoted by b·|∪ₙSₙ|) dwarfs the actual compute cost (a·N·K). Existing token‑centric optimizations such as expert skipping or pruning reduce per‑token FLOPs but do not address the global unique‑expert load, leaving the memory bottleneck untouched.
To break this coupling, the authors propose Dynamic Expert Sharing (DES), a paradigm shift from token‑level routing to sequence‑level “coreset” selection. A coreset is a compact subset C of experts that is dynamically identified at runtime based on aggregated routing information I (e.g., router logits or hidden states). All tokens in the parallel block are then forced to choose their top‑K experts exclusively from C, thereby reducing the weight‑fetching cost to b·|C| while keeping the compute term unchanged. The optimization problem is formalized as minimizing |C| subject to an accuracy constraint A(C) ≥ A_base − ε, where ε is a small tolerable drop.
Two concrete strategies instantiate DES:
-
DES‑Seq (Intra‑Sequence Sharing) – simply takes the union of the experts each token would have selected independently and uses that union as the coreset. This approach is straightforward and automatically captures any overlap, but may retain unnecessary experts.
-
DES‑Vote (Saliency‑Aware Voting) – aggregates the router logits across all tokens, computes a weighted average, and then selects the most “salient” experts based on this consensus. Tokens then vote for experts with the highest aggregated scores, effectively pruning the coreset to the most globally important experts. This method leverages the semantic coherence of tokens generated together and typically yields a smaller C than DES‑Seq.
Algorithmically, DES proceeds in two stages: (i) a sequence‑level consensus step that computes C via Φ(I), and (ii) a constrained local routing step where each token’s Top‑K selection is restricted to C, followed by renormalization using the model’s activation function σ. The authors provide a greedy approximation for Φ that balances coreset size and saliency, and they prove that the latency bound becomes L_MoE ≤ b·|C| + a·N·K, explicitly decoupling memory traffic from parallelism.
Empirical evaluation is performed on two state‑of‑the‑art MoE dLLM families (LLaD‑A‑MoE and LLaD‑A2.0‑mini) across four benchmarks: HumanEval, MBPP, MATH500, and GSM8K. Experiments vary the parallel block length (16, 32, 64 tokens) and compare vanilla MoE, dynamic expert skipping, and the two DES variants. Results show:
- Unique expert activations drop by >55 % on average.
- MoE kernel latency decreases up to 38 % for larger blocks.
- Accuracy remains within 1 % of the vanilla baseline (≈99 % of original performance).
- DES‑Vote consistently pushes the Pareto frontier further, achieving the same accuracy with fewer experts than DES‑Seq.
A detailed latency breakdown confirms that the MoE feed‑forward network dominates overall time, and that reducing the number of distinct experts directly translates into lower HBM‑SRAM traffic. The authors also demonstrate that the coreset size can be tuned (via a hyper‑parameter α controlling the fraction of top‑saliency experts) to trade off latency against a negligible accuracy loss, offering flexibility for different deployment constraints.
In the discussion, the paper emphasizes that DES is orthogonal to offline pruning, quantization, or weight‑sharing techniques; it can be combined with them for further gains. Moreover, while the study focuses on diffusion‑based LLMs, the underlying insight—that parallel tokens share contextual information and thus can share experts—should extend to other parallel decoding paradigms, including non‑diffusion AR models that employ block‑wise generation.
Limitations are acknowledged: DES relies on the quality of the router’s logits; poorly calibrated routers could produce suboptimal coresets. The current implementation uses a greedy heuristic for coreset selection; more sophisticated optimization (e.g., submodular maximization or meta‑learning of Φ) could yield better trade‑offs. Finally, scaling DES to multi‑GPU or distributed settings raises questions about how to synchronize coreset decisions across devices, an avenue for future work.
In conclusion, Dynamic Expert Sharing introduces a practical, sequence‑level expert selection mechanism that effectively decouples memory traffic from the degree of parallelism in MoE‑augmented diffusion LLMs. By dramatically cutting the number of unique expert weights fetched per decoding step while preserving generation quality, DES paves the way for truly scalable, high‑throughput LLM inference on modern hardware.
Comments & Academic Discussion
Loading comments...
Leave a Comment