LIBMoE: A Library for comprehensive benchmarking Mixture of Experts in Large Language Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Mixture of experts (MoE) architectures have become a cornerstone for scaling up and are a key component in most large language models such as GPT-OSS, DeepSeek-V3, Llama-4, and Gemini-2.5. However, systematic research on MoE remains severely constrained by the prohibitive computational costs of training and evaluation, restricting large-scale studies accessible to most researchers. We introduce LibMoE, a unified framework for reproducible, efficient, and extensible MoE research that supports both pretraining and sparse-upcycling regimes. Beyond unified implementations, the framework provides transparent analytical tools for probing routing and expert dynamics. Leveraging this foundation, we conduct a comprehensive analysis along three dimensions: (i) routing dynamics, covering expert selection patterns, routing stability and optimality, and how routing entropy reveals task specialization and expert diversity; (ii) the effect of lightweight initialization on load balancing, demonstrating how subtle changes in router initialization shape early expert utilization; and (iii) training regime differences, revealing how sparse upcycling and full pretraining exhibit distinct routing patterns and stability profiles. By lowering the barrier to entry and standardizing evaluation, along with our comprehensive analysis, LibMoE broadens access to MoE research and establishes a reliable benchmark to guide future innovations. GitHub: \href{https://github.com/Fsoft-AIC/LibMoE}{https://github.com/Fsoft-AIC/LibMoE}.

💡 Research Summary

The paper presents LibMoE, a unified, open‑source framework designed to lower the entry barrier for research on Mixture‑of‑Experts (MoE) models, especially in the context of large language models (LLMs). The authors identify that despite MoE’s proven scalability—evidenced by its inclusion in models such as GPT‑OSS, DeepSeek‑V3, Llama‑4, and Gemini‑2.5—systematic study has been hampered by the massive computational resources required for pre‑training and evaluation. LibMoE addresses this gap by providing (1) a standardized implementation of seven recent sparsely‑gated MoE (SMoE) algorithms, (2) consistent training pipelines for both full pre‑training and the “Sparse‑Upcycling” regime, and (3) a suite of transparent analytical tools that record token‑level routing decisions, expert utilization, load‑balancing metrics, routing entropy, and inter‑expert correlations.

Framework Design
LibMoE’s architecture separates the router and expert modules, allowing researchers to plug in new routing strategies or expert configurations without rewriting the whole training loop. The framework supports model sizes from 0.15 B to 5.67 B parameters and runs on modest hardware (4 × H100 GPUs) with training times ranging from 6 to 44 hours. The “Sparse‑Upcycling” workflow first trains a dense vision‑language model (LLaVA) on a relatively small dataset (≈1.4 B tokens), then converts selected dense feed‑forward networks into a MoE layer, thereby avoiding the costly full‑scale pre‑training phase while still enabling evaluation of MoE routing algorithms on top of a strong pretrained backbone.

Empirical Study
Using the unified pipeline, the authors conduct a large‑scale empirical analysis across three dimensions:

Routing Dynamics – They measure routing entropy as a proxy for expert diversity and task specialization. In full pre‑training, entropy steadily declines, indicating that experts gradually specialize. In Sparse‑Upcycling, entropy remains higher and routing decisions fluctuate more, reflecting the delayed discovery of useful experts when starting from a dense checkpoint.
Lightweight Initialization – By initializing router weights with a scaled normal distribution rather than a standard small‑variance Gaussian, early‑stage load balancing improves dramatically. This reduces the “rich‑get‑richer” effect where a few experts dominate early training, leading to more stable convergence and faster entropy reduction.
Training Regime Differences – The study shows that both regimes are sensitive to the top‑K value and router initialization, but Sparse‑Upcycling exhibits larger routing variance and slower load‑balancing convergence. Performance gains over dense baselines under the limited‑resource setting are modest (≈0.2–0.5 BLEU, 0.3–0.6 ROUGE), suggesting that MoE’s full potential still relies on massive data and compute. However, the analytical metrics reveal clear improvements in expert utilization efficiency and routing stability.

Open‑Source Release
All code, pretrained checkpoints, and visualization scripts are released on GitHub (https://github.com/Fsoft-AIC/LibMoE). The repository includes detailed documentation, reproducible baselines, and notebooks for reproducing the routing‑entropy and load‑balancing analyses.

Impact and Future Work
LibMoE provides a reproducible, extensible platform that democratizes MoE research, enabling groups without large GPU clusters to conduct systematic comparisons of routing algorithms, initialization schemes, and up‑cycling strategies. The authors anticipate that the framework will facilitate deeper investigations into dynamic K‑selection, expert sharing mechanisms, and entropy‑driven routing optimization, ultimately accelerating the translation of MoE advances from academic prototypes to production‑grade LLMs.

LIBMoE: A Library for comprehensive benchmarking Mixture of Experts in Large Language Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment