MLPMoE: Zero-Shot Architectural Metamorphosis of Dense LLM MLPs into Static Mixture-of-Experts

Reading time: 5 minute
...

📝 Original Info

  • Title: MLPMoE: Zero-Shot Architectural Metamorphosis of Dense LLM MLPs into Static Mixture-of-Experts
  • ArXiv ID: 2511.21089
  • Date: 2025-11-26
  • Authors: 논문에 명시된 저자 정보가 제공되지 않았으므로, 원문에 기재된 저자 리스트를 그대로 삽입해 주세요.

📝 Abstract

Large Language Models (LLMs) are predominantly deployed as dense transformers, where every parameter in every feed-forward block is activated for every token. While architecturally simple, this is computationally inefficient, since inference costs scale linearly with parameter count. Recent upcycling methods such as MoEfication, CMoE, ToMoE, and MoORE reveal that much of the useful computation lives in sparse, semi-modular substructures inside dense feed-forward networks, but these approaches typically rely on clustering, activation profiling, singular value decomposition, or custom routing that requires calibration data. This paper introduces MLPMoE (MLP Mixture-of-Experts), a training-free, deterministic transformation that restructures the dense MLP in transformer blocks into a static, high-cardinality mixture of experts. The transformation uses simple tensor slicing and summation, reinterpreting the algebra of tensor parallelism as a topological conversion rather than a distributed training pattern. We further introduce Fractal Fade (differential branch sparsity) and Compensated Pruning (variance-preserving branch reduction) as lightweight mechanisms for structured sparsity. On Qwen2.5-0.5B-Instruct and DeepSeek-R1-Distill-Llama-8B, the zero-shot MLPMoE transform changes a proxy perplexity metric by less than 0.05 percent while keeping the parameter count effectively constant. On the 8B model, differential sparsity removes about 20 percent of MLP parameters while keeping perplexity within about 2 percent of the dense baseline. The method operates entirely post hoc on existing checkpoints and does not require gradients, calibration sets, or router training. Code is available at https://gist.github.com/iwallarm/fc2ef1eddf226ca7814f9e5e2ae9bad1

💡 Deep Analysis

📄 Full Content

Dense transformers currently dominate the landscape of Large Language Models (LLMs). In these architectures, every neuron in every feed-forward block is evaluated for each token. While scaling laws indicate that larger dense models generally perform better, they incur prohibitive costs regarding FLOPs, energy consumption, and latency.

Mixture-of-Experts (MoE) architectures address this inefficiency by introducing conditional computation: only a subset of experts are activated per token, decoupling model capacity (total parameters) from active computation (inference cost). Systems ranging from the Switch Transformer to recent upcycling frameworks have demonstrated the efficacy of this approach when models are explicitly trained for it. Concurrently, analysis of dense FFNs has revealed two critical properties:

  1. Activation Sparsity: For the majority of tokens, only a small fraction of neurons in a given FFN produce significant activations.

We propose MLPMoE, a topology that splits MLP weights into branches to create structural experts. This method is immediate, training-free, and modular.

Several methodologies aim to convert dense models into modular, sparse architectures to preserve pretraining investments:

MoEfication [Zhang et al., 2021] ToMoE [Gao et al., 2024] frames the conversion as dynamic structural pruning, selecting top-k neuron groups per token.

MoORE [Shen et al., 2025] employs SVD to decompose weights into low-rank “micro-experts.” Unlike these methods, MLPMoE avoids clustering or SVD, relying instead on pure structural decomposition via contiguous slicing.

Distributed training frameworks like Megatron-LM [Narayanan et al., 2021] utilize tensor parallelism, splitting FFN weight matrices across devices and summing the partial results. Mathematically, MLPMoE performs this operation within a single process to create structural experts: MLPMoE reinterprets this identity to generate structural experts per MLP without altering the underlying function.

Consider a standard transformer MLP with gate and up projections:

MLPMoE partitions the intermediate dimension into contiguous slices such that . For each branch , we define sub-matrices , , and by slicing the original weights.

Each branch operates as an independent sub-MLP, equipped with a learnable scalar gate (initialized to 1.0):

The output of the MLPMoE layer is the summation of these branches: At initialization, this topology is mathematically equivalent to the original dense FFN.

To exploit latent sparsity without training, we introduce Fractal Fade. We treat the branches as a spectrum where the first branch serves as a dense backbone, and subsequent branches are increasingly sparsified.

For a branch index in a layer with branches, the sparsity ratio is defined as:

Weights in and falling below the quantile threshold defined by are zeroed out. This retains the model’s core capabilities in the early branches while reducing the parameter footprint of later branches.

We can statically prune entire branches to reduce computational load. To retain only the first branches while approximating the original variance, we scale the active branches using the scalar gate :

This heuristic preserves the output variance magnitude, stabilizing perplexity after aggressive structural pruning.

The following implementation converts Hugging Face transformers (specifically Qwen and Llama architectures) to MLPMoE. It includes logic for differential sparsity and compensated pruning.

The full executable script is available in the Supplementary Gist.

Magnitude pruning based on quantile for proj in [branch.gate, branch.up]: w = proj.weight.data w_abs = torch.abs(w).float() threshold = torch.quantile(w_abs, r) w.mul_(w_abs > threshold)

We evaluated MLPMoE on two instruction-tuned models:

  1. Qwen2.5-0.5B-Instruct

Configuration:

Branches ( ): 4 to 32 per MLP.

Variants: * Dense-Original: Baseline.

MLPMoE-All-16: Full conversion, no sparsity.

MLPMoE-DiffSparsity: Fractal Fade applied.

Metrics: Proxy Perplexity (evaluated on a synthetic text mixture for health-checking), Total/Non-zero Parameters, and Generation Time (end-to-end wall clock).

Table 1 summarizes the results for the 0.5B model. The conversion to MLPMoE incurs a negligible PPL increase (+0.0005). Differential sparsity removes 18% of parameters but impacts PPL by 13%, suggesting smaller models are less robust to naive zero-shot pruning. The 8B model exhibits high robustness:

The zero-shot conversion slightly improves proxy perplexity (consistent with sampling noise).

Differential sparsity prunes 1.59B parameters (approx. 20%) while maintaining PPL within 2.3% of the baseline.

Current generation time increases due to the lack of optimized sparse kernels; all branches are computed regardless of sparsity.

Structural Decomposability: MLPMoE demonstrates that dense FFNs are structurally decomposable into branch experts via tensor slicing without retraining. The stability of the 8B model un

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut