VersatileFFN: Achieving Parameter Efficiency in LLMs via Adaptive Wide-and-Deep Reuse

VersatileFFN: Achieving Parameter Efficiency in LLMs via Adaptive Wide-and-Deep Reuse
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The rapid scaling of Large Language Models (LLMs) has achieved remarkable performance, but it also leads to prohibitive memory costs. Existing parameter-efficient approaches such as pruning and quantization mainly compress pretrained models without enhancing architectural capacity, thereby hitting the representational ceiling of the base model. In this work, we propose VersatileFFN, a novel feed-forward network (FFN) that enables flexible reuse of parameters in both width and depth dimensions within a fixed parameter budget. Inspired by the dual-process theory of cognition, VersatileFFN comprises two adaptive pathways: a width-versatile path that generates a mixture of sub-experts from a single shared FFN, mimicking sparse expert routing without increasing parameters, and a depth-versatile path that recursively applies the same FFN to emulate deeper processing for complex tokens. A difficulty-aware gating dynamically balances the two pathways, steering “easy” tokens through the efficient width-wise route and allocating deeper iterative refinement to “hard” tokens. Crucially, both pathways reuse the same parameters, so all additional capacity comes from computation rather than memory. Experiments across diverse benchmarks and model scales demonstrate the effectiveness of the method. The code is available at https://github.com/huawei-noah/noah-research/tree/master/VersatileFFN.


💡 Research Summary

VersatileFFN tackles the growing memory burden of large language models (LLMs) by redesigning the feed‑forward network (FFN) component to reuse the same parameters along both width and depth dimensions. The architecture consists of two complementary pathways that share a single set of projection and output weight matrices (W_proj, W_out).

Width‑Versatile Path – Inspired by mixture‑of‑experts (MoE), this path creates N “virtual experts” by slicing the hidden dimension of the shared FFN into non‑overlapping sub‑spaces using a strided indexing scheme. A learnable router (W_g) computes gating logits for each token, selects the top‑K experts, and combines their outputs with softmax‑normalized gating probabilities. Because the experts are merely views of the same weight tensors, no additional parameters are introduced, yet the model gains the specialization benefits of MoE without the usual memory overhead.

Depth‑Versatile Path – This path applies the full shared FFN recursively up to a maximum of L_max iterations. A lightweight head (W_loop) predicts, for each token, a distribution over possible iteration counts. The distribution is made differentiable via Gumbel‑Softmax relaxation and a straight‑through estimator; during training the model samples from the relaxed distribution, while at inference time it selects the argmax count and executes exactly that many recursions. The final depth‑wise output is a probability‑weighted sum of the intermediate states, allowing tokens that require more reasoning to receive additional processing steps.

Difficulty‑Aware Fusion – The expected loop count from the depth‑versatile controller serves as a proxy for token difficulty. A gating coefficient λ∈


Comments & Academic Discussion

Loading comments...

Leave a Comment