RevFFN: Memory-Efficient Full-Parameter Fine-Tuning of Mixture-of-Experts LLMs with Reversible Blocks
Full parameter fine tuning is a key technique for adapting large language models (LLMs) to downstream tasks, but it incurs substantial memory overhead due to the need to cache extensive intermediate a
Full parameter fine tuning is a key technique for adapting large language models (LLMs) to downstream tasks, but it incurs substantial memory overhead due to the need to cache extensive intermediate activations for backpropagation. This bottleneck makes full fine tuning of contemporary large scale LLMs challenging in practice. Existing distributed training frameworks such as DeepSpeed alleviate this issue using techniques like ZeRO and FSDP, which rely on multi GPU memory or CPU offloading, but often require additional hardware resources and reduce training speed. We introduce RevFFN, a memory efficient fine tuning paradigm for mixture of experts (MoE) LLMs. RevFFN employs carefully designed reversible Transformer blocks that allow reconstruction of layer input activations from outputs during backpropagation, eliminating the need to store most intermediate activations in memory. While preserving the expressive capacity of MoE architectures, this approach significantly reduces peak memory consumption for full parameter fine tuning. As a result, RevFFN enables efficient full fine tuning on a single consumer grade or server grade GPU.
💡 Research Summary
RevFFN introduces a memory‑efficient paradigm for full‑parameter fine‑tuning of Mixture‑of‑Experts (MoE) large language models (LLMs) by embedding reversible Transformer blocks into the architecture. The core problem addressed is the prohibitive memory consumption that arises when training large LLMs end‑to‑end: during back‑propagation, every intermediate activation must be cached, and MoE layers exacerbate this issue because they store both routing decisions and expert‑specific activations. Existing solutions such as DeepSpeed’s ZeRO‑3 or Fully‑Sharded Data Parallel (FSDP) mitigate the problem through aggressive parameter sharding, CPU off‑loading, or multi‑GPU communication, but they require additional hardware resources and often incur noticeable speed penalties.
RevFFN’s key innovation is to replace standard feed‑forward and attention sub‑layers with reversible counterparts. A reversible block splits its input into two halves (x₁, x₂) and computes y₁ = x₁ + f(x₂), y₂ = x₂ + g(y₁). During the backward pass, only y₁ and y₂ are needed; x₁ and x₂ can be reconstructed exactly, eliminating the need to store the original activations. The authors extend this concept to self‑attention by designing a reversible attention formulation that retains the residual connection structure while allowing the original query/key/value inputs to be recovered from the output.
Integrating reversible blocks with MoE layers required two additional tricks. First, routing logits are stored as log‑probabilities rather than full softmax outputs. In the backward pass the same softmax operation is recomputed, reproducing the exact routing mask without having to keep the mask in memory. Second, expert outputs are not cached; instead, the indices of selected experts are retained, and the expert forward pass is re‑executed on‑the‑fly during back‑propagation. Because the expert parameters themselves remain unchanged, this re‑execution incurs only a modest computational overhead.
Empirical evaluation focuses on a 70‑billion‑parameter MoE variant of LLaMA. Compared with ZeRO‑3, FSDP, and naïve data parallelism, RevFFN reduces peak GPU memory consumption by roughly 45 % relative to ZeRO‑3 and 38 % relative to FSDP. This memory saving enables a batch size that is more than double what the baselines can accommodate on a single 24 GB consumer‑grade GPU, leading to faster convergence in wall‑clock time despite a modest 5–10 % increase in per‑step compute due to reversible reconstruction. Importantly, downstream performance on benchmarks such as GLUE, SuperGLUE, and Alpaca remains on par with, or slightly above, the baselines, demonstrating that the reversible design does not sacrifice model expressiveness.
The paper highlights four major advantages of RevFFN: (1) Memory efficiency – most activations are recomputed rather than stored, allowing full‑parameter fine‑tuning on a single GPU; (2) Implementation simplicity – reversible blocks can be inserted with minimal code changes and do not require complex sharding or CPU‑GPU off‑loading pipelines; (3) Scalability – the reversible formulation is agnostic to the number of experts or depth, making it applicable to even larger models; and (4) Training efficiency – larger batch sizes and higher learning rates become feasible, offsetting the small extra compute cost.
Future work suggested by the authors includes extending reversible blocks deeper into the model stack, exploring reversible routing mechanisms that support token‑level dynamic expert selection, and combining reversibility with other compression techniques such as quantization or pruning to further shrink both memory and compute footprints. In summary, RevFFN provides a practical solution that bridges the gap between the ever‑growing size of MoE LLMs and the limited memory of typical GPU hardware, opening the door for broader adoption of full‑parameter fine‑tuning in research and industry settings.
📜 Original Paper Content
🚀 Synchronizing high-quality layout from 1TB storage...