Reinforcement Learning Finetunes Small Subnetworks in Large Language Models
Reinforcement learning (RL) yields substantial improvements in large language models (LLMs) downstream task performance and alignment with human values. Surprisingly, such large gains result from updating only a small subnetwork comprising just 5 percent to 30 percent of the parameters, with the rest effectively unchanged. We refer to this phenomenon as parameter update sparsity induced by RL. It is observed across all 7 widely used RL algorithms (e.g., PPO, GRPO, DPO) and all 10 LLMs from different families in our experiments. This sparsity is intrinsic and occurs without any explicit sparsity promoting regularizations or architectural constraints. Finetuning the subnetwork alone recovers the test accuracy, and, remarkably, produces a model nearly identical to the one obtained via full finetuning. The subnetworks from different random seeds, training data, and even RL algorithms show substantially greater overlap than expected by chance. Our analysis suggests that this sparsity is not due to updating only a subset of layers, instead, nearly all parameter matrices receive similarly sparse updates. Moreover, the updates to almost all parameter matrices are nearly full-rank, suggesting RL updates a small subset of parameters that nevertheless span almost the full subspaces that the parameter matrices can represent. We conjecture that the this update sparsity can be primarily attributed to training on data that is near the policy distribution, techniques that encourage the policy to remain close to the pretrained model, such as the KL regularization and gradient clipping, have limited impact.
💡 Research Summary
The paper “Reinforcement Learning Finetunes Small Subnetworks in Large Language Models” investigates a striking phenomenon: during the reinforcement learning (RL) fine‑tuning stage of large language models (LLMs), only a tiny fraction of the model’s parameters—typically 5 % to 30 %—actually change. The authors term this “parameter update sparsity induced by RL.” They demonstrate that the effect is robust across seven widely used RL algorithms (PPO, GRPO, ORPO, KTO, DPO, SimPO, PRIME) and ten LLM families (Llama‑3, DeepSeek, Eurus, Tulu, etc.). No explicit sparsity‑promoting regularization, architectural constraint, or pruning technique is employed; the sparsity emerges naturally.
Key empirical findings:
- High Update Sparsity – Across all models, 68 %–96 % of parameters remain exactly unchanged after RL fine‑tuning, whereas supervised fine‑tuning (SFT) produces dense updates (only 6 %–15 % sparsity).
- Uniform Distribution Across Layers – Layer‑wise analysis shows that almost every transformer layer receives a similarly sparse set of updates; the only exception is LayerNorm, which is almost never touched. Thus, the sparsity is not confined to a few layers or modules.
- Full‑Rank Updates – Despite being sparse, the updates to each weight matrix are nearly full rank (average rank > 99 % of the maximum). This indicates that a small subset of parameters can span almost the entire subspace that the matrix could represent, contrasting with low‑rank adaptation methods such as LoRA.
- Subnetworks Are Sufficient – The authors construct a binary mask m that marks parameters changed by full RL fine‑tuning. They then re‑train the model from the pretrained checkpoint, but at each step multiply the gradient by m, thereby updating only the identified subnetwork. Experiments on two very different RL algorithms (DPO – an off‑policy method with implicit rewards, and PRIME – an on‑policy method with learned reward models) show that the “subnetwork‑only” model (θ_sub) matches the fully fine‑tuned model (θ_full) both in downstream test performance and in raw parameter values (differences < 10⁻⁵). This goes beyond the classic Lottery Ticket Hypothesis, which only guarantees performance recovery, by demonstrating near‑exact parameter recovery.
Why does this sparsity arise? The authors hypothesize that training on data that is close to the current policy distribution (i.e., in‑distribution data) is the primary driver. They conduct ablations varying KL‑regularization, gradient clipping, and on‑ vs. off‑policy training; none of these substantially affect sparsity. When RL is performed on data sampled from the evolving policy (on‑policy) or when SFT is first applied to the same data that will later be used for RL, the updates remain sparse. Conversely, SFT on a distribution that differs from the policy leads to dense updates.
Implications:
- Efficiency: Since > 70 % of parameters stay untouched, RL fine‑tuning can be made far more memory‑ and compute‑efficient by explicitly freezing the inert parameters or by designing algorithms that discover and exploit the subnetwork during training.
- Theoretical Insight: The finding suggests that pretrained LLMs already contain latent “winning tickets” that are naturally selected when the policy needs only modest adjustments. The updates being full‑rank despite sparsity implies that RL does not restrict learning to a low‑dimensional subspace but rather picks a sparse set of weights that collectively span the full expressive capacity of each layer.
- Future Directions: The work opens avenues for dynamic masking, subnetwork‑aware optimizers, and hybrid approaches that combine the benefits of LoRA (parameter‑efficient adaptation) with the naturally emerging sparsity of RL. Moreover, a deeper theoretical analysis of why in‑distribution data leads to such sparsity could inform the design of more stable and sample‑efficient RL‑based alignment pipelines.
In summary, the paper reveals that RL fine‑tuning of LLMs is fundamentally a sparse optimization problem: a small, consistently active subnetwork is responsible for the entire performance gain, while the majority of the model remains inert. This insight challenges the prevailing practice of full‑model RL fine‑tuning, suggests substantial computational savings, and provides a fresh perspective on the interplay between pretrained knowledge and policy‑level adaptation.
Comments & Academic Discussion
Loading comments...
Leave a Comment