Reinforcement Learning Fine-Tuning Enhances Activation Intensity and Diversity in the Internal Circuitry of LLMs
Large language models (LLMs) acquire extensive prior knowledge through large-scale pretraining and can be further enhanced via supervised fine-tuning (SFT) or reinforcement learning (RL)-based post-training. A growing body of evidence has shown that RL fine-tuning improves the capability of LLMs beyond what SFT alone achieves. However, the underlying mechanisms why RL fine-tuning is able to enhance the capability of various LLMs with distinct intrinsic characteristics remain underexplored. In this study, we draw inspiration from prior work on edge attribution patching (EAP) to investigate the internal differences of LLMs before and after RL fine-tuning. Our analysis across multiple model families and mathematical datasets shows two robust effects of online RL post-training: (i) an overall increase in average activation intensity, indicating that more internal pathways are engaged and their signals become stronger, and (ii) greater diversity in activation patterns, reflected by higher entropy and less concentrated edge distributions. These changes suggest that RL reshapes information flow to be both more redundant and more flexible, which may explain its advantage in mathematical generalization. Notably, models fine-tuned with Direct Preference Optimization (DPO) deviate from these trends, exhibiting substantially weaker or inconsistent internal changes compared to PPO- and GRPO-based training. Together, our findings provide a unified view of how RL fine-tuning systematically alters the internal circuitry of LLMs and highlight the methodological distinctions between online RL and preference-based approaches. Our code is open source at https://github.com/tsinghua-fib-lab/llm_rl_probing_analysis.
💡 Research Summary
This paper investigates how reinforcement‑learning (RL)‑based fine‑tuning reshapes the internal circuitry of large language models (LLMs). While prior work has shown that RL post‑training (e.g., PPO, GRPO) improves external performance on reasoning‑heavy tasks, the mechanistic reasons behind these gains have remained opaque. The authors adopt the Edge Attribution Patching (EAP) framework, which treats each residual connection in a Transformer as a directed edge in a graph, and estimates the importance of every edge by a first‑order Taylor approximation of the loss change caused by ablating that edge: I_EAP ≈ ‑⟨∇_H L, O⟩. This gradient‑based estimator can be computed for all edges in a single forward‑backward pass, making it feasible for 7‑billion‑parameter models.
The experimental pipeline proceeds as follows: (1) a pair of models—one fine‑tuned only with supervised data (SFT) and one further fine‑tuned with RL—are prompted on the same mathematical questions; only questions correctly answered by both are kept. (2) Token sequences are truncated to a common length (α·\bar{T}) to control for length‑related bias. (3) Self‑entropy (cross‑entropy of the model with respect to its own generated tokens) is computed on the truncated outputs, defining the loss L used for attribution. (4) I_EAP values are collected for every residual edge, and two summary statistics are derived: average activation intensity (mean I_EAP) and activation diversity (entropy of the I_EAP distribution).
Four model families are examined: DeepSeek‑Math, Mistral, Distilled‑Qwen, and Qwen2.5, each with a base SFT version and an RL‑fine‑tuned version. RL algorithms include PPO, Group‑Relative Policy Optimization (GRPO), and Direct Preference Optimization (DPO). Datasets consist of GSM8K, MATH, and MathInstruct, covering a broad spectrum of mathematical problem‑solving tasks.
Key findings:
- Activation intensity rises – PPO‑ and GRPO‑fine‑tuned models exhibit a 10‑30 % increase in mean I_EAP compared with their SFT counterparts, indicating that more residual pathways are actively contributing and that their signals become stronger.
- Activation diversity grows – Entropy of the edge‑importance distribution consistently increases (Δentropy ≈ 0.1–0.3), showing that the internal flow becomes less concentrated on a few “core circuits” and more evenly spread across the network.
- DPO behaves differently – Models fine‑tuned with DPO do not show systematic increases in either intensity or diversity; in some cases the metrics even decline. This suggests that DPO’s preference‑comparison reward signal induces more localized updates, failing to reshape the global information‑flow topology.
- Correlation with external performance – Across all model families, higher activation intensity and diversity correlate with 2‑5 percentage‑point gains in mathematical accuracy, especially in deeper layers and across many attention heads. The relationship holds even after controlling for model size and dataset.
The authors interpret these results as evidence that RL fine‑tuning does not merely adjust weights but reorganizes the usage of the residual network. By engaging a larger set of pathways (redundancy) and distributing activation more uniformly (flexibility), the model gains robustness and better generalization on complex reasoning tasks. The contrast between online RL (PPO/GRPO) and DPO highlights how the choice of RL algorithm fundamentally influences internal circuit dynamics.
Limitations are acknowledged: the study focuses on mathematical reasoning, so generalization to code generation, dialogue, or multimodal tasks remains to be tested. EAP relies on a linear approximation; highly nonlinear interactions may be under‑captured. Future work could involve higher‑order attribution, explicit edge‑ablation experiments, or interventions that deliberately modify edge importance to probe causality.
In summary, this work provides the first large‑scale, quantitative analysis of how RL‑based fine‑tuning reshapes the internal residual circuitry of LLMs. It demonstrates that online RL methods systematically increase both the strength and the diversity of internal activations, offering a mechanistic explanation for their superior downstream performance, while also revealing that preference‑based methods like DPO may not induce the same structural changes. The open‑source code and datasets enable the community to extend this line of inquiry to other domains and to explore more nuanced RL alignment strategies.
Comments & Academic Discussion
Loading comments...
Leave a Comment