Adapter Merging Reactivates Latent Reasoning Traces: A Mechanism Analysis

Adapter Merging Reactivates Latent Reasoning Traces: A Mechanism Analysis
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models fine-tuned via a two-stage pipeline (domain adaptation followed by instruction alignment) can exhibit non-trivial interference after adapter merging, including the re-emergence of explicit reasoning traces under strict decoding. We study this phenomenon in medical LLM settings using lightweight, reproducible measurements of trace leakage and instruction-following behavior. Beyond marker-based proxies, we introduce a marker-forbidden, answer-only evaluation and define a correctness-based direction that does not rely on surface markers; a rank-1 logit-space intervention along this direction modulates decision distributions and improves multiple-choice accuracy beyond random-direction controls at sufficiently large intervention strength. We further provide layer-wise geometric evidence that domain and instruction adapters induce partially misaligned update directions, and present a proof-of-concept geometry-aware merge that can reduce leakage and/or improve accuracy in a toy setting. Our results characterize boundary conditions of trace leakage and provide practical diagnostics and interventions for safer adapter merging.


💡 Research Summary

This paper investigates a striking interference phenomenon that arises when modular adapters—specifically a domain‑adaptive pre‑training (DAPT) adapter and a supervised‑fine‑tuning (SFT) instruction adapter—are merged in large language models (LLMs). Although the SFT adapter is explicitly trained to suppress reasoning traces (e.g., “”, “Step 1:” etc.), the authors find that after merging, these latent reasoning markers re‑appear even under strict prompts that forbid any internal reasoning. The study is conducted primarily in a medical‑LLM setting but is shown to generalize across eight model families and nine prompting/decoding configurations.

Experimental pipeline
Separate LoRA adapters (rank 16, scaling 32) are trained on the same base model (Qwen‑3‑14B) using medical pre‑training data for DAPT and instruction‑tuning data for SFT. The merged weights are formed by a simple linear interpolation: Wₘₑᵣgₑ = W_base + α ΔW_DAPT + (1‑α) ΔW_SFT, with α swept from 0.0 to 1.0. The authors evaluate the merged models under three system prompts—plain, “nothink‑soft”, and “nothink‑strict”—and two decoding presets (deterministic vs. creative). As α increases, the strict prompt increasingly yields reasoning markers, while instruction‑following scores drop, producing a non‑monotonic trade‑off that cannot be explained by simple additive effects.

Reproducibility and universality
The phenomenon persists across a second set of adapters (v2) and different random seeds, ruling out implementation bugs. A cross‑model probing study shows that even models not marketed as “thinking” (e.g., Llama‑3, Qwen‑2.5) exhibit a latent “trace‑associated subspace” that becomes active when a DAPT adapter is merged, confirming that the effect is a property of the pretrained representation rather than a peculiarity of a single architecture.

Mechanistic localization
Three complementary analyses pinpoint the source of the conflict:

  1. Centered Kernel Alignment (CKA) – Comparing hidden states under “plain” (reasoning allowed) and “strict” (reasoning suppressed) conditions reveals that early layers remain highly similar (CKA > 0.98) but the final 6‑10 transformer layers diverge sharply, indicating that the interference lives in the output stages.

  2. Principal Component Analysis (PCA) – Applying PCA to the layer‑wise difference vectors Δₗ = Xₗ − Yₗ shows that the first principal component (PC1) explains a large fraction of variance (≈60‑70 %) precisely in those late layers, suggesting a low‑rank (≈1‑2‑dimensional) subspace drives the behavior.

  3. Linear probing – Training linear classifiers on each layer to predict the presence of a reasoning token yields AUC scores that rise sharply in the same late layers, confirming that the identified subspace encodes explicit trace signals.

Together, these results establish a concrete, low‑dimensional, late‑layer subspace that is activated by the DAPT adapter and suppressed by the SFT adapter.

Causal intervention
To move beyond correlation, the authors implement a logit‑space intervention inside the vLLM inference engine. They compute a unit direction u that aligns with the trace‑associated subspace (estimated as the top PC of the logit difference between reasoning‑active and reasoning‑suppressed states) and modify logits at each generation step as:

 z′ = z − γ (uᵀz) u

where γ ∈ {0,1,2,3,5}. This projection removes the identified low‑rank component from the entire vocabulary distribution, rather than simply masking specific tokens. As γ grows, strict‑fail rates for “thinking” models (e.g., DeepSeek‑R1‑Distill) drop from >80 % to <8 %, while overall accuracy remains stable or improves slightly. Non‑thinking models show minimal change, consistent with their lower baseline leak rates.

Marker‑free correctness direction
Recognizing that surface markers could be a stylistic artifact, the authors also define a “correctness‑based” direction u_corr using only answer‑only labels (correct vs. incorrect) under a marker‑forbidden decoding regime. Applying the same rank‑1 logit suppression along u_corr yields a statistically significant shift in the decision distribution (reduced entropy, higher MCQ accuracy) compared with a random‑direction control. This demonstrates that the intervention framework can target decision‑relevant axes even when no explicit reasoning tokens are present, though it does not claim that a single linear direction fully governs reasoning correctness.

Conclusions and implications
The paper provides the first systematic evidence that adapter merging can break the assumed linear composability of PEFT methods. The conflict originates from a low‑dimensional, late‑layer subspace that encodes latent reasoning traces; DAPT adapters reactivate it, while SFT adapters suppress it. Simple logit‑space projections that nullify this direction effectively mitigate leakages and can even improve task performance. The findings suggest that future modular fine‑tuning should incorporate subspace‑aware alignment or geometry‑aware merging strategies rather than naïve weight averaging, especially in safety‑critical domains such as medicine.


Comments & Academic Discussion

Loading comments...

Leave a Comment