Directional Reasoning Trajectory Change (DRTC): Identifying Critical Trace Segments in Reasoning Models
Understanding how language models carry out long-horizon reasoning remains an open challenge. Existing interpretability methods often highlight tokens correlated with an answer, but rarely reveal where consequential reasoning turns occur, which earlier context triggers them under causal intervention, or whether highlighted text actually steers the rollout. We introduce Directional Reasoning Trajectory Change (DRTC), a process-causal method that (i) detects pivot decision points via uncertainty and distribution-shift signals and (ii) applies receiver-side interventions that preserve the realized continuation without resampling while blocking information flow from selected earlier chunks only at a pivot. DRTC measures how each intervention redirects the log-probability trajectory relative to the realized rollout direction, yielding signed per-chunk attributions; we also compute logit-space curvature changes and curvature signatures as a complementary geometric diagnostic. Across four reasoning models, influence is sharply concentrated (Gini approximately 0.50-0.58, top-5% mass approximately 0.23-0.28), and learned pivots induce stronger effects than matched random spans. In a 500-problem MATH scaling study with R1-Distill-Qwen-1.5B, learned spans continue to outperform matched random spans (median Delta=0.409, 355/500 positive; p=2.3e-21), and curvature-impact co-localizes with DRTC within traces as a diagnostic. We benchmark against gradient- and perturbation-based chunk attributions and show graded outcome linkage: under embedding-interpolation edits, top-ranked DRTC chunks reduce teacher-forced gold-answer log-probability more than strict position-matched random chunks on a stability-filtered subset. Overall, DRTC provides a causally grounded view of how specific context elements steer on-policy reasoning trajectories.
💡 Research Summary
The paper introduces Directional Reasoning Trajectory Change (DRTC), a novel process‑causal interpretability framework designed to pinpoint the exact context segments that steer a language model’s long‑horizon reasoning. Existing attribution methods typically highlight tokens correlated with the final answer, but they do not reveal where a model changes its line of thought, which earlier text triggers that change, or whether the highlighted text truly influences the rollout. DRTC addresses these gaps through four tightly coupled components.
First, it discovers “pivot” decision points within a single on‑policy generation by scoring each token with a weighted sum of three uncertainty‑related signals: entropy, the top‑2 probability margin, and a Jensen‑Shannon divergence measuring distribution shift before and after the token. The top‑K pivots (default K = 8) are selected under a minimum spacing constraint, and a softmax over their raw scores yields normalized pivot importance weights u_k.
Second, for each pivot τ_k the method performs a receiver‑side intervention that blocks information flow from a candidate earlier chunk c_i (a fixed‑stride token block, typically 16 tokens) only at that pivot. Concretely, the attention scores from the pivot query to all keys belonging to c_i are set to –∞ across every transformer layer and head, effectively zeroing the attention mass while leaving the token embeddings, hidden states, and the rest of the computation untouched. This yields a pivot‑local counterfactual that preserves the realized continuation, avoiding the confounding effects of full resampling.
Third, two quantitative measures are computed for each (pivot, chunk) pair. The “screening effect” a_{k,i} is the drop in the log‑probability of the baseline top token at τ_k when c_i is masked; after robust scaling and a sigmoid transformation, it becomes a relevance gate w_{k,i}∈
Comments & Academic Discussion
Loading comments...
Leave a Comment