Probing the Trajectories of Reasoning Traces in Large Language Models

Probing the Trajectories of Reasoning Traces in Large Language Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models (LLMs) increasingly solve difficult problems by producing “reasoning traces” before emitting a final response. However, it remains unclear how accuracy and decision commitment evolve along a reasoning trajectory, and whether intermediate trace segments provide answer-relevant information beyond generic length or stylistic effects. Here, we propose a protocol to systematically probe the trajectories of reasoning traces in LLMs by 1) generating a model’s reasoning trace, 2) truncating it at fixed token-percentiles, and 3) injecting each partial trace back into the model (or a different model) to measure the induced distribution over answer choices via next-token probabilities. We apply this protocol to the open-source Qwen3-4B/-8B/-14B and gpt-oss-20b/-120b models across the multiple-choice GPQA Diamond and MMLU-Pro benchmarks. We find that accuracy and decision commitment consistently increase as the percentage of provided reasoning tokens grows. These gains are primarily driven by relevant content in the model generation rather than context length or generic “reasoning style” effects. Stronger models often backtrack successfully from incorrect partial traces, but immediate answers often remain anchored in the weaker model’s incorrect response. More broadly, we show that trajectory probing provides diagnostics for efficient and safer deployment of reasoning models as the measurements can inform practical trace-handling and monitoring policies that improve reliability without assuming intermediate tokens are inherently faithful explanations.


💡 Research Summary

The paper introduces a systematic “trajectory probing” protocol for analyzing the evolution of reasoning traces generated by large language models (LLMs) on multiple‑choice tasks. The protocol consists of three steps: (1) generate a full chain‑of‑thought (CoT) trace by prompting the model with a system instruction that forces a “thinking” mode (e.g., tags); (2) slice the trace into deciles based on token count (10 % increments), producing partial prefixes r(d) for d ∈ {0,10,…,100}; (3) construct a probing prompt that contains the system instruction, the original question, the partial prefix, and an early‑stopping suffix that forces the model to output a single answer token. For each decile the model’s next‑token distribution over the answer alphabet Y is recorded, yielding p(y | x, r(d)). From these distributions the authors compute per‑decile accuracy, the probability mass assigned to the eventual answer (decision commitment), non‑choice probability (mass on tokens outside Y), and flip rate (how often the argmax answer changes between successive deciles).

The authors apply this protocol to two challenging benchmarks—GPQA Diamond (198 four‑option questions) and MMLU‑Pro (12 032 ten‑option questions)—using open‑source Qwen3‑4B/‑8B/‑14B and gpt‑oss‑20B/‑120B models. Across all models and both datasets, accuracy and decision commitment increase monotonically with the proportion of reasoning tokens supplied. Larger models consistently achieve higher absolute accuracy, and the steepest gains occur in the later deciles (60‑100 %). An exception is Qwen3‑8B, whose accuracy drops sharply from decile 90 to 100 due to frequent generation of “\boxed{}” formatting, which biases the model toward answer “A”.

To disentangle the contribution of pure context length from the semantic content of the partial traces, three length‑matched controls are introduced: (i) Random control—replace the prefix with random tokens of the same length; (ii) Swap control—replace the prefix with a trace from a different question of identical length, preserving trace form but breaking content alignment; (iii) Shuffle control—permute the tokens of the original prefix, preserving token identity but destroying sequential semantics. The Random control yields near‑baseline accuracy, confirming that length alone does not drive performance gains. The Swap control shows flat or even decreasing accuracy, indicating that misaligned reasoning provides no benefit and can be detrimental. The Shuffle control yields modest improvements (up to ~10 % over baseline), suggesting that while token identity contributes some signal, the ordered, instance‑specific semantic flow is the dominant factor.

The protocol also supports cross‑model experiments. The authors examine “weak‑to‑strong” injections, where a weaker model’s incorrect partial trace is fed to a stronger model. Two evaluation modes are considered: (a) Answer‑now—force the stronger model to answer immediately after the injected prefix; (b) Free‑continuation—allow the stronger model to continue reasoning before answering. The “rescue rate” (probability that the stronger model answers correctly despite the weaker model’s mistake) and “anchoring rate” (probability that it repeats the weaker model’s wrong answer) are computed. Results show that in the Free‑continuation mode, stronger models can recover from incorrect prefixes about 30‑45 % of the time, whereas in the Answer‑now mode they often remain anchored to the erroneous reasoning, especially for harder questions. This demonstrates that allowing additional reasoning steps after receiving a flawed prefix is crucial for mitigating error propagation.

Key insights from the study are: (1) The performance gains observed when more reasoning tokens are supplied are primarily driven by the semantic content of those tokens, not merely by increased context length or generic “reasoning style”. (2) Model scale amplifies the benefit of deeper reasoning, but over‑thinking can still occur, leading to “lost” trajectories where correct answers become incorrect in later deciles. (3) Decision confidence rises for both correct and incorrect answers, meaning longer reasoning can produce highly confident wrong predictions. (4) Model‑specific anomalies (e.g., Qwen3‑8B’s formatting bias) can be uncovered only through fine‑grained per‑decile analysis. (5) Cross‑model trace reuse is a double‑edged sword: stronger models can sometimes backtrack and correct a weak model’s mistake, but they can also become anchored, highlighting the need for careful policy design around trace handling.

The authors argue that trajectory probing offers a practical diagnostic toolkit for deploying reasoning‑enhanced LLMs safely and efficiently. By quantifying the marginal utility of each additional reasoning token and identifying failure modes (over‑thinking, anchoring, model‑specific biases), developers can devise trace‑handling policies—such as early stopping thresholds, confidence‑based monitoring, or selective continuation—that improve reliability without assuming that every intermediate token is a faithful explanation. This work thus bridges the gap between empirical observations of chain‑of‑thought benefits and actionable guidelines for trustworthy, compute‑efficient LLM deployment.


Comments & Academic Discussion

Loading comments...

Leave a Comment