Reading time: 35 minute
...

📝 Original Info

  • Title:
  • ArXiv ID: 2512.19995
  • Date:
  • Authors: Unknown

📝 Abstract

Large language models increasingly expose reasoning traces, yet their underlying cognitive structure and steps remain difficult to identify and analyze beyond surface-level statistics. We adopt Schoenfeld's Episode Theory as an inductive, intermediate-scale lens and introduce ThinkARM (Anatomy of Reasoning in Models), a scalable framework that explicitly abstracts reasoning traces into functional reasoning steps such as Analysis, Explore, Implement, Verify, etc. When applied to mathematical problem solving by diverse models, this abstraction reveals reproducible thinking dynamics and structural differences between reasoning and non-reasoning models, which are not apparent from token-level views. We further present two diagnostic case studies showing that exploration functions as a critical branching step associated with correctness, and that efficiency-oriented methods selectively suppress evaluative feedback steps rather than uniformly shortening responses. Together, our results demonstrate that episode-level representations make reasoning steps explicit, enabling systematic analysis of how reasoning is structured, stabilized, and altered in modern language models.

📄 Full Content

Large language models (LLMs) have demonstrated remarkable performance on complex reasoning tasks (OpenAI, 2024c;Marjanović et al., 2025;Comanici et al., 2025;Qwen Team, 2025a), particularly when generating explicit chains of thought (CoT) (Wei et al., 2023). Consequently, evaluation of reasoning models has predominantly focused on outcome-oriented metrics such as accuracy, solution length, or aggregate token counts (Lightman et al., 2023;Jiang et al., 2025a). While these measures are effective for comparing final performance, they provide limited insight into how these models *Co-First Authors.

organize their reasoning traces and how different reasoning behaviors emerge across models.

It is still unclear which parts of a generated chainof-thought correspond to problem understanding, exploration, execution, or verification, making it difficult to interpret model behavior beyond surface statistics. This opacity is particularly salient when studying “overthinking” (Chen et al., 2025b;Fan et al., 2025;Kumar et al., 2025), where longer or more elaborate reasoning does not necessarily translate into improved correctness (Feng et al., 2025b), yet the underlying thinking dynamics and structure are hard to characterize systematically. Various reasoning behaviors that are often discussed conceptually, e.g., abstract planning versus concrete execution, open-ended exploration versus evaluative checking, lack rigorous formulation and quantitative comparison across models.

To better interpret such reasoning traces, a natural question is whether they contain meaningful structure at an intermediate level of abstraction beyond individual tokens. Motivated by prior work (Li et al., 2025c) that introduced Schoenfeld’s Episode Theory (Schoenfeld, 1985) as a framework for characterizing problem-solving behaviors, we adopt episode-level representations as an inductive lens for analyzing LLM reasoning traces. Episode Theory conceptualizes problem solving in terms of functional episodes, thereby providing an interpretable intermediate-scale representation that bridges low-level token statistics and high-level reasoning intents. A condensed example is shown in Figure 1.

While earlier work (Li et al., 2025c) first leveraged this theory to reasoning models, their scope is limited to one model and one dataset. Thus, it remains unclear whether such an episode-level abstraction can reveal systematic, reproducible structure in LLM reasoning at scale. In this work, we build upon this foundation and extend episode-level annotation to a broader, comparative setting, us- ing it to examine what principal patterns emerge when diverse reasoning traces are viewed through a theory-grounded abstraction. Our study is organized around three research questions. RQ1: Do reasoning traces exhibit consistent linguistic and dynamic structure when viewed through episodelevel representations? RQ2: How do reasoning dynamics differ across models and reasoning styles, as indicated by episode sequencing and transition patterns? RQ3: Can we utilize this framework to analyze the correlation of thinking patterns to final correctness, and the structural differences between overthinking and efficient models?

To address these questions, we apply ThinkARM (Anatomy of Reasoning in Models), an episode-level annotation framework grounded in Schoenfeld’s Episode Theory (Harskamp and Suhre, 2007), to a large-scale analysis of mathematical reasoning traces from a diverse set of LLMs. Concretely, we curate a corpus of 410, 991 sentences generated by 15 models solving 100 problems from a subset of Omni-MATH (Gao et al., 2024). To support reliable automated analysis, we construct a human-verified gold set of 7, 067 sentences and evaluate multiple state-of-the-art LLMs as automatic annotators, and finally select GPT-5 (OpenAI, 2025b) for full-scale episode labeling due to its strongest agreement with human annotations. This pipeline enables consistent, sentence-level episode annotation at scale and sets up a controlled setting for comparing reasoning dynamics across models and methods.

Contributions. We extend a cognitive scienceinspired episode annotation framework to an automatic, scalable, sentence-level representation that supports large-scale analysis of reasoning traces and conduct a systematic study of reasoning dynamics across a diverse set of LLMs. Moreover, we demonstrate the practical utility of episode-level 1. When reasoning traces are analyzed at the episode level, a functional progression from abstract reasoning to concrete execution, and finally to evaluative control, consistently emerges. Episodes associated with analysis and exploration use more abstract, conceptual language and decrease steadily as reasoning progresses, while execution-oriented episodes dominate the middle of the trace through sustained concrete operations. In contrast, verificationrelated episodes are characterized by evaluative and meta-level language and increase toward the end of the reasoning process. 2. Comparing reasoning and non-reasoning models, the difference is not merely how many tokens they generate, but how reasoning is structured. Non-reasoning models allocate most of their response trace to execution, with episode transitions largely following a feed-forward pattern toward implementation.

In contrast, reasoning models distribute effort across analysis, exploration, execution, and verification, and exhibit frequent iterative Explore-Monitor/Verify loops. 3. Through our correctness-oriented case study, we find that exploration reflects uncertainty and serves as a critical branching point: correct solutions more often route exploration into monitoring or re-analysis, whereas incorrect solutions tend to continue execution or terminate prematurely after exploration. 4. Through our efficiency-oriented case study, we find that different efficient reasoning meth-ods selectively suppress evaluation-oriented episodes and feedback loops, leading to varying degrees of divergence from the reasoning patterns of the base model. Episode-level analysis thus reveals which episodes can be removed to gain efficiency, beyond token-level pruning.

Together, these findings make explicit a range of intermediate-scale reasoning behaviors that are often discussed intuitively but rarely characterized structurally.

In this section, we formalize our analytical methodology. We first ground our approach in cognitive science theory in mathematical problem solving and then detail the implementation of ThinkARM, a scalable, automated pipeline to quantify the reasoning dynamics of LLMs with high precision.

To scientifically dissect the reasoning traces of LLMs, we ground our framework in Schoenfeld’s Episode Theory (Schoenfeld, 1985), a seminal framework in cognitive science originally developed to decode the black box of human mathematical problem-solving. Schoenfeld’s theory is descriptive, derived from the rigorous analysis of hundreds of hours of videotaped “think-aloud” protocols. By observing students and mathematicians tackling mathematical problems, they identified that the performance is not distinguished by superior domain knowledge alone, but by the dynamic regulation of that knowledge. The theory frames problem-solving as a temporally ordered sequence of “episodes” that reveal the solver’s evolving goal structure and metacognitive decisions 1 . Schoenfeld’s framework ultimately consists of 7 episodes: the original 6 episodes, including Read, Analyze, Plan, Implement, Explore, and Verify, plus a later-added Monitor episode that specifically captures the solver’s metacognitive behaviors. Episode theory has since become a foundational analytic lens in mathematics-education research and in studies of human reasoning, offering a fine-grained vocabulary for tracing cognitive control and strategy shifts (Harskamp and Suhre, 2007). Li et al. (2025c) first applied this theory to annotate reasoning traces of LLMs with thinking behaviors. However, their work is limited to single-task scenarios 1 More discussion on mathematical problem solving in cognitive science can be found in Appendix A and single-model case studies and does not provide a systematic analysis of the reasoning dynamics of diverse LLMs.

While Schoenfeld’s original taxonomy captures key functional stages of human problem solving, it does not include an explicit state for structural convergence. In the context of LLM reasoning, where models are trained to produce a final answer in a prescribed format, this distinction becomes practically important. We therefore extend the taxonomy by introducing an Answer episode, resulting in Eight episodes, which allows us to explicitly identify when the model commits to producing a final solution and to analyze convergence behavior separately from verification or monitoring, as shown in Figure 1.

Prior work (Li et al., 2025c) explored episodebased annotation using hierarchical schemes that combine paragraph-level and sentence-level labels. In this study, we adopt sentence-level annotation only, as it provides a uniform granularity that facilitates large-scale aggregation, transition analysis, and comparison across models. This design choice reflects a trade-off toward scalability and analytical simplicity, rather than a claim about the relative merits of different annotation granularities.

To construct a diverse and representative dataset for our analysis, we sample problems from Omni-MATH using its domain annotations. Specifically, we stratify the full benchmark by domain and select problems from each group proportionally, aiming to preserve domain coverage while maintaining diversity in problem types and difficulty. This procedure yields a subset of 100 problems spanning a broad range of mathematical topics.

We then collect reasoning traces utilizing 15 widely used LLMs2 to solve each problem, result- ing in 1, 500 responses and a total of 410, 991 sentences. The collected traces include both thinking and answering tokens for native reasoning models and their distilled variants, and only answering tokens for standard instruction-following baselines and proprietary reasoning models without accessible thinking tokens. This diverse model coverage enables comparative analysis across model families and reasoning paradigms.

To support the reliable evaluation of automated episode annotation, we construct a human-verified gold set. From the 100 sampled problems, we select 9 representative problems following the same domain-based grouping strategy and manually annotate3 all resulting reasoning traces using the refined episode taxonomy and guidebook. This process produces 7, 067 annotated sentences, which we use as the gold standard for testing the effectiveness of the automatic annotators and selecting the annotation model used in our large-scale analysis.

In ThinkARM, we utilize LLMs for automatic annotation. During the automatic annotation process, we provide the detailed annotation guidebook, which consists of the definition and examples of each episode, along with the previous contexts and annotated categories. Furthermore, when annotating each sentence, we not only ask models to generate the annotated category, but also generate the justifications for each annotation to ensure the reliability of the annotation4 .

For the annotation model selection, we evaluate the performance of several state-of-the-art models, including GPT-4.1 (OpenAI, 2025a), GPT-5 (Ope-nAI, 2025b), Gemini-2.5-Flash (Comanici et al., 2025), and Gemini-2.5-Pro, on the gold standard annotated traces. The performance of the annotation models is shown in Table 1, containing the accuracy and kappa scores on the traces from reasoning and non-reasoning models. Based on the performance, we select GPT-5 for the full-scale annotation and utilize its results for further analysis.

With ThinkARM, we now move beyond surfacelevel performance metrics to empirically analyze the cognitive dynamics of current LLMs.

Our analysis asks whether different functional modes of reasoning, often discussed intuitively, manifest as separable patterns in language. Figure 2 visualizes the most frequent tokens associated with each episode and shows that episodes occupy distinct lexical regions, suggesting that ThinkARM captures meaningful behavioral differences rather than superficial variation.

Analyze vs. Implement: A clear separation emerges between abstract reasoning and concrete execution. Language associated with Analyze emphasizes conceptual and structural elements of the problem (e.g., “coprime”, “boundary”), reflecting the construction and manipulation of an abstract problem representation. In contrast, Implement is dominated by procedural and symbol-level language (e.g., variable names and concrete values), indicating step-by-step execution of a chosen approach. Thus, ThinkARM captures a genuine shift from conceptual thinking to solid implementation in reasoning behavior.

Verify vs. Monitor: Two qualitatively different forms of reflection also emerge. Verify is characterized by decisive evaluative language (e.g., “wrong”, “helpful”), indicating explicit checking of logical correctness or validity. Monitor, by contrast, is as- sociated with meta-level expressions of uncertainty or progress tracking (e.g., “confusing”, “messy”), reflecting awareness of the state of the reasoning process rather than evaluation of a specific step. These two behaviors that are separated lexically support treating verification and monitoring as distinct reasoning modes, which are important in later temporal and structural analyses. Plan vs. Explore: Forward-looking reasoning also exhibits two separable modes. Plan is dominated by directive verbs and goal-oriented language (e.g., “calculate”, “formalize”), signaling commitment to a concrete strategy. In contrast, Explore is marked by tentative and hypothesis-driven expressions (e.g., “suspect”, “maybe”), reflecting openended search over possible solution paths. This separation highlights exploration as a distinct behavioral mode characterized by uncertainty, rather than merely an early stage of execution.

By normalizing each reasoning trace to a 0-100% progress scale 5 , we analyze how episode frequencies evolve over the generation (Figure 3). A consistent coarse-grained temporal organization emerges, captured by ThinkARM, across reasoning models. We refer to this pattern as the cognitive heartbeat of machine reasoning.

Early-stage scaffolding: Episodes associated with initial scaffolding (Read, Analyze, Plan, Ex-5 Details in Appendix B plore) exhibit distinct decay patterns. Read drops sharply after the beginning, while Analyze and Plan decay more gradually, indicating that structural reasoning persists beyond the first few steps rather than being confined to a fixed planning prefix. Explore is typically front-loaded as well, consistent with early hypothesis search that narrows as execution proceeds.

Mid-stage execution: Implement exhibits a characteristic bell shape trajectory, peaking in the middle of the trace. This pattern suggests that concrete execution occupies the longest continuous region of the reasoning process, providing the backbone onto which other behaviors (e.g., exploration and verification) attach. The model shifts from high-level strategizing to mechanical execution (symbolic manipulation and calculation) as the core engine of the response.

Late-stage convergence: Verify increases steadily over progress, and Monitor follows a Ushaped profile with elevated mass near the beginning and end. Together, these trends indicate that evaluative and process-regulatory behaviors become more prominent as generation approaches completion, rather than appearing only as a final end check. Finally, Answer remains near-zero for most of the trace and rises sharply near the end, reflecting that answer commitment is concentrated in the terminal stage.

Table 2 reports the episode-level allocation of generated tokens. We group models into: (i) open-source reasoning models where full reasoning traces are available, (ii) standard instructionfollowing models evaluated on their direct responses, and (iii) proprietary reasoning models where only the final responses are observable.

From Doing to Thinking: A clear allocation gap emerges between reasoning and non-reasoning models. Standard instruction-following models allocate the majority of tokens to Implement, with minimal mass assigned to Explore, Verify, or Monitor. In contrast, reasoning models exhibit a more balanced profile, allocating substantially more budget to Analyze and Explore, and maintaining nontrivial allocation to Verify. This suggests that the distinguishing factor is not merely response length, but how the generation budget is distributed across reasoning behaviors.

The Nature of Non-reasoning Responses: For proprietary reasoning models where only the final formulated answers are accessible, the episode allocation from the observable outputs is much closer to the non-reasoning group than to opentrace reasoning models. In other words, the externally visible responses exhibit limited allocation to exploration-and verification-associated episodes.

Distillation Preserves Structure: Within the distilled series, we observe that episode allocation profiles remain similar across model sizes. R1-Distill-Qwen-1.5B exhibits an episode distribution close to its teacher DeepSeek-R1 despite large differences in parameter count. This indicates that distillation can transfer not only answers but also episode-level reasoning structure, as reflected in allocation patterns.

Beyond marginal episode allocations, episode transitions characterize how reasoning behaviors interact with each other. We convert each trace into a symbolic episode sequence (e.g., Read → Plan → Implement) and use an information-theoretic criterion to identify distinctive local structures6 . Concretely, we compute the Mutual Information (MI) between the presence of episode N -grams and the source group, and extract the most discriminative patterns. Formally, the MI between an episode N -gram pattern p and the source group variable g is defined as: where p indicates the presence of a given episode N -gram in a trace, and g denotes the source group.

After obtaining the top-k discriminative patterns, we use conditional probability to determine the attribution of the feature to different classes.

Reasoning vs. Answer Tokens: Comparing the reasoning portion of a reasoning model to the final formulated answer tokens, we find that the most discriminative patterns are short feedback loops involving Explore, Monitor, and Verify (Table 3). In particular, Exp-Mon and Mon-Exp rank among the top patterns, indicating frequent alternation between exploration and process monitoring during the reasoning trace. Patterns such as Ver-Exp further suggest that verification is often followed by renewed exploration rather than immediate convergence. In contrast, the final answer tokens are characterized by more feed-forward transitions with limited recurrence of these evaluative loops.

Reasoning vs. Non-Reasoning Models: When comparing reasoning traces to responses from standard instruction-following models, we observe an overlapping set of discriminative patterns: non-reasoning responses are dominated by feedforward transitions, where exploration-monitoringverification loops are much less prevalent. This indicates that the gap between reasoning and nonreasoning models is reflected not only in how much time is allocated to different behaviors, but also in the presence of recurrent transitions that interleave exploration with evaluation.

To demonstrate the utility of our framework, we conduct a correctness-oriented case study exam-ining how episode-level patterns are associated with solution correctness. Using a stratified sample of 500 reasoning traces from the 5 representative open-source reasoning models, we formulate a binary classification setting that predicts answer correctness from quantitative cognitive features extracted from episode annotations7 .

Methodology We map each reasoning trace to a feature vector consisting of three components: (i) global statistics (total token count, thinking-token count, and thinking-token ratio), (ii) episode intensities (token ratios for each of the eight episodes), and (iii) transition features (the flattened 8 × 8 episode transition matrix capturing frequencies of state shifts). Then, we fit a Lasso-regularized logistic regression model to predict the correctness of a reasoning trace:

where ŷ = σ(w ⊤ x), σ(•) denotes the sigmoid function and the ℓ 1 penalty encourages sparsity in the coefficient vector. Table 4 reports the top features ranked by coefficient magnitude (β). Here, each coefficient β corresponds to a weight in w for a specific episode-level feature. A positive (negative) β indicates that higher values of the feature are associated with increased (decreased) likelihood of a correct answer under this model, and the magnitude of β reflects its relative importance among the selected features.

The most predictive positive features highlight how exploratory behavior is resolved during successful reasoning. In particular, Explore → Monitor and Explore → Analyze rank among the strongest positive coefficients, suggesting that correct solutions tend to route exploratory uncertainty into meta-level monitoring and renewed conceptual analysis. Similarly, Monitor → Analyze indicates that monitoring is often followed by additional analysis rather than immediate execution, consistent with an uncertainty-to-reasoning redirection pattern. Verification-related transitions also appear as informative signals at different points in the trace. Read → Verify and Answer → Verify suggest that traces associated with correctness more frequently include explicit checking both near the beginning and near the end.

On the negative side, a higher Explore ratio is a strong risk indicator, suggesting that sustained

Efficient reasoning methods for LLMs are often evaluated primarily through surface metrics such as response length, leaving it unclear which reasoning behaviors are being altered. In this case study, we use ThinkARM to characterize how different efficiency paradigms reshape episode-level dynamics. Specifically, we compare a baseline model (R1-Distill-Qwen-1.5B) against three representative strategies: (1) L1 (Aggarwal and Welleck, 2025a), (2) ThinkPrune (Hou et al., 2025), and (3) model proposed by Arora and Zanette (2025).

Table 5 summarizes how episode-level token allocation changes for these methods. Compared to the baseline, L1 and ThinkPrune substantially reduce the budget assigned to evaluative and conceptual episodes (e.g., Verify and Analyze) and shift the profile toward a more Implement-heavy allocation. In contrast, Arora and Zanette (2025) maintains an allocation profile closer to the baseline, retaining a substantial Verify and Analyze budget. These patterns suggest that efficiency methods are not uniformly compressive: they can induce qualitatively different redistributions of reasoning behaviors.

To further localize which transition-level structures are most affected, Table 6 reports the most distinctive episode N -grams suppressed by each efficiency strategy relative to the baseline by computing the Mutual Information. L1 exhibits high divergence scores (peaking at 0.37), particularly for loop-like patterns such as N-V-N (Ana-lyze→Verify→Analyze) and V-N-V, indicating that recurrent verification loops are strongly attenuated. ThinkPrune shows a similar but generally weaker suppression of such loop structures. By contrast, Arora and Zanette (2025) yields much lower divergence scores (peaking around 0.10), suggesting that it preserves more of the baseline transition topology while still reducing overall cost.

Overall, this case study illustrates that efficiency is not behaviorally neutral: strategies that achieve similar surface-level reductions in length can correspond to markedly different changes in episode-level allocation and transition structure.

In this work, we introduce ThinkARM as an inductive, intermediate-scale framework that abstracts reasoning traces into functional reasoning steps grounded in cognitive theory. Through large-scale empirical analysis, we show that this abstraction makes previously opaque reasoning structures explicit, revealing consistent temporal organization and interpretable diagnostic patterns related to correctness and efficiency. Our framework provides a principled lens for analyzing, comparing, and diagnosing reasoning behavior in modern language models.

Large-scale episode annotation relies on an automatic annotator, which may introduce labeling noise despite strong agreement with human annotations on a verified gold set. Moreover, our experiments focus primarily on mathematical problem solving; extending episode-level analysis to other domains and reasoning settings remains an important direction for future work. Analyzing human problem-solving has evolved from broad cognitive taxonomies to domain-specific frameworks. Early models like Bloom’s Taxonomy (Krathwohl, 2002) categorized cognitive processes hierarchically but failed to capture the iterative nature of mathematical problem-solving. Similarly, instructional frameworks such as Pólya (1945) four-phase model and subsequent refinements by Mason et al. (2010) and Yeo and Yeap (2010) provided valuable pedagogical scaffolding but lacked the finegrained operationalization required for rigorous empirical annotation. Even more detailed models like Greenes (1995), while emphasizing metacognition, remained too sequential to effectively code the nonlinear dynamics often observed in real-world reasoning tasks. Schoenfeld (1985) episode theory offers a robust, empirically validated scheme for coding behaviors into distinct episodes such as Reading, Analysis, Exploration, and Verification. Unlike prescriptive models, Schoenfeld’s framework explicitly captures the strategic decisions and metacognitive control-or lack thereof-that determine problem-solving success (Kuzle, 2013). This granular focus on cognitive transitions makes it particularly suitable for analyzing AI-generated reasoning, which often struggles with self-regulation. Consequently, Schoenfeld’s model provides the necessary precision to systematically annotate and evaluate the complex, often non-linear reasoning traces investigated in this study. Li et al. (2025c) first applied this framework to analyze the reasoning process of large language models. However, they haven’t conducted a systematic fine-grained analysis of the episode-level patterns of annotated reasoning traces or compared the episode-level patterns between different models.

The surge in LLM development has catalyzed substantial efforts to bolster their reasoning skills (Ahn et al., 2024;Besta et al., 2025;Chen et al., 2025a). Most research has centered on enhancing these abilities via post-training methods. For instance, reinforcement learning has been utilized to steer models towards superior reasoning strategies (Shao et al., 2024;Xiong et al., 2025;Cui et al., 2025). Furthermore, instruction tuning using rigorously selected, high-quality datasets has proven effective in boosting performance (Li et al., 2024b,a;Ye et al., 2025;Muennighoff et al., 2025;Li et al., 2025a). Moreover, Snell et al. (2024); OpenAI (2024a) have shown that scaling up test time compute can also improve the reasoning capabilities of models. Yet, while these models demonstrate remarkable gains in mathematical reasoning on benchmarks, there remains a notable gap in systematically understanding and quantifying how these enhancements alter model behavior.

Overthinking (Chen et al., 2025b;Fan et al., 2025) has been an identified issue of reasoning models. They tend to produce excessive intermediate steps or consider unnecessary details, which can lead to inefficient reasoning. To address this issue, several methods (Sui et al., 2025;Feng et al., 2025a) have been proposed to encourage the reasoning models to reason more efficiently. Specifically, L1 (Aggarwal and Welleck, 2025a) proposes a simple reinforcement learning method that optimizes for accuracy and adherence to user-specified length constraints. ThinkPrune (Hou et al., 2025) offers a simple solution that continuously trains the long-thinking LLMs via reinforcement learning with an added token limit. Arora and Zanette (2025) train reasoning models to dynamically allocate inference-time compute based on task complexity. More works (Fang et al., 2025;Liu et al., 2025;Xiang et al., 2025;Li et al., 2025b) have been proposed to encourage the efficient reasoning of models. However, currently, very limited work explicitly analyzes how these methods are different in behavior or episode-level patterns. Our work is the first to systematically analyze the episode-level patterns of these methods in comparison to the baseline reasoning models.

Chain-of-thoughts (CoT) (Wei et al., 2023) can significantly elicit the reasoning capabilities of models. Various works have studied the different properties of CoT, including bias (Wu et al., 2025), faithful-ness (Lanham et al., 2023), and redundancy (Chen et al., 2025b). Specifically, for o1-like reasoning models that have particularly long reasoning traces and showcase more system-II level reasoning abilities, efforts have been made to uncover the structural patterns of CoT (Jiang et al., 2025b). Bogdan et al. (2025) introduced a black-box method that measures each sentence’s counterfactual importance to understand the sentence-level importance in long CoT chains. Feng et al. (2025b) introduced a graph view of CoT to extract structure and identified an effective statistic that correlated to the correctness of the reasoning trace. Li et al. (2025c) and Kargupta et al. (2025) incorporated theories from cognitive psychology to label the episodes of CoT in analogy of human problem-solving processes. Our work, built upon (Li et al., 2025c), is the first to systematically analyze the statistical patterns of reasoning traces grounded in cognitive theories.

We conduct our analysis on a variety of reasoning and non-reasoning models. The reasoning models include DeepSeek-R1 (DeepSeek-AI et al., 2025), DeepSeek-R1-Distill-Qwen, QwQ-32B (Qwen Team, 2025b), Phi4 (Abdin et al., 2024), Qwen3-32B (Qwen Team, 2025a), Gemini-2.5-Flash (Comanici et al., 2025), GPT-o1-mini (OpenAI, 2024b) and GPT-o3-mini (OpenAI, 2025c). The non-reasoning models include GPT-4o (OpenAI et al., 2024), Gemini-2.0-Flash (Team et al., 2024), Qwen2.5-32B (Team, 2024) and the non-reasoning mode of Phi4 and Qwen3-32B. We also study some efficient reasoning models including L1 (Aggarwal and Welleck, 2025b), ThinkPrune (Hou et al., 2025), and models released by Arora and Zanette (2025).

To investigate the temporal dynamics of reasoning phases across model responses, we analyze how the distribution of cognitive phases evolves over the course of generation. For each annotated response, we divide the sequence into B equal-sized temporal bins based on token position, where B = 25 in our analysis. Within each bin, we compute the frequency of tokens belonging to each reasoning phase category. Specifically, for a response with total length L tokens, we define bin size as ∆ = L/B. Each token at position t is assigned to bin b = ⌊t/∆⌋, ensuring uniform temporal coverage across responses of varying lengths. For each category c and bin b, we count the number of tokens belonging to that category, yielding a raw frequency distribution f c,b .

To enable comparison across responses with different phase compositions, we normalize the frequency distribution within each response. For category c in response r, we compute the normalized frequency in bin b as:

where the denominator represents the total number of tokens of category c in response r. This normalization ensures that the distribution sums to 1 for each category within each response, allowing us to examine the relative temporal positioning of phases independent of their absolute frequency.

After annotation, each response can be represented with a sequence of cognitive phases. To study which phase combinations are specific to a certain model or kinds of models (e.g. Deepseek vs others, reasoning vs non-reasoning), we conduct the phase-based N-gram analysis that treats each phase as a gram and investigate what combinations of grams are most significant patterns of a type of models. Specifically, we use a letter to represent each phase (e.g. R for READ), and each response becomes a string. We use the mutual information to identify the most discriminative patterns between two groups:

where x i ∈ {present, absent} indicates whether n-gram g appears in a sequence, y j ∈ {Group 1, Group 2} denotes the model group membership, and p(x i , y j ), p(x i ), and p(y j ) are the joint and marginal probabilities computed from the 2 × 2 contingency table of n-gram occurrences across the two groups.

In this section, we provide the detailed implementation of the correctness diagnostic analysis presented in Section 4.1. We formulate the problem as a binary classification task, where we predict the correctness of a reasoning trace based on its episode-level cognitive features.

C.1 Feature Engineering.

For each reasoning trace, we extract a comprehensive set of features capturing global statistics, episode intensity, and transition dynamics. Let S = {s 1 , s 2 , . . . , s N } be the sequence of sentences in a trace, where each sentence s i is associated with an episode tag e i ∈ E and a token count t i . The set of episode categories E consists of the 8 refined episodes: Read, Analyze, Plan, Implement, Explore, Verify, Monitor, Answer. We compute the following feature groups:

• Global Statistics:

-Total Tokens: The total number of tokens in the response, t i , estimated using the GPT-4 tokenizer (‘cl100k_base’).

-Think Ratio: The proportion of tokens belonging to the reasoning process (excluding the final Answer content) relative to the total tokens.

• Episode Intensity (Token Ratios): For each episode category c ∈ E, we compute the proportion of the total budget allocated to it:

This yields 8 features representing the relative dominance of each cognitive behavior.

• Transition Features: We construct a transition matrix capturing the frequency of shifts between reasoning states. We compute the raw count of transitions from episode src to episode tgt:

This results in 8 × 8 = 64 transition features, capturing the structural flow of reasoning (e.g., Explore → Verify).

We employ a Lasso-regularized Logistic Regression model to identify the most predictive features while enforcing sparsity for interpretability. The model predicts the probability of correctness p(y = 1|x):

where x is the standardized feature vector, w are the learned coefficients, and σ(•) is the sigmoid function.

To handle the high-dimensional feature space (particularly the transition matrix) and select only the most robust signals, we use L1 regularization (Lasso). The objective function is:

where M is the number of samples and λ controls the regularization strength.

Implementation Details. We use the reasoning traces from 5 representative open-source reasoning models (DeepSeek-R1, DeepSeek-R1-Distill-Qwen-32B, Phi-4, Qwen3-32B, and QwQ-32B). All features are standardized using Z-score normalization before training. This ensures that the magnitude of coefficients directly reflects the relative importance of features. We use the liblinear solver which is well-suited for smaller datasets with L1 regularization. We set the regularization parameter C = 0.5 (where C = 1/λ) to promote feature selection, and use a maximum of 2,000 iterations to ensure convergence. We analyze the learned coefficients w. Features with positive coefficients increase the probability of a correct answer, while those with negative coefficients are associated with incorrect outcomes. Features with zero coefficients are pruned by the Lasso regularization, indicating they are less relevant for predicting correctness in this linear approximation.

C.3 Full Feature Coefficients.

The detailed ThinkARM Framework is shown in Figure 4. For each batch of sentences in the model response, the annotation model tags the episode category for each sentence based on the guidebook, question, previous context and outputs the rationale and annotation in JSON format as instructed by the format prompt. The labels after batch processing are then concatenated to form the labels for the whole response. The guidebook and prompt template in the figure are in Appendix E and Appendix F.

Iterate until the end of response

Figure 4: The ThinkARM Framework. For each question-response pair, the model response is first segmented into sentences. They are then tagged by the annotation models in batches, along with information about the guidebook, question, context, and format. The guidebook is in Appendix E, the prompt template is in Appendix F.

In this project, we aim to analyze the reasoning process of current large language models (LLMs) with advanced reasoning capabilities, i.e., Large Reasoning Models (LRMs), based on a modified version of Alan Schoenfeld’s (1985) “Episode-Timeline” framework for problem-solving. The original Schoenfeld theory was built on hundreds of hours of recorded tapes of students tackling non-routine math problems while being asked to think aloud. Widely regarded as a gold-standard framework in mathematics education research, this theory offers a rigorously validated, fine-grained lens for dissecting both expert and novice problem-solving strategies. After thorough investigation, we find that the thinking process of LRMs can be well-aligned with the episodes in the theory, as they also follow similar problem-solving processes. Thus, in this project, we aim to annotate the model (solver) responses with these episode categories. To better apply the theory to the analysis of model responses, we utilize sentence-level annotation, which is used to capture the fine-grained behavior of each sentence, including eight categories: Read, Analyze, Plan, Implement, Explore, Verify, Monitor, and Answer. The original Schoenfeld theory only has six categories: Read, Analyze, Plan, Implement, Explore, and Verify. These categories describe the thinking behaviors of how humans solve a problem. Later, an additional “Monitor” category was included in the system to capture behaviors that do not contain specific content but are still important, such as “Let me think.” Moreover, when trying to apply the theory to analyzing LLM behaviors, we introduce another category, “Answer,” to represent the sentence that delivers the answer. Thus, in total, there are eight categories. For each sentence, the annotation depends on both the current sentence itself and its context.

• Definition: This is usually the initial phase, which focuses on extracting or restating the given information, conditions, and the goal of the problem as presented. It involves understanding the question without any inference of strategy or reasoning.

• Guidelines:

-Sentences in this category should directly present the content of the original problem statement.

-Look for phrases that recall or repeat elements of the question.

-This label is mostly presented for the model’s initial processing of the problem.

• Potential Keywords/Indicators: “The question asks…”, “The problem requires…”, “We are given…”, “The goal is to…”, “The choices are…”, direct quotes from the problem.

• Distinguishing Features:

-This stage is purely about understanding the input, not about processing it or deciding how to solve it. It should not contain content like trying to understand or analyze the question. -Avoid labeling sentences as Read if they include any form of analysis or evaluation of the problem. The Read stage usually appears at the beginning of the reasoning. However, it can also appear in the middle of the reasoning, in order to ensure that the question was understood correctly.

• Example: “The question asks us to find the value of x in the equation 2x + 5 = 10.”

• Definition: This stage involves constructing or recalling relevant theories, introducing necessary symbols, and deducing relationships based on the problem statement and existing knowledge. The core activity is explanation or logical inference that sets the stage for the solution but does not involve concrete calculations yet.

• Guidelines:

-Sentences should explain the underlying mathematical concepts or principles relevant to the problem. -This category includes analysis of either the problem itself or intermediate results.

-This label applies to logical deductions and inferences made with certainty.

• Potential Keywords/Indicators: “According to…”, “We can define…”, “This implies that…”, “Therefore…”, “Based on this…”, “We can infer that…”, “Let’s note that…”, “Let me observe that…”, “Let’s recall that…”

• Distinguishing Features:

-The Analyze episode involves certain inferences and explanations, unlike Explore, which shows uncertainty.

-The actual execution of calculations is mainly in the Implement stage.

-Analyze does not involve any concrete calculation, which is unlike Implement.

-The behavior of setting some basic notations should be in the Implement stage, e.g., “Let me set a = 1, then…”.

• Important Note: Be careful not to include sentences that involve substituting values or performing calculations, as those belong to the Implement stage.

• Example: “According to the Pythagorean theorem, in a right-angled triangle, the square of the hypotenuse is equal to the sum of the squares of the other two sides.” or “If I can get the equation in slope-intercept form (y = mx + b), then I can plug in y = 4 and solve for x, which should be d.”

• Definition: This stage involves announcing the next step or outlining the entire solution strategy. It represents a commitment to a particular course of action before the actual execution begins.

• Guidelines:

-Sentences should clearly state the intended next step or the overall plan.

-Look for explicit declarations of intent, often using the first person or imperative voice.

-This stage signifies that a decision has been made on how to proceed, and the next step should be related to math problem solving, rather than generally saying “let’s think about it.”

• Potential Keywords/Indicators: “Next, we will… • Example: “Next, we will differentiate both sides of the equation with respect to x.”

• Definition: This stage is the operational phase where the planned strategy is executed. It involves setting up basic notations, performing specific calculations, constructing diagrams, enumerating possibilities, or coding solutions using numerical values, symbols, or geometric objects.

• Guidelines:

-Sentences should describe the actual steps taken to solve the problem.

-Look for mathematical operations, substitutions, and the generation of intermediate results.

-This stage is about “doing” the math.

• Potential Keywords/Indicators: “Substituting x = 2, we get…”, “Therefore, P (1) = -1”, “Expanding the expression…”, “The matrix becomes…”, “Let me set a = 1, then…”, “Let’s denote x = 10…”, actual mathematical equations and calculations.

• Distinguishing Features:

-Implement involves concrete actions and calculations, unlike Analyze, which focuses on theoretical explanations, or Plan, which outlines future actions. -If a conclusion follows the implementation of math, that conclusion is tagged as Implement, such as “therefore, the sum of all possible values is 5.”

• Example: “Substituting x = 3 into the equation, we get 2(3) + 5 = 6 + 5 = 11.”

• Definition: This stage is characterized by generating potential ideas, making guesses, drawing analogies, or attempting trial calculations that might be abandoned later. The model is exploring different avenues without committing to a specific solution path. This stage often involves uncertainty.

• Guidelines:

-Sentences should suggest alternative approaches or possibilities.

-Look for tentative language and expressions of uncertainty.

-This stage involves brainstorming and initial investigations without a clear commitment to a particular method. • Distinguishing Features:

-Explore is marked by uncertainty and a lack of commitment, unlike Plan, which announces a definite course of action. -It involves considering various options before settling on a specific plan. If a sentence contains analyzing the problem, implementing the calculation, or verifying the result or thought, even if it follows sentences like “Maybe we can try…”, the sentences are not considered Explore at the sentence level, and therefore should not be labeled as Explore. Rather, these sentences are considered Analyze, Implement, or Verify within the Explore episode at the paragraph level.

Only sentences like “Maybe we can try…” will be labeled as Explore at the sentence level.

• Example: “Maybe we can try substituting different values for x to see if we can find a pattern.”

• Definition: This stage involves judging the correctness, effectiveness, or simplicity of the obtained result or the method used. It might include checking the answer, using an alternative method for calculation, or estimating bounds.

• Guidelines:

-Sentences should express an evaluation or confirmation of the solution or the process.

-Look for keywords related to checking, confirming, or validating.

-This stage ensures the solution and result are accurate and make sense.

• Potential Keywords/Indicators: “Let me double-check…”, “This is consistent with…”, “Plugging it back in…”, “Therefore, the answer is correct.”, “Let’s confirm…”, “Let me check again…”, “We can confirm this by…”, “This result seems reasonable because…”, “The answer is…?”, “Is the answer…?”, “Is there any mistake?”, “Did I make a mistake?”, “This is the same/correlated as previous…”, “But this seems to contradict…”, “…lead/arrive to the same answer”, “Wait, we don’t know… yet”, “Let’s try another way to verify…”, “XXX is possible/impossible.” When the following sentences are meant as conclusions, “…is indeed…”, “…should be…”

• Distinguishing Features:

-Verify focuses on evaluating the solution, unlike Implement, which focuses on generating it.

-It often involves comparing the result with initial conditions or using alternative methods.

• Example: “Let me double-check my calculations: 2 × 3 + 5 = 11, which matches the previous result.”

• Definition: This additional category captures sentences that are typically short interjections or expressions indicating the model’s self-monitoring, hesitation, or reflection at the juncture between different episodes. These often do not contain substantial problem-solving content and are brief pauses in the thought process.

• Guidelines:

-Sentences should be short phrases indicating a shift in thought or a brief pause.

-Look for expressions of uncertainty, reflection, or transition.

-This label is for meta-comments that don’t fit neatly into the other problem-solving stages.

• Potential Keywords/Indicators: “Hmm…”, “Wait…”, “Let me think.”, “Okay…”, “Let’s see.”, “Hold on.”, “But wait, hold on.”

• Distinguishing Features:

-Monitor sentences lack the substantive content of the other categories and primarily serve as indicators of the model’s internal processing flow. -They are often very short and act as bridges between more content-heavy stages.

-In most cases, it should not contain evaluation for previous steps or contain any specific question or solution content. If it contains specific content, e.g., “Wait, the problem says…”, it should be categorized into others like Read.

• Example: “Wait.”

• Definition: This stage is used for sentences that explicitly state an answer or conclusion to the problem. These sentences deliver the result, either as a final answer at the end of the response or as an intermediate answer that may be subject to later verification or revision. Note: it should be the answer to the given problem, rather than an intermediate answer for a calculation step.

• Guidelines:

-Sentences should directly present a solution, value, or conclusion in response to the given problem statement. -Look for clear, declarative statements that summarize the outcome of the reasoning or calculation. -This category applies whether the answer is final or provisional.

• Potential Keywords/Indicators: “The answer is…”, “Hence, the result is…”, “So, the final answer is…”.

• Distinguishing Features:

-Answer sentences are characterized by their directness in providing a result to the given problem, unlike Verify, which focuses on checking correctness, or Implement, which details the process of obtaining the result. -These sentences often appear at the end of a solution but can also occur mid-response as provisional answers.

• Example: “Therefore, the answer is 24.”

• Sentence-Level Focus: Annotate each sentence individually based on its primary function within the problem-solving process.

• Context is Key: While keywords can be helpful, always consider the context of the sentence within the overall response. A sentence might contain a keyword but function differently based on the surrounding text.

We use the prompt template in Figure 5 to annotate the reasoning traces in our study. The detailed content of the guidebook can be found in Appendix E.

Top discriminative episode N -grams ranked by Mutual Information (MI). Left: patterns that distinguish a reasoning model’s reasoning trace from the final answer segment. Right: patterns that distinguish reasoning traces from non-reasoning model responses. Higher MI indicates a stronger association between the pattern and the source group.

• Potential Keywords/Indicators: “Maybe we can try…”, “Perhaps we could use…”, “What if we consider…”, “Another possibility is…”, “Could this be related to…”, “Maybe I should…”, "

The complete list of models is in Appendix B

The detailed annotation guidebook is in Appendix E

The detailed ThinkARM Framework is in Appendix D and the annotation prompt is in Appendix F

Details in Appendix B

Details in Appendix C

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut