Reading time: 29 minute
...

📝 Original Info

  • Title:
  • ArXiv ID: 2512.17901
  • Date:
  • Authors: Unknown

📝 Abstract

Despite the superior performance of Large Reasoning Models (LRMs), their reasoning behaviors are often counterintuitive, leading to suboptimal reasoning capabilities. To theoretically formalize the desired reasoning behaviors, this paper presents the Laws of Reasoning (LORE), a unified framework that characterizes intrinsic reasoning patterns in LRMs. We first propose compute law with the hypothesis that the reasoning compute should scale linearly with question complexity. Beyond compute, we extend LORE with a supplementary accuracy law. Since the question complexity is difficult to quantify in practice, we examine these hypotheses by two properties of the laws, monotonicity and compositionality. We therefore introduce LORE-BENCH, a benchmark that systematically measures these two tractable properties for large reasoning models. Evaluation shows that most reasoning models exhibit reasonable monotonicity but lack compositionality. In response, we develop an effective finetuning approach that enforces compute-law compositionality. Extensive empirical studies demonstrate that better compliance with compute laws yields consistently improved reasoning performance on multiple benchmarks, and uncovers synergistic effects across properties and laws. Project page: https://lore-project.github.io/.

📄 Full Content

Large Reasoning Models (LRMs) such as OpenAI o1 (Jaech et al., 2024) have demonstrated unprecedented progress in approaching human-like reasoning capabilities. Despite their strong performance on solving complex problems, even powerful LRMs exhibit abnormal behaviors that deviate from typical human reasoning patterns. Human generally adapt their thinking based on problem complexity (Newell et al., 1972). In contrast, as illustrated in Fig. 1, DeepSeek-R1 (Guo et al., 2025) tends to generate longer reasoning but with a lower accuracy on a simpler sub-problem 1 .

We also identify this unexpected phenomenon across a wide range of reasoning models, as shown in Fig. 5. This is primarily because researchers generally overlook the high variability of Chain-of-Thought (CoT) (Wei et al., 2022) data during the training phase. These CoT data are heuristically curated by human annotators or generated through online rollout (Schulman et al., 2017;Shao et al., 2024), rarely constrained by explicit rules, e.g., how much thinking budget to allocate for a given problem (Wu et al., 2025). Hence, the current training paradigm fails to guide models toward an optimal thinking strategy. It will lead to inefficient allocation of computation-either overthinking (Chen et al., 2024b;Sui et al., 2025) or underthinking (Su et al., 2025;Yang et al., 2025;Wang et al., 2025), which in turn harms the performance (Stechly et al., 2024;Zhou et al., 2025a).

To overcome this limitation, one line of work focuses on adaptive post-training techniques, including supervised fine-tuning with variable-length CoT (Aggarwal & Welleck, 2025;Team et al., 2025). Another line of work modulates reasoning at test time (Muennighoff et al., 2025;Fan et al., 2025;Zhang et al., 2025b). While many attempts have been made to control reasoning, existing approaches primarily rely on ad-hoc heuristics and still behave undesirably in our studies. Therefore, beyond Figure 1: Illustrative example with DeepSeek-R1 on (a) a summation question, (b) a squaring question, and (c) their composition (“sum, then square”). The model allocates ~300 more reasoning tokens to solve the squaring question than to the composite question, with a 12.5% accuracy drop. The mismatch with human reasoning reveals an abnormal reasoning pattern present in current LRMs. empirical methods, several key challenges remain: (Q1) Can we theoretically formalize model reasoning to ensure desirable behavior? (Section 2) (Q2) How can we evaluate whether popular LRMs follow these proposed principles? (Section 3) (Q3) Does enforcing these principles further improve general reasoning capabilities? (Section 4,5) To fill this gap, we introduce the Laws of Reasoning (LORE), which systematically formalize the relationship between complexity and model reasoning behaviors in LRMs. The LORE framework comprises a core compute law and a complementary accuracy law. Given the practical challenges of measuring these hypotheses, the two fundamental laws are approximated via two tractable properties of optimal reasoning models, monotonicity and compositionality.

We then evaluate whether current LRMs follow the laws by developing LORE-BENCH, a comprehensive benchmark that examines monotonicity and compositionality in LRMs. While LORE-MONO is a curated benchmark across diverse domains for monotonicity, LORE-COMPO is constructed from MATH500 (Lightman et al., 2023) to measure compositionality. Our evaluation shows that current models exhibit reasonable monotonicity but lack compositionality, even for competitive baselines.

In response, we propose a simple yet effective fine-tuning approach to enforce the compute-law compositionality. From validation experiments, we present three key insights: (1) the compositionality of reasoning compute can be greatly improved with simple fine-tuning approach; (2) Enforcing compositionality generally leads to better reasoning capability; (3) Synergistic effects emerge, yielding broader improvements across different properties and laws.

We introduce the Laws of Reasoning (LORE), a unified framework that formalizes the relationship between question complexity and model reasoning behaviors. Specifically, we focus on two key aspects, reasoning compute and accuracy, which are fundamental to understanding how models scale, generalize, and allocate computation budget when solving complex problems. Section 2.1 formulates the key concepts of reasoning. In Section 2.2, we present the central compute law, with a hypothesis that the reasoning budget should scale proportionally with question complexity. In Section 2.3, we introduce the complementary accuracy law, which posits that overall accuracy should decay exponentially with increasing complexity. See Fig. 2 for an illustration of the overall framework.

Notation. Let x ∈ X ⊆ V * denote a question, where V * is the space of finite-length sequences over a vocabulary V. Let M θ ∈ M denote an autoregressive large reasoning model. LRMs adopts the thinking-then-answering paradigm (Guo et al., 2025;Abdin et al., 2025;Comanici et al., 2025), where the model M θ first generates a reasoning chain r ∈ R ⊆ V * with probability p θ (r | x) and then an answer y ∈ Y ⊆ V * with probability p θ (y | x, r). We assume a fixed decoding strategy by default and denote the model’s output by o = (r, y) ∈ O ⊆ V * . We define the composition of two questions x 1 and x 2 as their concatenation with a connector prompt c ∈ V *2 , i.e., x 1 ⊕ x 2 = concat(x 1 , c, x 2 ).

Figure 2: Overview of the LORE Framework. We present the compute law with the complementary accuracy law. These laws posit that compute scales linearly and accuracy decays exponentially with question complexity. Our framework approximates these laws using two properties: monotonicity and compositionality. Specifically, for the compute law, monotonicity captures that more complex questions require more compute, while compositionality indicates that for two independent questions, the compute for their composition is the sum of solving each individually.

Definition 1 (Complexity). Let a unit-cost primitive step denote a single valid transition of a fixed deterministic Turing machine (Turing et al., 1936), and let τ be any finite sequence of primitive steps with length ℓ(τ ) ∈ N. Let v(x, τ ) ∈ {0, 1} be a binary verifier that accepts (x, τ ) if and only if τ is a valid solution sequence for x. The complexity of x ∈ X is

Here the complexity refers to the minimal number of unit-cost primitive steps. Conceptually, κ(x) can be well-defined via a binary verifier and a fixed deterministic Turing machine. However, computing κ(x) is generally intractable, as verifying the minimal solution length requires a global search over a potentially exponential space.

For a given model, its test-time reasoning compute is directly proportional to the number of reasoning tokens generated. We therefore quantify reasoning compute as follows. Definition 2 (Reasoning Compute). The reasoning compute on question x is defined as the expected number of reasoning tokens generated by the model

where ℓ(r) denotes the length (in tokens) of the reasoning chain r. Definition 3 (Reasoning Accuracy). The reasoning accuracy is defined as the probability that the model, when generating a reasoning chain and an answer given input x, produces a final answer that matches the ground truth. Formally,

where a ⋆ (x) denotes the correct answer to x, and ans(y) extracts the final answer from y.

We hypothesize that, if a reasoning model allocates its reasoning compute efficiently, the amount of compute is expected to scale proportionally with complexity in approximation, i.e., C θ (x) ∝ κ(x): Hypothesis 1 (Compute Law). For an optimal reasoning model M θ and a question x with complexity κ(x), there exist α θ > 0 with, C θ (x) = α θ κ(x) + o(κ(x)), for some α θ > 0 that depends only on M θ and the decoding strategy. o(κ(x)) denotes a small systematic overhead that is sublinear, i.e., o(κ)/κ → 0 when κ → ∞.

Specifically, the o(κ(x)) term captures the introductory and transition tokens during the reasoning process. These tokens generally constitute a very small portion of the overall reasoning and can therefore be ignored in practice.

Two Tractable Alternative Properties as Proxies. As discussed in Definition 1, the complexity κ(x) is difficult to measure in practice. Consequently, empirically validating the linear relationship is nontrivial, as it would require known complexity values for individual questions. To address this, we adopt two tractable properties as empirical proxies for studying the laws: monotonicity and compositionality. These properties offer two key advantages: (i) they are tractable to verify without access to the exact value of κ(x). Monotonicity relies only on relative comparisons between questions, while compositionality tests whether compute is additive over independent question pairs; (ii) they are theoretically sufficient to imply the proposed compute law (Proposition 1). Property 1 (Compute-Complexity Monotonicity). For x 1 , x 2 ∈ X , the reasoning compute is monotonically non-decreasing with complexity:

Definition 4 (Independence). For x 1 , x 2 ∈ X , x 1 and x 2 are independent if the complexity of their composition is additive, i.e., κ(x 1 ⊕ x 2 ) = κ(x 1 ) + κ(x 2 ).

In practice, since the exact complexity values are difficult to obtain, we define independence operationally. Suppose each question x ∈ X is associated with a set of mathematical concepts3 S(x) ⊆ S relevant to solving it. We consider two questions x 1 and x 2 to be independent if their concept sets are disjoint, i.e., S(x 1 ) ∩ S(x 2 ) = ∅. Property 2 (Compute-Complexity Compositionality). For x 1 , x 2 ∈ X , if x 1 and x 2 are independent, their composite x 1 ⊕ x 2 exhibits additive compute:

where the sublinear terms accounts for systematic overhead in the reasoning process (as assumed in Hypothesis 1). Therefore, the reasoning compute is approximately additive:

Discussion. Intuitively, these properties are motivated by two basic principles: (i) more complex questions naturally require more reasoning; (ii) two independent sub-questions involve no overlapping reasoning, so the total compute is the sum of solving each one individually. In the next proposition, we state informally that these properties imply the compute law (Hypothesis 1); a formal proof is provided in Appendix D. These tractable properties thus offer a practical means to evaluate whether current LRMs follow the compute law. Proposition 1. Under certain conditions, if a reasoning model M θ satisfies compute-complexity monotonicity and compositionality, then its reasoning compute C θ (x) ∝ κ(x) for x ∈ X .

Following Definition 1, suppose a question requires solving κ(x) unit-cost primitive steps. If each step succeeds independently with a fixed probability and all steps must succeed for the final answer to be correct, then the overall accuracy is expected to decrease exponentially with κ(x). This intuition motivates the following formulation of the accuracy law: Hypothesis 2 (Accuracy Law). For an optimal reasoning model M θ and a question x with complexity κ(x), when 0 < A θ (x) ≤ 1, there exists λ θ ≥ 0 with,

Equivalently, log A θ (x) ∝ -κ(x), where λ θ ≥ 0 is the decay rate.

Similar to the compute law, we assume that the reasoning accuracy for the optimal reasoning model M θ also satisfies two fundamental properties: monotonicity and compositionality. Property 3 (Accuracy-Complexity Monotonicity). For x 1 , x 2 ∈ X , the reasoning accuracy is monotonically non-increasing with complexity: Property 4 (Accuracy-Complexity Compositionality). For x 1 , x 2 ∈ X , if x 1 and x 2 are independent, their composite x 1 ⊕ x 2 exhibits multiplicative accuracy:

Discussion. These properties are motivated by two basic principles: (i) more complex questions tend to have lower accuracy; (ii) for two independent questions with accuracies p 1 and p 2 (e.g., p 1 = 0.8, p 2 = 0.7), the probability of correctly answering both should be p 1 • p 2 (e.g., 0.56). We state below that these properties imply the accuracy law, with a formal proof provided in Appendix D. Proposition 2. Under certain conditions, if a reasoning model M θ satisfies accuracy-complexity monotonicity and compositionality, then its reasoning accuracy log A θ (x) ∝ -κ(x) for x ∈ X .

In this section, we systematically evaluate whether current LRMs follow our proposed reasoning laws. Specifically, we introduce LORE-BENCH, a two-fold benchmark that leverages two tractable properties, monotonicity and compositionality, to examine LRMs.

Evaluating the monotonicity property in Property 1 or Property 3 requires comparing the complexity of arbitrary question pairs. However, due to its definition via minimal solution length, complexity is inherently difficult to quantify in practice. As a result, existing benchmarks are not suited for such analysis. To address this challenge, we construct LORE-MONO, a synthetic benchmark where questions are carefully curated and validated to follow known complexity orderings, allowing us to systematically assess the monotonicity of reasoning compute and accuracy.

(1) Seed Question Curation. We select four domains that require extensive reasoning-math, science, language, and code-and curate 10 diverse seed questions for each. A seed question defines a problem template shared across its variants.

(2) From Seed Questions to Variants. As shown in Fig. 3, for each seed question, we create a series of variants (30 in total) that become increasingly complex by requiring more steps to reach the final answer. For example, variant 1 requires one matrix operation, variant 2 requires two, and variant 30 requires thirty, with the identical operation applied repeatedly. By design, a larger number of steps directly corresponds to higher complexity. Note that this construction does not assume or require each operation to use the same compute.

(3) Program-based Generation and Manual Verification. All variants are generated through Python scripts to ensure correctness and scalability. To prevent unintended shortcuts such as periodic patterns, we manually verify each seed question and review sampled variants. We provide detailed seed questions and variants for each domain in Appendix E.

We use the Spearman correlation coefficients ρ ∈ [-1, 1] to measure how the variant index, which directly determines the constructed question’s complexity, relates to two quantities: reasoning compute and log accuracy. A high correlation with reasoning compute indicates that compute grows monotonically with complexity (Property 1), while a negative correlation with log accuracy indicates that accuracy tends to degrade as complexity increases (Property 3).

In contrast, assessing compositionality is more straightforward: it only requires taking any two independent questions as sub-questions and constructing their composition. We build LORE-COMPO from MATH500 (Lightman et al., 2023), where each question is labeled by subject (e.g., Algebra, Geometry). Specifically, we randomly sample a pair of questions (x 1 , x 2 ) from distinct pre-defined subjects to ensure independence, and concatenate them into a composite question x 12 . Each original question is used at most once, yielding 250 triplets, each with two sub-questions and their composition:

We therefore quantify the degree to which a model follows this property using the mean absolute deviation (MAD):

A smaller MAD indicates stronger adherence to the compositionality property. However, MAD is scale-dependent. To address this, we adopt the Normalized MAD (nMAD):

Evaluation Setups. We examine 10 LRMs on LORE-MONO and LORE-COMPO: eight standard models-DeepSeek-R1-Distill (Qwen-1.5B, Qwen-7B, Llama-8B, Qwen-14B) (Guo et al., 2025), Phi-4-mini-reasoning (Xu et al., 2025a), OpenReasoning-Nemotron-14B (Ahmad et al., 2025), Sky-T1-32B-Preview (Team, 2025a) and Qwen3-Next-80B-A3B-Thinking (Team, 2025b)-and two models that apply reasoning length control, Thinkless-1.5B-RL-DeepScaleR (Fang et al., 2025) and AdaptThink-7B-delta0.05 (Zhang et al., 2025a). For each question, we sample 8 outputs per model with a fixed decoding temperature (0.6 for the DeepSeek family and 0.8 for the Phi-4 family from their technical reports) and a maximum length of 20480 tokens. For LORE-MONO, at each variant index we first average reasoning compute4 and log accuracy across the 40 questions, and then compute the Spearman correlation.

Current LRMs Largely Satisfy Monotonicity. On LORE-MONO, all LRMs exhibit a strong positive correlation between reasoning compute and the variant index, which directly reflects question complexity, with most overall Spearman correlations close to 1, as shown in Tab. 1. The only exception is DeepSeek-R1-Distill-Qwen-1.5B, which has the weakest reasoning ability among the six models and yields a lower overall correlation (0.875). As illustrated in Fig. 4, notably, in the language domain its correlation between reasoning compute and complexity is negative (-0.346), while in the code domain, it is near zero (0.151). This indicates that in some domains, the reasoning compute for this model does not systematically increase with complexity, and may even decrease. We provide a case study as additional analysis along with visualization results for other models in Appendix E.2. Meanwhile, most LRMs exhibit a negative correlation between log accuracy and the variant index, as expected. For DeepSeek-R1-Distill-Qwen-1.5B, however, this trend appears noticeably weaker.

Current LRMs Fail to Exhibit Compositionality. The nMAD is large for both reasoning compute and log accuracy (Tab. 2), indicating that current LRMs do not satisfy compositionality. Fig. 5 further plots C θ (x 1 ⊕ x 2 ) against C θ (x 1 ) + C θ (x 2 ) for two representative LRMs. If an LRM adhered to the compositionality law, most points would align closely with the y = x line. In practice, however, the majority of points deviate substantially. Notably, even models equipped with reasoning length control mechanisms (Thinkless-1.5B and AdaptThink-7B) exhibit considerable deviations, suggesting that such techniques do not inherently promote compositional behavior.

In Section 3, we showed that while most LRMs generally satisfy monotonicity, they often fail to satisfy compositionality. Based on Hypothesis 1, this observation motivates a natural question: can enforcing compositionality lead to stronger reasoning capacity? In response, we propose a simple yet effective supervised fine-tuning (SFT) method to promote compositional behavior in LRMs. Importantly, we focus on enforcing compositionality specifically with respect to reasoning compute, as it provides a more direct and actionable criterion for selecting supervision examples. 5Proposed Method: SFT-Compo Specifically, let M θ be an LRM and D train a training dataset. Following the construction in Section 3.2, we select question pairs (x 1 , x 2 ) ∈ D train from distinct categories and form composite questions x 12 = x 1 ⊕ x 2 . For each triplet (x 1 , x 2 , x 12 ), we sample K model outputs o = (r, y) ∈ O from an LRM (either the current model M θ or a stronger teacher model), where r ∈ R is a reasoning path and y ∈ Y is the corresponding final answer:

12 )} K k=1 for x 12 . Since compositionality is defined over reasoning paths, among the K 3 combinations (o 1 , o 2 , o 12 ), we consider only those where all three reasoning paths r 1 , r 2 , r 12 lead to correct answers, and select the combination that best satisfies the compositionality condition:

We now empirically evaluate SFT-Compo, addressing two research questions: (1) whether it effectively enforces compositionality, and (2) whether it further improves the reasoning capacity of LRMs. We also provide additional insightful findings in our analysis.

Model, Dataset and SFT Recipe. We evaluate four LRMs: DeepSeek-R1-Distill (Qwen-1.5B, Qwen-7B, Llama-8B) (Guo et al., 2025) and Phi-4-mini-reasoning (Xu et al., 2025a). We construct a dataset of sub-question and composite-question triplets using a subset of DeepScaler (Luo et al., 2025b). For each question (either sub-question or composite), we use DeepSeek-R1-Distill-Qwen-14B as a stronger teacher model to sample K = 8 model outputs. We then construct the compositionalityenforced dataset D comp as described in Eqn. 1, which contains 3.9K question-output pairs. We fine-tune each LRM on D comp for 5 epochs with a batch size of 16. Details are provided in Appendix F. Evaluation. To evaluate compositionality, we use LORE-COMPO. For general reasoning capacity, we consider six benchmarks: GSM8K (Cobbe et al., 2021), MATH500 (Lightman et al., 2023), AIME 2024, AIME 2025 (Mathematical Association of America, 2025), AMC 2023 (AI-MO, 2024), and OlympiadBench (He et al., 2024). We set the maximum generation length to 10240 tokens.

We compare LRMs before and after SFT using the nMAD of reasoning compute on LORE-COMPO. As shown in Fig. 6a, SFT-Compo consistently reduces nMAD compared to the base model. On the 1.5B model, SFT-Compo achieves a reduction from 0.528 to 0.314 (a 40.5% reduction), and on the 8B model, from 0.423 to 0.328 (a 22.5% reduction). We further visualize the results on the 1.5B model in Fig. 6b, where SFT-Compo aligns much more closely with the y=x line. Therefore, the compositionality of reasoning compute can be effectively enforced in a simple manner via SFT-Compo. Does Enforcing Compositionality Lead to Stronger Reasoning Capabilities? As shown in Tab. 3, SFT-Compo consistently improves performance across all six benchmarks and all four model sizes. For instance, on the 8B model, it yields a notable gain of +5.0 in average Pass@1. To rule out the possibility that performance gains stem solely from leveraging outputs generated by a stronger teacher model, we introduce a control baseline, SFT, which constructs the training dataset by uniformly sampling one correct reasoning path for each question in the triplet:

Notably, SFT-Compo outperforms SFT in all cases, showing that the gains are not just from distilling a stronger model but from better compliance with compositionality. This supports our Hypothesis 1-that stronger models better follow reasoning laws-and demonstrate that encouraging compositionality further enhances the reasoning capabilities of LRMs. Enforcing Compositionality in Reasoning Compute Improves Its Monotonicity. Recall from Section 3.3 that DeepSeek-R1-Distill-Qwen-1.5B initially exhibits relatively weak monotonicity in reasoning compute. Fig. 7a shows SFT-Compo significantly improves this property, increasing the overall Spearman correlation from 0.875 to 0.977. Specifically, in the code domain, it rises from 0.151 to 0.914. This indicates that enforcing compositionality can implicitly enhance monotonicity.

Enforcing Compositionality in Reasoning Compute Improves Compositionality in Accuracy.

Interestingly, though SFT-Compo is designed to enhance compositionality in reasoning compute, it improves the compositionality of log accuracy. Fig. 7b shows that the nMAD of log accuracy drops from 2.368 to 0.685 on the 1.5B model (a 71.1% reduction), and from 1.170 to 0.756 on the 7B model (a 35.4% reduction). This suggests a possible interplay among different reasoning laws.

LRMs have emerged as a family of foundation models (Wiggins & Tejani, 2022). Since the advent of OpenAI o1 (Jaech et al., 2024), the “thinking-then-answering” paradigm has been widely adopted, with notable follow-ups such as DeepSeek-R1 and Phi-4-Reasoning (Abdin et al., 2024;Guo et al., 2025;Qwen Team, 2025). Our framework builds upon the contemporary paradigm of adaptive reasoning, wherein the model’s reasoning budget is dynamically controlled either through posttraining interventions (Luo et al., 2025a;Zhou et al., 2025b) or at test time (Muennighoff et al., 2025;Xu et al., 2025b;Zhang et al., 2025b). Specifically, one line of work explores post-training techniques that modulate when and how long a model should reason (Chen et al., 2024a;Yong et al., 2025), while another frontier focuses on dynamically adjusting reasoning behavior during inference (Qiao et al., 2025;Liu & Wang, 2025). Refer to Appendix B for additional related work.

As a comprehensive study from theoretical hypotheses to empirical validation, we advance a theoretical perspective grounded in human reasoning for improving reasoning in LRMs. We hope LORE can inspire more potential strategies that guide models toward their optimal paradigms of thinking.

Consequently,

Then there exists a constant α θ ≥ 0 such that

for all x ∈ X fin .

Proof. Define an equivalence relation x ∼ y ⇐⇒ κ(x) = κ(y). By (A1), κ(x) ≤ κ(y) and κ(y

Thus there is a well-defined f :

Fix u ∈ K and m ∈ N. By (A3) choose jointly independent x 1 , . . . , x m with κ(x i ) = u. By (A2) and the fixed bracketing,

If K = {0} then C θ ≡ 0 and the claim holds with α θ = 0. Otherwise take u, v ∈ K with u, v > 0 and let ℓ = lcm(u, v). Then

Corollary D.1 (Asymptotic version with sublinear overhead). If the compositional compute holds up to a sublinear overhead, i.e., for independent x, y,

and the same (A3) assumption holds, then the above proof yields

Proposition 2 (Formal Version). Let X fin = {x ∈ X : κ(x) < ∞}. Assume the setting and independence notion of Property 3, Property 4, and Assumption (A3). Then there exists λ θ ≥ 0 such that for all x ∈ X fin with 0 < A θ (x) ≤ 1,

Therefore there exists a well-defined map

Fix u ∈ K and m ∈ N. By (A3), choose jointly independent x 1 , . . . , x m with κ(x i ) = u. By (A2) and the fixed bracketing,

If K = {0} then A θ ≡ 1 and the claim holds with λ θ = 0. Otherwise, let u, v ∈ K with u, v > 0 and set ℓ = lcm(u, v). Applying Eqn. 2 twice gives

Corollary D.2 (Asymptotic version with sublinear coupling). If for independent x, y the multiplicativity holds up to a sublinear deviation in the exponent,

and (A3) holds, then

Here we provide one representative seed question example for each domain.

Given an integer n = {N }, consider the order-2 recurrence over integers modulo M with an alternating update rule and a mild nonlinear term. You are given the initial values

We update the sequence one step at a time. Let t = 1, 2, 3, . . . , n denote the update index, where t = 1 is the update that produces x 2 from (x 1 , x 0 ). At each update t, compute x k+1 from (x k , x k-1 ) using the parity of t: -Define the nonlinear map φ(z) = (z + 1) 2 . (You may reduce intermediate values modulo M at any time.) -Odd step (t odd):

-Even step (t even):

For clarity, the first two updates are:

Apply exactly n -1 updates starting from x 0 , x 1 to reach x n (n = n), and **return x n ** as a single non-negative integer in [0, M -1]. Conventions: -All modular reductions are taken modulo M and return a non-negative remainder.

-The alternating rule depends on the update index t. -Output only the integer value of x n (no extra text).

We observe that the weakest model, DeepSeek-R1-Distill-Qwen-1.5B, occasionally produces no reasoning tokens on certain code questions. For example, on variant 1 of a seed code question, the generated output is simply \n, corresponding to a reasoning length of 0.

“Fail to think” -Example model output \n\n\n To solve this problem, we need to simulate the execution of a given Python code and determine the final value assigned to the variable ‘ANSWER’. The code involves string manipulation and state transitions, which we will simulate step by step.\n\n### Approach\n1. Initialization:\n -We start with ‘N = 1’ iteration, ’s = "01938"’, and a function ‘f’ that processes the string ’s’.\n\n2. Function Definition:\n -The function ‘f’ takes a string ’s’ and processes each character based on its position in the string. The position determines whether the count of ‘0’, ‘1’, ‘2’, or ‘3’ is incremented modulo 5.\n\n3. Processing Each Character:\n -For each character in ’s’, we count the occurrences of ‘0’, ‘1’, ‘2’, and ‘3’….. (omitted …)

For all composite questions, a fixed connector links the sub-questions, ensuring they are answered in sequential order. An illustrative example is provided below.

Answer the following questions in order: Q1. Jim and Martha are standing together at the corner of a rectangular field. Jim walks diagonally across the field. Martha gets to the same location by walking along its length and width. The field is 300 feet wide and 400 feet long. How many feet less than Martha does Jim walk? Q2. Find all values of x that satisfy the equation x = √ 11 -2x + 4.

Visualizations of compositionality results for the remaining LRMs are provided in Fig. 9.

F.1 IMPLEMENTATION DETAILS Since DeepScaler does not come with predefined categories, we first annotate each question using GPT-4.1-mini to assign it to one of the following categories: Algebra (Prealgebra), Counting & Probability, Geometry, Number Theory, or Calculus (Precalculus). Based on these annotations, we construct sub-question and composite-question triplets by pairing questions from different categories.

For SFT, we perform a grid search over learning rates in {1e-6, 5e-6, 5e-5}, using a batch size of 8, gradient accumulation of 2, and a warmup ratio of 0.

In Fig. 10, we further compare the reasoning compute compositionality of DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-Llama-8B before and after SFT-Compo. With SFT-Compo, the nMAD decreases, and the results align more closely with the y = x line compared to their base counterparts.

We present an example generated of DeepSeek-Distill-1.5B and after SFT-Compo. For DeepSeek-Distill-1.5B, we report metrics for the question pair and the composite question with accuracies

.50, and C θ (x 12 ) = 732.13, showing that the model severely underthinks on the composite question and compromises the accuracy. After SFT-Compo, the composite accuracy improves to A θ (x 12 ) = 0.875 with reasoning compute as C θ (x 12 ) = 4300.00. We provide detailed reasoning processes of the two models below.

Q: Answer the following questions in order: Q1. Pat is to select six cookies from a tray containing only chocolate chip, oatmeal, and peanut butter cookies. There are at least six of each of these three kinds of cookies on the tray. How many different assortments of six cookies can be selected? (Note that cookies of the same type are not distinguishable.) Q2. Let f (x) = x 10 + 5x 9 -8x 8 + 7x 7 -x 6 -12x 5 + 4x 4 -8x 3 + 12x 2 -5x -5. Without using long division (which would be horribly nasty!), find the remainder when f (x) is divided by x 2 -1.

Example reasoning process (DeepSeek-R1-1.5B) Q1. To determine the number of different assortments of six cookies, I need to consider the number of ways to select six cookies from three types: chocolate chip, oatmeal, and peanut butter. Since the cookies of the same type are indistinguishable, the order of selection doesn’t matter.

I can model this problem using combinations with repetition. The formula for the number of ways to choose k items from n types is given by:

In this case, n = 3 (types of cookies) and k = 6 (cookies to select). Plugging in the values:

Calculating 8 6 gives:

Therefore, there are 28 different assortments of six cookies that can be selected.

Q2. To find the remainder when f (x) is divided by x 2 -1 without using long division, I can utilize the Remainder Theorem. The Remainder Theorem states that the remainder of a polynomial f (x) divided by x -a is f (a). However, since the divisor here is x 2 -1, which factors into (x-1)(x+1), I need to evaluate f (x) at the roots of x 2 -1, which are x = 1 and x = -1.

The remainder will be a linear polynomial of the form R(x) = ax + b. To find a and b, I can set up two equations by plugging in x = 1 and x = -1 into f (x): 1. f (1) = a(1) + b 2. f (-1) = a(-1) + b

Calculating f (1):

f (1) = 1 10 + 5(1) 9 -8(1) 8 + 7(1) 7 -1(1) 6 -12(1) 5 + 4(1) 4 -8(1) 3 + 12(1) 2 -5(1) -5

f (1) = 1 + 5 -8 + 7 -1 -12 + 4 -8 + 12 -5 -5

f (1) = -1

Calculating f (-1):

f (-1) = (-1) 10 + 5(-1) 9 -8(-1) 8 + 7(-1) 7 -1(-1) 6 -12(-1) 5 + 4(-1) 4

-8(-1) 3 + 12(-1) 2 -5(-1) -5 Okay, so I have these two questions to solve. Let me start with the first one. Q1. Pat is to select six cookies from a tray containing only chocolate chip, oatmeal, and peanut butter cookies. There are at least six of each of these three kinds of cookies on the tray. How many different assortments of six cookies can be selected? (Note that cookies of the same type are not distinguishable.) Hmm, okay. So Pat is selecting six cookies, and the tray has three types: chocolate chip, oatmeal, and peanut butter. Each type has at least six cookies, so there’s no shortage. Since the cookies of the same type are indistinct, we’re dealing with combinations with repetition, I think.

I remember that the formula for combinations with repetition is C(n + k -1, k), where n is the number of types and k is the number of items to choose. In this case, n is 3 (since there are three types of cookies) and k is 6 (since Pat is selecting six cookies). So plugging into the formula, it should be C(3 + 6 -1, 6) = C(8, 6).

Wait, let me double-check. So, combinations with repetition: if order doesn’t matter and items are indistinct except for type, yes, that’s the case here. So, the formula is indeed C(n + k -1, k). So, n=3, k=6, so 3 + 6 -1 = 8, so C(8,6). Alternatively, C(8,2) since combinations are symmetric, which is 28. So is that the answer?

But wait, let me think again. Since each type has at least six cookies, and we’re selecting six, does that affect anything? Because if each type has at least six, then when we select six, the maximum number we can take from any one type is six, but since we’re only selecting six, the maximum from any type is six, but since we’re selecting six in total, the maximum from any type is six, but the minimum is zero.

But in this case, since we have at least six of each, so when we choose six, the number of ways is the same as the number of non-negative integer solutions to the equation:

x1 + x2 + x3 = 6, where x1, x2, x3 are the number of chocolate chip, oatmeal, and peanut butter cookies respectively.

Which is exactly the stars and bars problem, so C(6 + 3 -1, 3 -1) = C(8,2) = 28.

So yeah, the answer is 28. That seems right. Q2. Let f (x) = x 10 + 5x 9 -8x 8 + 7x 7 -x 6 -12x 5 + 4x 4 -8x 3 + 12x 2 -5x -5. Without using long division (which would be horribly nasty!), find the remainder when f (x) is divided by x 2 -1. Okay, so polynomial division. But f (x) is a degree 10 polynomial, and we’re dividing by x 2 -1, which is also degree 2. So the remainder should be of degree less than 2, so it should be a linear polynomial, right? So it can be written as ax + b, where a and b are constants.

To find a and b, we can use the Remainder Theorem. Since x 2 -1 factors as (x -1)(x + 1), the remainder when f(x) is divided by x 2 -1 is the same as the remainder when f(1) and f(-1) are computed, and then we can set up equations to solve for a and b.

Wait, let me recall: if we have f(x) divided by (x -c)(x -d), the remainder is R(x) = e x + f. So, to find R(x), we can evaluate f(c) and f(d), which gives us two equations:

Then, solving these two equations gives us e and f. In our case, the divisor is x 2 -1, which is (x -1)(x + 1). So, c =1 and d = -1.

[To be continued]

! (𝑥))

  • 12 ) | (x 1 , x 2 ) ∈ D train } .

One example of c can be “Answer the following questions in order: Q1. {Q1}\nQ2. {Q2}”.

For example, concepts may come from Calculus (e.g., derivatives), Algebra (e.g., group theory), or Discrete mathematics (e.g., logic).

We apply max-min normalization to the reasoning compute of each question to prevent any single item from dominating the results.

Accuracy compositionality is not easy to enforce directly, as it does not specify which reasoning path should be selected for supervision.

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut