Expanding Reasoning Potential in Foundation Model by Learning Diverse Chains of Thought Patterns
Recent progress in large reasoning models for challenging mathematical reasoning has been driven by reinforcement learning (RL). Incorporating long chain-of-thought (CoT) data during mid-training has also been shown to substantially improve reasoning depth. However, current approaches often utilize CoT data indiscriminately, leaving open the critical question of which data types most effectively enhance model reasoning capabilities. In this paper, we define the foundation model’s reasoning potential for the first time as the inverse of the number of independent attempts required to correctly answer the question, which is strongly correlated with the final model performance. We then propose utilizing diverse data enriched with high-value reasoning patterns to expand the reasoning potential. Specifically, we abstract atomic reasoning patterns from CoT sequences, characterized by commonality and inductive capabilities, and use them to construct a core reference set enriched with valuable reasoning patterns. Furthermore, we propose a dual-granularity algorithm involving chains of reasoning patterns and token entropy, efficiently selecting high-value CoT data (CoTP) from the data pool that aligns with the core set, thereby training models to master reasoning effectively. Only 10B-token CoTP data enables the 85A6B Mixture-of-Experts (MoE) model to improve by 9.58% on the challenging AIME 2024 and 2025, and to raise the upper bound of downstream RL performance by 7.81%.
💡 Research Summary
The paper introduces a principled framework for improving the reasoning capabilities of large foundation models, particularly in challenging mathematical problem solving. The authors first formalize “reasoning potential” as the probability that a model produces the correct answer on its first sampling attempt (Φ), and prove that this is the inverse of the expected number of independent attempts (K) required to solve a question. Reducing K therefore directly increases a model’s intrinsic reasoning ability.
Since an ideal “oracle” training set that maximally expands reasoning potential is impractical to construct, the authors approximate it with a carefully curated core reference set (C_core) that is rich in high‑value reasoning patterns. The construction pipeline involves: (1) sampling questions across difficulty levels and subjects, (2) using multiple strong reasoning models to obtain correct answers, (3) automatically annotating each chain‑of‑thought (CoT) with atomic reasoning patterns (ρ) via a large language model, and (4) scoring pattern importance with TF‑IDF. Human reviewers then select CoT instances that contain distinctive, high‑importance patterns to form the core set.
To select additional training data from a large source pool (D_source), the paper proposes a dual‑granularity algorithm that measures similarity on two fronts: (a) pattern‑chain distance and (b) token‑entropy chain distance. Both distances are computed using weighted Dynamic Time Warping (DTW); pattern‑chain DTW incorporates TF‑IDF weights, while entropy‑chain DTW uses absolute differences of token entropy. A weighted sum D_ij = λ·d_pattern + (1‑λ)·d_entropy yields a composite distance between a core instance i and a source instance j.
The selection problem is cast as an assignment problem: each core instance must be matched with exactly o source instances, and each source instance can be assigned at most once. By replicating each core instance o times, the authors obtain a standard linear assignment matrix of size (t·o) × N, where t is the number of core instances and N the size of the source pool. The optimal assignment is solved with the Hungarian algorithm, guaranteeing a globally optimal matching under the defined distances. The resulting selected set D_select is therefore highly aligned with the core set in both pattern and entropy dimensions.
For data construction, the authors aggregate several large math QA datasets (OpenR1‑Math, AM‑DeepSeek‑R1‑Distilled, BoostQA) and apply aggressive deduplication and similarity filtering to build a “LongCoTPool”. They generate long CoT traces (up to 32 k tokens) using strong models, annotate them with pattern chains, and ensure no overlap with the core set.
Experiments are conducted on an 85 B parameter Mixture‑of‑Experts (MoE) model pre‑trained on 14 T tokens. During mid‑training, the model is exposed to 10 B tokens of the selected high‑value CoT data (CoTP) mixed 1:2 with a general‑domain corpus (KnowEdu). A subsequent supervised fine‑tuning (SFT) stage on 60 k long‑CoT examples ensures the model can generate lengthy reasoning sequences, which is essential for fair comparison and for downstream reinforcement learning (RL).
Results show that the model trained with CoTP improves its accuracy on the AIME 2024 and 2025 benchmarks by 9.58 % relative to the baseline, and that the upper bound of performance after RL fine‑tuning rises by 7.81 %. Ablation studies demonstrate that both components of the dual‑granularity distance (pattern and entropy) and the λ weighting are crucial; removing either degrades performance substantially.
The paper highlights several strengths: (1) a clear theoretical link between reasoning potential and expected attempt count, (2) an automated yet human‑validated pipeline for extracting high‑value reasoning patterns, (3) a principled, globally optimal data selection method that balances abstract pattern similarity and token‑level information gain, and (4) strong empirical gains with relatively modest additional data. Limitations include reliance on manually curated core sets (which may be domain‑specific), computational cost of DTW and Hungarian assignment on very large pools, and the focus on mathematical reasoning without extensive validation on other domains.
Future work suggested includes extending the atomic pattern taxonomy to other fields (physics, chemistry, logical puzzles), employing meta‑learning to automate pattern importance estimation, and developing scalable approximations for online or streaming data selection. Overall, the study provides a compelling blueprint for systematically expanding the reasoning potential of foundation models through targeted, pattern‑aware data curation.
Comments & Academic Discussion
Loading comments...
Leave a Comment