Characterizing, Evaluating, and Optimizing Complex Reasoning
Large Reasoning Models (LRMs) increasingly rely on reasoning traces with complex internal structures. However, existing work lacks a unified answer to three fundamental questions: (1) what defines high-quality reasoning, (2) how to reliably evaluate long, implicitly structured reasoning traces, and (3) how to use such evaluation signals for reasoning optimization. To address these challenges, we provide a unified perspective. (1) We introduce the ME$^2$ principle to characterize reasoning quality along macro- and micro-level concerning efficiency and effectiveness. (2) Built on this principle, we model reasoning traces as directed acyclic graphs (DAGs) and develop a DAG-based pairwise evaluation method, capturing complex reasoning structures. (3) Based on this method, we construct the TRM-Preference dataset and train a Thinking Reward Model (TRM) to evaluate reasoning quality at scale. Experiments show that thinking rewards serve as an effective optimization signal. At test time, selecting better reasoning leads to better outcomes (up to 19.3% gain), and during RL training, thinking rewards enhance reasoning and performance (up to 3.9% gain) across diverse tasks.
💡 Research Summary
The paper tackles three fundamental challenges in the development of Large Reasoning Models (LRMs): defining high‑quality reasoning, reliably evaluating long and implicitly structured reasoning traces, and leveraging such evaluations to improve reasoning at scale. To address these, the authors introduce a unified framework built around the ME² principle, a DAG‑based representation of reasoning, and a pairwise preference‑trained Thinking Reward Model (TRM).
ME² Principle
ME² characterizes reasoning quality along two orthogonal axes—macro versus micro granularity and efficiency versus effectiveness—yielding four sub‑criteria: macro‑efficiency (avoiding unnecessary branching or re‑checking), macro‑effectiveness (maintaining logical coherence and alignment with the problem at a global level), micro‑efficiency (concise, non‑redundant steps), and micro‑effectiveness (local correctness, consistency, and sound computation). This multidimensional view moves beyond prior work that focuses solely on step‑wise correctness, verbosity, or simple structural signals.
DAG‑Based Structuring
The authors propose a scalable pipeline to convert free‑form reasoning traces into directed acyclic graphs (DAGs). First, raw text is split on double newlines, then refined using high‑frequency step prefixes to obtain stable atomic steps. For each step, a candidate parent pool is constructed from the current main branch and a few representative branch endpoints; an LLM is prompted to select the most semantically appropriate parents, establishing directed edges. Linear chains are collapsed into super‑nodes, yielding a compact DAG that captures progression, branching, and merging while preserving a topological order aligned with generation order.
Pairwise Evaluation
Given two reasoning DAGs, the framework creates complementary abstractions: a macro‑level textual summary of each super‑node’s intent and role, and a micro‑level extraction of the dominant path (the sequence of steps most directly supporting the final answer). These abstractions are fed into a Bradley‑Terry model that learns to predict which trace is superior according to the ME² criteria. The pairwise setup mitigates the noise and bias of absolute scoring and scales well to large datasets.
TRM‑Preference Dataset and Thinking Reward Model
The authors collect over 200 k human‑annotated preference pairs where the answer is held constant but one reasoning trace is judged higher quality. This dataset, called TRM‑Preference, is used to train a lightweight Transformer‑based Thinking Reward Model (TRM) with a Bradley‑Terry loss. TRM outputs a scalar “thinking reward” for any reasoning trace, reflecting both structural and local content quality while remaining orthogonal to answer correctness.
Empirical Findings
- Test‑time selection: When multiple candidate traces are generated for a query, selecting the one with the highest TRM score yields average accuracy gains of 12.7 % and up to 19.3 % across math, code, and commonsense tasks compared to a baseline that picks the first or randomly chosen trace.
- Training‑time optimization: Incorporating the TRM reward into reinforcement learning (RL) as an additional signal (alongside traditional outcome rewards) improves final performance by 2.4 % on average, with a maximum of 3.9 % on the hardest benchmarks. The RL‑augmented models also produce more concise and logically coherent reasoning traces, as measured by downstream human evaluations.
Contributions and Impact
The paper delivers four key contributions: (i) a principled, multidimensional definition of reasoning quality (ME²), (ii) a practical method to abstract arbitrary reasoning into DAGs, (iii) a robust pairwise evaluation scheme that simultaneously captures macro‑ and micro‑level aspects, and (iv) a scalable reward model (TRM) that can be used both for test‑time trace selection and for RL‑based training. By demonstrating that reasoning quality—independent of answer correctness—can serve as an effective optimization target, the work opens a new research direction focused on “process‑centric” improvement of LRM systems. The released code and data further enable the community to build upon this framework, potentially extending it to domains such as scientific discovery, legal reasoning, and complex planning where transparent, efficient, and effective reasoning processes are paramount.
Comments & Academic Discussion
Loading comments...
Leave a Comment