Rectifying LLM Thought from Lens of Optimization

February 20, 2026

Reading time: 5 minute

...

📝 Original Info

Title: Rectifying LLM Thought from Lens of Optimization
ArXiv ID: 2512.01925
Date: 2025-12-01
Authors: ** - Junnan Liu (Shanghai AI Laboratory, Monash University) – 공동 1저자 - Hongwei Liu (Shanghai AI Laboratory) – 공동 1저자 - Songyang Zhang (Shanghai AI Laboratory) – 교신 저자 - Kai Chen (Shanghai AI Laboratory) – 교신 저자 **

📝 Abstract

Recent advancements in large language models (LLMs) have been driven by their emergent reasoning capabilities, particularly through long chain-of-thought (CoT) prompting, which enables thorough exploration and deliberation. Despite these advances, long-CoT LLMs often exhibit suboptimal reasoning behaviors, such as overthinking and excessively protracted reasoning chains, which can impair performance. In this paper, we analyze reasoning processes through an optimization lens, framing CoT as a gradient descent procedure where each reasoning step constitutes an update toward problem resolution. Building on this perspective, we introduce RePro (Rectifying Process-level Reward), a novel approach to refine LLM reasoning during post-training. RePro defines a surrogate objective function to assess the optimization process underlying CoT, utilizing a dual scoring mechanism to quantify its intensity and stability. These scores are aggregated into a composite process-level reward, seamlessly integrated into reinforcement learning with verifiable rewards (RLVR) pipelines to optimize LLMs. Extensive experiments across multiple reinforcement learning algorithms and diverse LLMs, evaluated on benchmarks spanning mathematics, science, and coding, demonstrate that RePro consistently enhances reasoning performance and mitigates suboptimal reasoning behaviors.

💡 Deep Analysis

📄 Full Content

2025-12-2 Rectifying LLM Thought from Lens of Optimization Junnan Liu1,2,*, Hongwei Liu1, Songyang Zhang1,† and Kai Chen1,† 1Shanghai AI Laboratory, 2Monash University Recent advancements in large language models (LLMs) have been driven by their emergent reasoning ca- pabilities, particularly through long chain-of-thought (CoT) prompting, which enables thorough exploration and deliberation. Despite these advances, long-CoT LLMs often exhibit suboptimal reasoning behaviors, such as overthinking and excessively protracted reasoning chains, which can impair performance. In this paper, we analyze reasoning processes through an optimization lens, framing CoT as a gradient descent procedure where each reasoning step constitutes an update toward problem resolution. Building on this perspective, we introduce RePro (Rectifying Process-level Reward), a novel approach to refine LLM reasoning during post-training. RePro defines a surrogate objective function to assess the optimization process underlying CoT, utilizing a dual scoring mechanism to quantify its intensity and stability. These scores are aggregated into a composite process-level reward, seamlessly integrated into reinforcement learning with verifiable rewards (RLVR) pipelines to optimize LLMs. Extensive experiments across multiple reinforcement learning algorithms and diverse LLMs, evaluated on benchmarks spanning mathematics, science, and coding, demonstrate that RePro consistently enhances reasoning performance and mitigates suboptimal reasoning behaviors. Code and data are available at https://github.com/open-compass/RePro 1. Introduction Recent advancements in large language models (LLMs) have been propelled by their emergent reasoning capabilities, enabling them to tackle complex tasks (Huang & Chang, 2023; Plaat et al., 2024; Ahn et al., 2024; Ke et al., 2025; Sun et al., 2025). These capabilities are pivotal in progressing toward artificial general intelligence (AGI) (Zhong et al., 2024). State-of-the-art LLMs, such as OpenAI’s o-series (OpenAI, 2024a,b, 2025), DeepSeek- R1 (DeepSeek-AI et al., 2025), Kimi-K1 (Kimi-Team et al., 2025), and Gemini-2.5-Pro (Comanici et al., 2025), leverage long chain-of-thought (CoT) prompting to enhance reasoning. This approach facilitates comprehensive exploration and reflection, yielding robust reasoning processes (Chen et al., 2025a). Such improvements stem largely from reinforcement learning with verifiable rewards (RLVR) (Schulman et al., 2017; Shao et al., 2024), which enables LLMs to autonomously explore reasoning steps based on a terminal reward, fostering self-improving models with scalable reasoning during inference (Snell et al., 2024). Despite these advancements, long-CoT LLMs often exhibit suboptimal reasoning behaviors (Chen et al., 2025a). A significant issue is overthinking, where models generate excessive tokens or protracted reasoning paths that contribute minimally to problem resolution, incurring substantial computational costs (Chen et al., 2024; Wang et al., 2025c; Sui et al., 2025). For instance, in response to a simple query like “What is the answer to 2 plus 3?” (Chen et al., 2024), certain long-CoT LLMs produce reasoning chains exceeding thousands of tokens, increasing latency and resource demands, thus limiting applicability in time-sensitive domains (Sui et al., 2025). Drawing on prior work (Feng et al., 2023; Huang et al., 2025a), we analyze suboptimal reasoning through an optimization framework, conceptualizing CoT as a task-specific variant of gradient descent, where each reasoning step represents an optimization update (Liu et al., 2025a). In this paradigm, suboptimal reasoning manifests as oscillations around saddle points or local optima, hindering convergence to the optimal solution. To address these challenges, we propose RePro (Rectifying Process-level Reward), a novel method to * Work done when Junnan’s internship at Shanghai AI Laboratory. † Corresponding authors. Email to: junnan.liu@monash.edu; zhangsongyang@pjlab.org.cn arXiv:2512.01925v1 [cs.CL] 1 Dec 2025 Rectifying LLM Thought from Lens of Optimization rectify LLM thought during post-training. RePro formulates a surrogate objective function, 𝒥, to monitor the optimization process of CoT, measuring the LLM’s confidence in the ground truth via perplexity (Jelinek et al., 1977) over the ground-truth token sequence. For a reasoning trajectory of 𝑁steps, we compute a sequence of objective values [𝒥0, 𝒥1, . . . , 𝒥𝑁] and introduce a dual scoring system to assess optimization intensity and stability. These scores are combined into a composite process-level reward (Lightman et al., 2024), integrated into standard post-training pipelines (DeepSeek-AI et al., 2025; Shao et al., 2024; Hu, 2025) to enhance reasoning. RePro is plug-and-play, compatible with prevalent reinforcement learning algorithms. The efficacy of RePro is substantiated by comprehensive empirical evaluation. We validate RePro through extensive experiments using rein

📄 Read Full PDF on ArXiv