반사적 신뢰도 기반 효율적 추론 프레임워크

Reading time: 5 minute
...

📝 Abstract

Large Language Models (LLMs) have achieved remarkable success in complex reasoning tasks with techniques like Chain-of-Thought (CoT) and Self-Consistency. However, these ensemble methods, especially Self-Consistency which relies on multiple reasoning trajectories, often incur high computational overhead. To improve efficiency, researchers have leveraged internal confidence signals, where early stopping strategies such as DeepConf save resources by terminating low-confidence paths. Yet this discards incomplete paths and wastes computation. We introduce Reflective Confidence, a novel reasoning framework that transforms a low-confidence signal from a termination symbol into a reflection trigger. When confidence drops below a threshold, instead of stopping, the model generates a reflection prompt to analyze current reasoning, identify errors, and continue with a corrected trajectory. Experiments on mathematical reasoning benchmarks, including AIME 2025, show significant accuracy gains over advanced early stopping strategies at comparable cost, validating the efficiency of proactive correction over passive discarding.

💡 Analysis

Large Language Models (LLMs) have achieved remarkable success in complex reasoning tasks with techniques like Chain-of-Thought (CoT) and Self-Consistency. However, these ensemble methods, especially Self-Consistency which relies on multiple reasoning trajectories, often incur high computational overhead. To improve efficiency, researchers have leveraged internal confidence signals, where early stopping strategies such as DeepConf save resources by terminating low-confidence paths. Yet this discards incomplete paths and wastes computation. We introduce Reflective Confidence, a novel reasoning framework that transforms a low-confidence signal from a termination symbol into a reflection trigger. When confidence drops below a threshold, instead of stopping, the model generates a reflection prompt to analyze current reasoning, identify errors, and continue with a corrected trajectory. Experiments on mathematical reasoning benchmarks, including AIME 2025, show significant accuracy gains over advanced early stopping strategies at comparable cost, validating the efficiency of proactive correction over passive discarding.

📄 Content

In recent years, Large Language Models [1,2] (LLMs) have exhibited exceptional performance in domains requiring complex reasoning, such as mathematics, programming, and common-sense question answering. This success is largely attributable to the emergence of reasoning strategies like Chain-of-Thought [3,4] (CoT) and Self-Consistency [5]. The former elicits the reasoning potential of models by guiding them through step-by-step thinking, while the latter significantly enhances the robustness and accuracy of results by generating multiple independent reasoning paths and taking a majority vote. Despite their effectiveness, these methods often come at a substantial computational cost. This is especially true for Self-Consistency, where token consumption increases linearly with the number of sampled paths, severely limiting its deployment in practical applications.

To address the high computational cost, a primary research direction has been to utilize the model’s internal confidence signals to evaluate the quality of reasoning paths in real-time, thereby enabling more efficient inference. This has given rise to various optimization strategies [6], with early stopping mechanisms, such as DeepConf [7], being a prominent example. These methods monitor the model’s confidence during the generation process and prematurely terminate a reasoning path when its confidence falls below a certain threshold. This avoids expending further computational resources on low-quality paths, significantly reducing overall token consumption. However, this “passive discarding” strategy has inherent limitations: it treats all low-confidence paths as invalid and discards them outright, even though some may be temporarily “lost” due to a minor calculation error or logical deviation and possess significant potential for correction. This one-size-fits-all approach inherently wastes computational resources and misses opportunities to rectify potential errors and salvage partial reasoning results. To this end, we propose “Reflective Confidence,” a novel reasoning framework whose core idea is to transform the low-confidence signal generated by the model from a negative “termination signal” into a positive “correction signal.” When our system detects that a reasoning path has insufficient confidence, it does not abandon it. Instead, it triggers an online self-correction mechanism. This mechanism dynamically constructs a “reflection prompt,” asking the model to review its own just-generated, potentially problematic reasoning steps and attempt self-correction. In this way, our method can proactively “rescue” reasoning paths that may be deviating, maximizing the value of each model inference. Our main contributions are as follows: (1) We propose an online self-correction mechanism triggered in real time by the model’s internal signals, shifting reasoning from passive post-hoc filtering to active in-process correction.

(2) We introduce a new way to leverage the model’s own capabilities to revise reasoning trajectories, enabling it to act as a “self-censor” that performs real-time checks and corrections, guiding reasoning more intelligently.

(3) Experiments on challenging mathematical reasoning benchmarks show our method significantly outperforms early-stopping baselines in the trade-off between accuracy and computational efficiency.

Ensemble Methods in LLM Reasoning. Chain-of-Thought (CoT) prompting [8] reveals step-by-step reasoning but still risks single-path failure. [9,10] Self-Consistency (SC) [5] overcomes this by sampling K diverse chains and majorityvoting, inspiring extensions that adjust sampling temperature [11] or incorporate mixture-of-experts voting [12]. Yet all SC variants pay a linear token cost in K, limiting real-time deployment.

Confidence-Based Inference Optimization. To cut that cost, recent work turns to the model’s own probabilities as a trust signal [13]. Self-Certainty [14] assigns a KL-based score to each finished chain for weighted voting; entropy pruning [15] removes low-confidence answers before voting. DeepConf [16] moves the check online: it monitors a slidingwindow “group confidence” and stops paths whose score dips below a threshold. However, these methods still discard low-confidence trajectories, wasting partial computation that might be fixable.

Self-Correction and Reflection. Another line equips LLMs with self-repair [17]. Approaches such as R-CoT [18] and Self-Refine [19] ask the model to critique a complete draft, while Reflexion [20] and CriticGPT [21] rely on external execution traces or human labels. Our Reflective Confidence differs by triggering reflection during decoding, using the model’s intrinsic online confidence as the sole cue. A low score is treated as a “help request” that launches an immediate diagnose-and-continue prompt, salvaging compute and, as Section 4 shows, improving both accuracy and efficiency over passive discarding.

Our framework, Reflective Confidence, introduces a selfcorrec

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut