Reasoning with LLMs increasingly unfolds inside a broader verification loop. Internally, systems use cheap checks, such as self-consistency or proxy rewards, which we call weak verification. Externally, users inspect outputs and steer the model through feedback until results are trustworthy, which we call strong verification. These signals differ sharply in cost and reliability: strong verification can establish trust but is resource-intensive, while weak verification is fast and scalable but noisy and imperfect. We formalize this tension through weak--strong verification policies, which decide when to accept or reject based on weak verification and when to defer to strong verification. We introduce metrics capturing incorrect acceptance, incorrect rejection, and strong-verification frequency. Over population, we show that optimal policies admit a two-threshold structure and that calibration and sharpness govern the value of weak verifiers. Building on this, we develop an online algorithm that provably controls acceptance and rejection errors without assumptions on the query stream, the language model, or the weak verifier.
Recent advances in the reasoning capabilities of large language models have been driven not only by larger models or increased test-time computation, but critically by the incorporation of verification into the inference process. In practice, verification arises on two complementary fronts:
On the user side, model outputs are treated as high-potential candidates that are subject to careful evaluation. This evaluation may involve inspecting outputs line by line or, depending on the domain, testing them externally in the real world. Crucially, this process is informed by context, preferences, and domain knowledge that may extend beyond what can be captured textually. While this form of verification enforces the highest level of trust, it is inherently costly; we refer to it as strong verification.
LLM reasoning and verification. Recent progress in LLM reasoning is driven largely on two main axes. First, a large body of work focuses on improving the reasoning process via inference-time reasoning, including structured prompting [29,33,34], search [31,32], decoding strategies [23,30], as well as training approaches that elicit longer reasoning chains [22,27]. These methods treat weak signals as fixed and strategize reasoning traces to improve final performance.
Second, complementary work improves the weak verification signal itself, including LLM-as-judge evaluation [13], specialized verifiers [19,24], judge-time scaling [11,20], and process-reward models [26,28,35]. Our work is orthogonal to both: we take the weak verifier and reasoning procedure as fixed, and study the layer above, orchestrating when to trust the weak signal and when to invoke costly strong verification. This framework applies to any reasoning procedure (single pass, iterative refinement, or tree search) and any scoring model. To the best of our knowledge, this interaction has not been explicitly formulated or analyzed.
Selective prediction and learning to defer. Algorithmically, our setup relates to selective prediction and learning-to-defer (L2D). Early work established theoretical frameworks for classification with a reject option, posing the problem as risk minimization with explicit rejection costs [2,6]. Rather than fixing confidence thresholds post hoc, subsequent work learns when to abstain as part of training [4,7,8], with extensions to the online setting [7]. The L2D literature extends selective prediction to human-AI collaboration, studying the optimal division of labor between model and expert [14][15][16]25]. Our setting can be viewed as an instance of L2D, where deferral means invoking strong verification. The combination of distribution-free online calibration, partial feedback, and separate Type-I/II error control, together with the algorithmic techniques we develop, may be of independent interest to the broader umbrella of L2D.
We consider a general verification-guided reasoning setting involving a language model and two sources of verification.
Language model and verification oracles. Let P ∈ P denote a problem instance or prompt that requires reasoning. Let f : P → R denote a language model that, given P , generates a (possibly random) response R := f (P ).
We consider two forms of verification. First, let g : P × R → {0, 1} denote a strong verification oracle, which outputs a binary judgment indicating whether a response is correct. This oracle represents the strongest form of verification available, such as human inspection or domain specific executions, and serves as the ultimate criterion against which reasoning outcomes are evaluated.
Second, let w : P × R → [0, 1] denote a weak verification oracle, which assigns a real-valued score to a prompt-response pair, aiming to approximate strong verification, for example through proxy rewards or domain-specific tools. The continuous nature of w reflects uncertainty: it provides a confidence signal, with larger values indicating greater confidence in correctness.
Stream of queries and responses. We assume there exists an arbitrary and unknown stream of queries. At each time step t = 1, 2, . . . , the language model receives a query P t and produces a response R t = f (P t ). We use the notation w t := w(P t , R t ) and g t := g(P t , R t ).
We place no assumptions on how the stream {P t } t≥1 is generated. In particular, queries may be independent user prompts, intermediate reasoning steps, or any combination thereof, and the stream may depend arbitrarily on past verification outcomes. This modeling is flexible enough to capture a range of reasoning strategies. For example, -Each P t may correspond to a user prompt, each R t to a full model output. This resembles to a strategy known as output reward modeling in the literature [3].
-In step-by-step reasoning, each P t may consist of a prompt together with the partial solution so far, and each R t corresponds to a single reasoning step. This resembles process reward modeling in the literature [12].
In both cases, rejec
This content is AI-processed based on open access ArXiv data.