Adaptive Test-Time Compute Allocation via Learned Heuristics over Categorical Structure
Test-time computation has become a primary driver of progress in large language model (LLM) reasoning, but it is increasingly bottlenecked by expensive verification. In many reasoning systems, a large fraction of verifier calls are spent on redundant or unpromising intermediate hypotheses. We study reasoning under a \emph{verification-cost-limited} setting and ask how verification effort should be allocated across intermediate states. We propose a state-level selective verification framework that combines (i) deterministic feasibility gating over a structured move interface, (ii) pre-verification ranking using a hybrid of learned state-distance and residual scoring, and (iii) adaptive allocation of verifier calls based on local uncertainty. Unlike solution-level best-of-$N$ or uniform intermediate verification, our method distributes verification where it is most informative. On the \textsc{MATH} benchmark, our approach achieves higher accuracy than best-of-$N$, majority voting, and beam search while using 44% fewer verifier calls.
💡 Research Summary
The paper tackles the growing bottleneck of verification cost in test‑time reasoning with large language models (LLMs), especially in multi‑step symbolic domains such as grade‑school mathematics. While prior work has shown that increasing test‑time compute (e.g., best‑of‑N sampling, self‑consistency, tree search) improves accuracy, it treats verification as a uniform expense and often wastes calls on redundant or low‑utility intermediate hypotheses. The authors formalize a “verification‑cost‑limited” setting where the dominant resource is the number of expensive verifier invocations, and they ask how to allocate this budget across intermediate reasoning states.
Their solution is a three‑stage “state‑level selective verification” pipeline:
-
Deterministic feasibility gating – Each candidate step is expressed as a structured operator (op, args). Two cheap deterministic predicates filter out moves that (a) violate structural constraints (e.g., unparsable, wrong symbol scope) and (b) conflict with an explicit constraint context ℓ (e.g., type mismatches, contradictory bindings). No verifier calls are spent on these obviously invalid moves.
-
Hybrid pre‑verification scoring – For the remaining candidates, a score h(w,m) = D_type(w′, w★) + r_θ(w,m) is computed. D_type is a latent structural distance between the post‑move state w′ and the goal state w★, obtained from frozen LLM embeddings (or a learned distance function). r_θ is a learned residual scorer that predicts how promising a candidate is under a limited verifier budget. r_θ is trained from exploration logs: at each visited state the system records the gated candidate set and the binary verifier labels V(w,m). A pairwise logistic ranking loss pushes verifier‑accepted moves ahead of rejected ones, optionally enriched with a weak cost‑to‑go signal derived from remaining steps in successful trajectories.
-
State‑conditional verifier budget allocation – Using a local uncertainty proxy (e.g., variance of scores, gap between top‑k candidates, or a learned uncertainty estimator), the policy decides a per‑state verification budget k(w). Ambiguous branching points receive more verifier calls, while clear decisions receive few or none. This adaptive allocation concentrates expensive verification where it yields the highest marginal information gain.
The authors evaluate the method on the MATH benchmark, a standard testbed for long‑horizon symbolic reasoning. Under matched total verifier‑call budgets, their approach outperforms best‑of‑N sampling, majority voting, and beam search, achieving higher accuracy while using 44 % fewer verifier calls. The gains are especially pronounced on harder problems where intermediate decision points vary widely in difficulty. The paper also provides detailed implementation notes: how to generate the structured move interface, how to construct the deterministic gates, the architecture of the residual scorer (frozen LLM encoder + small MLP), and the data pipeline for collecting verifier‑labeled candidate lists.
In summary, the contribution is threefold: (1) a formal cost model that treats verifier invocations as the primary resource, (2) a practical gated‑competition policy that combines deterministic feasibility checks, learned pre‑verification ranking, and uncertainty‑driven budget allocation, and (3) empirical evidence that selective, state‑level verification dramatically improves the accuracy‑cost frontier on a challenging mathematical reasoning benchmark. The work opens a path toward cost‑aware LLM reasoning systems that can be deployed under strict latency or compute constraints without sacrificing performance.
Comments & Academic Discussion
Loading comments...
Leave a Comment