Scaling Agentic Verifier for Competitive Coding

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models (LLMs) have demonstrated strong coding capabilities but still struggle to solve competitive programming problems correctly in a single attempt. Execution-based re-ranking offers a promising test-time scaling strategy, yet existing methods are constrained by either difficult test case generation or inefficient random input sampling. To address this limitation, we propose Agentic Verifier, an execution-based agent that actively reasons about program behaviors and searches for highly discriminative test inputs that expose behavioral discrepancies among candidate solutions. Through multi-turn interaction with code execution environments, the verifier iteratively refines the candidate input generator and produces targeted counterexamples rather than blindly sampling inputs. We train the verifier to acquire this discriminative input generation capability via a scalable pipeline combining large-scale data synthesis, rejection fine-tuning, and agentic reinforcement learning. Extensive experiments across five competitive programming benchmarks demonstrate consistent improvements over strong execution-based baselines, achieving up to +10-15% absolute gains in Best@K accuracy. Further analysis reveals clear test-time scaling behavior and highlights the verifier’s broader potential beyond reranking.

💡 Research Summary

The paper tackles a persistent problem in competitive programming: even state‑of‑the‑art large language models (LLMs) often fail to produce a correct solution on the first try. A common remedy is to generate many candidate programs and then use an execution‑based verifier to pick the best one (the “best‑of‑N” approach). Existing verification methods either require full test cases (inputs + expected outputs), which are as hard to generate as solving the problem itself, or rely on randomly sampled inputs. Random input generation is inefficient because the space of valid inputs is huge while only a tiny fraction actually distinguishes a correct solution from an incorrect one.

To overcome this bottleneck, the authors introduce Agentic Verifier, an LLM‑driven agent that actively searches for highly discriminative test inputs. Given a problem description Q and a pair of candidate solutions (Cₐ, C_b), the verifier engages in a multi‑turn dialogue with a sandboxed code execution environment. In each turn it can (1) propose an input generator program G, (2) execute the candidate solutions on generated inputs, (3) inspect execution outcomes (return values, errors, timeouts), and (4) refine its hypothesis about where the solutions diverge. After a few interaction steps it outputs a final generator G; running G yields a concrete test input x. If x is valid (satisfies the problem’s input constraints) and causes different outputs for Cₐ and C_b, the verifier succeeds. The goal is to learn a policy that maximizes the probability of producing at least one such distinguishing input per problem instance.

Training this agentic verifier requires a large, structured dataset. The authors first collect competitive‑programming problems from public datasets and online judges, extracting both correct reference solutions and a variety of incorrect submissions. For problems lacking official test cases they synthesize high‑quality inputs by prompting an LLM to generate candidates, executing them on multiple reference solutions, and retaining only those inputs that achieve high agreement across references (a consensus filter). They then select inputs that reject many incorrect solutions while preserving agreement among correct ones, ending up with roughly 60 test cases per problem. Using these test suites they generate many candidate solutions with Qwen3‑30B‑Thinking‑2507 and label each as correct/incorrect based on execution results. From each problem they construct pairs (Cₐ, C_b) such that at least one test case distinguishes the pair, guaranteeing a feasible discriminative input for training.

The training pipeline consists of three stages:

Rejection Fine‑Tuning – A strong teacher model (Qwen3‑235B‑A22B‑Thinking‑2507) is used to sample diverse multi‑turn interaction trajectories for each (Q, Cₐ, C_b) tuple. Only trajectories that ultimately produce a valid distinguishing input are kept; failed attempts are rejected. About 60 K successful trajectories are collected and used to supervised‑fine‑tune a smaller base model (Qwen3‑30B‑A3B). The model learns to predict the sequence of tool calls and the final generator program, effectively acquiring interaction patterns.
Agentic Reinforcement Learning – Building on the fine‑tuned policy, the authors apply a reinforcement‑learning step to further improve discriminative power. The reward is sparse: +1 if the final generated input is both valid and distinguishes the two solutions, 0 otherwise. Policy optimization (e.g., PPO) is performed on rollouts that consist of the full multi‑turn dialogue plus the final generator. This step encourages the agent to discover more subtle corner cases that random or heuristic strategies would miss.
Inference – At test time, for each problem the main code‑generation model (the “policy model”) samples N candidate solutions (N = 64 in the experiments). The verifier receives the problem and each pair of candidates, runs its multi‑turn search, and outputs a set of input generators. The generated inputs are executed on all candidates, and a voting scheme based on output agreement selects the final solution.

Experimental Evaluation
The authors evaluate on five competitive‑programming benchmarks covering a range of difficulty: USACO, LiveCodeBench (v6), OJBench, and two additional datasets. The policy model is Qwen3‑30B‑Thinking‑2507. For each benchmark they compare three sources of test inputs: (a) random inputs generated by prompting the model, (b) the original benchmark test cases (ground truth), and (c) inputs generated by the trained Agentic Verifier. Results are reported as Best@K accuracy as the number of test inputs M varies (4, 8, 16, 32, 64).

Key findings:

Random inputs improve performance only modestly; even with 64 random inputs they lag behind using just 8–16 ground‑truth cases.
The verifier‑generated inputs consistently outperform random inputs by a large margin (≈15–20 % absolute gain). On USACO they even surpass the ground‑truth suite, indicating that the learned generator can discover more discriminative corner cases than the original test set.
Performance scales smoothly with the number of generated inputs, confirming the test‑time scaling property: more computation yields better selection.
Additional analysis shows that many benchmark test suites are incomplete; they sometimes accept incorrect solutions (false positives). The verifier can expose these hidden failures by generating targeted counterexamples, suggesting a broader role as a correctness‑checking tool.

Limitations and Future Work
The current implementation supports only Python and C++ and relies on a relatively heavyweight multi‑turn interaction, which may be costly for real‑time deployment. The authors propose future directions such as (i) extending to more languages, (ii) designing more efficient tool‑call strategies (e.g., limiting the number of execution rounds), (iii) integrating human‑curated corner cases with the synthetic data, and (iv) exploring hybrid verification pipelines that combine the verifier with static analysis or formal methods.

Contributions Summary

Empirical demonstration that random input generation is highly inefficient for input‑only execution‑based voting in competitive programming.
Introduction of an agentic verifier that actively generates highly discriminative test inputs via multi‑turn interaction with a code execution sandbox.
A scalable training pipeline combining large‑scale data synthesis, rejection fine‑tuning, and agentic reinforcement learning.
Extensive experiments showing consistent improvements across multiple benchmarks, clear test‑time scaling, and robustness to problem difficulty.

Overall, the paper presents a compelling approach to augment LLM‑based code generation with an intelligent, execution‑driven verifier that learns to hunt for the most informative test inputs, thereby turning test‑time computation into a powerful lever for improving solution quality in competitive programming.

Scaling Agentic Verifier for Competitive Coding

💡 Research Summary

Comments & Academic Discussion

Leave a Comment