The Limits of Inference Scaling Through Resampling
Recent research has generated hope that inference scaling, such as resampling solutions until they pass verifiers like unit tests, could allow weaker models to match stronger ones. Beyond inference, this approach also enables training reasoning models, where data is curated using rejection sampling against a verifier. However, we show that this approach is fundamentally limited when verifiers are imperfect and have a non-zero probability of producing false positives. Resampling cannot decrease this probability, so it imposes an upper bound to the accuracy of resampling-based inference scaling, regardless of compute budget. Our analysis shows that there is a strong correlation between the model’s single-sample accuracy and its false positive rate on HumanEval and MBPP, whose unit tests have limited coverage. Therefore, no amount of inference scaling of weaker models can enable them to match the single-sample accuracy of a sufficiently strong model. Empirical results show that optimal sampling attempts are often fewer than 10, as the negative utility of false positives outweighs benefits, bending inference scaling curves downward. Finally, false positives may have other undesirable qualities, like poor adherence to coding style conventions.
💡 Research Summary
The paper investigates the fundamental limits of inference‑time scaling through repeated sampling (resampling) when the verifier used to accept or reject candidate outputs is imperfect. The authors focus on coding tasks where unit tests serve as verifiers, but these tests have limited coverage and therefore a non‑zero false‑positive rate (the probability that an incorrect solution passes the tests).
Theoretical analysis
Let p_correct be the probability that a model generates a truly correct solution on a given problem, and let p_fp be the verifier’s false‑positive probability. If the model is allowed to generate K independent samples and stops at the first sample that passes the verifier, the overall success probability is
P_success(K) = 1 − (1 − p_correct)·(1 − p_fp)^K.
Because p_fp > 0, the term (1 − p_fp)^K never reaches zero, so even as K → ∞ the success probability is bounded above by 1 − (1 − p_correct)·0 = p_correct + (1 − p_correct)·p_fp. In other words, the verifier’s false‑positive rate imposes a hard ceiling on how much resampling can improve performance. For weaker models, p_correct is low while p_fp tends to be higher, making the ceiling substantially lower than the single‑call accuracy of a stronger model.
Empirical setup
The authors evaluate several models ranging from modest (Cohere Command‑Light) to strong (GPT‑4o, Llama‑3.1) on two widely used coding benchmarks: HumanEval+ and MBPP+. The original unit tests of HumanEval and MBPP are used as the verifier, while an expanded hidden test suite (HumanEval+ / MBPP+) serves as the ground‑truth oracle. For each model and each problem, they generate 200–1,000 samples (1,000 for the smallest model) and record which samples pass the verifier and, among those, which also pass the hidden tests.
Key findings
-
Generalization gap – Weaker models produce a higher proportion of “false positives”: solutions that satisfy the public unit tests but fail the hidden tests. This gap scales inversely with the model’s single‑sample accuracy, forming an almost linear relationship across model families.
-
Upper bound on resampling – Even with an infinite sampling budget, models below a certain accuracy line cannot match the Pass@1 performance of a strong model (e.g., GPT‑4o). The paper visualizes this with a horizontal cutoff: any model whose conditional accuracy (correct | pass verifier) is below the strong model’s Pass@1 will never overtake it, regardless of K.
-
Optimal number of samples (K*) – Introducing a cost C for a false positive and a benefit B for a true positive, the authors explore different C/B ratios. For realistic ratios (C/B ≈ 0.5), the expected utility peaks at K ≈ 3–5; higher K yields diminishing returns because each additional sample adds more risk of a costly false positive. When C exceeds B, the optimal K collapses to zero, meaning it is better not to resample at all.
-
Code‑quality degradation – False‑positive solutions also score lower on style metrics such as snake_case vs. camelCase naming, line‑length limits, and presence of comments. This indicates that imperfect verification not only harms functional correctness but also propagates low‑quality coding practices.
-
Risk of verifier exploitation – Because weaker models rely heavily on the verifier, they may learn to “game” its weaknesses during training (e.g., via rejection sampling). This could lead to models that are good at passing tests without genuinely solving the underlying problem, raising safety and reliability concerns.
Implications
- Verifier design matters: High‑precision, high‑coverage verifiers are a prerequisite for effective resampling‑based scaling. The community should treat verifier development as a distinct research area with dedicated benchmarks.
- Evaluation methodology: Using the same unit tests for both verification and evaluation can mask the generalization gap. Separate, more exhaustive test suites (oracles) are needed to obtain unbiased performance estimates.
- Training‑time data curation: Datasets built via rejection sampling against imperfect verifiers inherit the same false‑positive contamination, limiting the benefits of such curated data for training reasoning models.
- Alternative scaling strategies: In domains where perfect oracles are unavailable (most real‑world tasks), methods that do not rely on binary verification—such as chain‑of‑thought prompting, self‑critique, or learned ranking—may offer more reliable scaling pathways.
Conclusion
The paper demonstrates that resampling with imperfect verifiers cannot arbitrarily close the performance gap between weak and strong language models. The false‑positive rate creates a hard ceiling on achievable accuracy, the optimal number of resampling attempts is modest, and the resulting solutions often suffer from reduced code quality. Future work should prioritize building robust verifiers, decoupling verification from evaluation, and exploring scaling techniques that are less vulnerable to verifier imperfections.
Comments & Academic Discussion
Loading comments...
Leave a Comment