Consistency Meets Verification: Enhancing Test Generation Quality in Large Language Models Without Ground-Truth Solutions
Large Language Models (LLMs) have significantly advanced automated test generation, yet existing methods often rely on ground-truth code for verification, risking bug propagation and limiting applicability in test-driven development. We present ConVerTest, a novel two-stage pipeline for synthesizing reliable tests without requiring prior code implementations. ConVerTest integrates three core strategies: (i) Self-Consistency(SC) to generate convergent test cases via majority voting; (ii) Chain-of-Verification (CoVe) for iterative, reasoning-guided code refinement; and (iii) a Dual Execution Agreement to crossvalidate code and tests through consensus. Experiments on BIGCODEBENCH and LESS BASIC PYTHON PROBLEMS (LBPP) benchmarks demonstrate that ConVerTest improves test validity, line coverage, and mutation scores by up to 39%, 28%, and 18% respectively over baselines. Our findings highlight ConVerTest as a robust solution for mitigating hallucinations and enhancing the reliability of autonomous software testing agents.
💡 Research Summary
The paper tackles a fundamental problem in large‑language‑model (LLM)‑driven automated test generation: how to produce reliable unit tests without depending on a ground‑truth implementation, which can propagate bugs and limit applicability in test‑driven development. Existing approaches typically treat the code under test as an oracle, creating a circular error‑propagation loop, and suffer from hallucinations that generate invalid assertions or unrealistic inputs. To address these issues, the authors propose ConVerTest, a two‑stage pipeline that combines three complementary strategies—Self‑Consistency (SC), Chain‑of‑Verification (CoVe), and Dual Execution Agreement—to generate and validate tests in a code‑free setting.
Stage 1 – Consistency‑Driven Generation
The first stage runs two parallel pipelines. The test‑generation pipeline uses SC: it first prompts the LLM to produce a set of diverse test stubs (skeletons containing only the setup and function call). By decoupling stub creation from assertion completion, the model is encouraged to explore a wide range of scenarios, including edge cases. For each stub, the model is sampled N times to generate full test functions. All N completions are parsed into abstract syntax trees (ASTs) and clustered by logical equivalence, ignoring superficial differences such as variable names. The most frequent cluster is selected as the final test for that stub, ensuring both diversity (through many stubs) and consistency (through majority voting).
Simultaneously, the code‑generation pipeline applies CoVe, an iterative self‑verification loop originally designed for factual QA. Starting from a baseline solution generated from the problem description, the pipeline automatically formulates a verification plan consisting of targeted questions about correctness, logical soundness, edge‑case handling, constraint adherence, and robustness. Each question is answered independently, producing a diagnostic rationale. Based on these answers, the model refines the code, repeating the question‑answer‑refine cycle until the verification plan is satisfied. This process mimics a human code review, systematically eliminating logical flaws and hallucinated constructs.
Stage 2 – Consensus Verification
After generating M test cases and Z candidate solutions, ConVerTest employs a Dual Execution Agreement. Every test is executed against every candidate solution; a pair that passes is marked as an “inlier.” Solutions that pass the same subset of tests are grouped into consensus sets. Each set is scored by the number of solutions it contains and the number of tests it collectively passes. The highest‑scoring set yields the final solution, and the tests that it passes are deemed validated. Conversely, any test that fails to belong to a consensus set is discarded as likely invalid. This cross‑validation treats the generated code and tests as independent high‑confidence artifacts that mutually corroborate each other, thereby filtering out residual hallucinations that survived Stage 1.
Empirical Evaluation
The authors evaluate ConVerTest on two large benchmarks: BIGCODEBENCH (a massive collection of open‑source Python functions) and LESS BASIC PYTHON PROBLEMS (LBPP, a curated set of challenging Python coding tasks). They compare against strong baselines such as Gemma‑3.3 and CodeQwen‑3, which rely on ground‑truth code for test validation. Metrics include test validity rate (percentage of generated tests that correctly capture the intended specification), line coverage, and mutation score. ConVerTest improves test validity by up to 39 % (relative gain), raises line coverage by up to 28 %, and boosts mutation scores by up to 18 % over the baselines. An ablation study shows that SC alone yields a 7‑19 % increase in validity, CoVe alone yields 5‑12 % improvement, but the full combination achieves the best results, confirming the complementary nature of the three components.
Contributions and Limitations
The paper’s contributions are threefold: (1) adaptation of NLP consistency‑driven techniques (SC) and iterative verification (CoVe) to the software testing domain; (2) design of a two‑stage pipeline that integrates generation‑time and post‑generation hallucination mitigation; (3) a large‑scale empirical demonstration of substantial gains, together with open‑source release of data and code for reproducibility. Limitations include potential difficulty in generating stubs for highly nested or domain‑specific data structures, and the computational cost of the Dual Execution Agreement, which grows quadratically with the number of candidate solutions and tests. The evaluation is confined to Python, leaving cross‑language generalization an open question.
Future Directions
The authors suggest several avenues for further work: enriching stub generation with meta‑learning to better capture complex input schemas; optimizing the sampling strategy in the Dual Execution Agreement to reduce execution overhead; extending the framework to other programming languages and real‑world industrial pipelines; and investigating tighter integration with test‑driven development tools where tests are authored before any code exists.
In summary, ConVerTest presents a novel, code‑independent approach to automated test generation that effectively curbs LLM hallucinations through multi‑level consistency checks and mutual verification between generated tests and candidate solutions. The reported empirical gains demonstrate that combining self‑consistency, iterative verification, and dual execution consensus can substantially improve the reliability and adequacy of LLM‑generated test suites, marking a significant step forward for autonomous software testing agents.
Comments & Academic Discussion
Loading comments...
Leave a Comment