Dynamic Cogeneration of Bug Reproduction Test in Agentic Program Repair
Bug Reproduction Tests (BRTs) have been used in many agentic Automated Program Repair (APR) systems, primarily for validating promising fixes and aiding fix generation. In practice, when developers submit a patch, they often implement the BRT alongside the fix. Our experience deploying agentic APR reveals that developers similarly desire a BRT within AI-generated patches to increase their confidence. However, canonical APR systems tend to generate BRTs and fixes separately, or focus on producing only the fix in the final patch. In this paper, we study agentic APR in the context of cogeneration, where the APR agent is instructed to generate both a fix and a BRT in the same patch. We evaluate the effectiveness of different cogeneration strategies on 120 human-reported bugs at Google and characterize different cogeneration strategies by their influence on APR agent behavior. We develop and evaluate patch selectors that account for test change information to select patches with plausible fixes (and plausible BRTs). Finally, we analyze the root causes of failed cogeneration trajectories. Importantly, we show that cogeneration allows the APR agent to generate BRTs for at least as many bugs as a dedicated BRT agent, without compromising the generation rate of plausible fixes, thereby reducing engineering effort in maintaining and coordinating separate generation pipelines for fix and BRT at scale.
💡 Research Summary
The paper investigates the feasibility and benefits of generating bug‑reproduction tests (BRTs) together with bug‑fixes in a single patch, a process the authors call “cogeneration.” In many production settings, developers expect a failing‑to‑passing test to accompany a fix, but existing agentic automated program repair (APR) systems typically produce fixes and tests in separate pipelines or discard the test after it has been used for validation. To address this gap, the authors extend their LLM‑driven APR agent, Passerine, with three distinct cogeneration strategies that mirror human development workflows:
- Test‑Driven Development (TDD) – the agent first writes a BRT that fails on the buggy code, confirms the failure, and then writes the fix.
- Test‑Last Development (TLD) – the agent first writes the fix, then adds a BRT to verify that the fix indeed resolves the bug.
- Freeform (Dynamic) – the agent is free to interleave test and fix creation in any order, guided only by its internal reasoning.
These strategies are implemented by augmenting the system prompt with explicit instructions (e.g., “write the test before the fix” for TDD). The authors evaluate the approaches on a realistic benchmark of 120 human‑reported bugs from Google’s internal Issue Tracking System, covering six programming languages. For each bug, up to M parallel repair trajectories are launched, each allowed up to N steps. Generated patches pass through a multi‑stage validation pipeline consisting of a build‑and‑test reviewer, a “smell” heuristic reviewer, a specification‑based LLM reviewer, and a newly added BRT reviewer that checks whether the test fails on the buggy version and passes on the patched version.
Effectiveness is measured with execution‑based metrics such as plausible‑fix@k and plausible‑BRT@k. The results show that all three cogeneration strategies achieve at least the same success rate as dedicated fix‑only or BRT‑only agents, with the Freeform strategy performing best overall. Importantly, cogeneration does not degrade the rate of plausible fixes, demonstrating that the two generation tasks can be combined without trade‑offs.
Beyond raw success rates, the study characterizes how each strategy influences agent behavior. TDD tends to increase early root‑cause analysis and produces more test‑centric reasoning, while TLD often results in smaller code changes after the fix is written. Freeform exhibits a balanced mix of both, suggesting that allowing the LLM to decide the order yields the most adaptable workflow.
A key engineering contribution is the design of a “test‑aware” patch selector. The original selector, optimized for fix‑only patches, penalized patches containing tests because test code varies more and thus fragments clusters during majority voting. The new selector groups patches by both fix and test presence, ranks them, and preferentially selects patches that contain a plausible fix and a plausible BRT. Compared to the default selector, the test‑aware selector improves precision from 0.08 to 0.16 and recall from 0.57 to 0.71 for patches with both a valid fix and BRT, while maintaining similar performance for fix‑only patches.
The authors also perform a qualitative failure analysis. The most common failure modes are: (1) the agent discards the generated test, treating it as a temporary artifact; (2) the agent pursues an incorrect debugging hypothesis, exhausting steps without converging; and (3) the fix and test become over‑fitted to each other, causing the test to no longer reproduce the original bug. These insights point to future improvements in prompt design, LLM hyper‑parameter tuning, and richer intermediate verification steps.
In summary, the paper demonstrates that dynamic cogeneration of fixes and bug‑reproduction tests is not only feasible but also advantageous: it matches or exceeds the performance of separate generation pipelines, reduces engineering overhead by eliminating the need for parallel components, and yields patches that align better with developer expectations. The work provides concrete guidelines for integrating test generation into APR agents, introduces a practical test‑aware selection mechanism, and offers a thorough analysis of agent behavior and failure causes, paving the way for more trustworthy, production‑ready AI‑driven program repair systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment