Why Are AI Agent Involved Pull Requests (Fix-Related) Remain Unmerged? An Empirical Study
Autonomous coding agents (e.g., OpenAI Codex, Devin, GitHub Copilot) are increasingly used to generate fix-related pull requests (PRs) in real world software repositories. However, their practical effectiveness depends on whether these contributions are accepted and merged by project maintainers. In this paper, we present an empirical study of AI agent involved fix related PRs, examining both their integration outcomes, latency, and the factors that hinder successful merging. We first analyze 8,106 fix related PRs authored by five widely used AI coding agents from the AIDEV POP dataset to quantify the proportions of PRs that are merged, closed without merging, or remain open. We then conduct a manual qualitative analysis of a statistically significant sample of 326 closed but unmerged PRs, spending approximately 100 person hours to construct a structured catalog of 12 failure reasons. Our results indicate that test case failures and prior resolution of the same issues by other PRs are the most common causes of non integration, whereas build or deployment failures are comparatively rare. Overall, our findings expose key limitations of current AI coding agents in real world settings and highlight directions for their further improvement and for more effective human AI collaboration in software maintenance.
💡 Research Summary
This paper presents a large‑scale empirical investigation of how autonomous coding agents contribute fix‑related pull requests (PRs) in real‑world open‑source projects and why a substantial fraction of these contributions fail to be merged. Using the AIDEV‑POP dataset released in November 2025, the authors extracted 8,106 fix‑related PRs authored by five widely used AI coding agents: OpenAI Codex, GitHub Copilot, Devin, Cursor, and Claude Code. Each PR was classified into one of three outcome states—merged, closed without merging, or still open—based on its status at the time of data collection. The quantitative analysis (RQ1) shows that overall 65 % of the AI‑generated fix PRs are merged, 26 % are closed without merging, and 9 % remain open. However, success rates differ dramatically across agents. Codex achieves an 81.6 % merge rate, whereas Copilot and Devin hover around 42 % and exhibit a much higher proportion of closed‑but‑unmerged PRs.
The authors also measured integration latency for merged PRs by computing the elapsed time (in hours) between PR creation and closure. Figure 1 (log‑scale) reveals that Codex PRs tend to be merged quickly with a tight inter‑quartile range, while Copilot and Devin display broader distributions and heavier upper tails, indicating that many of their PRs experience days to weeks of review before merging or are ultimately rejected.
To answer RQ2, the study performed a manual qualitative analysis of a statistically representative sample of 326 closed‑but‑unmerged PRs (selected from the 2,113 unmerged cases, providing 95 % confidence with a 5 % margin of error). Two researchers independently coded each PR using an open‑coding approach, spending roughly 100 person‑hours in total. Inter‑rater agreement was substantial (Cohen’s κ = 0.82). From this effort the authors derived a catalog of twelve distinct failure reasons:
- Resolved by Another Pull Request (22.1 %)
- Test Case Failures (18.1 %)
- Incorrect or Incomplete Fixes (15.3 %)
- Fix Rejected by Maintainers (4.9 %)
- Deployment Failures (3.1 %)
- Build Failures (2.1 %)
- Low Priority or Obsolete Issue (8.0 %)
- Incomplete Review Process (5.5 %)
- Closed Due to Inactivity (9.2 %)
- Closed Without Explicit Rationale (4.0 %)
- No Review Conducted (4.6 %)
- Other Reasons (3.1 %).
Agent‑wise analysis (Table 2) shows that for Codex the dominant failure mode is test case failures (54.9 % of its unmerged PRs), indicating that while Codex can generate syntactically correct patches, many do not satisfy the project’s test suite. Devin’s unmerged PRs are overwhelmingly closed due to inactivity (54 %), suggesting that longer review cycles or lack of reviewer engagement lead to abandonment. Copilot’s failures are split between “resolved by another PR” and “incorrect/incomplete fixes,” while Cursor shows a higher share of “incorrect or incomplete fixes.”
The findings highlight two broad categories of obstacles: (a) technical shortcomings—patches that fail tests, break builds, or cannot be deployed; and (b) process‑related issues—reviewers not responding, PRs being closed without explanation, or the issue being superseded by other changes. The study therefore argues that current AI agents lack integrated verification pipelines that automatically ensure test pass, build success, and compliance with project conventions before a PR is submitted. Moreover, the human‑AI collaboration workflow often does not provide timely feedback loops, leading to PRs stalling or being closed for reasons unrelated to code quality.
In the discussion, the authors propose several avenues for improving the practical effectiveness of AI‑generated fix PRs:
- Integrate automated test generation and execution into the agent’s output pipeline so that test failures are caught early.
- Develop interactive review interfaces that allow agents to ingest reviewer comments and regenerate patches in near real‑time, reducing the latency observed for agents like Copilot and Devin.
- Implement agent‑specific quality gates (e.g., stricter test‑pass thresholds for Codex, activity‑monitoring alerts for Devin) to tailor expectations to each agent’s strengths and weaknesses.
- Introduce proactive notification mechanisms for inactive PRs to re‑engage reviewers or automatically reopen PRs when the underlying issue resurfaces.
The paper concludes that while AI agents are already capable of contributing a substantial volume of useful bug‑fixes—evidenced by a 65 % overall merge rate—there remains a non‑trivial “failure ceiling” of about one‑quarter of PRs that do not make it into the codebase. Addressing both the technical validation gaps and the collaborative process gaps is essential for moving from experimental AI‑generated patches toward reliable, production‑grade software maintenance tools. Future work should explore tighter CI/CD integration, richer human‑AI feedback channels, and longitudinal studies to assess whether the proposed mitigations indeed raise the merge rate and reduce latency across diverse open‑source ecosystems.
Comments & Academic Discussion
Loading comments...
Leave a Comment