On the Use of Agentic Coding: An Empirical Study of Pull Requests on GitHub
Large language models (LLMs) are increasingly being integrated into software development processes. The ability to generate code and submit pull requests with minimal human intervention, through the use of autonomous AI agents, is poised to become a standard practice. However, little is known about the practical usefulness of these pull requests and the extent to which their contributions are accepted in real-world projects. In this paper, we empirically study 567 GitHub pull requests (PRs) generated using Claude Code, an agentic coding tool, across 157 diverse open-source projects. Our analysis reveals that developers tend to rely on agents for tasks such as refactoring, documentation, and testing. The results indicate that 83.8% of these agent-assisted PRs are eventually accepted and merged by project maintainers, with 54.9% of the merged PRs are integrated without further modification. The remaining 45.1% require additional changes benefit from human revisions, especially for bug fixes, documentation, and adherence to project-specific standards. These findings suggest that while agent-assisted PRs are largely acceptable, they still benefit from human oversight and refinement.
💡 Research Summary
This paper presents an empirical investigation of “agentic coding,” the practice of using autonomous AI agents powered by large language models (LLMs) to generate, modify, and submit code changes on GitHub without direct human prompting. The authors focus on Claude Code, an Anthropic‑developed agentic coding tool, and analyze 567 pull requests (PRs) that were explicitly marked as generated by Claude Code (“Generated with Claude Code”) across 157 diverse open‑source repositories. A matched set of 567 human‑authored PRs (HPRs) from the same repositories and authors serves as a baseline for comparison.
Data collection and methodology
The study extracts PRs via the GitHub GraphQL API, limiting the search to repositories with at least ten stars to ensure a minimal level of activity and quality. The time window spans from the public release of Claude Code (24 Feb 2025) to the start of data collection (30 Apr 2025). After filtering out open or irrelevant PRs, 567 agentic PRs (APRs) remain. For each APR, a corresponding human PR is sampled from the same repository and author, extending the sampling window backward when necessary to achieve parity. The final dataset contains 567 APRs and 567 HPRs, mutually exclusive.
To answer four research questions (RQs), the authors manually classify PR purpose using the two‑dimensional framework of Zeng et al. (purpose × object). Labels include bug‑fix, feature, refactor, docs, test, build, style, CI, performance, and chore. Multi‑label classification yields a label‑level agreement of 75.8 % across annotators, with disagreements resolved through discussion.
RQ 1 – Change size and purpose
APR and HPR distributions differ markedly. APRs are heavily skewed toward non‑functional improvements: refactoring (24.9 %), documentation (22.1 %), test additions (18.8 %), build changes (10.8 %), and style tweaks (7.5 %). In contrast, HPRs contain more “chore” (10.4 %) and CI‑related changes (7.2 %). Both groups address bugs (≈31 %) and new features (≈27 %) at similar rates, indicating that agents are already being used for substantive functional work, but they excel at repetitive, well‑defined tasks.
RQ 2 – Acceptance and rejection
Overall, 83.8 % of APRs are merged, compared with 91.0 % of HPRs. Rejections are primarily driven by project‑specific context—alternative solutions, oversized PRs, or misalignment with repository conventions—rather than intrinsic code defects. This suggests that the primary barrier to adoption is not AI quality per se, but the need for the PR to fit the maintainers’ workflow and standards.
RQ 3 – Merged without revision
Among merged PRs, 54.9 % of APRs are accepted “as‑is,” a proportion statistically indistinguishable from the 58.5 % observed for HPRs. When revisions are required, the additional effort (measured by number of extra commits, lines of code changed, and files touched) does not differ significantly between APRs and HPRs. Thus, human reviewers do not expend substantially more work to clean up agent‑generated changes.
RQ 4 – Types of required revisions
For the 45.1 % of APRs that needed post‑acceptance changes, the most common revision categories are bug fixes (47.7 %), documentation updates (29.0 %), refactoring (27.1 %), and style improvements (23.4 %). These findings highlight that while agents can produce syntactically correct code, they still miss subtle logical errors and project‑specific style or documentation conventions, underscoring the necessity of human oversight.
Implications
The study demonstrates that autonomous AI agents can produce pull requests that are largely acceptable to real‑world maintainers, especially for tasks that are repetitive, well‑structured, or non‑functional in nature. The comparable merge‑without‑revision rates suggest that agent‑generated code can reach a quality level close to human contributions. However, the need for human‑driven bug fixes and contextual adjustments indicates that a fully autonomous pipeline is not yet viable; a hybrid workflow where agents handle boilerplate or routine changes while humans perform final validation appears optimal.
Threats to validity
Key limitations include reliance on a single agentic tool (Claude Code), which may not represent the broader class of LLM‑based agents; the string‑based PR identification method could miss unmarked agentic PRs; and the focus on open‑source repositories may limit generalizability to corporate or proprietary codebases with different governance models.
Future work
The authors propose extending the analysis to multiple LLM agents (e.g., GitHub Copilot, OpenAI’s Code Interpreter) and to private enterprise repositories. They also suggest developing automated quality metrics (e.g., static analysis, test coverage) to quantitatively assess the incremental value of human revisions, and exploring how to better integrate agents into CI/CD pipelines to reduce the manual review burden.
In sum, the paper provides robust empirical evidence that agentic coding tools are already useful in practice, achieving high acceptance rates and modest revision overhead, while also clarifying the boundaries where human expertise remains indispensable.
Comments & Academic Discussion
Loading comments...
Leave a Comment