How do Agents Refactor: An Empirical Study

How do Agents Refactor: An Empirical Study
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Software development agents such as Claude Code, GitHub Copilot, Cursor Agent, Devin, and OpenAI Codex are being increasingly integrated into developer workflows. While prior work has evaluated agent capabilities for code completion and task automation, there is little work investigating how these agents perform Java refactoring in practice, the types of changes they make, and their impact on code quality. In this study, we present the first analysis of agentic refactoring pull requests in Java, comparing them to developer refactorings across 86 projects per group. Using RefactoringMiner and DesigniteJava 3.0, we identify refactoring types and detect code smells before and after refactoring commits. Our results show that agent refactorings are dominated by annotation changes (the 5 most common refactoring types done by agents are annotation related), in contrast to the diverse structural improvements typical of developers. Despite these differences in refactoring types, we find Cursor to be the only model to show a statistically significant increase in refactoring smells.


💡 Research Summary

This paper presents the first large‑scale empirical investigation of how modern code generation agents—Claude Code, GitHub Copilot, Cursor Agent, Devin, and OpenAI Codex—perform Java refactoring in real‑world open‑source projects, and how their refactoring activities compare to those of human developers. The authors extracted two parallel datasets from the AIDev repository (accessed October 2025). The “agentic” dataset consists of 1,278 pull requests (PRs) made by the five agents across 86 Java projects, yielding 2,626 commits, of which 413 commits contain at least one refactoring operation (324 of those modify .java files). The “developer” baseline comprises 86 popular Java repositories (≥50 stars, no updates since January 2021) with 6,466 commits, 1,142 of which are refactoring commits.

Refactoring detection was performed with RefactoringMiner, a state‑of‑the‑art tool with 97.96 % precision and 87.20 % recall across refactoring types. Code‑smell analysis before and after each refactoring commit used DesigniteJava 3.0, which reports >98 % precision and recall for a suite of 30+ design smells.

Research Question 1 (RQ1): What refactoring types do agents perform and how frequent are they?
The authors find a stark contrast between agents and developers. Developers exhibit a balanced distribution: the most common type, Change Attribute Access Modifier, accounts for only 5.92 % of all refactorings, followed closely by Move Class (5.91 %) and Extract Method (4.98 %). This indicates a focus on structural improvements that enhance maintainability. In contrast, agents concentrate heavily on annotation‑related changes. Across all agents, the top three refactoring types are Add Method Annotation (22.52 %), Add Parameter Annotation (12.82 %), and Modify Method Annotation (10.37 %). Claude Code is the most extreme case: over 91 % of its 16,780 refactorings are annotation‑centric (41.9 % Add Method Annotation, 27.6 % Add Parameter Annotation, 21.4 % Modify Method Annotation). Claude also performs massive batches, with a mean of 762 changes per refactoring commit. Copilot shows a more mixed profile, including Remove Parameter (9.1 %), Move Class (8.7 %), and annotation edits, making it the closest to human behavior. Cursor Agent performs the fewest total refactorings (31) but focuses on structural changes such as Extract Method (29 %) and Move Class (16 %). Devin mirrors Claude’s annotation bias, while OpenAI Codex displays a blend of structural and annotation edits, though still more annotation‑heavy than the developer baseline.

Research Question 2 (RQ2): Are agents and humans equally likely to introduce code smells when refactoring?
The study measures the mean change (Δ) in the number of detected smells per refactoring commit. Developers show a modest average increase of 2.43 smells (median Δ = 0), indicating that typical human refactorings neither dramatically improve nor degrade code quality. Claude Code’s average Δ is 35.5, but the Wilcoxon rank‑sum test yields p = 0.25 and a negligible effect size (Cohen’s d = 0.14), so the increase is not statistically significant. Copilot (Δ = 2.22), Devin (Δ = 0), and OpenAI Codex (Δ = ‑1.2) also show no significant differences from developers (p > 0.4, |d| < 0.1). Cursor Agent is the outlier: its mean Δ is 19.86, with a median increase of 3 smells; the difference is statistically significant (p = 0.013) and has a moderate effect size (Cohen’s d = 0.51). This suggests that Cursor’s structural refactorings tend to introduce more design smells than human refactorings, potentially reducing code quality.

Implications
The findings reveal that current agents excel at bulk annotation management but lack the diversity of structural refactorings that human developers routinely apply. The sheer number of refactoring actions in an agent‑generated PR should not be interpreted as a proxy for quality improvement. Moreover, while most agents’ impact on code smells is statistically indistinguishable from developers, agents that focus on structural changes (e.g., Cursor) may require additional human review or automated quality gates to avoid degrading design quality.

Limitations and Future Work
The study is limited to Java projects and two static analysis tools; extending to other languages and dynamic analyses could provide a broader view. The relatively small number of Cursor PRs (9 projects) may affect the robustness of the statistical findings. Future research should explore hybrid workflows where agents suggest refactorings and developers validate them, as well as the development of agent‑aware refactoring metrics that capture both annotation and structural dimensions.

In summary, the paper demonstrates that while code agents are increasingly contributing refactoring commits, their behavior is heavily skewed toward annotation edits, and only a subset (Cursor) shows a statistically significant tendency to increase code smells. This nuanced picture informs both tool developers—who may need to enrich agents’ structural refactoring capabilities—and practitioners—who should apply appropriate review processes when integrating agent‑generated refactorings into production codebases.


Comments & Academic Discussion

Loading comments...

Leave a Comment