How AI Coding Agents Communicate: A Study of Pull Request Description Characteristics and Human Review Responses

How AI Coding Agents Communicate: A Study of Pull Request Description Characteristics and Human Review Responses
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The rapid adoption of large language models has led to the emergence of AI coding agents that autonomously create pull requests on GitHub. However, how these agents differ in their pull request description characteristics, and how human reviewers respond to them, remains underexplored. In this study, we conduct an empirical analysis of pull requests created by five AI coding agents using the AIDev dataset. We analyze agent differences in pull request description characteristics, including structural features, and examine human reviewer response in terms of review activity, response timing, sentiment, and merge outcomes. We find that AI coding agents exhibit distinct PR description styles, which are associated with differences in reviewer engagement, response time, and merge outcomes. We observe notable variation across agents in both reviewer interaction metrics and merge rates. These findings highlight the role of pull request presentation and reviewer interaction dynamics in human-AI collaborative software development.


💡 Research Summary

**
This paper investigates how autonomous AI coding agents differ in the way they write pull‑request (PR) descriptions on GitHub and how human reviewers react to those PRs. Using the AIDev dataset, the authors collected 33,596 PRs generated by five widely used agents—GitHub Copilot, OpenAI Codex, Claude Code, Cursor, and Devin. They defined eleven quantitative features to capture three dimensions of a PR description: (1) Work Style (files changed, lines added/deleted, number of commits), (2) Description Style (total characters, density of Markdown headers, list items, code blocks, emojis, and polite phrases), and (3) PR Compliance (whether the title follows Conventional Commit conventions). All features were normalized with Z‑scores, allowing direct cross‑agent comparison.

RQ1 – Description Characteristics
The analysis revealed clear stylistic clusters. Claude Code and Copilot produce long, code‑heavy descriptions with many code blocks but few headers or lists, making them less readable. Cursor favors plain text and a high density of polite expressions, while Devin splits changes into many small commits. In contrast, OpenAI Codex uniquely employs headers and bullet lists, resulting in highly structured, concise descriptions and relatively modest code changes. These differences suggest that each agent has a distinct “communication profile” that can affect reviewer cognition.

RQ2 – Human Reviewer Response
The second research question was split into two parts. For RQ2‑1 (review process), the authors filtered 94,865 raw review events to 28,961 genuine human comments, removing bots and non‑English entries. Four metrics were computed: comments per PR, median comment length, median time to first comment, and sentiment distribution (positive, neutral, negative) using a RoBERTa model fine‑tuned on software‑engineering text. All metrics showed statistically significant variation across agents (p < 0.001). Comments per PR had the largest effect size (ε² = 0.280). Claude Code elicited the longest comments and the highest proportion of positive sentiment; Copilot generated the most comments per PR but with predominantly neutral tone; Cursor received the highest share of negative sentiment; Devin and Codex attracted minimal engagement and very short response times.

For RQ2‑2 (outcome), the authors measured merge rate and median time to completion (from PR creation to final state). OpenAI Codex achieved the highest merge rate (82.6 %) and the shortest completion time (≈1 minute), indicating an extremely efficient review cycle. Cursor also performed well (65.2 % merge, 0.9 h completion) despite its negative sentiment share. Copilot lagged with a 43 % merge rate and a 13‑hour median cycle, reflecting extensive discussion but low conversion to merged code. Devin showed moderate merge success (53.8 %) but a longer 8.9‑hour cycle, often closing without resolution. Claude Code, while generating deep discussions, resolved PRs quickly (merge rate 59 %, completion 1.95 h).

Interpretation and Implications
The study demonstrates that the way an AI agent formats its PR description directly influences reviewer workload, emotional response, and final integration success. Structured markdown (headers, lists) and concise language correlate with higher merge rates and faster cycles, confirming prior findings that readability reduces cognitive load. Conversely, agents that prioritize raw code output without clear narrative (e.g., Copilot) provoke more comments but lower merge efficiency, suggesting reviewers spend more effort reconciling the changes. The sentiment analysis adds a human‑centric dimension: even high‑performing agents can generate negative sentiment, which may affect reviewer satisfaction and long‑term adoption.

From a design perspective, the authors propose concrete guidelines for future AI coding agents: (1) incorporate markdown structures to improve readability, (2) balance code block usage with explanatory text, (3) embed polite, courteous phrasing, and (4) adhere to Conventional Commit conventions for titles. These practices aim to lower reviewer cognitive burden and foster smoother human‑AI collaboration.

Threats to Validity
The authors acknowledge that the observational nature of the study precludes causal claims; task type, repository norms, and code complexity could confound the observed differences. Automated parsing of PR bodies may miss nuanced information, and filtering non‑English comments could bias sentiment results. Finally, findings are limited to the AIDev dataset and may not generalize to other platforms, time periods, or emerging agents.

Conclusion
Overall, the paper provides the first large‑scale empirical comparison of PR description styles across multiple autonomous AI coding agents and links these styles to concrete reviewer behaviors and outcomes. It highlights that effective AI‑assisted development requires not only functional code generation but also thoughtful communication design, paving the way for more human‑friendly AI tools in collaborative software engineering.


Comments & Academic Discussion

Loading comments...

Leave a Comment