On the Adoption of AI Coding Agents in Open-source Android and iOS Development

On the Adoption of AI Coding Agents in Open-source Android and iOS Development
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

AI coding agents are increasingly contributing to software development, yet their impact on mobile development has received little empirical attention. In this paper, we present the first category-level empirical study of agent-generated code in open-source mobile app projects. We analyzed PR acceptance behaviors across mobile platforms, agents, and task categories using 2,901 AI-authored pull requests (PRs) in 193 verified Android and iOS open-source GitHub repositories in the AIDev dataset. We find that Android projects have received 2x more AI-authored PRs and have achieved higher PR acceptance rate (71%) than iOS (63%), with significant agent-level variation on Android. Across task categories, PRs with routine tasks (feature, fix, and ui) achieve the highest acceptance, while structural changes like refactor and build achieve lower success and longer resolution times. Furthermore, our evolution analysis shows improvement in PR resolution time on Android through mid-2025 before it declined again. Our findings offer the first evidence-based characterization of AI agents effects on OSS mobile projects and establish empirical baselines for evaluating agent-generated contributions to design platform aware agentic systems.


💡 Research Summary

This paper presents the first category‑level empirical investigation of AI coding agents’ contributions to open‑source mobile applications. Using the AIDev dataset, the authors identified 193 verified native Android (98) and iOS (95) repositories on GitHub that have at least ten stars and are not tutorials or sample projects. From these repositories they extracted 2,901 AI‑authored pull requests (PRs) spanning May–November 2025, including the original AIDev window (May–July) and an extended set (August–November).

PRs were classified into 13 task categories (feature, fix, refactor, build, chore, performance, style, test, docs, operations, UI, localization, other) through an open‑card sorting process assisted by GPT‑5, followed by expert refinement and validation (Cohen’s κ = 0.877). Statistical analysis employed Bayesian smoothing for small‑sample bias, non‑parametric tests (Mann‑Whitney U, Chi‑Square, Kruskal‑Wallis) with Holm correction, and post‑hoc Fisher’s exact or Dunn’s tests where appropriate.

RQ1 – Platform and Agent Acceptance Rates
Android PRs enjoy a higher acceptance (merged) rate of 71 % versus 63 % for iOS (p < 0.05). On Android, acceptance varies markedly by agent: Codex achieves 76.8 % acceptance, while Copilot and Cursor lag at 28.0 % and 42.3 % respectively (significant chi‑square, p < 0.05). iOS shows a narrow, uniform acceptance band (51 %–79 %) with no significant agent‑level differences (p > 0.05). The authors interpret this as evidence that Android’s more heterogeneous build ecosystem amplifies quality differences among agents, whereas iOS’s stricter design and CI policies impose a higher baseline barrier that flattens agent performance.

RQ2 – Category‑Level Acceptance
On Android, routine categories—localization (100 % acceptance), UI (88 %), and fix (75 %)—outperform structural categories such as refactor, feature, and build, which have significantly lower acceptance (p < 0.05). Fisher’s exact post‑hoc confirms localization’s superiority over several low‑performing groups. iOS, by contrast, exhibits no statistically significant variation across categories (p > 0.05), indicating a more homogeneous review stance regardless of task type.

RQ3 – Resolution Time Dynamics
iOS PRs resolve on average 18 × faster than Android PRs (p < 0.05). Within agents, Codex consistently resolves the fastest on both platforms; on Android it is three times quicker than Claude, while on iOS Devin appears fastest but suffers from a tiny sample (n = 7). Functional PRs (feature, UI, localization, fix) resolve dramatically faster than non‑functional ones (refactor, build, chore, performance, style, test, docs, operations): 400 × on Android and 7 × on iOS (p < 0.05). Temporal analysis shows Android’s median resolution time decreasing from May to August 2025, then rising again later, suggesting an unstable efficiency gain. iOS displays modest month‑to‑month fluctuations without a clear upward or downward trend.

Threats to Validity
The study relies on acceptance rate and elapsed time as proxies for contribution quality, omitting deeper signals such as post‑merge defect rates or reviewer comment depth. Category labeling is based on PR titles generated by GPT‑5, which may misclassify mixed‑intent changes despite expert validation. The dataset is limited to public GitHub OSS; corporate or private repositories may exhibit different dynamics. Some agents and categories have sparse data, potentially affecting statistical power despite smoothing techniques.

Discussion and Implications
The findings reveal platform‑specific dynamics: Android developers are more receptive to AI contributions overall, but the choice of agent matters substantially; Codex emerges as the most effective model. iOS developers are uniformly cautious, yet they process AI‑generated PRs more quickly, suggesting that when an AI contribution passes the higher entry barrier, it integrates smoothly. Routine tasks (localization, UI tweaks, bug fixes) are the low‑risk sweet spot for automation, whereas structural changes still demand extensive human review, especially on Android.

Future Work
The authors propose extending the analysis to cross‑platform frameworks (Flutter, React Native), incorporating richer collaboration signals (review comments, subsequent defect tracking), and examining longitudinal trends as LLMs evolve. Evaluating private industrial repositories would test the generalizability of the observed patterns.

In sum, this paper provides the first quantitative baseline for AI‑generated contributions in mobile open‑source development, highlighting how platform architecture, task type, and agent selection jointly shape acceptance likelihood and review efficiency.


Comments & Academic Discussion

Loading comments...

Leave a Comment