Predicting Developer Acceptance of AI-Generated Code Suggestions

Predicting Developer Acceptance of AI-Generated Code Suggestions
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

AI-assisted programming tools are widely adopted, yet their practical utility is often undermined by undesired suggestions that interrupt developer workflows and cause frustration. While existing research has explored developer-AI interactions when programming qualitatively, a significant gap remains in quantitative analysis of developers’ acceptance of AI-generated code suggestions, partly because the necessary fine-grained interaction data is often proprietary. To bridge this gap, this paper conducts an empirical study using 66,329 industrial developer-AI interactions from a large technology company. We analyze features that are significantly different between accepted code suggestions and rejected ones. We find that accepted suggestions are characterized by significantly higher historical acceptance counts and ratios for both developers and projects, longer generation intervals, shorter preceding code context in the project, and older IDE versions. Based on these findings, we introduce CSAP (Code Suggestion Acceptance Prediction) to predict whether a developer will accept the code suggestion before it is displayed. Our evaluation of CSAP shows that it achieves the accuracy of 0.973 and 0.922 on imbalanced and balanced dataset respectively. Compared to a large language model baseline and an in-production industrial filter, CSAP relatively improves the accuracy by 12.6% and 69.5% on imbalanced dataset, and improves the accuracy by 87.0% and 140.1% on balanced dataset. Our results demonstrate that targeted personalization is a powerful approach for filtering out code suggestions with predicted rejection and reduce developer interruption. To the best of our knowledge, it is the first quantitative study of code suggestion acceptance on large-scale industrial data, and this work also sheds light on an important research direction of AI-assisted programming.


💡 Research Summary

The paper tackles a pressing problem in modern software development: AI‑assisted programming tools such as GitHub Copilot or Cursor generate code suggestions that are frequently rejected, disrupting developers’ workflow and causing frustration. While prior work has explored developer‑AI interaction qualitatively, quantitative evidence—especially at the scale of fine‑grained IDE telemetry—is scarce because such data are usually proprietary. To fill this gap, the authors obtained 66,329 interaction logs from a large technology company. Each log records whether a suggestion was accepted or rejected, together with rich contextual information: preceding code context, generated code length, modifications made by the developer, IDE type and version, generation latency, and timestamps.

Research Questions

  1. Which features differ significantly between accepted and rejected suggestions?
  2. Can these features be leveraged to predict acceptance before the suggestion is shown?

Feature Design
The authors construct a compact feature set spanning three dimensions: (i) developer habit, (ii) project preference, and (iii) in‑situ context. Historical developer‑level and project‑level statistics are aggregated over a 7‑day window (shorter windows were found sufficient). Developer‑habit features include average preceding‑context lines/characters, average generated‑code lines/characters, average modification distance (Levenshtein), modification ratio, average written‑code lines/characters, generation time, and—crucially—historical acceptance count and acceptance ratio. Project‑level features mirror the acceptance count/ratio at the project scale and the average preceding‑context length of the project. In‑situ features capture the IDE version, IDE type, and generation latency for the specific request.

Statistical significance testing (t‑test / Mann‑Whitney U) reveals that accepted suggestions are characterized by:

  • Higher historical acceptance counts and ratios for both the individual developer and the project.
  • Longer generation intervals, suggesting developers are more likely to accept suggestions when they have waited longer (perhaps indicating more thoughtful requests).
  • Shorter preceding‑code context, implying that suggestions in less complex surroundings are more useful.
  • Older IDE versions, an unexpected finding that may reflect that newer IDEs provide richer native assistance, reducing reliance on AI suggestions.

Prediction Model – CSAP
Based on the significant features, the authors propose CSAP (Code Suggestion Acceptance Prediction), a lightweight neural network (a few fully‑connected layers) trained with a class‑balanced binary cross‑entropy loss to mitigate the natural imbalance (≈ 1:4 acceptance:rejection). Two evaluation settings are used: (a) the original imbalanced dataset, and (b) a balanced version created via oversampling. Results:

  • Imbalanced dataset: Accuracy 0.973, Precision 0.961, Recall 0.938, F1 0.949.
  • Balanced dataset: Accuracy 0.922, Precision 0.915, Recall 0.907, F1 0.911.

The model is compared against two baselines: (i) a direct large language model call (Qwen2.5‑Coder‑32B) that estimates suggestion quality, and (ii) the company’s production‑grade “Circuit Breaker” filter. CSAP improves accuracy by 12.6 % (vs. LLM) and 69.5 % (vs. Circuit Breaker) on the imbalanced set, and by 87.0 % and 140.1 %, respectively, on the balanced set.

Feature Importance
SHAP analysis shows the top contributors are in‑situ IDE version and developer_accepted_ratio, confirming that both the immediate environment and the developer’s past acceptance behavior dominate the prediction. Other notable contributors include generation time, project acceptance ratio, and preceding‑context character count.

Contributions & Implications

  1. First large‑scale quantitative study of code‑suggestion acceptance using real industrial telemetry.
  2. Systematic feature engineering that blends historical, project‑level, and real‑time signals.
  3. CSAP, a practical, low‑latency model that can be deployed inside IDE plugins to filter out likely‑to‑be‑rejected suggestions, thereby reducing interruptions and improving overall developer experience.

Limitations & Future Work
The dataset originates from a single company, so external validity across domains or open‑source ecosystems remains to be demonstrated. The current model does not directly assess the intrinsic quality of the generated code (e.g., correctness, performance) nor does it model fine‑grained coding style preferences. Future research directions include: (a) extending the study to multiple organizations, (b) integrating code‑quality metrics and style similarity into the prediction pipeline, (c) exploring reinforcement‑learning or bandit approaches for real‑time policy optimization, and (d) developing online learning mechanisms that continuously adapt CSAP as developers’ habits evolve.

In summary, the paper provides compelling evidence that personalized, history‑aware features can accurately forecast whether a developer will accept an AI‑generated code suggestion. By deploying CSAP, AI‑assisted programming tools can proactively suppress low‑value suggestions, mitigating workflow disruption and fostering higher adoption rates.


Comments & Academic Discussion

Loading comments...

Leave a Comment