Large language models (LLMs) for code generation are becoming integral to modern software development, but their real-world prevalence and security impact remain poorly understood.
We present the first large-scale empirical study of AI-generated code (AIGCode) in the wild. We build a high-precision detection pipeline and a representative benchmark to distinguish AIGCode from human-written code, and apply them to (i) development commits from the top 1,000 GitHub repositories (2022-2025) and (ii) 7,000+ recent CVE-linked code changes. This lets us label commits, files, and functions along a human/AI axis and trace how AIGCode moves through projects and vulnerability life cycles.
Our measurements show three ecological patterns. First, AIGCode is already a substantial fraction of new code, but adoption is structured: AI concentrates in glue code, tests, refactoring, documentation, and other boilerplate, while core logic and security-critical configurations remain mostly human-written. Second, adoption has security consequences: some CWE families are overrepresented in AI-tagged code, and near-identical insecure templates recur across unrelated projects, suggesting "AI-induced vulnerabilities" propagated by shared models rather than shared maintainers. Third, in human-AI edit chains, AI introduces high-throughput changes while humans act as security gatekeepers; when review is shallow, AI-introduced defects persist longer, remain exposed on network-accessible surfaces, and spread to more files and repositories.
We will open-source the complete dataset and release analysis artifacts and fine-grained documentation of our methodology and findings.
With the rapid development of large-scale language models, artificial intelligence technology is profoundly reshaping the field of software engineering. Code generation tools, such as GitHub Copilot [37,58] and Claude Code [11,32], have demonstrated stable and significant performance improvements in areas such as code completion, function implementation, refactoring, and projectlevel task collaboration [2,6,29,39,45,52]. Statistics show that a considerable proportion of developers are already using AI tools to write code in their daily work [10,19], and AIGCode has become a crucial component of the modern software development process.
However, this technological shift also brings potential security risks. LLMs’ power stems from their learning from massive amounts of publicly available codebases, but this data inevitably includes historical code with security flaws [19,27,44,54]. Therefore, the model inherits these insecure coding patterns during the learning process, potentially causing the generated code to reproduce known vulnerabilities [21,44] or even introduce new, more subtle security risks when interacting with context [1,10]. Furthermore, the convenience and efficiency of AIGCode allow for the generation and revision of large amounts of code at once through developer-guided conversations, often without careful review and security checks, further increasing the probability of code vulnerabilities.
While academia and industry have directed attention to the security issues of AIGCode [14,28,51], a fundamental deficiency persists in existing research: the lack of systematic and largescale empirical studies on AIGCode in the wild. This knowledge gap prevents a comprehensive understanding of AIGCode’s actual penetration scale, specific security impact, and long-term risk evolution trends in real-world software projects. This insufficient empirical basis, in turn, severely restricts the effective formulation and deployment of risk mitigation strategies.The primary technical obstacle to conducting such empirical analysis is the challenge of code provenance and detection. Since AIGCode is highly similar to human-written code in syntax, style, and logic, there is a lack of effective technical means to accurately distinguish it, making effective regulation, risk assessment, and liability determination extremely difficult.
To bridge this methodological gap and enable systematic empirical analysis, this paper proposes an ensemble learning framework for AIGCode detection and performs a comprehensive empirical analysis of GitHub projects to dissect the associated trends and security risks. We designed and implemented an AIGCode detection framework, which innovatively adopts an ensemble learning strategy to improve detection accuracy and speed by combining the strengths of multiple base models. Based on this, we conducted a large-scale empirical study on publicly available vulnerability intelligence and the codebase of GitHub’s Top 1,000 open-source projects, aiming to systematically reveal the current penetration status of AIGCode in the wild and analyze its potential correlation with security vulnerabilities.
The main contributions of this paper are summarized as follows:
(1) A Novel Detection Framework. We designed and implemented an efficient, ensemble learning-based AIGCode detection framework, Cascade-Aggregation Framework. Experimental results show that this framework demonstrates state-of-the-art accuracy and robustness in distinguishing between real-world AI and human code. We will open-source the corresponding detection model and dataset. (2) The First Large-Scale Ecosystem Analysis. We conducted the first large-scale empirical study on the Top 1000 open-source projects ranked by GitHub stars, quantitatively analyzing the scale and penetration trend of AIGCode in real-world software development. (3) A Comprehensive Security Risk Profile. By performing correlation analysis between vulnerability intelligence and AIGCode usage in recent years, we construct a detailed security risk profile for AIGCode, revealing its unique risk patterns.
With the rapid popularity of LLM-driven programming assistance tools, distinguishing AIG-Code from human-written code has garnered growing attention in software security and trusted computing [5,9,35,38,56]. Existing methods are broadly divided into two categories: proactive provenance and passive content analysis [17,18].
Proactive Provenance. This category, typified by digital watermarks and signatures, embeds verifiable “fingerprints” directly within the code during the generation phase, thereby theoretically offering high verifiability and robustness [17,59,60]. However, this approach relies on the cooperation and modification of the model or server, making it difficult to cover the large amount of existing in-the-wild code, and also difficult to trace historical data and fragments after cross-platform copying. Therefore, it is not suitable for our task scenario ai
This content is AI-processed based on open access ArXiv data.