Can Developers rely on LLMs for Secure IaC Development?
We investigated the capabilities of GPT-4o and Gemini 2.0 Flash for secure Infrastructure as Code (IaC) development. For security smell detection, on the Stack Overflow dataset, which primarily contains small, simplified code snippets, the models detected at least 71% of security smells when prompted to analyze code from a security perspective (general prompt). With a guided prompt (adding clear, step-by-step instructions), this increased to 78%.In GitHub repositories, which contain complete, real-world project scripts, a general prompt was less effective, leaving more than half of the smells undetected. However, with the guided prompt, the models uncovered at least 67% of the smells. For secure code generation, we prompted LLMs with 89 vulnerable synthetic scenarios and observed that only 7% of the generated scripts were secure. Adding an explicit instruction to generate secure code increased GPT secure output rate to 17%, while Gemini changed little (8%). These results highlight the need for further research to improve LLMs’ capabilities in assisting developers with secure IaC development.
💡 Research Summary
This paper investigates the ability of two state‑of‑the‑art large language models (LLMs), GPT‑4o and Gemini 2.0 Flash, to assist developers in secure Infrastructure‑as‑Code (IaC) development. The authors focus on two research questions: (RQ 1) How effectively can LLMs uncover security smells in IaC scripts? (RQ 2) To what extent can LLMs generate secure IaC solutions when prompted with scenarios that typically lead to insecure outcomes?
To answer RQ 1, the study builds two benchmark datasets. The first consists of 169 security‑smell‑containing code snippets extracted from 2,569 Stack Overflow posts tagged with Ansible or Puppet. The second dataset comprises 430 real‑world IaC files sampled from a larger collection of 21,757 GitHub files identified through literature review and keyword searches. Manual labeling by two experts achieved high inter‑rater reliability (Cohen’s κ = 0.93 for Stack Overflow, 0.89 for GitHub).
The authors evaluate the models under two prompting strategies. A “general prompt” simply asks the model to act as a security expert and analyze the code. A “guided prompt” decomposes the task into explicit steps: identify the IaC tool, enumerate security smells, map each to its CWE and line number, and suggest secure alternatives, while limiting verbosity.
Results for security‑smell detection show a clear benefit from the guided prompt. On the Stack Overflow set, GPT‑4o’s detection rate rises from 80 % (general) to 88 % (guided), while Gemini improves from 71 % to 78 %. On the more complex GitHub scripts, the general prompt performs poorly (42 % for GPT‑4o, 51 % for Gemini), but the guided prompt lifts both models to roughly 67 % detection. The models are better at providing generic remediation advice (≈95 % of cases) than concrete code fixes (≈5 %).
For RQ 2, the authors craft 89 synthetic vulnerable scenarios covering nine known IaC security smells (hard‑coded secrets, weak cryptography, admin‑by‑default, etc.). They ask each model to generate IaC code for the scenario, first with a neutral prompt and then with an explicit “generate secure code” instruction. Across all scenarios, only 7 % of the generated scripts are fully secure under the neutral prompt. Adding the security‑focused instruction raises GPT‑4o’s secure‑output rate to 17 % but leaves Gemini essentially unchanged (8 %). Moreover, a substantial fraction of outputs (44 % for GPT‑4o, 34 % for Gemini) still contain security smells without any warning, indicating that the models often reproduce insecure patterns silently.
The paper concludes that while current LLMs can assist in detecting IaC security smells—especially when guided prompts are used—their ability to autonomously produce secure IaC code remains limited. The authors argue that the models inherit biases from their training data, leading them to repeat insecure patterns unless explicitly constrained. They recommend future work in three directions: (1) designing richer, security‑oriented prompt templates or tool‑assisted prompting to improve detection and remediation; (2) fine‑tuning LLMs on security‑focused IaC corpora to embed stronger security knowledge; and (3) integrating LLM outputs with static analysis or policy‑enforcement tools to provide real‑time validation and corrective feedback.
Overall, the study provides a thorough empirical assessment of GPT‑4o and Gemini 2.0 Flash in the context of IaC security, highlighting both their promise for smell detection and their current shortcomings in secure code generation, thereby offering a roadmap for researchers and practitioners aiming to harness LLMs safely in DevOps pipelines.
Comments & Academic Discussion
Loading comments...
Leave a Comment