The Impact of AI-Assisted Development on Software Security: A Study of Gemini and Developer Experience

The Impact of AI- Assisted Dev elopment on Soware Security: A Study of Gemini and Dev eloper Exp erience Nadine Jost nadine.jost@rub.de Ruhr University Bochum Bochum, Germany Benjamin Berens benjamin.berens@kit.edu Karlsruhe Institute of T echnology Karlsruhe, Germany Manuel Karl m.karl@tu- braunschweig.de T U Braunschweig Braunschweig, Germany Stefan Albert Horstmann stef an.horstmann@uni- koeln.de University of Cologne Cologne, Germany Martin Johns m.johns@tu- braunschweig.de T U Braunschweig Braunschweig, Germany Alena Naiakshina alena.naiakshina@uni- koeln.de University of Cologne Cologne, Germany Abstract The ongoing shortage of skilled dev elopers, particularly in security- critical softwar e dev elopment, has led organizations to increasingly adopt AI-powered development tools to boost productivity and re- duce reliance on limited human expertise. These tools, often based on large language models, aim to automate routine tasks and make secure software dev elopment more accessible and ecient. How- ever , it remains unclear how developers’ general programming and security-specic experience, and the type of AI tool used (free vs. paid) aect the security of the resulting software. Therefore, we conducted a quantitative programming study with software dev el- opers (n=159) exploring the impact of Google’s AI tool Gemini on code security . Participants were assigned a security-related pro- gramming task usin g either no AI tools, the free version, or the paid version of Gemini. While we did not observe signicant dierences between using Gemini in terms of secure software development, programming experience signicantly improved code security and cannot be fully substituted by Gemini. Ke ywords Developer Security Study , Gemini, LLMs, Articial Intelligence, AI Assistants 1 Introduction “Imagine hiring a brilliant junior developer who arrives ev ery morn- ing with no memory of yesterday’s lessons, innite enthusiasm but no judgment, and the ability to produce 1,000 lines of code while you grab coee. Now imagine working with them on a year-long project” [ 29 ]. The software industry is facing a persistent and grow- ing shortage of skilled developers, particularly in security-critical domains [ 40 ]. This problem is driven by sev eral interrelated factors, including the rapid pace of te chnological advancement, insucient adaptation of educational systems, and increasing demand across various industries [ 70 ]. Despite the rise of short courses and certi- cations, the supply of de velopers with strong software and security skills continues to fall short of meeting the ne eds of companies undergoing digital transformation. T o address this gap, organiza- tions are increasingly turning to AI-powered development to ols, including both free and commercial oerings, as a means of impr ov- ing productivity and reducing reliance on scarce human expertise (e.g., [ 14 , 26 , 36 ]). These tools, often based on large language mod- els (LLMs), promise to automate routine coding tasks, accelerate development worko ws, and make software engineering more ac- cessible [ 22 , 73 ]. A survey by GitHub found that 92% of U.S.-based developers are now employing AI tools in both professional and personal capacities [ 22 ]. While past research investigating softwar e security [ 3 , 58 , 61 , 68 ], rev ealed that around 40% of the AI-generated code contained vulnerabilities [ 58 ], it remains unclear how dev el- opers’ general programming and security-sp ecic e xperience aect the security of the resulting software. For example, a recent outage prompted Amazon to involve senior engineers to address issues arising from “Gen- AI-assiste d” changes [41]. ChatGPT [ 53 ], GitHub Copilot [ 21 ], and Gemini [ 25 ] are the most popular and well-known examples of AI tools among develop- ers [ 73 ]. While the impact of Copilot and ChatGPT on code security has been studied [ 3 , 27 , 52 , 58 , 61 , 68 , 69 ], the inuence of Google ’s AI Gemini, previously known as Bard [ 24 ], remains underexplored. Therefore, we focused on Gemini as a case study and conducted an online study . Launched in 2024, Gemini’s paid version, Gemini Advanced, is based on a dierent underlying model than the free version and intr oduced enhanced capabilities, including multimodal processing and advanced reasoning [ 63 ]. In a study on consumers’ expectations and behavior regarding apps [ 6 ], users reported dif- ferences in security and privacy between free and paid versions. While free versions w ere more likely to b e installed, participants placed greater trust in paid v ersions to comply with security and privacy standards. However , it remains unclear whether similar eects apply to software developers. Thus, we tested both the free and paid versions of Gemini. W e conducted an online study with 159 developers recruited via Upwork [ 33 ] and assigne d participants to either the paid or free version of Gemini, or to a control group without access to AI tools. W e asked them to complete a programming task that include d user authentication and functionality for managing a list of websites associated with a user while considering secure implementation. Af- ter task completion, we evaluated participants’ submissions for the ve common vulnerabilities: Cross-Site Scripting (XSS), Cross-Site Request Forgery (CSRF), Improper Input V alidation, SQL Injection, and Cryptographic Failures. With this work, we examined the impact of varying levels of developer expertise and the use of paid versus free AI coding to ols Jost et al. on software security , as well as how trust in these tools inuences developers’ behavior: • RQ1 : How does de velopers’ programming experience aect the security of the developed software? • RQ2 : How does developers’ security experience aect the security of the developed software? • RQ3 : How do es the use of paid versus free AI-assisted development tools such as Gemini impact security? • RQ4 : Do developers e xhibit diering levels of trust in Gem- ini’s free and paid versions? W e found that participants’ programming experience signi- cantly improved the security of code submissions, while no signi- cant dierences were observed in the security of code written with the assistance of the free version of Gemini, the paid version, or no assistance. Further , no signicant security benet or dierence in user trust was found between the paid and free versions. Our r esults suggest that while Gemini can serve as valuable supplementary aid, it cannot fully substitute programming experience. 2 Related W ork This section outlines studies on AI to ol code generation and security developer studies. 2.1 AI Assistance T ool Studies Research on transformer-based LLMs like Codex, AlphaCode, and GPT -4 assessed their performance in co de generation [ 19 , 32 , 37 , 71 ] and developed benchmarks like HumanEval [ 15 ] and MBPP [ 5 ], highlighting their potential to enhance software development. De- velopers value AI tools for b oosting productivity [ 38 , 54 ], with 92% of U .S. developers using AI tools for programming pr ofessionally and personally [22]. Pearce et al. [ 59 ] found that LLMs could x 100% of security bugs in carefully constructed prompts. Siddiq et al. [ 69 ] identied 265 code smell types and 44 se curity smells across 3 LLM training sets, including 18 code smells and 2 se curity smells in GitHub Copilot’s suggestions. He and V echev [ 31 ] proposed SVEN, a learning-base d approach that impro ved secure code generation from a CodeGen LM from 59.1% to 92.3%, while maintaining functional correctness. Pearce et al. [ 58 ] analyzed 1,689 GitHub Copilot-generated pro- grams for vulnerabilities relevant to the “2021 CWE T op 25 Most Dangerous Software W eaknesses” [ 78 ], nding issues in ab out 40% of the programs. Mousavi et al. [ 42 ] investigated 5 Application Programming Interfaces (APIs) and found that ChatGPT gener- ated code with API misuses in around 70% of the cases in a set of 48 programming tasks and revealed 20 distinct misuse types. Whether developers w ould prompt AI tools similarly or adopt their suggestions remains uncertain. Sandoval et al. [ 68 ] studied 58 students using OpenAI’s code- cushman-001 to implement a shopping list structure in C, manually analyzing the code for CWEs. Students with AI access pr oduced se- curity bugs at a rate no higher than 10% compared to those without, suggesting that LLMs did not introduce new security risks. Perry et al. [ 61 ] conducte d a study with 47 students and pro- fessionals to explore interactions with an AI assistant based on OpenAI’s codex-davinci-002 model for security-related tasks in Python, JavaScript, and C. In contrast to [ 68 ], they showed that code written with access to AI was less secure compared to code written without access. Interestingly , participants with se curity experience were less likely to trust and replicate AI outputs than those without security experience. Asare et al. [ 3 ] evaluated GitHub Copilot’s security performance in a user study with 25 students and professional developers, com- paring code written with and without Copilot assistance in 2 C tasks: implementing a user sign-in and a transaction fulllment. They found that access to Copilot accompanied more secure solutions when tackling more complex problems, and for easier problems, no eect was observed. Similarly to [ 68 ], no disproportionate impact on particular vulnerabilities was found, highlighting the potential security benets of using Copilot. Oh et al. [ 52 ] conducted 2 studies on developer responses to insecure AI code suggestions. An online survey of 238 developers found they frequently use these tools but may overlook poisoning risks. In a lab experiment with 30 professionals, developers using a poisoned tool were more likely to write insecure code and partici- pants trusted code completion tools signicantly more than code generation tools. A further study on consumer expe ctations and app usage b e- havior [ 6 ] found that users perceived notable security and privacy dierences between free and paid versions of apps. Similarly , in healthcare r esearch [ 44 ], paid chatbots w ere found to provide more readable responses. Compared to pre vious work, we focused our research on the security implications of using Gemini for software development and how paid and free AI versions aect code security and trust. 2.2 Security Developer Studies Acar et al. [ 1 ] examined the impact of information resources on code se curity by surveying 295 Google Play developers and conduct- ing a lab study with 54 Android de velopers. The ndings showed developers using Stack Overow pr oduced less secure code than those using ocial documentation or b ooks. Nadi et al. [ 45 ] found that developers struggle with low-level Java cryptography APIs despite condence in concept sele ction, highlighting a need for more task-based support. Further , Acar et al. [ 2 ] conducted an online programming study with 307 GitHub users, who completed various se curity-related tasks, including credential storage, encryption, and writing a URL shortener service. They found dierences in security and function- ality based on participants’ self-reported years of experience. Naiakshina et al. [ 46 – 49 ] investigated developers’ security b e- havior with secure password storage. In a lab study [ 48 ], 20 students were tasked with storing passwords; half were told the study was about API usability , while the other half were directly instructed to ensure secure password storage. The authors noted that understand- ing security concepts did not always lead to secure task solutions. In a follow-up study , Naiakshina et al. [ 49 ] found a positive eect of copy/paste on security . This study also analyze d the ee ct of programming experience on the security scores, sho wing that y ears of Java experience had no signicant eect on security . In contrast to the previous work, we wanted to understand if AI tools can substitute programming and security experience. The Impact of AI- Assisted Development on Soware Security: A Study of Gemini and Developer Experience 3 Methodology W e designed an experiment to examine developers’ programming and se curity experience and the eect of Gemini’s free and paid versions on code security score. Based on our power analysis (see Section 3.3), we required a large sample of a hard-to-reach popula- tion (n = 159). W e therefore conducted a remote eld study with a large, international sample in their familiar environment, recruited via Upwork [ 33 ], and compared 3 groups solving security-related tasks: • No-AI : A control group that did not use any AI. • Free-AI : A group that used only Gemini’s free version. • Paid-AI : A group that used only Gemini’s paid version. Following best practices that recommend instructing participants not to use LLMs when they are not permitted [ 64 ], participants in No-AI were instructed not to use AI. By contrast, participants in Free-AI were aske d to use the free version of Gemini with their own Google accounts, while participants in P aid-AI were provided with the login credentials for a Go ogle account with a paid Google One subscription including access to Gemini Advanced. After n- ishing the implementation task, participants submitted the code via GitHub Classroom and completed a survey . The survey included questions on demographics, challenges faced during implemen- tation, support and diculties with Gemini, and trust in AI. W e additionally used standardized scales to measure the workload using the NASA - TLX [ 28 ], usability of Gemini using the System Usability Scale (SUS) [ 12 ], and participants’ security self-ecacy using the Secure Software Development Self-Ecacy Scale (SSD-SES) [83]. 3.1 Programming T ask Due to the constantly growing cloud business, web applications are becoming increasingly important compared to conventional local on-premise solutions [ 76 ]. Therefore, we designe d a web-based programming task similar to Linktree [ 39 ] for our study . Linktree is a tool that lets users create a single link to house multiple links to their social media proles, websites, and other online resources. W e chose Python as the programming language due to its rst- place ranking on the TIOBE Index [ 13 ] and its position as the most popular language on GitHub [ 84 ]. Additionally , we used the Flask framework because it was the most popular Python-based web framework in 2024 [73]. For our study , we examined the top 10 of the 25 most dangerous CWEs [ 78 ] to identify relevant vulnerabilities for the study task and analyze the submissions’ security . W e excluded Out-of-bounds W rite (CWE-787), Use After Free (CWE-416), and Out-of-bounds Read (CWE-125) due to their spe cic applicability to low-level programming languages like C or C++. Further , we excluded OS Command Injection (CWE-78) because its characteristics depend on the underlying operating system. Finally , we excluded Path Trav ersal (CWE-22) and Unrestricted Upload of File with Dangerous T yp e (CWE-434), as le uploads and le system interactions wer e not aligned with our programming task. Thus, we selected the remaining 4 vulnerabilities: Cross-Site Scripting (XSS): User input is not pr operly sanitized before being included in web pages, allowing it to be executed as code and potentially leading to malicious code execution (CWE- 79) [81]. Cross-Site Re quest Forgery (CSRF): The web application cannot fully verify if a valid request was intentionally submitte d by the user (CWE-352) [80]. Improper Input V alidation: Input is received without proper validation to ensure it meets the required properties for safe and correct processing (CWE-20) [79]. SQL-Injection: User-controller input is used in an SQL command without properly neutralizing special elements, allowing user inputs to be interpreted as SQL rather than as data (CWE-89) [82]. In addition, we selecte d the O W ASP T op T en categor y Cryp- tographic Failures [ 55 ], as authentication is essential for any web application that applies user accounts. Y et, password-based authentication is the most commonly used authentication metho d in sensitive areas such as nance and medical care [ 74 , 75 ]. Further , a dataset of 9,948,575,739 unique plaintext passwords, uncovered by researchers in 2024 [62], and a ruling against Meta by the Irish Data Protection Commission for storing passw ords in plaintext [ 17 ] highlight ongoing issues with inadequate passwor d storage prac- tices. With our nal set of 5 vulnerabilities, we designed the program- ming task that included user authentication and functionality for managing a list of websites asso ciated with a user . The task was divided into 4 subtasks: implementing 1) a user registration, 2) a user login, 3) a function for adding websites, and 4) a function for deleting websites. All participants wer e asked to pay attention to security while completing the task. Participants could use their usual IDE and consult any resources to solv e the problem, except for restrictions on using Gemini. A detailed description of the task can be found in the Appendix A.3. 3.2 Pilot Study W e conducted a series of pilot studies to test our task design. First, the study was piloted with 2 student assistants. The rst pilot participant worked approximately 2.5 hours on the task. Following the examples set in related AI studies [ 3 , 61 ], we aimed for a 2-hour duration for our study task. Thus, we remov ed one of the initial 5 subtasks from the code. This subtask was implementing the public prole view of users, which had the lowest security impact. In the following, we piloted our 3 groups with 1 participant per group from Upwork. These pilot participants worked 108 minutes on average . One participant canceled their participation after knowing they were assigned to the No-AI group. This led us to add deception to the study design and not to tell participants in the actual study that 3 groups and AI tools were being investigated. They were told the study was about developers’ programming b ehavior with a Flask web application. 3.3 Power Analysis Prior to conducting the experiment, a power analysis was performed to determine the necessary sample size. W e considered a medium eect size (f = 0.25) [ 16 ], a signicance level of 𝛼 = 0 . 05 , a power of 0.80, a numerator degree of freedom of 2, and 3 groups in the experimental design. The power analysis indicated that a sample size of 159 participants would provide sucient statistical power to detect statistical signicance. Consequently , a sample size of 159 Jost et al. participants (53 per group) was determine d for the study to ensure the robustness and reliability of the ndings. 3.4 Participants W e recruited freelance software developers on Upwork [ 33 ], follow- ing the recommendations of prior work on developer r ecruitment best practices [ 34 ]. The job post for this study can be found in the Appendix A.2. Eligible participants were at least 18 years old and were pr ocient in English. Further , participants were required to have Python skills listed in their Upwork proles. The Upwork team supported us during recruitment and invited 2138 people to our job p ost. Of these, 542 answered the invitation. The Upwork team screened the interested participants directly based on the skills advertised in their Upwork proles. The r esearchers double- checked the screening criteria, and 291 eligible participants wer e contacted. 102 participants did not r espond or canceled. Finally , 189 participants w ere recruited and assigned to one of our 3 groups. All participants were compensated with $60 via Upwork. W e asked participants to submit their solutions within 2 days to reect that freelancers might work on parallel pr ojects. W e chose this period to prevent the paid AI accounts from being blocked for other participants for a longer p eriod of time, as these accounts could only be used by one participant at a time. All participants were asked in the follow-up survey what resources they used to complete the task. From the 189 participants who completed the study , 7 were excluded in No -AI for using AI tools despite b eing told not to, 14 participants were excluded in Free-AI for using none or other AI tools than the free v ersion of Gemini, and 9 participants were excluded in P aid-AI for using none or other AI to ols than the paid version of Gemini. The demographics of the remaining 159 participants can be found in T able 1. Our sample involved 91 freelance and 37 industrial developers. 147 participants w ere men, 9 were women, and 3 did not disclose their gender , reecting typical gender distribution in software development as found in the SO Developer Survey [ 72 ] with only 5.17% of respondents identie d as women. The ages ranged from 18 to 54 with a mean age of 28. 95 had a bachelor’s degree, and 44 held a master’s or equivalent. Participants came from 42 countries, including Pakistan (33), India (27), the USA (13), and Nigeria (11). 3.5 Evaluation W e analyzed whether the submissions fullled our functional and security requirements as follows. Code Analysis: T wo researchers independently reviewed all submissions to assess b oth functional correctness and potential security vulnerabilities. As part of the security evaluation, the re- viewers actively examined whether issues such as XSS, CSRF, SQL injection, or improper input validation might arise. When their as- sessments diered, the researchers engaged in a structured discus- sion to analyze the discrepancy and reach a consensus on the nal judgment. W e used a binary score for (non-)existent vulnerabilities (0 or 1). Finally , we checked all submissions with 2 static scanners, Bandit [ 66 ] and SonarQube [ 67 ], which are listed by O W ASP [ 57 ]. This enabled us to v erify our ndings and cross-check that nothing T able 1: Demographics of the 159 Participants. Gender Man: 147, W oman: 9, Prefer not to disclose: 3 Age * min = 18.0, max = 54.0, mean = 28.26, median = 27.0, sd = 6.21 Educational Qualication Bachelor Degree: 95, Masters Degree or equivalent Diploma: 44, School degree: 6, Professional training: 4, Doctoral degree (Dr ./PhD): 3, No school degree: 3, Other: 4 Main Occupation Freelance developer: 91, Industrial de veloper: 37, Graduate student: 9, Un- dergraduate student: 9, Academic researcher: 2, Other: 11 Country of Residence PK: 33, IN: 27, US: 13, NG: 11, GB: 9, KE: 7, EG: 6, CA: 4, ET: 4, UA: 4, NP: 3, MA: 3, MY: 2, AE: 2, ES: 2, DE: 2, DZ: 2, Other: 25 General Development Experience [years]* min = 1.0, max = 40.0, mean = 6.53, median = 5.0, sd = 4.88 Python Exp erience [years]* min = 1.0, max = 25.0, mean = 4.63, median = 4.0, sd = 3.17 Commonly Use d AI T ools ChatGPT Free V ersion: 101, Gemini Free V ersion: 57, ChatGPT Paid V ersion: 39, GitHub Copilot: 34, Gemini Paid V ersion: 8, Visual Studio IntelliCo de: 8, T abnine Free V ersion: 3, None: 1, Other: 9 Security Experience No experience: 63, Security course/training: 45, Developed security applications/implemented security measures: 42, W orked on I T security in spare time: 14, W orke d at IT security-related companies: 10, Certicate in I T security: 4, Degree in I T security: 3, Other: 4 * = There were no signicant dier ences between the groups. had been overlooked. SonarQube is a widely used solution for ana- lyzing code quality and dete cting bugs and vulnerabilities, used by 7 million developers and over 400,000 organizations [ 67 ]. Bandit is an open-source tool that analyzes Python co de for security vulner- abilities. It is integrated into o ver 48,000 projects on GitHub [ 66 ]. Additionally , participants received one point for securely storing user passwords. For this, we expected participants to hash and salt passwords using state-of-the-art passwor d hashing functions or schemes (e.g., bcrypt [ 65 ], scr ypt [ 60 ], Argon2 [ 7 ]), according to [30, 48]. T wo researchers manually evaluated the participants’ submis- sions and resolved conicts by discussion. The nal security score ranged from 0 to 5. Open Survey Questions: Our follow-up sur vey included open- ended questions about challenges faced during implementation, support and diculties with Gemini, factors inuencing trust and mistrust in AI, and the security b est practices employed. T wo re- searchers conducte d a thematic analysis [ 8 ] of the responses to open-ended questions. First, they developed an initial co debook based on the responses from 5 random participants. Following this, each researcher indep endently coded the responses, regularly me et- ing to discuss and rene emerging themes and categories. During these discussions, they collaboratively added or merged codes, re- sulting in the nal version of the codeb ook. After both researchers completed coding, they resolved discrepancies through discussion, reaching full agreement. This appr oach aligns with the guidance of Braun and Clarke [ 9 , 10 ]. The nal codebook can be found in the Appendix A.5. The Impact of AI- Assisted Development on Soware Security: A Study of Gemini and Developer Experience 3.6 Methodology: Statistical Analysis and Hypothesis T esting W e derived hypotheses from our research questions and used stan- dard statistical hypothesis tests to examine the following 4 main hypotheses in our study: • H1: There is a signicant eect of the programming experi- ence on the security score. • H2: There is a signicant ee ct of security experience on the security score. • H3: There is a signicant dier ence in the security scores among Free-AI , P aid -AI , and No-AI . • H4: There is a signicant eect of the tool ( Free-AI , P aid - AI ) on the trust in the tool. T o address the rst three hypotheses, we mo delled the binar y Security Score (the numb er of passed security test cases out of 𝐾 = 5 ) with a logistic regression ( binomial GLM) that simultaneously includes all hypothesised predictors. The full model is logit  Pr ( Success )  = 𝛽 0 + 𝛽 1 Group Free-AI + 𝛽 2 Group Paid-AI + 𝛽 3 Security Experience + 𝛽 4 Programming Experience + 𝛽 5 Security Self-Ecacy + 𝛽 6 W orkload + 𝜀 where • Group is a three - level factor (No - AI, Free - AI, Paid - AI) code d with the control (No-AI) as reference; • Security Experience is a binary indicator of self - reported prior se curity experience (0 = no experience, 1 = experi- ence); • Programming Experience , Security Self-Efficacy and Workload are continuous, z-standardised measures. All predictors are entered simultaneously , thus each coecient reects the unique contribution of the corresponding variable while holding the others constant. This single - model approach directly yields eect - size estimates (log - odds 𝛽 , exponentiated to odds - ratios, OR) that answer the hypotheses without the need for a series of separate tests. Model diagnostics. Prior to inference, we examined over - dispersion by comparing the residual deviance to its degr ees of freedom; the ratio was well below the critical value of 1.5, indicating that the bi- nomial assumption holds. When ov er - dispersion had been detected, we would have retted the model with a quasibinomial family , but this was not required. Hypothesis testing. For each categorical predictor ( Group ) we performed a T ype - II likelihood - ratio test ( 𝜒 2 ) to evaluate the null hypothesis 𝐻 0 : 𝛽 𝑗 = 0 (no eect) against the alternative 𝐻 1 : 𝛽 𝑗 ≠ 0 . Continuous predictors ( Programming Experience, Security Self-Efficacy, Workload ) are tested with W ald 𝑧 - statistics. All tests used a two - tailed 𝛼 - level of .05; marginally signicant eects ( 𝑝 b etween .05 and .10) are reported as trends. Eect - size estimation. Model coecients were exponentiated to obtain odds - ratios with 95% condence intervals ( via the W ald method). T o aid interpretability we derived marginal means ( esti- mated probabilities) for each level of Group using the ‘emmeans‘ package, back - transforming from the logit to the probability scale . Pairwise dierences b etween AI conditions were examined with T ukey - adjusted odds - ratio contrasts to control the family - wise error rate. Model t. Goodness - of - t was summarised with pseudo - 𝑅 2 in- dices (Nagelkerke’s 𝑅 2 =0.11) and the Hosmer–Lemeshow test ( 𝜒 2 8 =5.3, 𝑝 =0.73) to conrm adequate calibration. In short, the lo- gistic regression framework allows us to test the main eects of programming experience, security experience, and Gemini assis- tance (RQ1–RQ3) in a single, statistically rigorous model, while the T yp e - II LR tests, odds - ratio reporting, and post - hoc T ukey con- trasts provide the inferential basis for accepting or rejecting the corresponding hypotheses. For the fourth research question, we calculated descriptiv e statis- tics (means and standard deviations) for each condition. T o test the hypothesis that trust levels dier between the Free-AI and P aid - AI groups, we compared the two independent samples using the non - parametric Mann - Whitney U test (appropriate for or dinal data and non - normal distributions). The test was performed separately for each trust item, yielding U statistics and two-tailed p-values. 3.7 Ethics Our university did not have an institutional review board (IRB) at the time of this study . Howev er , we complied with the General Data Protection Regulation (GDPR) cleared with our data protection ocer . T o protect participants’ privacy , we minimized the colle ction of personal data and only collected information necessar y for the study . Participants were informed about the study’s procedure and provided with a consent form, including our data policies and the study information, which also provided contact details for the research team and the data protection ocer . Participation was voluntary , and participants wer e asked to agr ee to the consent form before proceeding. Participants were allowed to ask questions at any time during the study and withdraw from the study at any time without facing any consequences. At the end of the study , we debriefed and informed all participants about the purpose of the study . 3.8 Limitations As with most user studies, this study has several limitations that should be considered when interpreting its ndings. First, we re- cruited freelance developers for this study , which may limit the generalizability of our ndings to other populations. Developers employed by companies or recruited through other platforms may behave dierently . Second, the programming task in our study was intentionally designed to be susceptible to various vulnerabilities in Python. Dierent programming tasks or the use of other pro- gramming languages may have led to dierent outcomes. Third, we focused exclusively on the free and paid versions of the AI tool Gemini. Further research is needed to explore how results vary across dierent tools. Fourth, following best practices that recom- mend instructing participants not to use LLMs when they are not Jost et al. permitted [ 64 ], we instructed participants either not to use any AI tools or to use only Gemini, depending on their condition. While we could not verify compliance , all participants were informed that they w ould be compensated, so we have no indication of incentives to misreport. Fifth, due to the r emote study design, our analysis was limited to participants’ code submissions. Thus, we could not verify how prompting behavior inuenced the se curity score. Finally , the observed eect size is modest, and the mo del explains only a limite d portion of the variance in the security score. The sample size may have limited our ability to dete ct smaller eects, suggesting that additional factors could have inuenced security scores. 4 Results This section describes the results of the 4 resear ch questions. Fol- lowing the evaluation of our quantitative results, we present an analysis of our qualitative ndings for each research question to provide further insight and explanation. W e report individual state- ments and results by referencing the No -AI participants with N , the Free-AI participants with F , and the P aid-AI participants with P . Participants complete d the task with a mean time of 144 minutes. There was no statistical dierence in the completion time b etween the three groups. An overview of the general descriptiv e statistics is presented in T able 2. Functionality: Of the 159 participants ( No-AI n = 53, Free-AI n = 53, P aid-AI n = 53), 138 submitted functional solutions ( No-AI n = 47, Free-AI n = 41, P aid -AI n = 50). In the following, we report our security analysis of participants’ code by considering only functional solutions. W e modeled participants’ security score as a binomial outcome using logistic regression, with the number of correct items out of K = 5 per participant as the dependent variable. The model included Group (three levels) and four covariates entered as main eects: standardized overall programming experience, standardized self- ecacy , standardized workload, and prior security experience. W e report likelihoo d-ratio tests for each term and o dds ratios with 95% CIs; adjusted marginal means by gr oup are presented on the 0–5 score scale (i.e., probability × 5). Given the high correlation between Python-specic and overall programming experience (r = 0.78), Python experience was omitted from the nal model to prevent multicollinearity; overall programming experience was retained. T able 3 provides an overview of the estimated odds ratios and their 95% condence intervals, showing how each variable is associated with changes in the odds of the outcome. Model Diagnostics: Over - dispersion was assessed via the Pear- son 𝜒 2 statistic. The dispersion factor ˆ 𝜙 = 0 . 87 indicated no serious over - dispersion, so a standard binomial GLM was retained (the beta - binomial alternative was not triggered). Pseudo - 𝑅 2 values (T a- ble 4) show modest explanatory power (McFadden 𝑅 2 = 0 . 035 ). 4.1 RQ1 - Programming Experience and Software Security Quantitative A nalysis. Regression Analysis: A one-standard- deviation increase in ov erall programming experience raised the odds of a higher security score by a factor of 1 . 43 (95 % CI [ 1 . 14 , 1 . 81 ] , 𝑝 = 0 . 002 ). In probability terms, the predicted security success rate increased from 30% (mean experience) to 38% at + 1 SD . A one - SD Figure 1: Correlation Programming Experience and Se curity Score. increase in programming experience raised the odds of a higher security score by 23%. Descriptive V alues: Figure 1 visualizes the asso ciation between the security score and programming experience: points show individual participants (colored by group), with a least-squares trend line. The relation- ship was approximately linear (Pearson r = 0.2, p = 0.018). Table 5 reports descriptive statistics for Security Score and Pr ogramming Experience by group (mean, SD , n, min, and max). Qualitative A nalysis. V alidation by Experienced Developers: Developers reported using their knowledge and experience to evalu- ate, verify , and correct Gemini’s output. Rather than blindly trusting the AI, they reporte d using it as a starting point, cross-checking its outputs with their experience and other resources. For instance, P16 noted: “I have background knowledge of backend development with Flask and had a general idea of where the LLM was going in development” — [P16]. F56 shared: “with my experience and skill set, I can strategically test, v erify , and take over to ensure the nal code for a given task meets the needs of the project” — [F56]. This sentiment was e choed by 6 participants in Free-AI (F05, F06, F34, F36, F56, F67) and 5 participants in P aid-AI (P16, P26, P32, P56, P59), who described their ability to assess Gemini’s suggestions due to their knowledge or experience. In summary , the fact that experienced developers veried Gemini’s output might help explain the associa- tion between programming experience and higher security scores, suggesting that programming experience remains important for critically evaluating and integrating Gemini’s suggestions. W e found that general programming experience signicantly im- proved code security . Developers with more years of experience wrote more secure code. Many participants emphasized the impor- tance of evaluating AI-generated code using their own experience. RQ 1 – Summary 4.2 RQ2 - Se curity Experience and Software Security Quantitative A nalysis. W e divided the participants into 2 gr oups: those who reported having previous security experience and those The Impact of AI- Assisted Development on Soware Security: A Study of Gemini and Developer Experience T able 2: O verview of T ask Statistics. Time in Minutes mean = 143.89, median = 120.0 min = 1.0, max = 800.0 sd = 128.57 Resources Used Ocial documentation: 90, StackOverow: 84, GitHub: 44, T utorial W ebsites: 37, Y ouTube: 18, W3Schools: 10, Codecademy: 1, Other: 27 SUS Gemini Free mean = 76.1, median = 77.5 min = 32.5, max = 100 sd = 14.84 SUS Gemini Paid mean = 77.1 median = 77.5 min = 50.0, max = 100 sd = 14.15 Helpfulness Gemini Free* mean = 4.68, median = 5.0 min = 2.0, max = 7.0 sd = 1.33 Helpfulness Gemini Paid* mean = 4.92, median = 5.0 min = 2.0, max = 7.0 sd = 1.44 Number of Prompts Gemini Free mean = 11.62, median = 5.0 min = 1.0, max = 80.0 sd = 16.51 Number of Prompts Gemini Paid mean = 29.64, median = 10.0 min = 1.0, max = 500.0 sd = 74.08 % of Adopted Suggestions Gemini Free mean = 57.17, median = 60.0 min = 0.0, max = 100.0 sd = 22.73 % of Adopted Suggestions Gemini Paid mean = 56.42, median = 60.0 min = 10.0, max = 100.0 sd = 25.88 * = Likert Scale, 1: Not Helpful - 7: V ery Helpful. T able 3: Estimate d Odds Ratios and 95% CIs from the Binomial GLM T erm Estimate Std.error Statistic p.value Conf.low Conf.high (Intercept) 0.64 0.173 -2.61 0.0090 0.45 0.89 Free-AI 1.31 0.199 1.34 0.180 0.88 1.93 P aid -AI 1.54 0.192 2.24 0.0252 1.06 2.25 Programming Experience 1.23 0.0818 2.55 0.0107 1.05 1.45 Security Self-Ecacy 0.92 0.0869 -0.94 0.346 0.78 1.09 W orkload 1.08 0.0804 0.96 0.337 0.92 1.27 Security Experience 0.77 0.178 -1.44 0.151 0.55 1.10 T able 4: Pseudo- 𝑅 2 Statistics for the Final Model. McFadden Cox & Snell Nagelkerke 0.035 0.10 0.103 T able 5: Descriptive Statistics for Programming Experience (Y ears) and Security Scores by AI Group. Group Count Mean SD Min Max General Programming Experience No-AI 47 6.70 4.68 1 22 Free-AI 41 6.25 6.23 1 40 P aid-AI 50 7.16 4.52 2 25 Security Score No-AI 47 1.81 1.17 0 5 Free-AI 41 2.07 0.79 1 5 P aid-AI 50 2.26 1.10 0 5 without. Security experience included security e ducation, certi- cations, implementation of se curity measures, work in security- related companies, or personal security projects. Developers who reported prior security experience tende d to have lower odds of a higher security score ( 0 . 75 , 95 % CI [ 0 . 55 , 1 . 01 ] , 𝑝 = 0 . 058 ). Although the eect narrowly missed conv entional sig- nicance ( 𝛼 = 0 . 05 ), the direction is noteworthy: participants with security experience produced software that, on average, had lower T able 6: Descriptive Statistics of Security Scores by Security Experience by AI Group. Security Experience Group Mean V ariance Min Max Count No Experience No -AI 1.80 1.22 1 5 20 No Experience Free-AI 2.12 0.49 1 4 17 No Experience P aid -AI 2.68 1.67 0 5 19 No Experience All 2.20 1.25 0 5 56 Experience No -AI 1.81 1.54 0 5 27 Experience Free-AI 2.04 0.74 1 5 24 Experience P aid -AI 2.00 0.80 1 4 31 Experience All 1.95 1.01 0 5 82 security scores (predicted probability = 0 . 27 ) than those without such experience (predicted probability = 0 . 30 ). The mean security score for participants without prior security experience was 2.20, compared to 1.95 for those with experience. T able 6 further shows that participants without security e xperience achieved higher mean security scores when using Gemini than par- ticipants with experience. When no AI tool was used, participants with and without prior security experience show ed nearly identical mean scores (1.80 vs. 1.81). In contrast, when Gemini assistance was available, participants without prior security experience achie ved higher mean scores than those with security experience, indicat- ing that Gemini may have mitigated dierences related to security experience within the scope of this analysis. T able 7 presents the descriptive statistics for the participants’ SSD-SES scores, which were comparable across gr oups. Jost et al. T able 7: Summar y Statistics for Security Self-ecacy by Group Participants per Score Range Group Count Mean V ariance Min Max Median 0-1 > 1-2 > 2-3 > 3-4 > 4-5 No-AI 47 3.31 0.90 1.33 4.80 3.60 0 6 12 16 13 Free-AI 41 3.44 0.81 1.20 4.73 3.67 0 5 6 18 12 P aid -AI 50 3.57 0.58 2.00 4.93 3.73 0 1 13 20 16 Qualitative A nalysis. Leveraging Se curity Experience: F44 emphasized the critical role of experience when reviewing AI- generated code: “auditing for se curity against today’s high-end skilled cyber attacker is necessary with assistance of experienced developer” — [F44]. Similarly , P27 and P32 stress the value of their personal expe- rience in identifying vulnerabilities: “Based on personal experience, I can identify weak points” — [P27], “As I have experience so I can know easily if the presented code makes sense or not. Or if ther e are any security issues” — [P32]. P16 highlights the importance of expe- rience, particularly in security: “I also have background knowledge of password storage [...]. At each step, I tested the generated co de by Gemini to conrm that it met requirements in terms of functionality and security . It is not safe to trust code generated by Gemini blindly” — [P16]. In contrast, F08 described a dier ent persp ective, trusting the code generated by Gemini precisely b ecause they lacked sucient security knowledge: “I’m not knowledgeable enough about security implementations to know if the code provided is insecure” — [F08]. This might indicate that developers without prior security experi- ence may have relied more heavily on Gemini or formulated more targeted security-related prompts, which may help contextualize the quantitative nding that participants without prior se curity experience achieved slightly higher security scores when using Gemini. W e observed no signicant eect of security experience on the security score . Howe ver , developers with no prior security experi- ence achieved slightly higher security scores when using Gemini compared to participants without experience who did not use AI. Qualitative responses suggest that some developers with security experience critically evaluated Gemini’s suggestions, whereas some less experienced participants relied more directly on Gem- ini’s output. RQ 2 – Summary 4.3 RQ3 - Paid -AI vs. Free-AI Quantitative A nalysis. Main Ee ct of AI Group: The omnibus T ype - II LR test for Group was marginally signi- cant ( 𝜒 2 ( 2 ) = 5 . 12 , 𝑝 = 0 . 077 ), suggesting a trend toward dierential impact of Gemini assistance on security scores. Post - hoc pairwise contrasts (T able 8) conrm this pattern. Estimated Probabilities: Marginal means back - transformed to the probability scale (T a- ble 9) show a monotonic increase from the control condition ( No -AI ) to P aid -AI , but none of the dierences r each statistical signicance. After adjusting for experience, self - ecacy , workload, and prior se- curity background, participants who used the paid version achieved a 10% absolute gain in security scored relative to the No -AI Group T able 8: Pairwise Odds - Ratio Comparisons b Between AI Groups ( Adjusted With T ukey’s Method). Contrast Odds Ratio 𝑝 No-AI vs. Free-AI 0.766 0.3732 No-AI vs. P aid-AI 0.650 0.0650 Free-AI vs. P aid -AI 0.849 0.6772 (95% CI: 5.8%–14.9%; p = 0.06). T able 10 displays the summary sta- tistics for the NASA- TLX. Notably , the P aid-AI group had lower workload scores compared to the other 2 groups. T able 9: Estimated Probability of a Successful Security Out- come by AI Group ( A veraged Over Security Exp erience). Group Probability 95 % CI No-AI 0.359 [0.299, 0.424] Free-AI 0.422 [0.355, 0.492] P aid -AI 0.462 [0.400, 0.526] Security: W e evaluated code security on a scale from 0-5 (see Section 3.5). The mean values of the security score for each group were relativ ely close, with Free-AI (2.07) and P aid-AI (2.26) being similar , while No -AI is slightly lower (1.81). T able 11 shows the percentages of the individual vulnerabilities in the security score. Notably , password-storage security improved by 21.7% with Gemini assistance, and SQL inje ction vulnerabilities improved by 10.5%. In contrast, XSS vulnerabilities showed only a 3.5% impr ovement, while CSRF and improper input validation exhibited almost no change with Gemini assistance. The distributions of the scores for each group are visualized in Figur e 2, using violin plots to illustrate the spread and density of the data. No-AI shows the highest dis- tribution around the security score of 1 with 16 participants and around 2 with 19 participants. Free-AI shows a high number of 27 participants with a score of 2 and 5 participants with a score of 3. P aid -AI has the highest number of participants with also 27 at around 2 points, but also 9 participants with 1 and 7 participants with 4. Qualitative A nalysis. Security Assistance From Gemini: Five participants in P aid-AI (P31, P44, P46, P51, P62) and 3 participants in Free-AI (F05, F35, F54) mentioned using Gemini to get general suggestions about security (e.g., “it suggested some techniques to improve the se curity of the website” — [F35]). Four participants from Free-AI (F05, F35, F54, F65) and 3 from P aid-AI (P01, P27, P62), The Impact of AI- Assisted Development on Soware Security: A Study of Gemini and Developer Experience T able 10: Summar y Statistics for the NASA - TLX by Group Participants per Score Range Group Count Mean V ariance Min Max Median 0-10 11-20 21-30 31-40 41-50 51-60 61-70 71-80 81-90 No-AI 47 41.49 341.12 6 84.00 40.67 2 5 6 10 9 11 1 2 1 Free-AI 41 40.40 376.67 2 86.67 42.67 2 5 6 6 10 5 4 2 1 P aid-AI 50 33.44 448.50 7 90.00 28.00 4 11 14 6 5 2 5 1 2 T able 11: Percentage of Vulnerabilities by Group. The “ A verage AI” column shows the average percentage of vulnerabilities for participants who used AI ( Free-AI and P aid-AI ), followed by the dierence to No-AI . Category No-AI Free-AI Paid -AI A verage AI Dierence No -AI to AI Insecure Password Storage 40.4 17.1 20 18.68 -21.7 SQL Injection (CWE-89) Vulnerable 21 7 14 10.5 -10.5 XSS (CWE-79) V ulnerable 79 85 66 75.5 -3.5 CSRF (CWE-352) V ulnerable 96 93 96 94.5 -1.5 Improper Input V alidation (CWE-20) Vulnerable 83 90 78 84 1 Figure 2: Violinplot for Security Score Divide d by Group (0 = lowest, 5 = highest security score). reported using Gemini to identify vulnerabilities. For example, P01 explained their pr ocess: “First I query Gemini AI to complete the tasks provided with task and supporting code. Then wrote another query to rewrite the task (functions) to optimize it and make it secure” — [P01]. P62 shared that Gemini “helpe d talk through potential se curity issues” — [P62], emphasizing its role in identifying vulnerabilities. Security Challenges With Gemini: Eleven participants in P aid -AI and 6 in Free-AI highlighted various ways in which Gem- ini hindered them in writing secure code. T wo participants (P27, P43) noted that the se curity suggestions were not optimal (e.g., “Sometimes it correctly points on a possible security issue, but solution is not accurate or being done in a more convenient way than it sug- gests” — [P27]). The quality of explanations for security suggestions was criticized by 5 participants (F35, F64, F67, P56, P57). P57 felt that “The explanation wasn’t convincing and/or the risk of trusting any wrong code was very high in cases such as password hashing or verication” — [P57]. Three participants (F59, P43, P58) noted that when asking Gemini for security suggestions, “it often provides code with errors” — [P43]. The generated code not adhering to best se curity practices was mentioned by F64 and P01, while F44 and P49 noted that Gemini was more likely to introduce new vulnerabilities in the generated code. Credential Storage: Four participants in P aid -AI (P30, P35, P46, P57) and 4 in Free-AI (F02, F23, F39, F64) r eported using Gemini for assistance with hashing. For instance, F02 mentioned that Gemini “Just helped [them] decide over hashing methods” — [F02], highlight- ing its assistance in making informed de cisions. P57 appreciated how Gemini introduced approaches they might not have consid- ered on their own: “It has provided me approaches that I probably wouldn’t have thought of beforehand. Its implementation of hash for making the application more secure was helpful” — [P57]. Three participants (F31, P07, P16) mentione d Gemini’s failure to incor- porate security into its suggestions. P16 described how Gemini’s responses lacke d crucial security measures, stating: “The initial responses from Gemini on registration/login showed no hashing or salting of passwords and tried storing them in plain text” — [P16]. Outdated security suggestions were a concern for 3 participants (F64, P46, P48) (e.g., “Gemini generated code based on older versions of bcrypt” — [P46]). SQL-Injection: Tw o participants (F05, P36) reported using Gem- ini for SQL injection prevention. P36 explained how Gemini cor- rected their misunderstanding of SQL parameter interpolation, pre- venting an SQL inje ction vulnerability in their co de: “I assumed that SQL params interpolation is done using ’%s’ syntax, but Gem- ini suggeste d a syntax with ’?’ . I corrected it, but turns out I was wrong and Gemini was correct” — [P36]. Notably , 19 participants indicated that they actively implemented measures to prevent SQL injections as a best practice. 3 participants (P36, P43, N01) reported general challenges with pre venting SQL injections ( e.g., “parameter- izing queries to prevent SQL injection and properly handling database connections” — [N01]). The relatively high number of participants mentioning preventing SQL injections, coupled with the frequency of actually mitigated vulnerabilities, suggests a greater awareness of SQL injection vulnerabilities than improper input validation, CSRF and XSS. Jost et al. Improper Input V alidation: Nine participants mentioned mit- igating improper input validation. F35 felt that the security sugges- tions for improper input validation and hashing were confusing: “One time Gemini suggested Form validation and hash password as sug- gestion I think that was bit confusing and I was not able to implement it properly” — [F35]. P56 mentione d their challenge with improper input validation: “I used package validators for validating URLs, but the sample code didn’t mention what kind of validation it really do es and I had several failed tests before lo oking up more elaborate docs” — [P56]. T wo further participants (N18, N52) r eported challenges with improper input validation. N52 described challenges with validat- ing URLs: “if you inserted “faceb o ok.com” or “ww w .facebo ok.com, ” it would raise an error ’Invalid URL’ . That’s because the missing H T TPS, so it has to be something like this https://w ww .faceb ook.com [. . . ] This took a lot of my time and was frustrating” — [N52]. CSRF: Only 4 participants (N07, F15, F29, P62) reported taking measures to prev ent CSRF. Inter estingly , no participants reported having challenges with CSRF prevention. Given that CSRF was also the most pre valent vulnerability , this suggests there is a lo w awareness of CSRF . XSS: Just 2 participants (N26, P62) mentioned addressing XSS. P62 mentione d preventing “XSS, [...] as suggested by Gemini ” — [P43]. Only one participant, P43 reporte d challenges with XSS: “The challenge was during the security part when I wanted to avoid [...] XSS attacks” — [P43]. No statistically signicant dierences were observed among devel- opers using no AI, the free version, or the paid version of Gemini. Participants found Gemini helpful for security tasks, including general security advice, guidance on hashing, and identifying vul- nerabilities. Howev er , they noted challenges such as sub optimal or outdated suggestions, missed measures, unclear explanations, errors, and concerns about Gemini’s suggestions introducing vul- nerabilities. RQ 3 – Summary 4.4 RQ4 - Impact of Gemini Use on User Trust Quantitative A nalysis. Trust in AI: W e asked participants if they “trust Gemini in general” and to “generate se cure code” on a Likert Scale from 1: Nev er - 7: Always (see Figure 3). Participants in the No-AI group were asked if they would trust AI (see Figure 4). The mean scores for general trust in Gemini in both gr oups were similar , with Free-AI having a slightly higher mean (4.85) than P aid-AI (4.76). The results for trust in Gemini to create secure code were also similar . How ever , the mean values were slightly lower than for the general trust, 4.20 for Free-AI and 4.06 for P aid -AI . Trust in Solutions: Similarly , we aske d participants to rate their self-assessed functional correctness in Figure 5 and self-assessed security in Figure 6. The mean scores for the self-assesse d functional correctness were 6.32 in No -AI , 6.80 in Free-AI and 6.64 in P aid- AI . The mean scores for the self-assesse d security were 5.23 in No-AI , 5.41 in Free-AI and 5.48 in P aid-AI . The results indicate that participants trust the functionality of their code more than the security . T able 12 summarizes the AI-preference responses for No-AI , indicating that the majority of participants in the No -AI group would have pr eferred to use AI. Figure 3: Responses for *“I trust Gemini in general” and **“I trust Gemini to generate secure code” . Figure 4: Responses for *“W ould you trust AI in general?” and **“W ould you trust AI to generate secure code?” , Asked for No-AI Group T able 12: Responses to AI Assistance Preference, Asked for No-AI Group AI Preference Number of Participants No 13 Y es 34 “W ould you have appr eciated assistance from an AI?” Figure 5: Resp onses for “I believe I have solved this task functionally correctly . ” Signicance T esting: A Mann- Whitney U test was conducted to compare the trust between the Free-AI and P aid-AI groups. The results show ed no statistically signicant dierence between the groups for the general trust (U = 1072, p = .7012) and the security trust (U = 1060, p = .7782). The Impact of AI- Assisted Development on Soware Security: A Study of Gemini and Developer Experience Figure 6: Resp onses for “How would you rate the level of security of your solution for the task?” Qualitative A nalysis. Reasons for Trust: Forty-one participants reported trusting Gemini’s suggestions because they could assess them using their own knowledge or by cross-referencing with other resources (e.g., “With my own knowledge [...], I could verify the suggestions [...]. Since they align and make sense, and sometimes can be veried by other sites (SO, API do c, etc.), I trust those sugges- tions” — [F05]). Seventeen participants expressed trust in Gemini’s suggestions be cause they were mostly correct or worked as ex- pected. T welve participants trusted Gemini’s suggestions due to its informative explanations (e.g., “I also trusted the code be cause Gemini explained the code properly and added comments to the co de it generated” — [P57]). Trust in Gemini’s suggestions was linked to its training data, as noted by 10 participants (e.g., “they are base d on extensive knowledge and experience” — [P14]). 6 participants (P09, P44, F02, F23, F35, F40) trusted because of the company Google behind Gemini (e .g., “the fact that it was developed by Google as a reputable company helpe d me trust the suggestions” — [F23]). Trust was also attributed to ref- erences by 4 participants (F13, F39, F59, F64), e.g., “All the answers by Gemini were referenced. [...] That is why I think Gemini gives trustable results than other LLMs” — [F39], “providing links to o- cial documentation and well-regarded tutorials [...] was particularly important for implementing security [...]” — [F64]. T en participants reported trusting Gemini spe cically for security- related suggestions. For example, P27 highlighted Gemini’s ability to catch security mistakes: “giving it a code for , sort of, review pro- vides a layer of security . Basically , it’s another pair of eyes that might notice something you’ve missed or even doesn’t know” — [P27]. Interestingly , 2 participants (P07, P55) mentione d trusting the paid version of Gemini more: “A s its was the premium version, so i believe the resource behind it must b e b est available” — [P07], “Looked secure to me, moreover , it is their paid version, so it must generate good code, right?” — [P55]. Apart from these two participants, there were no instances where trust diered between the free and the paid version, supporting the results of the quantitative analysis. Reasons for Mistrust: T wenty-four participants reported that they assessed Gemini’s suggestions using their knowledge or other resources, as they did not trust the AI’s suggestions ( e.g., “I googled what Gemini told me just to make sure I’m doing the right thing” — [P02], “It wasn’t blind trust, I validate d the code using my own knowl- edge” — [P26]). Similar to working suggestions, non-working sug- gestions led to mistrust, as mentioned by 26 participants. Seven participants (F05, F26, F33, F50, P09, P37, P51) mistrusted Gemini due to its training data ( e.g., “It’s totally dependent on humans for training and bad data might b e supplied” — [P51]). Participants also cited outdated suggestions (P04, P33, P44, P46, P48, F64), required modications (P13, P15, P33, P5, F04, F25, F61) and misinterpreta- tion of their requirements (P14, P18, P28, P44, P50, P62, F22) (e.g., “sometimes [Gemini] loses context or ignores instructions previously provided, causes trust issues” — [P18]). T welve participants in P aid-AI and 10 in Free-AI expressed mistrust in Gemini’s security-related suggestions (e.g., “It could scrape the internet and give me some security aws” — [P09]). 5 participants (P16, P18, P26, F02, F34) mentioned mistrust because of possible hallucinations (e.g., “AI A ssistants can hallucinate often [...]. This is even more critical in environments where security is paramount” — [P16]). Our ndings revealed no signicant dierences in general or security-related trust between Gemini’s free and paid versions. Participants reported trusting Gemini’s suggestions primarily be- cause they could verify them through their own knowledge or external resources, with some also trusting the AI for its clear explanations, quantity , and quality of training data, or be cause it met their expectations. Trust was further bolstered by Gem- ini’s security-related fee dback and Google’s reputation. How ever , mistrust arose from concerns o ver incorrect or outdated sugges- tions, security risks, and the AI’s tendency to ignore context or hallucinate. RQ 4 – Summary 5 Discussion and Re commendations Our results indicate that Gemini cannot compensate for the depth of expertise that experienced developers bring to secure software development. This has dir ect implications for an industry already struggling with a persistent shortage of skilled professionals, par- ticularly in security-critical domains. Although many organisations have turned to AI to address this shortage, sometimes even replac- ing junior developer roles [ 20 ], our results suggest that practical programming experience remains essential, especially for produc- ing se cure code. While AI assistance may oer value to novice developers, such as learning new concepts or technologies [ 35 ], it seems not to eliminate the need for human expertise. If organiza- tions increasingly substitute low-experience developers with AI support, junior practitioners will lose vital opportunities to build foundational skills, and the long-term availability of experienced professionals may de cline even further , undermining the overall security posture of software ecosystems. Based on our ndings, we strongly caution against viewing AI tools as substitutes for genuine developer expertise. 5.1 Implications for Companies Security Through Programming Exp erience: W e found that Gemini did not signicantly improve se curity scores and secure software development could not be fully substitute d by Gemini support. Our analysis showed that de velopers with more general programming experience tend to pr oduce code with signicantly higher security scores, supporting the ndings of Acar et al. [ 2 ], who also found a positive impact of self-r eported years of experience on Jost et al. security . This correlation could o ccur because more experienced developers might have had more time and workload to focus on security practices. Eective workload management could enhance developer satisfaction and code security [18, 23, 43, 50]. AI Does Not Eliminate the Need for Security Measures: While Gemini could replace programming experience, it can assist developers, particularly in security-critical implementations such as password storage. Despite these benets, our ndings indi- cate that the security scores of code written with Gemini assistance was comparable to that of code written without such assistance, supporting the ndings of prior studies [ 3 , 68 ]. Consequently , or- ganizations should not rely on AI tools as a substitute for security processes within the software de velopment lifecycle [ 4 ]. Security testing, code analysis and security code reviews r emain essential regardless of whether AI assistance is used. 5.2 Implications for Developers Don’t Rely Solely on AI for Se curity: Developers should be cautious ab out relying on AI tools such as Gemini for enhanced security . In the context of our study , we did not obser ve signicant dierences in trust b etween the free and the paid versions, however , 2 participants mentioned trusting Gemini Advanced more because it is the paid version. Instead of dep ending on AI-generated code for security measures, developers might use AI as an extra layer of support, particularly for identifying potential vulnerabilities or addressing security measures that may hav e been overlooked. As our participants frequently mentioned errors in se curity-related suggestions or newly introduced vulnerabilities from Gemini, it is critical to validate AI’s suggestions with trusted resources, such as O W ASP [56] or NIST [51]. Security Overcondence: Our study rev ealed an unexpected trend indicating that participants with self-reported prior security experience tended to write less secure co de. A possible explanation could be that participants who feel more condent in their security skills might believe they ar e less likely to make mistakes, leading them to pay less attention to security and overlook vulnerabilities. Code reviews focused on security [ 11 ], along with the use of static analysis tools, can help identify overlooked vulnerabilities [77]. 5.3 Implications for AI Practitioners Vulnerabilities: While Asare et al. [ 3 ] found no signicant impact of AI assistance on sp ecic vulnerability types, our ndings suggest a nuanced eect. When considering the individual vulnerabilities, Gemini assistance led to approximately 21.7% fewer vulnerabili- ties in password storage and 10.5% fewer in SQL vulnerabilities. Howev er , participants still faced challenges, such as Gemini sug- gesting outdated libraries or omitting hashing recommendations. XSS vulnerabilities were improved by only 3.5% with the use of Gemini. CSRF and improper input validation seemed almost not to be inuenced by Gemini usage. This suggests that AI tool designers might require to address spe cic vulnerabilities. For instance, AI training could b e improved by running static analysis checks b efore including code into the training data [ 61 ], including only the latest versions of security libraries to avoid outdated suggestions. When Paying More Does Not Mean More Se cure: While our participants reported that Gemini helpe d them to implement secu- rity measures, no signicant improvement in security scores was ob- served between the fr ee and paid versions. Their r esponses showed that Gemini frequently overlooked security concerns without ex- plicit prompting. AI-generated code intended to address security often contained errors, making it unsuitable for implementation. These ndings reected that developer condence in Gemini’s paid and free versions did not dier . On the one hand, this could indi- cate that the paid v ersion’s benets are not being communicated eectively or that these benets were not appar ent when the par- ticipants used the paid version of Gemini. On the other hand, these ndings also suggest that organizations and individual developers might not be required investing in costly paid versions of AI tools if free versions oer comparable security . 6 Conclusion T o explore the impact of AI to ols such as Gemini on software secu- rity , we conducted a study with 159 developers recruited through the freelancer platform Upwork. Our ndings showed that de velop- ers using Gemini did not produce code that was signicantly more secure. W e did not obser ve signicant dierences in the security or user trust between the free and paid versions. However , pro- gramming experience improved code security signicantly . While developers with no security experience demonstrated enhanced code security when supporte d by Gemini, we advise against relying solely on AI tools and recommend using them as a supplementar y layer . Acknowledgments This work was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy - EXC 2092 CASA - 390781972. References [1] Y asemin Acar , Michael Backes, Sascha Fahl, Doowon Kim, Michelle L. Mazurek, and Christian Stransky . 2016. Y ou Get Where Y ou’re Looking for: The Impact of Information Sources on Code Security . In 2016 IEEE Symposium on Security and Privacy (SP) (San Jose, CA, 2016-05). IEEE, San Francisco, CA, US, 289–305. doi:10.1109/SP.2016.25 [2] Y asemin Acar , Christian Stransky , Dominik W ermke, Michelle L. Mazurek, and Sascha Fahl. 2017. Se curity developer studies with github users: exploring a convenience sample. In Proceedings of the Thirteenth USENIX Conference on Usable Privacy and Security (Santa Clara, CA, USA) (SOUPS ’17) . USENIX Association, USA, 81–95. [3] Owura Asare , Meiyappan Nagappan, and N. Asokan. 2024. A User-centered Security Evaluation of Copilot. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering (Lisbon, Portugal) (ICSE ’24) . Association for Computing Machinery , New Y ork, N Y , USA, Article 158, 11 pages. doi:10.1145/ 3597503.3639154 [4] Hala Assal and Sonia Chiasson. 2018. Se curity in the Software Development Lifecycle. In Fourteenth Symposium on Usable Privacy and Security (SOUPS 2018) . USENIX Association, Baltimore, MD , 281–296. https://www .usenix.org/confere nce/soups2018/presentation/assal [5] Jacob Austin, A ugustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael T erry, Quo c Le, and Charles Sutton. 2021. Program Synthesis with Large Language Models. arXiv:2108.07732 [cs] http://arxiv .org/abs/2108.07732 [6] Kenneth A Bamberger , Serge Egelman, Catherine Han, Amit Elazari Bar On, and Irwin Reyes. 2020. Can you pay for privacy? consumer expectations and the behavior of free and paid apps. Berkeley Tech. LJ 35 (2020), 327. [7] Alex Biryukov , Daniel Dinu, and Dmitr y Khovratovich. 2015. Argon2: the memory-hard function for password hashing and other applications. ht tp s: //www .password- hashing.net/argon2- specs.pdf [Online; accessed 2025-12-11]. The Impact of AI- Assisted Development on Soware Security: A Study of Gemini and Developer Experience [8] Richard E. Boyatzis. 1998. Transforming Qualitative Information | SAGE Publi- cations Inc . ht tp s: // us .s ag ep ub .c om /e n- u s/ na m/ tr an sf or mi ng - q ua li ta ti ve - information/book7714 [9] Virginia Braun and Victoria Clarke. 2024. Got questions about Thematic Analy- sis? W e have prepared some answers to common ones. https://w ww .thematican alysis.net/f aqs/. [Online; accessed 2025-12-05]. [10] Virginia Braun, Victoria Clarke, and Nikki Hayeld. 2022. ‘A starting point for your journey , not a map’: Nikki Hayeld in conversation with Virginia Braun and Victoria Clarke about thematic analysis. Qualitative Research in Psychology 19, 2 (2022), 424–445. doi:10.1080/14780887.2019.1670765 [11] Larissa Braz, Christian A eberhard, Gul Calikli, and Alberto Bacchelli. 2022. Less is More: Supporting Developers in Vulnerability Detection during Code Review . In 2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE) . IEEE Computer Society , Los Alamitos, CA, USA, 1317–1329. doi:10.1145/351000 3.3511560 [12] J Brooke. 1996. SUS: A “quick and dirty” Usability Scale. Usability evaluation in industry 189, 194 (1996), 4–7. [13] TIOBE Software BV . 2024. TIOBE Index - TIOBE. https://www.tiobe .com/tiobe- index/. [Online; accessed 2025-12-05]. [14] Satish Chandra. 2024. Progress of AI-based assistance for software engineering in Google’s internal tooling and our projections for the future. https://research.goo gle/blog/ai- in- software- engineering- at- google- progress- and- the- path- ahead/ [Online; accessed 2025-12-05]. [15] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Y uri Burda, Nicholas Joseph, Greg Brockman, Alex Ray , Raul Puri, Gretchen Krueger , Michael Petrov , Heidy Khlaaf, Girish Sastry , Pamela Mishkin, Brooke Chan, Scott Gray , Nick Ryder, Mikhail Pavlov , Alethea Power , Lukasz Kaiser , Mohammad Bavarian, Clemens Winter , Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias P lap- pert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-V oss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas T ezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr , Jan Leike, Josh Achiam, V edant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer , Peter W elinder , Bob McGrew , Dario Amodei, Sam McCandlish, Ilya Sutskever , and W ojciech Zaremba. 2021. Evaluating Large Language Models Trained on Code. arXiv:2107.03374 [cs] http://arxiv .org/abs/2107.03374 [16] Jacob Cohen. 1988. Statistical Power A nalysis for the Behavioral Sciences (2 ed.). Routledge, New Y ork. doi:10.4324/9780203771587 [17] Data Pr otection Commission. 2024. Irish Data Protection Commission nes Meta Ireland € 91 million. ht tp s: / /w w w . da ta pr o te ct i on .i e/ e n/ ne ws - m ed ia /p r es s- releases/DPC- announces- 91- million- ne- of - Meta. [Online; accessed 2025-12- 05]. [18] Thomas Dohmke, Marco Iansiti, and Greg Richards. 2023. Sea change in softwar e development: Economic and productivity analysis of the ai-powered developer lifecycle. [19] Xueying Du, Mingwei Liu, Kaixin W ang, Hanlin W ang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. 2024. Evaluating Large Language Models in Class-Level Code Generation. In Procee dings of the IEEE/ACM 46th International Conference on Software Engineering (New Y ork, NY, USA, 2024-04-12) (ICSE ’24) . Association for Computing Machiner y , Lisb on, Portugal, 1–13. doi:10.1145/3597503.3639219 [20] By Carl Friedmann. 2025. Demand for junior developers softens as AI takes o ver | CIO. https ://w ww .cio.com /articl e/40620 24/dem and- f or- junior- de velopers - softens- as- ai- takes- over.html [Online; accessed 2025-12-10]. [21] Inc. GitHub. 2024. GitHub Copilot · Y our AI pair programmer . https://github.c om/f eatures/copilot. [Online; accessed 2025-12-05]. [22] GitHub, Inc. 2024. Sur vey reveals AI’s impact on the developer experience - The GitHub Blog. https ://github.blog/n ews- insi ghts/research/sur vey- reveals- ais- impact- on- the- developer- experience/. [Online; accessed 2025-12-05]. [23] Lucian Gonçales, Kleinner Farias, Bruno da Silva, and Jonathan Fessler . 2019. Measuring the Cognitive Load of Software De velopers: A Systematic Mapping Study . In 2019 IEEE/A CM 27th International Conference on Program Comprehension (ICPC) . IEEE/ACM, Montr eal, QC, Canada, 42–52. doi:10.1109/ICPC.2019.00018 ISSN: 2643-7171. [24] Google. 2024. Bard becomes Gemini: Try Ultra 1.0 and a new mobile app today. htt ps:/ /bl og.g oo gle/ pro duc ts/g emi ni/b ard- ge mini - a dvan ced- a pp/. [Online; accessed 2025-12-05]. [25] Google. 2025. Gemini. htt p s: / /g em i ni . go o gl e. c om / ap p. [Online; accessed 2025-12-05]. [26] Jason Graefe. 2025. How software development companies are paving the way for AI transformation. https://partner .microsof t.com/en- kw/blog/article/generative- ai- impact- for- partners [Online; accessed 2025-12-05]. [27] Sivana Hamer , Marcelo d’ Amorim, and Laurie Williams. 2024. Just another copy and paste? Comparing the security vulnerabilities of ChatGPT generated code and StackOverow answers. In 2024 IEEE Se curity and Privacy W orkshops (SP W) . IEEE, San Francisco, CA, US, 87–94. doi:10.1109/SPW63631.2024.00014 [28] Sandra G Hart and Lowell E Staveland. 1988. Development of NASA- TLX (Task Load Index): Results of empirical and theoretical r esearch. In Advances in psy- chology . V ol. 52. Elsevier , 139–183. [29] Ahmed E. Hassan. 2026. Agentic Software Engineering: Building Trustworthy Software with Stochastic T eammates at Unprecedented Scale (rst e dition, v0.3 ed.). [30] George Hatzivasilis, Ioannis Papaefstathiou, and Charalampos Manifavas. 2015. Password Hashing Competition - Survey and Benchmark. Cr yptology ePrint Archive, Paper 2015/265. https://eprint.iacr .org/2015/265 [31] Jingxuan He and Martin V echev . 2023. Large Language Models for Code: Secu- rity Hardening and Adversarial T esting. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security (2023-11-15). ACM, Copen- hagen Denmark, 1865–1879. doi:10.1145/3576915.3623175 [32] W enpin Hou and Zhicheng Ji. 2024. Comparing Large Language Models and Human Programmers for Generating Programming Code. doi:10.1002/advs.202 412279 [33] Upwork ® Global Inc. 2024. Upwork | The W orld’s W ork Marketplace. ht tps : //www .upwork.com/. [Online; accessed 2025-12-05]. [34] Harjot Kaur , Sabrina Amft, Daniel V otipka, Y asemin Acar , and Sascha Fahl. 2022. Where to Recruit for Security Development Studies: Comparing Six Software Developer Samples. In 31st USENIX Security Symposium (USENIX Se curity 22) . USENIX Association, Boston, MA, 4041–4058. https://www .usenix.org/confere nce/usenixsecurity22/presentation/kaur [35] Majeed Kazemitabaar , Justin Chow , Carl K a To Ma, Barbara J. Ericson, David W eintrop, and T ovi Grossman. 2023. Studying the eect of AI Code Generators on Supporting Novice Learners in Introductory Programming. In Proce edings of the 2023 CHI Conference on Human Factors in Computing Systems (2023-04-19). ACM, Hamburg Germany , 1–23. doi:10.1145/3544548.3580919 [36] GitHub Sta K yle Daigle. 2024. Survey: The AI wave continues to gro w on software development teams - The GitHub Blog. h ttp s: //g it hu b.b lo g/n ews - insights/research/survey- ai- wave- grows/ [Online; accessed 2025-12-05]. [37] Y ujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, T om Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hub ert, Peter Choy, Cyprien de Masson d’ Autume , Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes W elbl, Sven Gowal, Alexey Cherepanov , James Molloy , Daniel J. Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals. 2022. Competition- Level Code Generation with AlphaCode. Science 378, 6624 (2022), 1092–1097. arXiv:2203.07814 [cs] doi:10.1126/science.abq1158 [38] Jenny T . Liang, Chenyang Yang, and Brad A. Myers. 2024. A Large-Scale Sur vey on the Usability of AI Programming Assistants: Successes and Challenges. In Proceedings of the IEEE/A CM 46th International Conference on Software Engineering (ICSE) (2024-02-06). ACM, Lisbon Portugal, 1–13. doi:10.1145/3597503.3608128 [39] Linktree. 2024. Link in bio tool: Everything you are, in one simple link | Linktree. https://linktr.ee/. [Online; accessed 2025-12-05]. [40] Rossella Mattioli and Apostolos Malatras. 2024. Foresight cyb ersecurity threats for 2030: Up date . h t tp s : / / w w w. e n i sa . e u r o p a . eu / s i t e s / d e f au l t /  l e s / 2 02 4 - 11 / Cy b er s ec ur i ty % 20 T hr e at s %2 0 f o r% 2 02 0 30 % 20- %2 0 Up d at e %2 0 20 2 4% 2 0- %20Executive%20Summary_0.pdf [Online; accessed 2025-12-05]. [41] Jowi Morales. 2026. In wake of outage, Amazon calls upon senior engineers to address issues created by ’Gen- AI assisted changes, ’ report claims. h tt ps: // w w w . to m sh ar d wa re .c o m/ t ec h - i nd us t r y/ a rt if i c ia l- i nt el l ig e nc e/ a ma z on - calls- eng ineers- to- addr ess- issues- cau sed- by- u se- of - ai- to ols- report- clai ms- compan y- sa ys- rec ent- i ncident s- h ad- h igh- b last- radius - an d- were- all egedly- related- to- gen- ai- assisted- changes Accessed: 2026-03-13. [42] Zahra Mousavi, Chadni Islam, Kristen Moore, Alsharif Abuadbba, and M. Ali Babar . 2024. An Investigation into Misuse of Java Security APIs by Large Lan- guage Models. In Proce edings of the 19th ACM Asia Conference on Computer and Communications Security (Singapore Singapore, 2024-07). A CM, Singapore, 1299–1315. doi:10.1145/3634737.3661134 [43] Sebastian C. Muller . 2015. Measuring Software Developers’ Perceived Diculty with Biometric Sensors. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering (2015-05). IEEE, Florence, Italy , 887–890. doi:10.110 9/ICSE .2015.284 [44] David Musheyev , Alexander Pan, Preston Gross, Daniel Kamyab, Peter Kaplinsky , Mark Spivak, Marie A Bragg, Stacy Loeb, and Abdo E Kabarriti. 2024. Readability and information quality in cancer information from a free vs paid chatbot. JAMA network open 7, 7 (2024), e2422275–e2422275. [45] Sarah Nadi, Stefan Krüger , Mira Mezini, and Eric Bodden. 2016. Jumping thr ough hoops: why do Java developers struggle with cryptography APIs? . In Proceedings of the 38th International Conference on Software Engineering (ICSE) (2016-05-14). ACM, A ustin T exas, 935–946. doi:10.1145/2884781.2884790 [46] Alena Naiakshina, Anastasia Danilova, Eva Gerlitz, and Matthew Smith. 2020. On Conducting Security De veloper Studies with CS Students: Examining a Password- Storage Study with CS Students, Freelancers, and Company Developers. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (2020-04-23) (CHI ’20) . Association for Computing Machinery , New Y ork, N Y, USA, 1–13. doi:10.1145/3313831.3376791 Jost et al. [47] Alena Naiakshina, Anastasia Danilova, Eva Gerlitz, Emanuel von Zezschwitz, and Matthew Smith. 2019. "If you want, I can store the encrypted password": A Password-Storage Field Study with Freelance Developers. In Procee dings of the 2019 CHI Conference on Human Factors in Computing Systems (2019-05-02) (CHI ’19) . Association for Computing Machinery , New Y ork, N Y, USA, 1–12. doi:10.1145/3290605.3300370 [48] Alena Naiakshina, Anastasia Danilova, Christian Tiefenau, Marco Herzog, Sergej Dechand, and Matthew Smith. 2017. Why Do Developers Get Password Storage W rong?: A Qualitative Usability Study . In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (2017-10-30). A CM, Dallas T exas USA, 311–328. doi:10.1145/3133956.3134082 [49] Alena Naiakshina, Anastasia Danilova, Christian Tiefenau, and Matthew Smith. 2018. Deception task design in developer password studies: exploring a student sample. In Proceedings of the Fourteenth USENIX Conference on Usable Privacy and Security (2018) (SOUPS ’18) . USENIX Association, USA, 297–313. ev ent-place: Baltimore, MD, USA. [50] Kevin KB Ng, Liyana Fauzi, Leon Le ow , and Jaren Ng. 2024. Harnessing the Potential of Gen-AI Coding A ssistants in Public Sector Software Development. [51] National Institute of Standards and T e chnology . 2024. Cybersecurity and Privacy. https://www .nist.gov/cybersecurity- and- privacy. [Online; accessed 2025-12-05]. [52] Sanghak Oh, Kiho Lee, Seonhye Park, Do owon Kim, and Hyoungshick Kim. 2024. Poisoned ChatGPT Finds W ork for Idle Hands: Exploring Developers’ Coding Practices with Insecure Suggestions from Poisoned AI Models . In 2024 IEEE Symposium on Se curity and Privacy (SP) . IEEE Computer Society , Los Alamitos, CA, USA, 1141–1159. doi:10.1109/SP54263.2024.00046 [53] OpenAI. 2025. ChatGPT | OpenAI. ht tp s:/ /c hat gp t.c om/. [Online; accesse d 2025-12-05]. [54] Stack Overow . 2023. Stack Overow Developer Survey 2023. https://survey.sta ckoverow .co/2023. [Online; accessed 2025-12-05]. [55] O W ASP. 2025. A02 Cryptographic Failures - O W ASP T op 10:2021. h t t ps : //owasp.org/T op10/2021/A02_2021- Cryptographic_Failures/ [Online; accessed 2025-12-10]. [56] Inc. OW ASP Foundation. 2024. About the O W ASP Foundation | O W ASP Foun- dation. https://owasp.org/about/. [Online; accessed 2025-12-05]. [57] Inc. O W ASP Foundation. 2024. Source Code Analysis T ools | O W ASP Foundation. https://owa sp.org/www - comm unity/Sourc e_Code_Analy sis_ Tools. [Online; accessed 2025-12-05]. [58] Hammond Pearce, Baleegh Ahmad, Benjamin T an, Brendan Dolan-Gavitt, and Ramesh Karri. 2025. Asleep at the Keyboard? Assessing the Security of GitHub Copilot’s Code Contributions. Commun. ACM 68, 2 (Jan. 2025), 96–105. doi:10.1 145/3610721 [59] Hammond Pearce, Benjamin T an, Baleegh Ahmad, Ramesh Karri, and Brendan Dolan-Gavitt. 2023. Examining Zero-Shot Vulnerability Repair with Large Lan- guage Models. In 2023 IEEE Symposium on Security and Privacy (SP) (2023-05). IEEE, San Francisco, CA, USA, 2339–2356. doi:10.1109/SP46215.2023.10179324 [60] Colin Percival. 2009. Stronger key derivation via sequential memory-hard func- tions. [61] Neil Perry , Megha Srivastava, Deepak Kumar , and Dan Boneh. 2023. Do Users W rite More Insecure Code with AI Assistants? . In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security (Copenhagen, Denmark,) (CCS ’23) . A ssociation for Computing Machinery , New Y ork, NY , USA, 2785–2799. doi:10.1145/3576915.3623157 [62] Vilius Petkauskas. 2024. RockY ou2024: 10 billion passwords leaked in the largest compilation of all time. https://cybernews.com/security/rockyou2024- largest- password- compilation- leak/. [Online; accessed 2025-12-05]. [63] Sundar Pichai. 2024. Google Gemini update: Sundar Pichai introduces Ultra 1.0 in Gemini Advanced. https://blog.go ogle/technology/ai/google- gemini- update- sundar- pichai- 2024/ [Online; accessed 2025-05-13]. [64] Prolic. 2025. How to Detect and Prevent the Use of Large Language Models in Studies. https://researcher- help.prolic.com/en/articles/445207- how- to- detect- and- prevent- the- use- of - large- language- models- in- studies. Prolic Research Help Center . Accessed: 2026-03-13. [65] Niels Provos and David Mazières. 1999. A future-adaptive password scheme. In Proceedings of the A nnual Conference on USENIX Annual Technical Conference (Monterey , California) (A TEC ’99) . USENIX Association, USA, 32. [66] Python Code Quality Authority (PyCQ A). 2024. PyCQA/bandit: Bandit is a tool designed to nd common security issues in Python code. https://github.com/P yCQA/bandit. [Online; accessed 2025-12-05]. [67] SonarSource SA. 2024. Code Quality, Security & Static Analysis T ool with SonarQube | Sonar. ht t p s :/ / w w w. so n a r so u r c e. c o m/ p r o d u ct s / s on a r q u b e/. [Online; accessed 2025-12-05]. [68] Gustavo Sandoval, Hammond Pearce , T e o Nys, Ramesh Karri, Siddharth Garg, and Brendan Dolan-Gavitt. 2023. Lost at C: A User Study on the Security Im- plications of Large Language Model Code Assistants. In 32nd USENIX Security Symposium (USENIX Se curity 23) . USENIX Association, Anaheim, CA, 2205–2222. https://www .usenix.org/conference/usenixsecurity23/presentation/sandoval [69] Mohammed Latif Siddiq, Shafayat H. Majumder, Maisha R. Mim, Sourov Jajodia, and Joanna C. S. Santos. 2022. An Empirical Study of Code Smells in Transformer- based Code Generation T echniques. In 2022 IEEE 22nd International W orking Conference on Source Co de Analysis and Manipulation (SCAM) (2022-10). IEEE, Limassol, Cyprus, 71–82. doi:10.1109/SCAM55253.2022.00014 [70] Andy Smith. 2024. Can Generative AI Solve the Software Engineer Shortage? | HatchW orks AI. htt ps: //h atch wor ks. com /blo g/g en- a i/so f tw are - e ngi neer- shortage/ [Online; accessed 2025-12-05]. [71] Dominik Sobania, Martin Briesch, and Franz Rothlauf. 2022. Choose your pro- gramming copilot: a comparison of the program synthesis performance of github copilot and genetic programming. In Proceedings of the Genetic and Evolutionar y Computation Conference (2022-07-08) (GECCO ’22) . Association for Computing Machinery , New Y ork, N Y, USA, 1019–1027. doi:10.1145/3512290.3528700 [72] Stack Exchange, Inc. 2022. Stack O verow Developer Sur vey 2022. h t t p s : // su rv ey .s tac ko ve r ow.co /2 022 /# de mo gr aph ic s- g en der. [Online; accessed 2025-12-05]. [73] Stack Exchange, Inc. 2024. 2024 Stack Overow Developer Sur vey . h t t p s : //survey.stackoverow .co/2024/. [Online; accessed 2025-12-05]. [74] Statista. 2022. Authentication factors used by nance companies 2022 | Statista. https:/ /www.statista.co m/statist ics/1342 535/authe ntication - fac tors- used- by- nance- companies/. [Online; accessed 2025-12-05]. [75] Statista. 2022. A uthentication factors used by healthcare companies 2022 | Statista. https:/ /www.statista.co m/statist ics/1342 533/authe ntication - fac tors- used- by- healthcare- companies/. [Online; accessed 2025-12-05]. [76] Statista. 2024. Revenue of the public cloud market worldwide from 2020 to 2029. htt ps: //w ww.sta tist a.c om/ f or eca sts /96 384 1/cl oud - ser vi ces- re venu e- in- t he- world. [Online; accessed 2025-12-05]. [77] Mohammad Tahaei, Kami Vaniea, Konstantin (Kosta) Beznosov , and Maria K W olters. 2021. Security Notications in Static Analysis T o ols: Developers’ At- titudes, Comprehension, and Ability to Act on Them. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (2021-05-07) (CHI ’21) . Association for Computing Machinery , New Y ork, N Y, USA, 1–17. doi:10.1145/3411764.3445616 [78] The MI TRE Corporation. 2024. CWE - 2023 CWE T op 25 Most Dangerous Software W eaknesses. https://cwe.mitre.org/top25/archive/2023/2023_top25_list .html. [Online; accessed 2025-12-05]. [79] The MITRE Corporation. 2024. CWE - CWE-20: Improper Input V alidation (4.15). https://cwe.mitre.org/data/def initions/20.html. [Online; accessed 2025-12-05]. [80] The MI TRE Corporation. 2024. CWE - CWE-352: Cross-Site Request Forgery (CSRF) (4.15). https://cwe.mitre.org/data/def initions/352.html. [Online; accessed 2025-12-05]. [81] The MITRE Corporation. 2024. CWE - CWE-79: Improper Neutralization of Input During W eb Page Generation (’Cross-site Scripting’) (4.15). https://cwe.mi tre.org/data/def initions/79.html. [Online; accessed 2025-12-05]. [82] The MITRE Corporation. 2024. CWE - CWE-89: Improper Neutralization of Special Elements used in an SQL Command (’SQL Injection’) (4.15). h t t p s : //cwe.mitre.org/data/def initions/89.html. [Online; accessed 2025-12-05]. [83] Daniel V otipka, Desiree Abrokwa, and Michelle L. Mazurek. 2020. Building and V alidating a Scale for Se cure Software Development Self-Ecacy . In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (Honolulu HI USA, 2020-04-21). ACM, Honolulu, HI, USA, 1–20. doi:10.1145/3313831.3376754 [84] Carlo Zapponi. 2024. Github Language Stats. https://madnight.github.io/githut/. [Online; accessed 2025-12-05]. A Appendix A.1 A vailability Within the appendix, we outline the following components: (1) Section 2: Within this se ction, we provide the job post used for recruiting developers on Upwork. (2) Section 3: This section details the task description pre- sented to the participants. (3) Section 4: Here, we present the follo w-up survey . (4) Section 5: This section provides our codebook . Participants’ solutions and responses are not provided in order to protect their privacy . A.2 Job Post Participants for Python Programming Study with Flask W e are r esearchers from [anonymous univ ersity] and ar e looking The Impact of AI- Assisted Development on Soware Security: A Study of Gemini and Developer Experience for software developers who are inter ested in participating in our programming study . Our resear ch goal is to study de velopers’ programming behavior with a Flask web application. Y ou will b e asked to complete four short programming tasks in a Python Flask web application, submit the code via GitHub Classroom, and ll out a short survey . Y ou will b e rewarded with an expense allowance of $60. The study will be conducted in English. Y ou are free to organize y our time as you wish, but the tasks must be complete d within 2 days (48h) after accepting the job. All data will be processed pseudonymously and stored anonymized after the study; there will be no identifying information publishe d in any form. What does this job look like? (1) Apply for the job (2) Chose a preferred appointment (3) Complete the programming task (4) Fill out a short survey (5) Receive payment via Upwork Please complete the study in a focused setting, using a laptop or PC. A.3 T ask Description W elcome! W e would like to invite you to participate in our study and thus make a valuable contribution to our research. Our research goal is to study developers’ programming b ehavior with a Flask web application. Instructions: • The task requires Python (at least version 3.10.) • Y ou can use an IDE of your choice for solving the task. Before starting with the task, please disable all IDE Plugins that use AI (plugins such as GitHub Copilot, T abnine, . . . ) • ( No-AI group:) Please do not use any kind of AI to fulll this task ( Free-AI group:) Please only use the fr ee version of Gem- ini (https://gemini.google.com) for solving the task. If you do not have a Google account, please create one for free. ( Paid-AI group:) Please only use the paid version of Gem- ini (https://gemini.google.com) for solving the task. The access data for a Google account with a paid Google One subscription (AI Premium V ersion) will b e provided. Within this account you will have access to Gemini A dvanced. (1) Follow the link: https://accounts.google.com/ (2) Login with the provided Google Account: (3) Username: (username) (4) Password: (passwor d) (5) Follow the link https://gemini.google.com to use Gem- ini • ( Paid-AI group:) P lease do not enter login data such as passwords or p ersonal data into the chat in Gemini. All chats in the provided account will be deleted after the par- ticipation. • ( Free-AI group:) Please do not use any other AI tools except the free version of Gemini. • ( Paid-AI group:) P lease do not use any other AI tools except the paid version of Gemini. • In order to solv e the task, you are allowed to use any kind of source available on the Internet that may b e helpful or has valuable information, except W ebsites that include AI such as ChatGPT , Microsoft Copilot/Bing Chat ( No-AI group:) or Google Gemini. ( Free-AI group:) Please use only the fr ee version of Gem- ini instead. ( Paid-AI group:) P lease use only the premium version of Gemini instead. • Y ou may import any neede d external libraries. • How to start the application: python -m ask run Link to T ask: [Link to GitHub Classroom] T ask: • Please follow the link above and connect your GitHub Ac- count to GitHub Classroom. Select your Pseudonym from the list. • Clone the code to your IDE • Install the r equirements with pip install -r requirements.txt • Complete the TODOs in the code • Push the completed code to GitHub • Please ll in the following survey after working on the task: [Link to survey on Qualtrics] Implementation Hints: The Python co de sets up a F lask web application with user au- thentication and basic CRUD functionalities for managing user websites. It uses F lask’s login extension for managing user sessions and SQLite for database operations. The User class represents user objects with their credentials. There are routes for registering, log- ging in, logging out, viewing the dashboard and managing websites. Functionalities that should be implemented ar e marked with TODO. Please pay attention to the security aspects during implementation. In the future, it is planne d that users will be able to share their proles containing their websites publicly , therefore making them visible to others. A.4 Survey • Please enter your pseudonym: • How old are you? • What is your gender? – W oman – Man – Non-binary – Prefer not to disclose – Prefer to self-describe: • In which country do you currently live mainly? • What level of education do you have? Please indicate only the highest degree. – No school degree – School degree – Professional training – Bachelor Degree – Masters Degree or equivalent Diploma – Doctoral degree (Dr ./P hD) – Other university degree – Other , namely: Jost et al. • Please specify the number of years you hav e been program- ming in general: • Please specify the number of years you hav e been program- ming in Python: • Which pr ogramming, scripting, and markup language( s) do you regularly utilize? – Bash/Shell (all shells) – C – C# – C++ – H TML/CSS – Java – JavaScript – Python – SQL – TypeScript – Other: • Which Integrated Development Environment (IDE) do you commonly use? – Android Studio – IntelliJ IDEA – Jupyter Notebook/JupyterLab – Neovim – Notepad++ – PyCharm – Sublime T ext – Vim – Visual Studio – Visual Studio Co de – Other: • Which IDE did you utilize for completing the task? – Android Studio – IntelliJ IDEA – Jupyter Notebook/JupyterLab – Neovim – Notepad++ – PyCharm – Sublime T ext – Vim – Visual Studio – Visual Studio Co de – Other: • Which AI tools do you commonly use? – None – ChatGPT Free V ersion – ChatGPT Paid V ersion – Gemini Free V ersion – Gemini Paid V ersion – GitHub Copilot – Tabnine Fr ee V ersion – Tabnine Paid V ersion – Visual Studio IntelliCo de – Other: • Which AI tools did you utilize for solving the task? – None – ChatGPT Free V ersion – ChatGPT Paid V ersion – Gemini Free V ersion – Gemini Paid V ersion – GitHub Copilot – Tabnine Fr ee V ersion – Tabnine Paid V ersion – Visual Studio IntelliCo de – Other: • Do you have IT se curity experience? – No experience – I have taken an I T security course/training – I have a degree in I T security – I have a certicate in I T security – I have dev eloped security applications/implemente d security measures – I have worked at I T security-related companies – I have worked on I T security in my spare time – Other: • What is your current main occupation? – Freelance developer – Industrial developer – Freelance security expert – Industrial security expert – Industrial researcher – Academic researcher – Undergraduate student – Graduate student – Other: • How much time in minutes did you nee d to solve the task? • On a scale of 1 to 7, please rate the following statement: ’I believe I have solved this task functionally correctly . ’ 1: Strongly Disagree - 7: Strongly A gree • On a scale of 1 to 7, how would you rate the le vel of security of your solution for the task? 1: Not Secure - 7: T otally Secure • ( Free-AI & Paid-AI group:) – On a scale from 1 to 7, how helpful do you consider Gemini in solving this task? 1: Not Helpful - 7: V ery Helpful – In what ways has the assistance pro vided by Gemini positively inuenced the solving of the task? – In what ways has the assistance pro vided by Gemini presented challenges in solving the task? – On a scale of 1 to 7, please rate the following statement: ’I trust Gemini in general. ’ 1: Never - 7: Always – On a scale of 1 to 7, please rate the following statement: ’I trust Gemini to generate secure code. ’ 1: Never - 7: Always – Why did you choose to trust the suggestions by Gem- ini? – Why did you choose to not trust the suggestions by Gemini? – T o help us understand your interactions with Gemini, please answer the following questions honestly: ∗ How often have you consulted Gemini (ques- tions or prompts)? The Impact of AI- Assisted Development on Soware Security: A Study of Gemini and Developer Experience ∗ What percentage of Gemini’s suggestions did you adopt? 0% - 100 % • ( No -AI group:) – W ould you have appreciated assistance from an AI? ∗ Y es ∗ No – On a scale of 1 to 7, would you trust AI in general? 1: Never - 7: Always – On a scale of 1 to 7, would you trust AI to generate secure code? 1: Never - 7: Always • Can you describe any particular challenges you faced dur- ing the development process? • ( No-AI group:) Which resources did you use to complete the task? ( Free-AI & Paid -AI group:) Which resources ( beside Gemini) did you use to complete the task? – Codecademy – GitHub – Ocial documentation – StackOverow – Tutorial W ebsites – W3Schools – Y ouT ube – Other: • W ere there any industr y best practices or standards that you followed to enhance security in your de velopment process? If yes, please name them. – No – Y es: • NASA - TLX [28] • Secure Software Development Self-Ecacy Scale (SSD-SES) [83] • ( Free-AI & Paid -AI group:) System Usability Scale (SUS) [12] A.5 Codebo ok • Codesystem: 1327 • Other AI tools: 0 – ChatGPT: 4 – Copilot: 2 – BlackBoxAI: 1 • Gemini challenges: 0 – Quality of suggestions: 0 ∗ Suggestions not working: 36 ∗ Solution not optimal: 19 ∗ Incomplete suggestions: 18 ∗ Output needs to be teste d and adjusted: 12 ∗ Inaccurate code: 10 ∗ V ague answers: 10 ∗ Unuseful suggestions: 8 ∗ Long answers: 5 ∗ Outdated suggestions: 4 ∗ Inconsistency: 4 ∗ Hallucinations: 4 ∗ Not mentioning needed librar y: 4 ∗ Not for complex problems: 3 ∗ Code formatting/indentation: 2 – Usability: 0 ∗ Diculties prompting: 7 ∗ Copy&Pasting/Non-printable characters in out- put: 4 ∗ Hard to convince tool to do other approach: 4 ∗ UI: 4 ∗ Integrating suggestions into existing code: 3 ∗ Slow: 2 ∗ Hard to pass les to Gemini: 1 – Missing context: 23 – Security: 21 – Multiple prompts/re-prompting: 17 – No challenges: 12 – Misunderstanding of problem/requirements: 9 – Suggestions for Gemini improvement: 4 – LLM can’t assist message: 3 – Personal preferences: 2 – Gemini gives results from Google, no ne w code: 2 – Lack of experience with Gemini: 1 – Less own understanding of code: 1 • Gemini trust: 0 – Mistrust: 0 ∗ Assessment: 28 ∗ Security aws: 28 ∗ Suggestions not working: 26 ∗ Inaccurate: 15 ∗ Mistrust in LLM: 8 ∗ Forgetting context: 8 ∗ Outdated: 7 ∗ Training data: 7 ∗ Not following/misinterpreting requirements: 7 ∗ Suggestions needed mo dications: 7 ∗ Hallucinations: 5 ∗ V ague answers: 5 ∗ Less practical suggestions: 4 ∗ Inconsistency: 3 ∗ Less unique suggestions: 3 ∗ Including unnecessary/not optimal libraries: 2 ∗ Explanation not convincing: 2 ∗ Complex tasks: 2 ∗ Lack of own knowledge: 1 ∗ New technology: 1 ∗ Personal coding preferences: 1 ∗ Other AI tools perform better: 1 ∗ Gemini disclaimer: 1 – Trust: 0 ∗ Assessment: 42 ∗ Correct/working suggestions: 17 ∗ Explanations/informative suggestions: 13 ∗ Security: 10 ∗ Training data: 10 ∗ Trusted company: 6 ∗ Suggestions meet expectations: 5 ∗ Simple tasks: 5 Jost et al. ∗ References Gemini gave: 4 ∗ W ell-known: 3 ∗ Quick/faster solutions: 3 ∗ Only use for initial idea: 3 ∗ Lack of own knowledge: 3 ∗ Past experience with Gemini: 2 ∗ Part of task: 2 ∗ It’s the best: 2 ∗ Trust in LLM: 2 ∗ Other AI tools perform worse: 2 ∗ Premium version: 2 ∗ Presents multiple options: 1 ∗ Gemini disclaimer: 1 • Task challenges: 0 – Security: 7 ∗ Hashing: 16 ∗ Unfamiliar ask_login: 10 ∗ Authentication: 6 ∗ SQL Injection Prevention: 3 ∗ Input validation: 2 ∗ Integrating security into existing code: 2 ∗ Searching measures for passwor d security: 2 ∗ Balancing security and usability: 1 ∗ XSS Prevention: 1 – No challenges: 36 – Database: 24 – Flask: 20 – Setup: 14 – Debugging: 14 – Session management: 11 – Implement task functionality: 0 ∗ Dashboard view: 4 ∗ W ebsite deletion: 2 ∗ Delete user: 1 ∗ Linking websites to users: 1 ∗ Routing Links: 1 – SQL Issues: 9 – Unfamiliar libraries: 7 – Syntax: 6 – Task uncertainty: 6 – Not using AI: 5 – Integration code into existing system: 4 – User messages: 4 – Understanding code base: 3 – Adding modules/libraries: 3 – Smaller modules: 2 – Time management: 2 – Passing parameter to template: 2 – T esting: 2 – Data storage: 2 – Handle variables: 2 – Git: 1 – Port: 1 – Error handling: 1 • Gemini assistance: 0 – Security: 0 ∗ Security suggestions: 11 ∗ Find security issues: 9 ∗ Hashing: 8 ∗ ask_login: 4 ∗ SQL Injection Prevention: 2 ∗ User permissions: 1 – Code generation/completion: 37 – T emplating: 22 – Eciency/Productivity/Saving time: 20 – Debugging: 20 – Guidance: 18 – Understand codebase: 17 – Database: 13 – Flask: 9 – Syntax: 8 – For simple tasks/basic code: 7 – Evaluate possible solutions: 6 – Pytest/T esting: 4 – Setup: 4 – Recommend libraries: 3 – Improved accuracy: 3 – Integrate libraries: 3 – Git: 3 – Error handling: 2 • Security best practices: 0 – Hashing: 47 ∗ BCrypt: 11 ∗ werkzeug.security: 4 ∗ SECRETS library: 1 ∗ argon2: 1 ∗ SHA -256: 1 – SQL injection prevention: 19 – Authentication: 17 – Input validation: 9 – Error handling: 8 – Encryption: 7 – Library: 7 – Misconceptions: 6 – Managing sessions: 5 – Flak security: 3 ∗ Flask login: 2 – CSFR protection: 4 – Sanitizing: 4 – PEP8: 4 – Password requirements: 4 – O W ASP: 1 ∗ O W ASP T op 10: 1 ∗ O W ASP Secure Coding: 1 ∗ O W ASP Cheat sheet: 1 – Database security: 3 – Functionality rst, security second: 2 – Needs more security information: 2 – XSS Prevention: 2 – GDPR: 1 – NIST: 1 – Used tested libraries: 1 – Logging: 1

The Impact of AI-Assisted Development on Software Security: A Study of Gemini and Developer Experience

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment