Measuring Mid-2025 LLM-Assistance on Novice Performance in Biology

Measuring Mid-2025 LLM-Assistance on Novice Performance in Biology
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models (LLMs) perform strongly on biological benchmarks, raising concerns that they may help novice actors acquire dual-use laboratory skills. Yet, whether this translates to improved human performance in the physical laboratory remains unclear. To address this, we conducted a pre-registered, investigator-blinded, randomized controlled trial (June-August 2025; n = 153) evaluating whether LLMs improve novice performance in tasks that collectively model a viral reverse genetics workflow. We observed no significant difference in the primary endpoint of workflow completion (5.2% LLM vs. 6.6% Internet; P = 0.759), nor in the success rate of individual tasks. However, the LLM arm had numerically higher success rates in four of the five tasks, most notably for the cell culture task (68.8% LLM vs. 55.3% Internet; P = 0.059). Post-hoc Bayesian modeling of pooled data estimates an approximate 1.4-fold increase (95% CrI 0.74-2.62) in success for a “typical” reverse genetics task under LLM assistance. Ordinal regression modelling suggests that participants in the LLM arm were more likely to progress through intermediate steps across all tasks (posterior probability of a positive effect: 81%-96%). Overall, mid-2025 LLMs did not substantially increase novice completion of complex laboratory procedures but were associated with a modest performance benefit. These results reveal a gap between in silico benchmarks and real-world utility, underscoring the need for physical-world validation of AI biosecurity assessments as model capabilities and user proficiency evolve.


💡 Research Summary

This paper reports the first large‑scale, pre‑registered, investigator‑blinded randomized controlled trial (RCT) that directly tests whether access to state‑of‑the‑art large language models (LLMs) improves the ability of novice participants to carry out a multi‑step viral reverse‑genetics workflow in a physical BSL‑2 laboratory. Between June and August 2025, 153 individuals with minimal prior laboratory experience were randomly assigned to either an “Internet” control arm (access to standard web resources such as Google, Wikipedia, YouTube, and read‑only forums) or an “LLM” intervention arm (unrestricted access to mid‑2025 frontier models from Anthropic, OpenAI, and DeepMind in addition to the same web resources). All participants received a safety and LLM‑training session, then completed up to 39 four‑hour laboratory sessions over eight weeks, attempting five tasks that together model the synthesis of a virus from a known genome: (1) micropipetting, (2) mammalian cell culture, (3) molecular cloning, (4) recombinant AAV virus production, and (5) RNA quantification. Tasks 2‑4 were defined as the “core reverse‑genetics sequence.” Participants worked without external human assistance; communication tools were blocked, and all reagents had to be identified from an inventory spreadsheet containing distractors.

The primary outcome was successful completion of the core sequence (tasks 2‑4). In the full analysis set (FAS), only 4 of 77 participants (5.2%) in the LLM arm and 5 of 76 (6.6%) in the Internet arm achieved this composite endpoint (risk ratio = 0.79, 95 % CI 0.24–2.62, p = 0.759, one‑sided Fisher’s exact test). No individual task reached statistical significance, although the LLM arm showed numerically higher success rates in four of five tasks. The most notable difference was in cell culture: 68.8 % success in the LLM arm versus 55.3 % in the Internet arm (p = 0.059). In the per‑protocol set (participants attending ≥35 sessions, n = 128), the cell‑culture advantage became statistically significant (risk ratio = 1.28, p = 0.025); other tasks remained non‑significant.

Because overall completion rates were low, the authors performed post‑hoc Bayesian hierarchical modeling that pooled data across all five tasks while accounting for task heterogeneity and individual variability. This analysis estimated a risk ratio of 1.42 (95 % credible interval 0.74–2.62) for a “typical” reverse‑genetics task under LLM assistance, with an 85.5 % posterior probability that the effect is positive. An out‑of‑sample prediction yielded a similar estimate (RR = 1.42, 95 % CrI 0.74–2.62). Ordinal regression further suggested that participants with LLM access were more likely to achieve intermediate milestones across tasks, with posterior probabilities of a positive effect ranging from 81 % to 96 %.

Time‑to‑completion analyses indicated that the LLM arm completed the cell‑culture task on average six days earlier (RMST difference = −6.02 days, p = 0.02) and required fewer attempts. No significant differences were observed for overall workflow completion time or for the other tasks.

Strengths of the study include its rigorous RCT design, pre‑registration, blinding of outcome assessors, and the use of both frequentist and Bayesian analytic frameworks to extract signal from sparse data. Limitations comprise the low baseline competence of participants (overall success <10 %), which limited statistical power; the binary success/failure outcome that obscures partial progress; and limited granularity of LLM usage logs, preventing a fine‑grained analysis of which prompts or model interactions drove performance gains.

The authors conclude that mid‑2025 LLMs do not dramatically increase novice capability to independently execute complex dual‑use virology protocols, but they do provide modest, task‑specific benefits—particularly in cell‑culture procedures and in accelerating progress through intermediate steps. This gap between strong in‑silico benchmark performance and modest real‑world utility underscores the necessity of physical‑world validation for AI biosecurity risk assessments. Future work should explore higher‑skill cohorts, longer training horizons, and more sophisticated human‑AI interaction designs to better gauge the potential for LLMs to amplify both beneficial and malicious biotechnological capabilities.


Comments & Academic Discussion

Loading comments...

Leave a Comment