Biothreat Benchmark Generation Framework for Evaluating Frontier AI Models I: The Task-Query Architecture

February 23, 2026

Reading time: 7 minute

...

📝 Abstract

Both model developers and policymakers seek to quantify and mitigate the risk of rapidly-evolving frontier artificial intelligence (AI) models, especially large language models (LLMs), to facilitate bioterrorism or access to biological weapons. An important element of such efforts is the development of model benchmarks that can assess the biosecurity risk posed by a particular model. This paper describes the first component of a novel Biothreat Benchmark Generation (BBG) Framework. The BBG approach is designed to help model developers and evaluators reliably measure and assess the biosecurity risk uplift and general harm potential of existing and future AI models, while accounting for key aspects of the threat itself that are often overlooked in other benchmarking efforts, including different actor capability levels, and operational (in addition to purely technical) risk factors. As a pilot, the BBG is first being developed to address bacterial biological threats only. The BBG is built upon a hierarchical structure of biothreat categories, elements and tasks, which then serves as the basis for the development of task-aligned queries. This paper outlines the development of this biothreat task-query architecture, which we have named the Bacterial Biothreat Schema, while future papers will describe follow-on efforts to turn queries into model prompts, as well as how the resulting benchmarks can be implemented for model evaluation. Overall, the BBG Framework, including the Bacterial Biothreat Schema, seeks to offer a robust, re-usable structure for evaluating bacterial biological risks arising from LLMs across multiple levels of aggregation, which captures the full scope of technical and operational requirements for biological adversaries, and which accounts for a wide spectrum of biological adversary capabilities.

💡 Analysis

🇰🇷 한글로 읽기

📄 Content

Extensive previous research has attempted to characterize the risks artificial intelligence (AI) models and generative AI tools pose to public safety, peace, and global stability. One major concern is how AI models might empower malicious actors to generate catastrophic harm. 1 A particularly prominent area of attention has been the potential impact of frontier AI models, especially largelanguage models (LLMs), on biosecurity risk. Biotechnology is a rapidly evolving domain, and biosecurity experts fear that equally rapidly-evolving foundational AI tools might increase the capabilities of states, terrorists and other non-state actors to accomplish previously inaccessible technical operations, thus accelerating the creation and dissemination of biological weapons. The inherently dual-use nature of much biological knowledge, equipment, and agents complicates the evaluation of frontier AI systems, given that the same piece of information can have both benign and malicious uses.

AI providers and policymakers alike now seek to quantify and qualify the biosecurity risk that frontier AI tools currently pose and could pose in the future. Recognizing the collective action challenge, in 2023 several model providers signed a voluntary commitment to increase AI safety, including in the biological area. 2 In addition to calling for increased Red Teaming, these commitments recommend developing a set of benchmark prompts (questions, requests, instructions etc.) that could be used to screen frontier AI models to objectively measure the degree to which a model might increase biosecurity risk. More precisely, the Frontier Model Forum describes benchmarking evaluation as: “Sets of safety-relevant questions or tasks designed to test model capabilities and assess how answers differ across models. These evaluations aim to provide baseline indications of general or domain-specific capabilities that are comparable across models.” 3 The problem can be summarized as follows: AI tool providers need to understand how their model’s capabilities for biotechnology misuse change over time compared to a consistent standard -a benchmark. However, we argue that existing benchmarks, while a valuable first step, do not approach the threat elements of the problem with sufficient nuance and as a result provide only partial assessments of risk, thus making biosecurity risk mitigation more challenging.

Our research team therefore set out to develop a proof of concept of a Biothreat Benchmark Generation (BBG) Framework, focused on bacterial biothreats. The BBG Framework is intended to serve as a defensible and sustainable process for generating and implementing a set of practical biothreat benchmarks for AI systems. In addition to providing a similar function to existing benchmarks in this domain, the benchmarks created by the BBG will measure potential harm multi-dimensionally, as well as identifying the key areas along the biosecurity threat pathway where a model might provide the greatest assistance to adversaries, thus helping to prioritize mitigation measures and providing a more nuanced understanding of evolving risks.

In the larger AI risk context, benchmarking primarily supports the evaluation component of AI risk management frameworks. For example, benchmarking would fall under the “Assess” function of the Organisation for Economic Co-Operation and Development’s (OECD) AI risk management framework, in particular the subset of assessments related to robustness, security, and safety. 4 Benchmarking plays a similar role in the AI risk management frameworks of organizations such as the National Institute of Standards and Technology, the International Organization for Standardization, the Institute of Electrical and Electronics Engineers, and others, which generally align with the OECD framework. 5 Benchmarks can also support risk mitigation by re-assessing AI systems following risk reduction measures to evaluate the extent to which the risks have been reduced.

Scholarship has identified the value of developing benchmarks. Schuett, et al surveyed 51 experts from AGI labs, academia, civil society, and others around AI governance best practices and three of the top practices on which experts “strongly agreed” were: pre-deployment risk assessment, dangerous capabilities evaluations and third-party model audits -all of which can be supported by the use of proper benchmarks. 6 Barrett, et al note that open benchmarking with publicly available questions and answers can be a low-cost approach to evaluating models, and should be utilized in conjunction with more in-depth Red Teaming. 7 It must be acknowledged, of course, that benchmarking, together with most other AI evaluation approaches, have their limitations. For example, an underlying assumption of such approaches is that AI developers and policymakers will perceive AI danger based on objective knowledge. 8 In the biological context, this implies an assumption that our current scientific-techn

View Original ArXiv

This content is AI-processed based on ArXiv data.

Biothreat Benchmark Generation Framework for Evaluating Frontier AI Models I: The Task-Query Architecture

📝 Abstract

💡 Analysis

📄 Content

Table of Contents

Table of Contents

📝 Abstract

💡 Analysis

📄 Content

Start searching

No results found