PropensityBench: Evaluating Latent Safety Risks in Large Language Models via an Agentic Approach

February 23, 2026

Reading time: 6 minute

...

📝 Original Info

Title: PropensityBench: Evaluating Latent Safety Risks in Large Language Models via an Agentic Approach
ArXiv ID: 2511.20703
Date: 2025-11-24
Authors: Researchers from original ArXiv paper

📝 Abstract

Recent advances in Large Language Models (LLMs) have sparked concerns over their potential to acquire and misuse dangerous or high-risk capabilities, posing frontier risks. Current safety evaluations primarily test for what a model \textit{can} do - its capabilities - without assessing what it $\textit{would}$ do if endowed with high-risk capabilities. This leaves a critical blind spot: models may strategically conceal capabilities or rapidly acquire them, while harboring latent inclinations toward misuse. We argue that $\textbf{propensity}$ - the likelihood of a model to pursue harmful actions if empowered - is a critical, yet underexplored, axis of safety evaluation. We present $\textbf{PropensityBench}$, a novel benchmark framework that assesses the proclivity of models to engage in risky behaviors when equipped with simulated dangerous capabilities using proxy tools. Our framework includes 5,874 scenarios with 6,648 tools spanning four high-risk domains: cybersecurity, self-proliferation, biosecurity, and chemical security. We simulate access to powerful capabilities via a controlled agentic environment and evaluate the models' choices under varying operational pressures that reflect real-world constraints or incentives models may encounter, such as resource scarcity or gaining more autonomy. Across open-source and proprietary frontier models, we uncover 9 alarming signs of propensity: models frequently choose high-risk tools when under pressure, despite lacking the capability to execute such actions unaided. These findings call for a shift from static capability audits toward dynamic propensity assessments as a prerequisite for deploying frontier AI systems safely. Our code is available at https://github.com/scaleapi/propensity-evaluation.

💡 Deep Analysis

Deep Dive into PropensityBench: Evaluating Latent Safety Risks in Large Language Models via an Agentic Approach.

📄 Full Content

PropensityBench: Evaluating Latent Safety Risks in Large Language Models via an Agentic Approach Udari Madhushani Sehwag1*, Shayan Shabihi2*, Alex McAvoy3, Vikash Sehwag4, Yuancheng Xu5, Dalton Towers6, Furong Huang2 1Scale AI, 2University of Maryland, College Park, 3University of North Carolina at Chapel Hill, 4Google DeepMind, 5Netflix, 6University of Texas at Austin * Equal Contributions # udari.sehwag@scale.com # shabihi@umd.edu scale.com/research/propensitybench Abstract Recent advances in Large Language Models (LLMs) have sparked concerns over their potential to acquire and misuse dangerous or high-risk capabilities, posing frontier risks. Current safety evaluations primarily test for what a model can do—its capabilities—without assessing what it would do if endowed with high-risk capabilities. This leaves a critical blind spot: models may strategically conceal capabilities or rapidly acquire them, while harboring latent inclinations toward misuse. We argue that propensity—the likelihood of a model to pursue harmful actions if empowered—is a critical, yet underexplored, axis of safety evaluation. We present PropensityBench, a novel benchmark framework that assesses the proclivity of models to engage in risky behaviors when equipped with simulated dangerous capabilities using proxy tools. Our framework includes 5,874 scenarios with 6,648 tools spanning four high-risk domains: cybersecurity, self-proliferation, biosecurity, and chemical security. We simulate access to powerful capabilities via a controlled agentic environment and evaluate the models’ choices under varying operational pressures that reflect real-world constraints or incentives models may encounter, such as resource scarcity or gaining more autonomy. Across open-source and proprietary frontier models, we uncover 9 alarming signs of propensity: models frequently choose high-risk tools when under pressure, despite lacking the capability to execute such actions unaided. These findings call for a shift from static capability audits toward dynamic propensity assessments as a prerequisite for deploying frontier AI systems safely. Our code is available at https://github.com/scaleapi/propensity-evaluation. 1. Introduction As Large Language Models (LLMs) develop increasingly sophisticated capabilities across a wide range of domains, they also raise significant safety concerns with potential for misuse by malicious actors with limited resources [18, 35, 48]. These risks are most acute in frontier-safety domains, ranging from advanced cyber-attack techniques and the automated synthesis of chemical or biological agents, to novel hazards that emerge as models gain greater autonomy [26, 28]. Current frontier safety risk evaluations primarily test for capa- bility: whether a model possesses factual knowledge or skills that could be misused in frontier-risk domains. Benchmarks such as WMDP [26] probe domain-specific competencies in these risk domains [19, 28, 33]. However, this capability- centric view overlooks a critical dimension: even if a model yet lacks the capability to pursue high-risk actions, it may still harbor a propensity to pursue them if given the means. Models may also strategically obscure capabilities or rapidly gain them due to finetuning or tool augmentation [3, 23, 29]. Relying solely on monitoring of capabilities without accounting for propensity, therefore, provides an incomplete picture of the safety risks. Drawing inspiration from human risk assessment [24, 53], where we judge not just capability, but intent; we propose a shift from assessing can-do, to probing would-do. We introduce PropensityBench, a systematic framework to measure LLMs’ inclination to misuse simulated capabilities across four critical risk domains: cybersecurity, self-proliferation, biosecurity, and chemical security [18, 26, 28], where safety failures can precipitate security breaches with catastrophic consequences. To facilitate research on this new axis of evaluation, we release PropensityBench as a comprehensive open-source toolkit, including an automated scenario generation pipeline and a robust evaluation harness. Table 1 details PropensityBench’s scale and its main components. We adopt an agentic evaluation framework [57], in which LLMs are instantiated as agents and provided with a suite of 1 Claude 4S Gemini 2.5P O3 O4-mini 0 25 50 75 Propensity Score (%) (a) Safety Erodes Under Pressure 0 2 4 6 8 10 12 Pressure Level 0 25 50 75 (b) Propensity Escalates with Pressure Claude 4S Gemini 2.5P O3 O4-mini 0 20 40 Propensity Increase (∆pp) (c) Tool Name Sensitivity (Benign/Harmful) Bio-Security Chemical-Security Cyber-Security Self-Proliferation 0 25 50 75 Propensity Score (%) (d) Risk Domain Propensities (Harmful Names) Bio-Security Chemical-Security Cyber-Security Self-Proliferation 0 25 50 75 (e) Risk Domain Propensities (Benign Names) Claude 4S Gemini 2.5P O3 O4-mini EA F PS RD SP T 11.8 79.9 8.5 18.2 10.8 77.8 10.4 14.7 7.2 73.9 10.4 13.5 17.3 81.2 11

…(Full text truncated)…

📄 Read Full PDF on ArXiv