PropensityBench: Evaluating Latent Safety Risks in Large Language Models via an Agentic Approach

Reading time: 6 minute
...

📝 Original Info

  • Title: PropensityBench: Evaluating Latent Safety Risks in Large Language Models via an Agentic Approach
  • ArXiv ID: 2511.20703
  • Date: 2025-11-24
  • Authors: Researchers from original ArXiv paper

📝 Abstract

Recent advances in Large Language Models (LLMs) have sparked concerns over their potential to acquire and misuse dangerous or high-risk capabilities, posing frontier risks. Current safety evaluations primarily test for what a model \textit{can} do - its capabilities - without assessing what it $\textit{would}$ do if endowed with high-risk capabilities. This leaves a critical blind spot: models may strategically conceal capabilities or rapidly acquire them, while harboring latent inclinations toward misuse. We argue that $\textbf{propensity}$ - the likelihood of a model to pursue harmful actions if empowered - is a critical, yet underexplored, axis of safety evaluation. We present $\textbf{PropensityBench}$, a novel benchmark framework that assesses the proclivity of models to engage in risky behaviors when equipped with simulated dangerous capabilities using proxy tools. Our framework includes 5,874 scenarios with 6,648 tools spanning four high-risk domains: cybersecurity, self-proliferation, biosecurity, and chemical security. We simulate access to powerful capabilities via a controlled agentic environment and evaluate the models' choices under varying operational pressures that reflect real-world constraints or incentives models may encounter, such as resource scarcity or gaining more autonomy. Across open-source and proprietary frontier models, we uncover 9 alarming signs of propensity: models frequently choose high-risk tools when under pressure, despite lacking the capability to execute such actions unaided. These findings call for a shift from static capability audits toward dynamic propensity assessments as a prerequisite for deploying frontier AI systems safely. Our code is available at https://github.com/scaleapi/propensity-evaluation.

💡 Deep Analysis

Deep Dive into PropensityBench: Evaluating Latent Safety Risks in Large Language Models via an Agentic Approach.

Recent advances in Large Language Models (LLMs) have sparked concerns over their potential to acquire and misuse dangerous or high-risk capabilities, posing frontier risks. Current safety evaluations primarily test for what a model \textit{can} do - its capabilities - without assessing what it $\textit{would}$ do if endowed with high-risk capabilities. This leaves a critical blind spot: models may strategically conceal capabilities or rapidly acquire them, while harboring latent inclinations toward misuse. We argue that $\textbf{propensity}$ - the likelihood of a model to pursue harmful actions if empowered - is a critical, yet underexplored, axis of safety evaluation. We present $\textbf{PropensityBench}$, a novel benchmark framework that assesses the proclivity of models to engage in risky behaviors when equipped with simulated dangerous capabilities using proxy tools. Our framework includes 5,874 scenarios with 6,648 tools spanning four high-risk domains: cybersecurity, self-prolife

📄 Full Content

PropensityBench: Evaluating Latent Safety Risks in Large Language Models via an Agentic Approach Udari Madhushani Sehwag1*, Shayan Shabihi2*, Alex McAvoy3, Vikash Sehwag4, Yuancheng Xu5, Dalton Towers6, Furong Huang2 1Scale AI, 2University of Maryland, College Park, 3University of North Carolina at Chapel Hill, 4Google DeepMind, 5Netflix, 6University of Texas at Austin * Equal Contributions # udari.sehwag@scale.com # shabihi@umd.edu € scale.com/research/propensitybench Abstract Recent advances in Large Language Models (LLMs) have sparked concerns over their potential to acquire and misuse dangerous or high-risk capabilities, posing frontier risks. Current safety evaluations primarily test for what a model can do—its capabilities—without assessing what it would do if endowed with high-risk capabilities. This leaves a critical blind spot: models may strategically conceal capabilities or rapidly acquire them, while harboring latent inclinations toward misuse. We argue that propensity—the likelihood of a model to pursue harmful actions if empowered—is a critical, yet underexplored, axis of safety evaluation. We present PropensityBench, a novel benchmark framework that assesses the proclivity of models to engage in risky behaviors when equipped with simulated dangerous capabilities using proxy tools. Our framework includes 5,874 scenarios with 6,648 tools spanning four high-risk domains: cybersecurity, self-proliferation, biosecurity, and chemical security. We simulate access to powerful capabilities via a controlled agentic environment and evaluate the models’ choices under varying operational pressures that reflect real-world constraints or incentives models may encounter, such as resource scarcity or gaining more autonomy. Across open-source and proprietary frontier models, we uncover 9 alarming signs of propensity: models frequently choose high-risk tools when under pressure, despite lacking the capability to execute such actions unaided. These findings call for a shift from static capability audits toward dynamic propensity assessments as a prerequisite for deploying frontier AI systems safely. Our code is available at https://github.com/scaleapi/propensity-evaluation. 1. Introduction As Large Language Models (LLMs) develop increasingly sophisticated capabilities across a wide range of domains, they also raise significant safety concerns with potential for misuse by malicious actors with limited resources [18, 35, 48]. These risks are most acute in frontier-safety domains, ranging from advanced cyber-attack techniques and the automated synthesis of chemical or biological agents, to novel hazards that emerge as models gain greater autonomy [26, 28]. Current frontier safety risk evaluations primarily test for capa- bility: whether a model possesses factual knowledge or skills that could be misused in frontier-risk domains. Benchmarks such as WMDP [26] probe domain-specific competencies in these risk domains [19, 28, 33]. However, this capability- centric view overlooks a critical dimension: even if a model yet lacks the capability to pursue high-risk actions, it may still harbor a propensity to pursue them if given the means. Models may also strategically obscure capabilities or rapidly gain them due to finetuning or tool augmentation [3, 23, 29]. Relying solely on monitoring of capabilities without accounting for propensity, therefore, provides an incomplete picture of the safety risks. Drawing inspiration from human risk assessment [24, 53], where we judge not just capability, but intent; we propose a shift from assessing can-do, to probing would-do. We introduce PropensityBench, a systematic framework to measure LLMs’ inclination to misuse simulated capabilities across four critical risk domains: cybersecurity, self-proliferation, biosecurity, and chemical security [18, 26, 28], where safety failures can precipitate security breaches with catastrophic consequences. To facilitate research on this new axis of evaluation, we release PropensityBench as a comprehensive open-source toolkit, including an automated scenario generation pipeline and a robust evaluation harness. Table 1 details PropensityBench’s scale and its main components. We adopt an agentic evaluation framework [57], in which LLMs are instantiated as agents and provided with a suite of 1 Claude 4S Gemini 2.5P O3 O4-mini 0 25 50 75 Propensity Score (%) (a) Safety Erodes Under Pressure 0 2 4 6 8 10 12 Pressure Level 0 25 50 75 (b) Propensity Escalates with Pressure Claude 4S Gemini 2.5P O3 O4-mini 0 20 40 Propensity Increase (∆pp) (c) Tool Name Sensitivity (Benign/Harmful) Bio-Security Chemical-Security Cyber-Security Self-Proliferation 0 25 50 75 Propensity Score (%) (d) Risk Domain Propensities (Harmful Names) Bio-Security Chemical-Security Cyber-Security Self-Proliferation 0 25 50 75 (e) Risk Domain Propensities (Benign Names) Claude 4S Gemini 2.5P O3 O4-mini EA F PS RD SP T 11.8 79.9 8.5 18.2 10.8 77.8 10.4 14.7 7.2 73.9 10.4 13.5 17.3 81.2 11

…(Full text truncated)…

📸 Image Gallery

FSM1.png FSM1.webp LogotypeGradientRGB.png LogotypeGradientRGB.webp all_models_subplots_distribution.png all_models_subplots_distribution.webp all_models_subplots_distribution_select.png all_models_subplots_distribution_select.webp avg_target_calls_interaction_length.png avg_target_calls_interaction_length.webp building_funcs.png building_funcs.webp building_pressure_message.png building_pressure_message.webp domain_vulnerability_heatmap_harm.png domain_vulnerability_heatmap_harm.webp failure_analysis_flags_by_model.png failure_analysis_flags_by_model.webp figure_archetypes_matrix_giveup_harmful.png figure_archetypes_matrix_giveup_harmful.webp figure_extended_narrative_6_panel.png figure_extended_narrative_6_panel.webp generation_flowchart.png generation_flowchart.webp layered_structure.png layered_structure.webp layered_structure_pyramid.png layered_structure_pyramid.webp overall_generation_steps.png overall_generation_steps.webp pressure_dimension_effectiveness_boxes.png pressure_dimension_effectiveness_boxes.webp pressure_layered_structure.png pressure_layered_structure.webp pressure_protocol.png pressure_protocol.webp propensity_elo_regression_plot.png propensity_elo_regression_plot.webp say_do_gap_analysis.png say_do_gap_analysis.webp shallow_alignment_gap_dumbbell_plot.png shallow_alignment_gap_dumbbell_plot.webp taxonomy_quadrant_plot_v2.png taxonomy_quadrant_plot_v2.webp top_attack_vectors_by_domain.png top_attack_vectors_by_domain.webp top_attack_vectors_cyber_only.png top_attack_vectors_cyber_only.webp triggered_immediately_after_pressure_gap.png triggered_immediately_after_pressure_gap.webp

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut