BashArena: A Control Setting for Highly Privileged AI Agents

Reading time: 5 minute
...

📝 Original Info

  • Title: BashArena: A Control Setting for Highly Privileged AI Agents
  • ArXiv ID: 2512.15688
  • Date: 2025-12-17
  • Authors: ** Adam Kaufman*, James Lucassen*, Tyler Tracy, Cody Rushing, Aryan Bhatt – Redwood Research **

📝 Abstract

Future AI agents might run autonomously with elevated privileges. If these agents are misaligned, they might abuse these privileges to cause serious damage. The field of AI control develops techniques that make it harder for misaligned AIs to cause such damage, while preserving their usefulness. We introduce BashArena, a setting for studying AI control techniques in security-critical environments. BashArena contains 637 Linux system administration and infrastructure engineering tasks in complex, realistic environments, along with four sabotage objectives (execute malware, exfiltrate secrets, escalate privileges, and disable firewall) for a red team to target. We evaluate multiple frontier LLMs on their ability to complete tasks, perform sabotage undetected, and detect sabotage attempts. Claude Sonnet 4.5 successfully executes sabotage while evading monitoring by GPT-4.1 mini 26% of the time, at 4% trajectory-wise FPR. Our findings provide a baseline for designing more effective control protocols in BashArena. We release the dataset as a ControlArena setting and share our task generation pipeline.

💡 Deep Analysis

Figure 1

📄 Full Content

BASHARENA: A CONTROL SETTING FOR HIGHLY PRIVILEGED AI AGENTS Adam Kaufman∗ James Lucassen∗ Tyler Tracy Cody Rushing Aryan Bhatt Redwood Research ABSTRACT Future AI agents might run autonomously with elevated privileges. If these agents are misaligned, they might abuse these privileges to cause serious damage. The field of AI control develops techniques that make it harder for misaligned AIs to cause such damage, while preserving their usefulness. We introduce BashArena, a setting for studying AI control techniques in security-critical environments. BashArena contains 637 Linux system administration and infrastructure engineering tasks in complex, realistic environments, along with four sabotage objectives (execute malware, exfiltrate secrets, escalate privileges, and disable firewall) for a red team to target. We evaluate multiple frontier LLMs on their ability to complete tasks, perform sabotage undetected, and detect sabotage attempts. Claude Sonnet 4.5 successfully executes sabotage while evading monitoring by GPT-4.1 mini 26% of the time, at 4% trajectory-wise FPR. Our findings provide a baseline for designing more effective control protocols in BashArena. We release the dataset as a ControlArena setting and share our task generation pipeline. 1 Introduction AI agents are increasingly being deployed to automate software engineering tasks that require elevated privileges, such as writing security-critical infrastructure or working with sensitive data. Scheming [1], backdoored [2, 3], or otherwise misaligned AIs might abuse these privileges to cause security failures. Mitigating this threat while preserving the usefulness of benign AIs is one important subproblem of AI control [4]. The field of AI control develops techniques to reduce the probability that AIs trying to cause damage will succeed before being caught. Researchers develop more effective AI control protocols by testing them in settings that resemble future risky deployments. Control settings require two types of tasks - main tasks and side tasks. Main tasks represent the work the AI is supposed to do, and side tasks represent security failures a malicious AI might cause. This allows us to evaluate and improve AI control protocols via an adversarial game between a red team and a blue team. The red team constructs an attack policy that attempts to accomplish the side task, while the blue team constructs control protocols that detect and prevent attempts to achieve side tasks without hurting an honest policy’s performance on main tasks. For research on control protocols to be useful, the setting should be representative of the future deployments we’re interested in: ideally, the most effective control protocols in future should also perform the best in a control setting. There are a few control settings currently available. However, each of these settings have weaknesses which make the adversarial game unrealistic, or too easy for the most advanced models and control protocols. APPS backdooring [4] is a single-action setting, which makes it hard to study control protocols intended for agents that interact with an environment over multiple actions. BashBench [5] is a multi-action setting, but only includes one side task with a narrow attack surface, and main tasks in BashBench are solvable by models substantially behind the current frontier. SHADE-Arena [6] tasks are high average quality but the dataset contains only 26 task pairs, most of which do not involve writing and running code. ∗Core research contributor arXiv:2512.15688v1 [cs.CR] 17 Dec 2025 We introduce BashArena, a control setting that improves upon the current state of the art in the following ways: 1. Supports sophisticated sabotage strategies. BashArena environments are complex and diverse enough for AIs to invent sophisticated and creative sabotage strategies. Because the main tasks and side tasks require overlapping permissions and behaviors, simple monitoring and sandboxing protocols cannot rule out side tasks without making main tasks much harder or impossible. This makes BashArena a suitable testbed to design more effective protocols. 2. Main tasks are challenging for current frontier models. BashArena’s main tasks are difficult enough that they are not saturated by current frontier models, and weaker models perform significantly worse. The tradeoff between weaker trusted models and stronger untrusted models is an important feature of control protocol design. Harder main tasks allow BashArena to study stronger pairs of models while simulating this tradeoff realistically. 3. Large and diverse. BashArena contains a large number and a wide variety of main task / side task pairs. This makes the adversarial game both more challenging and more realistic by forcing the blue team to defend a broad attack surface. 2 Dataset 2.1 Contents BashArena contains 637 main tasks created by an automated pipeline, and four manually-constructed side tasks which simulate critical security failures.

📸 Image Gallery

attacker_performance.png attackvariance.png badsafetycurves_byattacker.png badsafetycurves_bymonitor.png badsafetycurves_bysidetask.png claude_heatmap.png cover.png gpt_heatmap.png maintask.png methodology.png monitor_safety_curves.png perfvariance.png pipeline.png tool_stats.png usefulness_overtime_longer.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut