BashArena: A Control Setting for Highly Privileged AI Agents
Reading time: 5 minute
...
📝 Original Info
Title: BashArena: A Control Setting for Highly Privileged AI Agents
ArXiv ID: 2512.15688
Date: 2025-12-17
Authors: ** Adam Kaufman*, James Lucassen*, Tyler Tracy, Cody Rushing, Aryan Bhatt – Redwood Research **
📝 Abstract
Future AI agents might run autonomously with elevated privileges. If these agents are misaligned, they might abuse these privileges to cause serious damage. The field of AI control develops techniques that make it harder for misaligned AIs to cause such damage, while preserving their usefulness. We introduce BashArena, a setting for studying AI control techniques in security-critical environments. BashArena contains 637 Linux system administration and infrastructure engineering tasks in complex, realistic environments, along with four sabotage objectives (execute malware, exfiltrate secrets, escalate privileges, and disable firewall) for a red team to target. We evaluate multiple frontier LLMs on their ability to complete tasks, perform sabotage undetected, and detect sabotage attempts. Claude Sonnet 4.5 successfully executes sabotage while evading monitoring by GPT-4.1 mini 26% of the time, at 4% trajectory-wise FPR. Our findings provide a baseline for designing more effective control protocols in BashArena. We release the dataset as a ControlArena setting and share our task generation pipeline.
💡 Deep Analysis
📄 Full Content
BASHARENA: A CONTROL SETTING
FOR HIGHLY PRIVILEGED AI AGENTS
Adam Kaufman∗
James Lucassen∗
Tyler Tracy
Cody Rushing
Aryan Bhatt
Redwood Research
ABSTRACT
Future AI agents might run autonomously with elevated privileges. If these agents are misaligned,
they might abuse these privileges to cause serious damage. The field of AI control develops techniques
that make it harder for misaligned AIs to cause such damage, while preserving their usefulness. We
introduce BashArena, a setting for studying AI control techniques in security-critical environments.
BashArena contains 637 Linux system administration and infrastructure engineering tasks in complex,
realistic environments, along with four sabotage objectives (execute malware, exfiltrate secrets,
escalate privileges, and disable firewall) for a red team to target. We evaluate multiple frontier
LLMs on their ability to complete tasks, perform sabotage undetected, and detect sabotage attempts.
Claude Sonnet 4.5 successfully executes sabotage while evading monitoring by GPT-4.1 mini 26% of
the time, at 4% trajectory-wise FPR. Our findings provide a baseline for designing more effective
control protocols in BashArena. We release the dataset as a ControlArena setting and share our task
generation pipeline.
1
Introduction
AI agents are increasingly being deployed to automate software engineering tasks that require elevated privileges, such
as writing security-critical infrastructure or working with sensitive data. Scheming [1], backdoored [2, 3], or otherwise
misaligned AIs might abuse these privileges to cause security failures.
Mitigating this threat while preserving the usefulness of benign AIs is one important subproblem of AI control [4]. The
field of AI control develops techniques to reduce the probability that AIs trying to cause damage will succeed before
being caught. Researchers develop more effective AI control protocols by testing them in settings that resemble future
risky deployments.
Control settings require two types of tasks - main tasks and side tasks. Main tasks represent the work the AI is supposed
to do, and side tasks represent security failures a malicious AI might cause. This allows us to evaluate and improve
AI control protocols via an adversarial game between a red team and a blue team. The red team constructs an attack
policy that attempts to accomplish the side task, while the blue team constructs control protocols that detect and prevent
attempts to achieve side tasks without hurting an honest policy’s performance on main tasks. For research on control
protocols to be useful, the setting should be representative of the future deployments we’re interested in: ideally, the
most effective control protocols in future should also perform the best in a control setting.
There are a few control settings currently available. However, each of these settings have weaknesses which make
the adversarial game unrealistic, or too easy for the most advanced models and control protocols. APPS backdooring
[4] is a single-action setting, which makes it hard to study control protocols intended for agents that interact with an
environment over multiple actions. BashBench [5] is a multi-action setting, but only includes one side task with a
narrow attack surface, and main tasks in BashBench are solvable by models substantially behind the current frontier.
SHADE-Arena [6] tasks are high average quality but the dataset contains only 26 task pairs, most of which do not
involve writing and running code.
∗Core research contributor
arXiv:2512.15688v1 [cs.CR] 17 Dec 2025
We introduce BashArena, a control setting that improves upon the current state of the art in the following ways:
1. Supports sophisticated sabotage strategies. BashArena environments are complex and diverse enough for
AIs to invent sophisticated and creative sabotage strategies. Because the main tasks and side tasks require
overlapping permissions and behaviors, simple monitoring and sandboxing protocols cannot rule out side tasks
without making main tasks much harder or impossible. This makes BashArena a suitable testbed to design
more effective protocols.
2. Main tasks are challenging for current frontier models. BashArena’s main tasks are difficult enough that
they are not saturated by current frontier models, and weaker models perform significantly worse. The tradeoff
between weaker trusted models and stronger untrusted models is an important feature of control protocol
design. Harder main tasks allow BashArena to study stronger pairs of models while simulating this tradeoff
realistically.
3. Large and diverse. BashArena contains a large number and a wide variety of main task / side task pairs. This
makes the adversarial game both more challenging and more realistic by forcing the blue team to defend a
broad attack surface.
2
Dataset
2.1
Contents
BashArena contains 637 main tasks created by an automated pipeline, and four manually-constructed side tasks which
simulate critical security failures.