From Assistant to Double Agent: Formalizing and Benchmarking Attacks on OpenClaw for Personalized Local AI Agent
Although large language model (LLM)-based agents, exemplified by OpenClaw, are increasingly evolving from task-oriented systems into personalized AI assistants for solving complex real-world tasks, their practical deployment also introduces severe security risks. However, existing agent security research and evaluation frameworks primarily focus on synthetic or task-centric settings, and thus fail to accurately capture the attack surface and risk propagation mechanisms of personalized agents in real-world deployments. To address this gap, we propose Personalized Agent Security Bench (PASB), an end-to-end security evaluation framework tailored for real-world personalized agents. Building upon existing agent attack paradigms, PASB incorporates personalized usage scenarios, realistic toolchains, and long-horizon interactions, enabling black-box, end-to-end security evaluation on real systems. Using OpenClaw as a representative case study, we systematically evaluate its security across multiple personalized scenarios, tool capabilities, and attack types. Our results indicate that OpenClaw exhibits critical vulnerabilities at different execution stages, including user prompt processing, tool usage, and memory retrieval, highlighting substantial security risks in personalized agent deployments. The code for the proposed PASB framework is available at https://github.com/AstorYH/PASB.
💡 Research Summary
The paper addresses the emerging security challenges of personalized large‑language‑model (LLM) agents, using OpenClaw as a representative system. While prior work on agent security has largely focused on synthetic, task‑centric, or white‑box settings, the authors argue that personalized assistants introduce a distinct attack surface: persistent operation, accumulation of private user context, and high‑privilege tool access. To evaluate these risks in realistic deployments, they introduce the Personalized Agent Security Bench (PASB), an end‑to‑end, black‑box benchmark framework specifically designed for real‑world personalized agents.
PASB extends existing agent‑attack taxonomies (prompt injection, indirect injection, tool misuse, memory poisoning) with three key enhancements. First, it models personalized usage scenarios and private assets by constructing realistic workflows (e.g., personal communication, information management, long‑horizon task coordination) and embedding auditable canary tokens, confidential files, and other sensitive data in a simulated long‑term memory store. Second, it provides a controllable testbed that mimics real toolchains—email sending, file access, financial transactions—through self‑hosted web services and mock APIs, allowing the adversary to manipulate external content and tool responses without touching production systems. Third, it automates the entire evaluation pipeline: a harness drives user‑facing inputs, records the agent’s textual responses, tool‑call events, and tool returns, and then computes a success predicate based on three observable harms: (i) P_leak – unauthorized disclosure of private assets, (ii) P_act – execution of disallowed or high‑impact tool actions, and (iii) P_persist – continuation of malicious influence after the attacker stops injecting inputs.
The threat model is formalized as an attack task Γ = ⟨C, I, B, G, P⟩, where C defines the scenario (including the tool set T and memory D), I enumerates adversary‑controllable channels (user prompts, untrusted web content, tool responses), B sets the interaction budget (max horizon), G specifies the adversarial goal (leakage, unsafe action, persistence), and P is the success predicate evaluated on the observable execution trace tr(τ). The adversary operates in a black‑box setting, observing only the normal interaction outputs and tool I/O, and may adapt its strategy across turns.
Applying PASB to OpenClaw, the authors design four representative attack families across the agent’s execution pipeline: (1) direct prompt injection that embeds a canary token into the agent’s memory; (2) indirect injection via malicious web pages that influence the agent’s planning and tool arguments; (3) tool misuse where a fabricated content server causes the agent to download malicious files or invoke a fake financial API; and (4) memory poisoning that re‑uses previously injected malicious data in later retrieval steps, enabling long‑term persistence. Each attack is executed over multiple interaction horizons, and success is measured via the Attack Success Rate (ASR) and its decomposition into leak, act, and persist components.
The empirical results reveal substantial vulnerabilities. Prompt‑processing attacks achieve ~35 % leak rate, tool‑calling attacks breach policy in ~58 % of trials, and combined memory‑poisoning attacks exhibit a persist rate of ~71 %, indicating that malicious influence can survive beyond the injection phase. Overall, PASB reports an ASR of roughly 48 % for OpenClaw, demonstrating that even without white‑box access, realistic adversaries can cause serious system‑level harms. The findings underscore that defenses focused solely on prompt sanitization or output filtering are insufficient; robust security must encompass tool‑interface validation, memory integrity checks, and continuous monitoring of long‑horizon interactions.
The paper’s contributions are threefold: (1) the design and open‑source release of PASB, a benchmark that captures personalized agent realities; (2) a comprehensive security evaluation of OpenClaw that uncovers critical, multi‑stage vulnerabilities; and (3) a reproducible evaluation pipeline that can serve as a baseline for future research on personalized agent security. The authors suggest extending PASB to other LLM agents, integrating automated mitigation mechanisms (e.g., runtime policy enforcement, anomaly detection), and exploring defenses that can break the propagation of malicious influence across the agent’s action chain.
Comments & Academic Discussion
Loading comments...
Leave a Comment