CHAI: Command Hijacking against embodied AI
Embodied Artificial Intelligence (AI) promises to handle edge cases in robotic vehicle systems where data is scarce by using common-sense reasoning grounded in perception and action to generalize beyond training distributions and adapt to novel real-world situations. These capabilities, however, also create new security risks. In this paper, we introduce CHAI (Command Hijacking against embodied AI), a physical environment indirect prompt injection attack that exploits the multimodal language interpretation abilities of AI models. CHAI embeds deceptive natural language instructions, such as misleading signs, in visual input, systematically searches the token space, builds a dictionary of prompts, and guides an attacker model to generate Visual Attack Prompts. We evaluate CHAI on four LVLM agents: drone emergency landing, autonomous driving, aerial object tracking, and on a real robotic vehicle. Our experiments show that CHAI consistently outperforms state-of-the-art attacks. By exploiting the semantic and multimodal reasoning strengths of next-generation embodied AI systems, CHAI underscores the urgent need for defenses that extend beyond traditional adversarial robustness.
💡 Research Summary
The paper introduces CHAI (Command Hijacking against embodied AI), a novel physical‑world indirect prompt‑injection attack that targets the “command layer” of robotic systems powered by Large Visual‑Language Models (LVLMs). Unlike prior attacks that focus on raw sensor data (adversarial patches, LiDAR spoofing) or on one‑shot typographic manipulations, CHAI jointly optimizes both the semantic content of a visual prompt (the text that conveys the malicious command) and its visual realization (color, font, size, placement). The attacker places a human‑readable sign or poster in the robot’s field of view; through a dual‑objective optimization, the sign is crafted so that the LVLM, when processing the scene, is highly likely to generate a specific intermediate text command that drives unsafe or unintended actions.
The authors formalize the threat model: the adversary has no cyber access to the robot but can physically introduce visual cues. The attack must remain readable to humans, avoid occluding critical visual information, and succeed across a range of viewpoints, lighting conditions, and background variations. To meet these constraints, CHAI proceeds in two stages. First, a “semantic optimizer” uses a large language model (e.g., GPT‑4) to generate a dictionary of candidate malicious phrases and evaluates their likelihood of being emitted by the target LVLM in simulated scenes. Second, a “visual optimizer” treats color, font, size, and placement as differentiable parameters and maximizes the probability that the LVLM outputs the chosen phrase, using gradient‑based updates on a surrogate differentiable LVLM (often CLIP‑style). By sampling multiple scene variations during training, the resulting visual prompt is universal: it works on many unseen images of the same environment.
Experiments cover three representative LVLM‑driven agents—drone emergency landing, autonomous driving (DriveLM), and aerial object tracking (CloudTrack)—as well as a real‑world robotic vehicle. In simulation, CHAI achieves attack success rates (ASR) of 95.5 % on CloudTrack, 81.8 % on DriveLM, and 72.8 % on the drone landing task. In physical tests with varying illumination and viewpoints, the attack still succeeds over 87 % of the time, demonstrating robustness to real‑world noise. Compared to the prior state‑of‑the‑art typographic attack SceneTap, CHAI is up to ten times more effective and, unlike SceneTap, produces a single universal prompt rather than a per‑image one‑shot generation.
The paper also evaluates multilingual generalization, showing that the same optimization pipeline works for English, Chinese, Spanish, and mixed “Spanglish” prompts, confirming that LVLMs’ language‑agnostic semantic extraction can be abused across languages.
Defensive considerations are discussed: (1) visual‑text filtering to detect suspicious signage, (2) multimodal consistency checks that compare detected text against expected scene semantics, and (3) a separate verification layer that validates LVLM‑generated commands before they reach the low‑level controller. The authors argue that provable robustness and alignment‑aware defenses are required because traditional adversarial‑patch defenses do not address this high‑level, cross‑modal attack surface.
In summary, CHAI reveals a previously unexamined vulnerability in embodied AI systems that rely on LVLMs for reasoning and decision making. By exploiting the ability of LVLMs to fuse visual cues with natural‑language commands, an attacker can hijack the robot’s high‑level planning without touching its firmware or network. The work establishes a new benchmark for physical prompt‑injection attacks, provides extensive empirical evidence of practicality, and calls for a new class of defenses that secure the command layer of multimodal autonomous agents.
Comments & Academic Discussion
Loading comments...
Leave a Comment