RoboSafe: Safeguarding Embodied Agents via Executable Safety Logic

Reading time: 5 minute
...

📝 Original Info

  • Title: RoboSafe: Safeguarding Embodied Agents via Executable Safety Logic
  • ArXiv ID: 2512.21220
  • Date: 2025-12-24
  • Authors: Le Wang, Zonghao Ying, Xiao Yang, Quanchen Zou, Zhenfei Yin, Tianlin Li, Jian Yang, Yaodong Yang, Aishan Liu, Xianglong Liu

📝 Abstract

Embodied agents powered by vision-language models (VLMs) are increasingly capable of executing complex real-world tasks, yet they remain vulnerable to hazardous instructions that may trigger unsafe behaviors. Runtime safety guardrails, which intercept hazardous actions during task execution, offer a promising solution due to their flexibility. However, existing defenses often rely on static rule filters or prompt-level control, which struggle to address implicit risks arising in dynamic, temporally dependent, and context-rich environments. To address this, we propose RoboSafe, a hybrid reasoning runtime safeguard for embodied agents through executable predicate-based safety logic. RoboSafe integrates two complementary reasoning processes on a Hybrid Long-Short Safety Memory. We first propose a Backward Reflective Reasoning module that continuously revisits recent trajectories in short-term memory to infer temporal safety predicates and proactively triggers replanning when violations are detected. We then propose a Forward Predictive Reasoning module that anticipates upcoming risks by generating context-aware safety predicates from the long-term safety memory and the agent's multimodal observations. Together, these components form an adaptive, verifiable safety logic that is both interpretable and executable as code. Extensive experiments across multiple agents demonstrate that RoboSafe substantially reduces hazardous actions (-36.8% risk occurrence) compared with leading baselines, while maintaining near-original task performance. Real-world evaluations on physical robotic arms further confirm its practicality. Code will be released upon acceptance.

💡 Deep Analysis

📄 Full Content

RoboSafe: Safeguarding Embodied Agents via Executable Safety Logic Le Wang1 Zonghao Ying1 Xiao Yang1 Quanchen Zou2 Zhenfei Yin3 Tianlin Li4 Jian Yang1 Yaodong Yang5,6 Aishan Liu1,6* Xianglong Liu1 1Beihang University 2360 AI Security Lab 3The University of Sydney 4Nanyang Technological University 5Peking University 6Beijing Academy of Artificial Intelligence Abstract Embodied agents powered by vision-language models (VLMs) are increasingly capable of executing complex real- world tasks, yet they remain vulnerable to hazardous in- structions that may trigger unsafe behaviors. Runtime safety guardrails, which intercept hazardous actions dur- ing task execution, offer a promising solution due to their flexibility. However, existing defenses often rely on static rule filters or prompt-level control, which struggle to ad- dress implicit risks arising in dynamic, temporally depen- dent, and context-rich environments. To address this, we propose RoboSafe, a hybrid reasoning runtime safeguard for embodied agents through executable predicate-based safety logic. RoboSafe integrates two complementary rea- soning processes on a Hybrid Long-Short Safety Memory. We first propose a Backward Reflective Reasoning mod- ule that continuously revisits recent trajectories in short- term memory to infer temporal safety predicates and proac- tively triggers replanning when violations are detected. We then propose a Forward Predictive Reasoning module that anticipates upcoming risks by generating context-aware safety predicates from the long-term safety memory and the agent’s multimodal observations. Together, these compo- nents form an adaptive, verifiable safety logic that is both interpretable and executable as code. Extensive experi- ments across multiple agents demonstrate that RoboSafe substantially reduces hazardous actions (-36.8% risk occur- rence) compared with leading baselines, while maintaining near-original task performance. Real-world evaluations on physical robotic arms further confirm its practicality. Code will be released upon acceptance. *The corresponding author. … turn on faucet [reflecting…] … turn off faucet … throw candle [checking …] stop Temporal Risk Mitigation Contextual Risk Prevention replan block Backward Reasoning Forward Reasoning Contextual Safety Logic Temporal Safety Logic violation = temp_logic(action, stm) if violation[‘val’]: return f‘REPLAN with {violation[‘action’]} \ with {violation[‘msg’]}’ else: return ‘PASS’ Embodied Agent risk = context_logic(action, obs) if risk[‘val’]: return f‘BLOCK on {action} \ for {risk[‘msg’]}’ else: return ‘PASS’ [Avoid Flooding!] [Avoid Damaging!] Figure 1. Illustration of RoboSafe, where runtime safety guardrail generates executable safety logic to eliminate implicit temporal hazards and prevent contextual risks under dynamic scenarios. 1. Introduction In recent years, vision-language-model (VLM)-driven em- bodied agents have demonstrated impressive performance in solving complex, long-horizon tasks within interactive environments [4, 6, 8, 18, 20–22, 35, 41]. By harness- ing the powerful reasoning and planning capabilities of VLMs [2, 14, 24–26], embodied agents can understand abstract multimodal inputs and autonomously decompose them into executable, multi-step plans in the physical world [9, 13, 22, 30, 33, 34]. Despite this, VLM-driven embodied agents have been shown to be significantly vulnerable to malicious hazardous instructions [36] (e.g., “Throw ball to break the window”). This vulnerability is critically amplified [10– 12, 19, 29, 36–38] compared to normal large language mod- els (LLMs). While harmful contents generated by LLMs are confined to only textual outputs, embodied agents are capable of translating unsafe instructions down to physical actions, posing immediate and irreversible real-world safety threats [40]. A significant body of research has focused on safety de- fense for embodied agents [12, 28, 32, 36]. In contrast to the training-time strategies that require costly data collec- 1 arXiv:2512.21220v2 [cs.AI] 26 Dec 2025 tion and substantial computational resources [39], runtime safety guardrails offer a flexible and lightweight solution by monitoring agent output actions at inference time, thereby bypassing costly model training. However, current meth- ods often rely on pre-defined, static rules or hand-crafted, safety-aligned prompting, which fall short in effectively mitigating implicit risks in dynamic, temporally dependent, and context-rich environments. Specifically, they struggle to address two types of implicit risks. ❶Contextual risk, where a seemingly benign action becomes hazardous due to the immediate, specific context. Consider the seemingly be- nign action “turn on the microwave”. Whether this action is safe or hazardous depends on implicit environmen- tal states that are not explicitly represented in the command itself. If a metal fork happens to be inside the microwave, the same action becomes unsafe; if it contain

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut