📝 Original Info
- Title: RoboSafe: Safeguarding Embodied Agents via Executable Safety Logic
- ArXiv ID: 2512.21220
- Date: 2025-12-24
- Authors: Le Wang, Zonghao Ying, Xiao Yang, Quanchen Zou, Zhenfei Yin, Tianlin Li, Jian Yang, Yaodong Yang, Aishan Liu, Xianglong Liu
📝 Abstract
Embodied agents powered by vision-language models (VLMs) are increasingly capable of executing complex real-world tasks, yet they remain vulnerable to hazardous instructions that may trigger unsafe behaviors. Runtime safety guardrails, which intercept hazardous actions during task execution, offer a promising solution due to their flexibility. However, existing defenses often rely on static rule filters or prompt-level control, which struggle to address implicit risks arising in dynamic, temporally dependent, and context-rich environments. To address this, we propose RoboSafe, a hybrid reasoning runtime safeguard for embodied agents through executable predicate-based safety logic. RoboSafe integrates two complementary reasoning processes on a Hybrid Long-Short Safety Memory. We first propose a Backward Reflective Reasoning module that continuously revisits recent trajectories in short-term memory to infer temporal safety predicates and proactively triggers replanning when violations are detected. We then propose a Forward Predictive Reasoning module that anticipates upcoming risks by generating context-aware safety predicates from the long-term safety memory and the agent's multimodal observations. Together, these components form an adaptive, verifiable safety logic that is both interpretable and executable as code. Extensive experiments across multiple agents demonstrate that RoboSafe substantially reduces hazardous actions (-36.8% risk occurrence) compared with leading baselines, while maintaining near-original task performance. Real-world evaluations on physical robotic arms further confirm its practicality. Code will be released upon acceptance.
💡 Deep Analysis
📄 Full Content
RoboSafe: Safeguarding Embodied Agents via Executable
Safety Logic
Le Wang1
Zonghao Ying1
Xiao Yang1
Quanchen Zou2
Zhenfei Yin3
Tianlin Li4
Jian Yang1
Yaodong Yang5,6
Aishan Liu1,6*
Xianglong Liu1
1Beihang University
2360 AI Security Lab
3The University of Sydney
4Nanyang Technological University
5Peking University
6Beijing Academy of Artificial Intelligence
Abstract
Embodied agents powered by vision-language models
(VLMs) are increasingly capable of executing complex real-
world tasks, yet they remain vulnerable to hazardous in-
structions that may trigger unsafe behaviors.
Runtime
safety guardrails, which intercept hazardous actions dur-
ing task execution, offer a promising solution due to their
flexibility. However, existing defenses often rely on static
rule filters or prompt-level control, which struggle to ad-
dress implicit risks arising in dynamic, temporally depen-
dent, and context-rich environments. To address this, we
propose RoboSafe, a hybrid reasoning runtime safeguard
for embodied agents through executable predicate-based
safety logic. RoboSafe integrates two complementary rea-
soning processes on a Hybrid Long-Short Safety Memory.
We first propose a Backward Reflective Reasoning mod-
ule that continuously revisits recent trajectories in short-
term memory to infer temporal safety predicates and proac-
tively triggers replanning when violations are detected. We
then propose a Forward Predictive Reasoning module that
anticipates upcoming risks by generating context-aware
safety predicates from the long-term safety memory and the
agent’s multimodal observations. Together, these compo-
nents form an adaptive, verifiable safety logic that is both
interpretable and executable as code.
Extensive experi-
ments across multiple agents demonstrate that RoboSafe
substantially reduces hazardous actions (-36.8% risk occur-
rence) compared with leading baselines, while maintaining
near-original task performance. Real-world evaluations on
physical robotic arms further confirm its practicality. Code
will be released upon acceptance.
*The corresponding author.
…
turn on
faucet
[reflecting…]
…
turn off
faucet
…
throw candle
[checking …]
stop
Temporal Risk
Mitigation
Contextual Risk
Prevention
replan
block
Backward
Reasoning
Forward
Reasoning
Contextual Safety Logic
Temporal Safety Logic
violation = temp_logic(action, stm)
if violation[‘val’]:
return f‘REPLAN with {violation[‘action’]} \
with {violation[‘msg’]}’
else:
return ‘PASS’
Embodied
Agent
risk = context_logic(action, obs)
if risk[‘val’]:
return f‘BLOCK on {action} \
for {risk[‘msg’]}’
else:
return ‘PASS’
[Avoid Flooding!]
[Avoid Damaging!]
Figure 1. Illustration of RoboSafe, where runtime safety guardrail
generates executable safety logic to eliminate implicit temporal
hazards and prevent contextual risks under dynamic scenarios.
1. Introduction
In recent years, vision-language-model (VLM)-driven em-
bodied agents have demonstrated impressive performance
in solving complex, long-horizon tasks within interactive
environments [4, 6, 8, 18, 20–22, 35, 41].
By harness-
ing the powerful reasoning and planning capabilities of
VLMs [2, 14, 24–26], embodied agents can understand
abstract multimodal inputs and autonomously decompose
them into executable, multi-step plans in the physical world
[9, 13, 22, 30, 33, 34].
Despite this, VLM-driven embodied agents have been
shown to be significantly vulnerable to malicious hazardous
instructions [36] (e.g., “Throw ball to break the
window”). This vulnerability is critically amplified [10–
12, 19, 29, 36–38] compared to normal large language mod-
els (LLMs). While harmful contents generated by LLMs
are confined to only textual outputs, embodied agents are
capable of translating unsafe instructions down to physical
actions, posing immediate and irreversible real-world safety
threats [40].
A significant body of research has focused on safety de-
fense for embodied agents [12, 28, 32, 36]. In contrast to
the training-time strategies that require costly data collec-
1
arXiv:2512.21220v2 [cs.AI] 26 Dec 2025
tion and substantial computational resources [39], runtime
safety guardrails offer a flexible and lightweight solution by
monitoring agent output actions at inference time, thereby
bypassing costly model training. However, current meth-
ods often rely on pre-defined, static rules or hand-crafted,
safety-aligned prompting, which fall short in effectively
mitigating implicit risks in dynamic, temporally dependent,
and context-rich environments. Specifically, they struggle
to address two types of implicit risks. ❶Contextual risk,
where a seemingly benign action becomes hazardous due to
the immediate, specific context. Consider the seemingly be-
nign action “turn on the microwave”. Whether this
action is safe or hazardous depends on implicit environmen-
tal states that are not explicitly represented in the command
itself. If a metal fork happens to be inside the microwave,
the same action becomes unsafe; if it contain
Reference
This content is AI-processed based on open access ArXiv data.