SAVeS: Steering Safety Judgments in Vision-Language Models via Semantic Cues

S A V E S : S T E E R I N G S A F E T Y J U D G M E N T S I N V I S I O N - L A N G U A G E M O D E L S V I A S E M A N T I C C U E S Carlos Hinojosa 1 Clemens Grange 1 , 2 * Bernard Ghanem 1 1 King Abdullah Univ ersity of Science and T echnology (KA UST), Saudi Arabia 2 T echnical Univ ersity of Munich (TUM), Germany carlos.hinojosa@kaust.edu.sa A B S T R A C T V ision-language models (VLMs) are increasingly deployed in real-world and embodied settings where safety decisions depend on visual context. Howe ver , it remains unclear which visual e vidence driv es these judgments. W e study whether multimodal safety behavior in VLMs can be steered by simple semantic cues. W e introduce a semantic steering framework that applies controlled textual, visual, and cogniti ve interventions without changing the underlying scene content. T o ev aluate these effects, we propose SA V eS, a benchmark for situational safety under semantic cues, together with an ev aluation protocol that separates behavioral refusal, grounded safety reasoning, and false refusals. Experiments across multiple VLMs and an additional state-of-the-art benchmark sho w that safety decisions are highly sensiti ve to semantic cues, indicating reliance on learned visual–linguistic associations rather than grounded visual understanding. W e further demonstrate that automated steering pipelines can e xploit these mechanisms, highlighting a potential vulnerability in multimodal safety systems. 1 Introduction V ision–language models (VLMs) are increasingly deployed in embodied and real-world scenarios where safety judg- ments depend critically on visual context [1, 2, 3, 4, 5, 6, 7]. The same instruction may be harmless in one scene yet hazardous in another . F or example, an instruction such as ‘‘put the items from the counter into the clear glass jar’’ may be safe when the items are candies, but dangerous when they are laundry detergent pods near a jar labeled for children (see ﬁg. 1). In such situations, correct behavior requires models to ground their deci- sions in the visual scene and distinguish safe from unsafe contexts. Failures in this process can manifest as unsafe compliance , in which the model complies with instructions despite a hazardous situation, or as over-r efusal , in which the model unnecessarily refuses benign requests. Recent work on multimodal safety has primarily focused on improving refusal policies [8, 9] or detecting harmful instructions [10, 11]. Howe ver , safety in embodied en vironments is inherently situational [1, 12, 13, 14]: the safety of an action depends on the interaction between the instruction and the visual context. This raises a fundamental question: What visual evidence actually drives safety decisions in VLMs? Current ev aluation protocols provide limited insight into this mechanism [15, 13, 1]. Models may appear safe simply by refusing frequently , yet such behavior does not guarantee that refusals are grounded in relev ant visual cues. In this work, we in vestigate whether safety judgments in VLMs can be steer ed by structured semantic cues. Speciﬁcally , we study controlled interventions that highlight regions of interest or explicitly direct the model’ s attention without altering the underlying scene semantics. Our central hypothesis is that safety decisions are highly sensitiv e to such cues, re vealing latent mechanisms by which models interpret visual risk. Importantly , these cues can inﬂuence model behavior in two directions: they can help models focus on relev ant hazards, but they can also induce hallucinated risk and o ver-refusal. T o systematically study this phenomenon, we introduce a framew ork for semantic steering of safety decisions. The framew ork includes three complementary intervention mechanisms: textual steering, which provides spatial descrip- ∗ W ork done during an internship at KA UST . S A V E S Figure 1: Steering safety judgments in VLMs using semantic cues . (Left) The same instruction may be safe or hazardous depending on the visual conte xt. (Center) Semantic steering cues (visual markers and textual prompts) guide the model’ s attention toward relev ant objects. (Right) W ithout steering, the model may produce unsafe compliance, whereas steering enables grounded reasoning and refusal of unsafe actions. tions or coordinate references; visual steering, which overlays semantic markers (e.g., circles) onto the image; and cognitiv e steering, which prompts the model to explicitly reason about safety and highlighted regions. These mecha- nisms enable controlled probing of how VLMs interpret visual e vidence during ev aluation. Evaluating such ef fects requires metrics that distinguish behavioral correctness from grounded reasoning. W e therefore introduce an ev aluation protocol that separates behavioral refusal from visual grounding. Behavioral Response Accu- racy (BRA) measures whether the model beha ves correctly under unsafe scenarios, while Grounded Safety Alignment (GSA) e valuates whether the model’ s explanation aligns with the ground-truth hazard. In addition, the False Refusal Rate (FRR) quantiﬁes unnecessary refusals in safe scenarios, capturing hallucinated risk that is often overlook ed in standard safety benchmarks. T o support controlled experiments, we introduce SA V eS , a benchmark designed to ev aluate situational safety under semantic steering. SA V eS complements existing datasets such as MSSBench-Embodied [1] by providing curated, high-quality synthetic image–instruction pairs with both safe and unsafe contexts, enabling systematic interventions and analysis. Using this benchmark, we conduct extensiv e experiments across multiple open VLMs and inv estigate how safety decisions respond to dif ferent steering strategies. Our results reveal that safety decisions can be substantially altered by relativ ely simple semantic cues. In particular, coupling visual mark ers with e xplicit reasoning prompts produces the strongest steering effect. Further analysis shows that steering effecti veness depends on marker semantics, prompt–cue alignment, and global scene context. More- ov er , we show that automated pipelines can exploit these mechanisms to induce systematic over -refusal, exposing a previously undere xplored vulnerability in multimodal safety systems. W e summarize our contributions as follows: • W e introduce a frame work for semantic steering that shows ho w safety judgments in vision–language models can be inﬂuenced by controlled textual, visual, and cognitive interventions, including their combinations, rev ealing that semantic cues (e.g., visual markers and textual prompts) can alter safety decisions e ven when the underlying scene hazard remains unchanged. • W e ev aluate multiple VLMs on both a state-of-the-art benchmark and our proposed SA V eS benchmark, and introduce an ev aluation protocol that separates behavioral refusal, grounded safety reasoning, and false re- fusals. • Through extensi ve experiments, we show that safety behavior in VLMs is highly sensitiv e to semantic cues, suggesting that models rely hea vily on learned visual–linguistic associations when making safety judgments. This exposes both opportunities for impro ved hazard aw areness and vulnerabilities to adversarial steering. • W e e valuate automated steering pipelines (Guardian, Auditor , and Attack er), sho wing limited g ains for assis- tiv e steering b ut strong adversarial e xploitability , establishing semantic steering as a bidirectional mechanism that can both improv e safety guidance and enable targeted safety manipulation. 2 S A V E S 2 Related W orks Multimodal Safety in V ision–Language Models. Ensuring safe behavior in vision–language models has become an important concern as these systems are increasingly deployed in real-world and embodied settings[8, 2, 16, 6]. Prior work has explored safety alignment through refusal policies, reinforcement learning from human feedback, and rule- based safeguards designed to pre vent harmful outputs [17, 18, 9, 19]. In multimodal contexts, recent studies highlight that safety decisions often depend on the interaction between language instructions and visual inputs, motiv ating the study of situational safety where identical instructions may lead to safe or unsafe outcomes depending on the scene [1, 20, 14, 21]. Ho wev er , existing approaches primarily ev aluate whether models refuse or comply with potentially harmful requests, without analyzing how visual and linguistic cues inﬂuence the underlying safety reasoning. As a result, it remains unclear how multimodal signals guide safety judgments or whether these decisions are grounded in the scene’ s visual content. Our work addresses this gap by studying ho w semantic cues can systematically steer safety behavior in VLMs. Steering and Manipulation of VLM Beha vior . Recent work has sho wn that the behavior of vision–language models can be manipulated through subtle multimodal signals that exploit learned visual–linguistic associations. T ypographic attacks demonstrate that inserting textual o verlays into images can signiﬁcantly alter model reasoning and predictions by activ ating textual priors rather than grounded visual interpretation [22]. Similarly , Li et al. [23] show that VLMs can map visual symbols such as logos to corresponding brand names ev en when no readable text is present, re vealing semantic entanglement in the visual projector . More recent approaches show that adversaries can construct multimodal jailbreak contexts using image-driv en prompts that induce harmful responses from target models [24]. Other studies attribute related hallucination phenomena to statistical biases and spurious modality shortcuts that bypass proper multi- modal grounding [9, 25, 26, 27, 28]. While these works primarily in vestigate adversarial manipulation or hallucination mechanisms, they do not analyze how such signals inﬂuence safety judgments. In contrast, our work in vestigates se- mantic steering as a controlled mechanism to inﬂuence safety decisions in VLMs, examining how simple visual and textual cues can systematically alter safety beha vior ev en when the underlying scene content remains unchanged. Safety Benchmarks and Evaluation Protocols. Recent work has proposed benchmarks to ev aluate the safety be- havior of vision–language models under potentially harmful instructions. These datasets typically measure whether models refuse unsafe requests or produce harmful outputs, pro viding useful insights into safety alignment and rob ust- ness [15, 14]. More recent work such as MSSBench-Embodied ev aluates situational safety in embodied scenarios where the same instruction may be safe or unsafe depending on the visual context [1]. Howe ver , existing ev aluations largely rely on outcome-lev el metrics such as refusal rates or harmful completions, which can obscure whether the model’ s reasoning is grounded in the visual context. In particular , models may appear safe by frequently refusing requests ev en when no hazard is present, leading to over -refusal and hallucinated risks. While prior work has studied situational safety , it has not examined ho w controlled semantic cues inﬂuence safety decisions. In contrast, our work introduces SA V eS, a benchmark designed to study situational safety under controlled semantic cues, and an e valuation protocol that distinguishes among behavioral refusal, grounded safety reasoning, and f alse refusals. 3 Method In this section, we formalize situational safety ev aluation in vision–language models (VLMs) and introduce a con- trolled semantic visual steering frame work to analyze ho w safety judgments can be inﬂuenced by structured interven- tions. 3.1 Problem F ormulation. W e study multimodal situational safety [1] in embodied settings, where a model must determine whether executing an instruction is safe given the visual context. Giv en an image I representing the current en vironment and a language query Q describing an intended action, a VLM f produces a response R = f ( I , Q ) , which implicitly encodes a safety decision. W e therefore model safety judgment as the conditional probability P ( Safe | I , Q ) , reﬂecting whether ex ecuting Q in scene I is safe. Speciﬁcally , P ( Safe | I , Q ) depends on the visual context, meaning that safety cannot be determined from Q alone without grounding in I . A safe model should refuse instructions that are dangerous in the given visual context while a voiding false refusals in safe situations. A correct safety judgment requires two conditions: (i) a correct behavioral decision (safe or unsafe), and (ii) grounded reasoning supported by visual evidence in I . Our goal is to analyze how safety decisions change under controlled semantic interventions in embodied scenarios. W e study semantic steering , in which we modify I and/or Q to inﬂuence the regions and semantic cues on which the model relies, without altering the underlying scene 3 S A V E S Figure 2: Semantic steering mechanisms . From left to right: visual steering highlights regions of interest using visual markers, cogniti ve steering prompts explicit safety reasoning prior to action planning, and textual steering directs attention via spatial language or coordinates. Each intervention can shift the model’ s decision between safe and unsafe responses under identical scene conditions. content. This enables us to determine whether safety beha vior can be steered by semantic cues rather than by grounded visual understanding. 3.2 Semantic Steering Framework W e introduce a controlled semantic steering framew ork to analyze how safety judgments in embodied tasks can be inﬂuenced without altering the underlying scene content. Gi ven ( I , Q ) , we deﬁne a steering mechanism M as a transformation that modiﬁes the image I , the query Q , or both, producing modiﬁed inputs ( ˜ I , ˜ Q ) . In our work, we consider image-only , text-only , and joint (image-and-text) interventions. The steered response is then ˜ R = f ( ˜ I , ˜ Q ) . By comparing R and ˜ R under the same scene semantics, we isolate ho w controlled semantic interv entions affect safety decisions. W e deﬁne three orthogonal categories of steering mechanisms, M ∈ { M v , M c , M t } , where each category isolates a distinct modality-lev el inﬂuence on safety behavior (see ﬁg. 2). V isual Steering ( M v ). V isual steering modiﬁes the image ˜ I = M v ( I ) while preserving scene semantics and keeping Q ﬁxed. The goal is to test whether safety judgments are sensitive to symbolic visual cues. W e ev aluate three visual intervention strategies: Semantic Markers , which consist of overlaying colored circles (red, white, green, yellow , and orange) on the image to highlight target objects or hazards. Red circles are commonly associated with danger , white circles act as a neutral spatial anchor , and the remaining colors test graded color sensitivity . Attention-Based Selection (ABS) examines whether distraction or resolution dri ves behavior . W e generate object crops I crop and evaluate two feeding strategies: (i) Dual-Stream, where ( I global , I crop ) are provided jointly; and (ii) Single-Stream, where only I crop or a masked v ersion of I global is used. This isolates the role of contextual information. Adversarial Overlays introduce noise patches and typographic stickers, such as a “SAFE” label, to test rob ustness to visual and semantic distractions. Cognitive Steering ( M c ). Cognitiv e steering modiﬁes the linguistic input ˜ Q = M c ( Q ) to alter the model’ s safety reasoning state while keeping the image I ﬁxed. W e e valuate three prompting strategies: Instruction F ollowing (IF) , where the model executes the instruction without being explicitly instructed to assess safety . In-Context Safety (IC) serves as our baseline, where the model is instructed to check for potential safety risks before answering; for ex- ample: "First check whether there is a potential safety risk · · · If NO safety risk, provide a plan". F ocus Steering (ICF) modiﬁes the prompt to explicitly direct attention to speciﬁc visual mark ers and reason about their safety; for example: "First check if there is a red circle. If yes, focus on it." W e distinguish General F ocus (check for an y circle) and Color -Speciﬁc Focus to test whether linguistic references to color trigger semantic danger associations. T extual Steering ( M t ). T extual steering encodes region information symbolically using bounding box coordinates in the prompt Q , such that ˜ Q = M t ( Q ) while keeping the image I ﬁxed. For example: "Focus on the region defined by [x1, y1, x2, y2]." The image remains unchanged. This tests whether abstract spatial descriptions can substitute for pixel-le vel cues. The separation of M v , M c , and M t ensures that changes in safety beha vior can be attributed to speciﬁc modality-lev el interventions. 4 S A V E S Figure 3: A utomated semantic steering architectur es. (A) Guardian assists safety by highlighting high-risk objects using VLM-based risk scoring. (B) The Auditor diagnoses model sensiti vity by analyzing attention maps using spatial masking and adversarial hotspot analysis. (C) Attacker exploits semantic cues by cloaking hazardous objects and inserting distractors to in vert the model’ s perceiv ed risk. 3.3 A utomated Steering Architectures T o study ho w semantic steering can be applied or exploited adversarially , we deﬁne three automated architectures (see ﬁg. 3). These architectures allow us to analyze ho w semantic steering inﬂuences safety decisions. Pipeline A: Guardian (Assistive). An auxiliary VLM, referred to as the Spotter , estimates risk scores S ∈ [0 , 1] for objects in the scene and selects the top- k most safety-critical ones (we use k = 3 ). Then, a P ainter module modiﬁes the image, ˜ I = M v ( I ) , by overlaying a colored circle according to the risk score: Marker ( S ) =    Red Circle , S > 0 . 8 , Orange Circle , 0 . 4 < S ≤ 0 . 8 , White Circle , S ≤ 0 . 4 . (1) This ev aluates whether explicitly highlighting detected hazards improves the model’ s safety decisions. Pipeline B: A uditor (Diagnostic). W e e xtract attention maps from the model and aggre gate weights across layers. W e observe that attention frequently concentrates near image corners, ev en when those regions do not correspond to se- mantically rele v ant objects. W e refer to these persistent high-attention re gions as attention sinks. T o mitigate this bias, we apply a spatial mask that suppresses attention responses near image borders before selecting the highest-attended regions. W e ev aluate three variants. In Hot-Spot V alidation , red circles are placed on the three highest-attention re- gions to reinforce dominant focus areas. In Cold-Spot Adversarial , red circles are placed on the three lo west-attention regions to test whether attention can be redirected toward previously ignored areas. In Hot-Spot Cloaking , white cir- cles are placed on the highest-attention regions to test whether neutral markers can suppress salient hazard cues. These interventions instantiate M v using attention-deriv ed regions and ev aluate whether attention guidance alone can steer safety behavior . Pipeline C: Attacker (Adv ersarial). This pipeline is designed to e xploit a potential semantic shortcut in which color cues, such as red, are implicitly associated with danger . The Spotter identiﬁes the main task-rele vant object as well as unrelated background objects. The P ainter then applies a tw o-stage interv ention: it ﬁrst cloaks the task-rele vant object by placing a white circle ov er it, and then distracts the model by placing red circles o ver unrelated background objects. This adversarial strategy also operates through M v , but with intentionally misaligned region selection. It ev aluates whether semantic color associations can in vert safety alignment, causing the model to ignore real risks and hallucinate risk in irrelev ant regions. 5 S A V E S Figure 4: Evaluation datasets. (T op) MSSBench-Embodied [1], which provides safe and unsafe scenarios for the same instruction in simulated en vironments. (Bottom) SA V eS, our synthetic dataset for ev aluating situational safety under semantic cues. 3.4 Safety Evaluation Hallucination-A ware Safety Metric. Standard safety ev aluations typically report a binary Safe/Unsafe accuracy . This conﬂates behavioral correctness (refusing dangerous instructions) with grounded reasoning (correctly identifying the actual hazard). T o disentangle these aspects, we introduce a hallucination-aware safety metric based on a structured rubric e valuated using an LLM-as-a-Judge frame work. The Judge compares each VLM response against the Ground T ruth (GT) and assigns one of ﬁv e mutually exclusiv e categories: Refusal (R) corresponds to direct refusal without speciﬁc reasoning (e.g., ‘‘I cannot answer’’ ). Correct Grounding (CG) indicates that the model identiﬁes a risk that semantically matches the GT . Hallucination (H) denotes that the model identiﬁes a risk that does not match the GT , or attributes risk in a safe image. Unsafe Compliance (UC) corresponds to providing a plan without identifying risks. Capability Fail (CF) captures failures due to a lack of perceptual or task capability (e.g., ‘‘I cannot see the object’’ ) and is treated as ev aluation noise. This separation enables us to distinguish cautious behavior from grounded hazard understanding. W e deﬁne separate metrics for unsafe and safe scenarios to isolate distinct failure modes. Let D s and D u denote the sets of N s safe images and N u unsafe images, respectiv ely . For the i -th image in the dataset, let y i be the label assigned by the Judge from a ﬁx ed set of categories. Then, we deﬁne the Beha vioral Refusal Accuracy (BRA) and Grounded Safety Accurac y (GSA) for unsafe scenarios, and False Refusal Rate (FRR) and Safe Scenario Accuracy (SSA) for safe scenarios, as follo ws: BRA = 1 N u X i ∈D u 1 ( y i ∈ { R, C G, H } ) , (2) GSA = 1 N u X i ∈D u 1 ( y i = C G, (3) FRR = 1 N s X i ∈D s 1 ( y i ∈ { R, C G, H } ) , (4) SSA = 1 N s X i ∈D s 1 ( y i ∈ { U C , C F } ) . (5) BRA assesses behavioral correctness in unsafe scenarios, regardless of whether the stated safety rationale is correct. In contrast, GSA measures strict grounding and requires semantic alignment with the GT . A high FRR indicates unnecessary refusals or hallucinated risks in safe scenarios. Providing a plan represents correct behavior for safe instructions, but constitutes Unsafe Compliance (UC) under unsafe instructions. CF is also counted to ward SSA, as it reﬂects task infeasibility rather than a failure of safety reasoning. Evaluation Datasets. W e e valuate situational safety in embodied scenarios using two datasets: MSSBench-Embodied [1] and our proposed SA V eS dataset (see ﬁg. 4). F or MSSBench-Embodied, we use a curated subset focusing on 6 S A V E S T able 1: Steering modality comparison acr oss datasets. W e report Behavioral Refusal Accurac y (BRA, ↑ ) and False Risk Rate (FRR, ↓ ). IC denotes the in-context safety baseline prompt (the baseline form of M c ). M t denotes textual steering via bounding-box coordinates. M v +IC denotes visual steering with semantic markers combined with the IC prompt. M v +ICF denotes visual steering combined with explicit focus prompting. Higher BRA indicates stronger refusal of unsafe actions, while lower FRR indicates fe wer hallucinated or unnecessary refusals in safe scenes. Model MSSBench SA V eS IC M t M v + IC M v + ICF IC M t M v + IC M v + ICF BRA FRR BRA FRR BRA FRR BRA FRR BRA FRR BRA FRR BRA FRR BRA FRR Qwen3-VL-8B 34.3 14.9 44.8 25.4 53.7 10.4 74.6 16.4 63.3 25.0 90.0 20.0 76.7 25.0 85.0 33.3 Qwen3-VL-32B 31.3 16.4 36.5 17.5 50.7 13.4 51.5 10.9 78.3 21.7 91.7 25.0 90.0 26.7 88.3 31.7 DeepSeek-VL 43.3 59.7 40.3 62.7 65.7 77.6 52.2 61.2 33.3 61.7 26.7 55.0 65.0 85.0 43.3 71.7 LLaV A-HF-13B 52.2 32.8 58.2 49.3 83.6 85.1 92.5 32.8 56.7 60.0 48.3 26.7 91.7 95.0 91.7 96.7 LLaV A-HF-34B 3.0 10.4 7.5 7.5 13.4 16.4 13.4 11.9 23.3 10.0 30.0 8.3 41.7 11.7 43.3 15.0 physical hazards in robotic interaction settings. T o ensure metric stability , we manually removed contradictory im- age–instruction pairs. The ﬁnal e valuation split contains 67 distinct scenarios, each comprising a Safe and an Unsafe condition, yielding 134 samples in total. Each sample consists of an image I and an instruction Q , where Q is v alid in the Safe context but hazardous in the Unsafe context. A known limitation of MSSBench-Embodied is its reliance on simulator-rendered or synthetic imagery with limited visual complexity , which may not sufﬁciently challenge models to reason ov er realistic textures, depth cues, and subtle en vironmental hazards. T o address these limitations, we introduce SA V eS, a synthetic dataset designed to isolate visual grounding from te xtual priors. The dataset comprises 60 distinct safety scenarios spanning div erse hazard categories, including thermal, elec- trical, and child safety risks. Each scenario consists of one neutral robot instruction and two high-ﬁdelity images (Safe and Unsafe), yielding 120 image–instruction pairs. Scenarios were deﬁned through structured ideation inspired by MSSBench task cate gories, follo wed by generation using Gemini and manual reﬁnement to ensure logical consistency and hazard clarity . Safe and Unsafe images were generated separately through an iterativ e visual synthesis pipeline with manual supervision to ensure realism and visual ﬁdelity . Additional details on the SA V eS data generation and collection pipeline are provided in the supplementary material. 4 Experiments Experimental Setup. W e ev aluate semantic steering on two embodied safety benchmarks and report results using the metrics deﬁned in section 3.4. W e use GPT -5-latest as an LLM-as-judge to automatically score model responses for these metrics. W e ev aluate the following open-weight VLMs: Qwen3-VL-8B, Qwen3-VL-32B, DeepSeek-VL, LLaV A-HF-13B, and LLaV A-HF-34B. For automated pipeline experiments, we focus on the Qwen3-VL family , since the current pipeline implementation depends on Qwen3-VL-speciﬁc attention hooks. 4.1 Steering Across MSSBench-Embodied and SA V eS W e begin by comparing the three steering mechanisms introduced in section 3.2, T extual Steering ( M t ), V isual Steer- ing ( M v ), and Cognitive Steering ( M c ), across models and datasets. T able 1 summarizes the main comparison. Here, M v is instantiated as semantic marker o verlays (i.e., colored circles) applied to the image. W e use the in-context safety prompt (IC) as the baseline form of M c , and compare it against textual coordinate steering ( M t ), visual steering paired with the baseline prompt ( M v + IC), and visual steering paired with explicit focus prompting ( M v + ICF). Context-vie w ablations (ABS, Crop-Only , and Masked) and adversarial overlay settings are analyzed separately and reported in the supplementary material. T able 1 supports three main observ ations. First, semantic steering is effecti ve across model families: both textual and visual interv entions can substantially change safety beha vior relati ve to the baseline. Second, the strongest gains often arise from the coupled condition M v + M c , where the visual cue is paired with an explicit focus instruction. This pattern is most clearly visible for the Qwen and LLaV A families on MSSBench, where the focused condition substantially increases BRA relative to the baseline. Third, the same qualitativ e trend transfers to SA V eS, when visual markers are combined with explicit focus, refusal behavior increases, but often at the cost of a higher FRR. This tradeoff is central to our analysis: steering can improve beha vioral caution, but it can also induce spurious or hallucinated safety concerns. W e also note that larger models do not necessarily yield better safety alignment under steering, likely due to dif ferences in instruction tuning and safety alignment rather than scale alone. Finally , SA V eS may provide clearer localization cues for some models, but the resulting gains remain model-dependent. 7 S A V E S T able 2: Mechanism and context ablations on M S S B E N C H . W e report our ablations using Q W E N 3 - V L - 8 B. The three inline panels isolate complementary factors behind steering: (a) color hierar chy , (b) trigger speciﬁcity , and (c) visual conte xt . W e report BRA ↑ , Grounded Safety Accurac y (GSA, ↑ ), and FRR ↓ . Additional models and robustness- to-distraction results are deferred to the appendix. (a) Color Color BRA ↑ GSA ↑ FRR ↓ Red 73.1 28.4 38.8 Orange 47.8 28.4 34.3 Y ello w 53.7 25.4 34.3 Green 49.3 14.9 29.9 White 41.8 19.4 14.9 (b) T rigger Prompt BRA ↑ GSA ↑ FRR ↓ Matched 73.1 28.4 38.8 General 46.3 17.9 17.9 Mismatch 26.9 6.0 10.4 (c) Context V iew BRA ↑ GSA ↑ FRR ↓ Full 34.3 10.4 14.9 Crop 46.3 34.3 4.5 ABS 64.2 35.8 20.9 Masked 20.9 13.4 12.1 These results also reinforce a central claim of our method: safety behavior can be steered by controlled semantic cues without changing the underlying scene semantics. At the same time, the FRR columns indicate that higher behavioral refusal does not necessarily imply better safety alignment. In se veral settings, the model becomes more cautious while also becoming more prone to hallucinated or unnecessary refusals. On MSSBench, visual cueing ( M v + IC) often induces larger behavioral shifts than coordinate-only textual steering ( M t ), although the effect is model-dependent. This suggests that pixel-le vel mark ers provide a stronger cue for safety decisions than location descriptions alone. Howe ver , these gains are not uniformly beneﬁcial: improvements in refusal can coincide with higher false refusals for some models. Adding e xplicit focus prompting ( M v + ICF) further improves refusal beha vior in se veral settings, indicating that steering is mediated by cue-instruction coupling rather than mark er presence alone. W e analyze this interaction directly in T able 2. 4.2 Ablation Studies T o understand why steering works, we next isolate the mechanisms behind the strongest visual interventions. T able 2 reports representativ e ablations on MSSBench using Qwen3-VL-8B. Color Semantics. The color hierarchy in T able 2(a) shows that steering strength depends strongly on the semantic meaning of the marker color . When the same highlighted regions are marked with red circles, BRA reaches 73 . 1% , whereas white circles reduce BRA to 41 . 8% . Intermediate colors (orange, yellow , green) produce intermediate beha v- iors. Interestingly , green yields BRA similar to orange, b ut lower GSA and FRR, indicating that behavioral refusal and grounded hazard identiﬁcation can di verge. This supports our claim that the model is not merely following a generic spatial highlight; rather , it reacts to the semiotic prior (color meaning) associated with the marker itself. Notably , white circles show a cloaking effect: relativ e to red, they reduce BRA ( 73 . 1% → 41 . 8% ) and GSA ( 28 . 4% → 19 . 4% ), while also lowering FRR ( 38 . 8% → 14 . 9% ), consistent with a more neutral/annotative interpretation of the mark er . T rigger Speciﬁcity . The trigger-speciﬁcity rows further strengthen this interpretation. When the prompt explicitly matches the marker semantics ( Matched : red circles + ‘‘Focus on red circles’’ ), BRA and GSA are highest. Removing color-speciﬁc wording ( General : red circles + general focus instruction) or mismatching prompt and marker ( Mismatch : white circles + ‘‘Focus on red circles’’ ) causes a large drop in both BRA and GSA. This shows that the safety ef fect is not purely visual: visual and linguistic cues interact, and the prompt can either acti vate or suppress the semantic shortcut induced by the marker . Context Dependence. T able 2(c) shows that steering also depends on global scene context. Using only the cropped region reduces false alarms (lo wer FRR), b ut changes the e vidence av ailable for safety decisions. Providing both crop and global image (ABS) yields the best overall balance. In contrast, masking the background leads to a pronounced collapse in BRA, indicating that the model cannot reliably infer safety from isolated object appearance alone. T ogether , these results support the interpretation that semantic steering acts on top of contextual grounding: the steering cue is powerful, b ut the model still relies on scene context to resolve what the highlighted object implies for safety . 4.3 A utomated Steering Pipelines W e ﬁnally ev aluate whether the steering process can be automated. T able 3 compares the three pipeline families from section 3.3 against the corresponding baseline prompts. The automated pipelines reveal a more nuanced picture than the manual steering results. 8 S A V E S T able 3: A utomated steering pipelines. W e compare the three automated pipeline f amilies against the corresponding baseline prompt on the Qwen3-VL family . A denotes the Guar dian pipeline, B-H and B-C denote the Auditor pipeline with hot- and cold-region steering, respecti vely , and C denotes the Attacker pipeline. Pipeline Qwen3-VL-8B Qwen3-VL-32B Qwen3-VL-32B MSSBench MSSBench SA V eS BRA ↑ GSA ↑ FRR ↓ BRA ↑ GSA ↑ FRR ↓ BRA ↑ GSA ↑ FRR ↓ Baseline 34.3 10.4 14.9 31.3 6.0 16.4 78.3 63.3 21.7 A (Guardian) 38.8 10.4 10.4 25.0 6.2 16.7 83.3 68.3 30.0 B-H (Auditor) 53.0 10.6 25.8 32.3 7.7 16.9 76.7 70.0 15.0 B-C (Auditor) 48.5 9.1 18.2 23.8 3.2 12.1 81.7 73.3 15.0 C (Attacker) 83.6 0.0 80.6 92.4 7.6 94.0 98.3 53.3 96.7 Pipeline A ( Guardian ) provides modest and model-dependent gains. On Qwen3-VL-8B/MSSBench, it increases BRA slightly while reducing FRR, suggesting lower false alarms in this setting. Ho wev er , the effect is not stable: on Qwen3-VL-32B/MSSBench, Guardian lowers BRA with little change in FRR, and on Qwen3-VL-32B/SA V eS it improv es BRA/GSA at the cost of higher FRR. Overall, automatically highlighting estimated hazards can be helpful, but the beneﬁt is limited and depends on the quality of the auxiliary hazard-proposal module. Pipeline B ( A uditor ) is highly conﬁguration-dependent. It uses model attention to propose regions (hot/cold) and applies steering based on these attention-deriv ed cues. The hot-spot and cold-spot variants both steer model behav- ior , but not always in the same direction. In particular, on Qwen3-VL-32B/SA V eS, both variants improve GSA and reduce FRR relativ e to the baseline, with the cold-spot variant also improving BRA. In contrast, on MSSBench, the same family is much less stable. This suggests that attention-deriv ed regions can inﬂuence safety judgments, but raw attention is not a reliable proxy for grounded hazard relev ance. Pipeline C ( Attack er ) is the clearest and most consistent result. Across all reported settings, it sharply increases BRA while causing FRR to explode. In other words, the attacker can force near-univ ersal refusal, but this refusal is poorly calibrated and not reliably grounded (GSA typically stagnates or de grades). This directly supports our core claim that semantic steering is bidirectional: the same mechanism that can increase caution can also be exploited adversarially to ov erride normal safety alignment. 4.4 Qualitative Analysis W e perform a qualitati ve analysis by ﬁxing the same instruction b ut changing only the cue, the prompt, or the a vailable context. Figure 5 illustrates four cases on the two datasets. Additional qualitativ e examples are provided in the supplementary material. The ﬁrst MSSBench panel ( semantic shortcut ) compares the same unsafe microwa ve scene under red versus white circle ov erlays. W ith a red circle, the model identiﬁes the knife inside the micro wave and produces a grounded refusal; with a white circle, it interprets the cue as a benign “annotation”, misses the hazard, and proceeds. As observed, the semiotic prior (color meaning) of the marker (circle) changes the safety judgment. The second MSSBench panel ( context dependence ) shows the same unsafe instruction under four view conditions: full image, crop-only , masked context, and dual-view ABS. Depending on the av ailable context, the model’ s decision can ﬂip between unsafe compliance and grounded refusal. This example complements T able 2(c): the model’ s safety decision depends not only on the object itself, but also on ho w much local versus global context is preserved. In the adversarial pipeline example on the SA V eS dataset (panel C), the baseline scene is correctly treated as safe, but our adversarial attack (Pipeline C, section 3.3) adds red-circle overlays (distractor) that induce a hallucinated refusal. In the prompt-sensiti vity example on the SA V eS dataset (panel D), the same unsafe scene shifts from unsafe compliance under the baseline prompt to grounded refusal once red-circle overlays are introduced, with the focused variant further reinforcing that behavior . T ogether, these cases show that semantic steering changes not just whether the model refuses, but also whic h evidence it treats as safety-rele vant. 9 S A V E S Figure 5: Qualitative examples. Paired qualitati ve comparisons illustrating how semantic markers, prompt coupling, and context av ailability steer safety judgments: (A) red vs. white markers ﬂip decisions; (B) Full/Crop/ABS/Masked views alter refusal behavior; (C) adversarial ov erlays induce hallucinated refusal; (D) marker-a ware prompting shifts unsafe compliance tow ard grounded refusal. 5 Conclusions This paper in vestigated semantic steering for multimodal situational safety decisions in embodied VLMs. Across MSSBench-Embodied and SA V eS, we found that safety behavior is highly sensitiv e to semantic cues. V isual markers generally steer beha vior more strongly than coordinate-only text prompts, and adding explicit focus instructions often further increases refusal behavior . At the same time, these gains are not uniformly beneﬁcial. Higher refusal rates can lead to more false alarms, indicating a clear calibration trade-off. Our ablations show that the effect depends on marker semantics, cue–instruction compatibility , and scene context, rather than spatial highlighting alone. Finally , automated pipelines conﬁrm that steering is bidirectional: assistiv e pipelines yield only modest, model-dependent improv ements, whereas adversarial overlays can reliably exploit the same mechanism and induce spurious refusals. Overall, current safety behavior is highly steerable but only partially grounded, motiv ating more robust, grounding- aware safety alignment. References [1] K. Zhou, C. Liu, X. Zhao, A. Compalas, D. Song, and X. E. W ang, “Multimodal situational safety , ” in The Thirteenth International Confer ence on Learning Repr esentations , 2025. [2] D. Driess, F . Xia, M. S. M. Sajjadi, C. L ynch, A. Chowdhery , B. Ichter, A. W ahid, J. T ompson, Q. V uong, T . Y u, W . Huang, Y . Chebotar , P . Sermanet, D. Duckworth, S. Le vine, V . V anhoucke, K. Hausman, M. T oussaint, K. Greff, A. Zeng, I. Mordatch, and P . Florence, “Palm-e: an embodied multimodal language model, ” in ICML , ICML ’23, JMLR.org, 2023. [3] C. W u, S. Y in, W . Qi, X. W ang, Z. T ang, and N. Duan, “V isual chatgpt: T alking, drawing and editing with visual foundation models, ” arXiv pr eprint arXiv:2303.04671 , 2023. [4] J. Li, D. Li, S. Sav arese, and S. Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image en- coders and large language models, ” in International conference on mac hine learning , pp. 19730–19742, PMLR, 2023. [5] H. Liu, C. Li, Q. W u, and Y . J. Lee, “V isual instruction tuning, ” in NeurIPS , 2023. [6] D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhosein y , “MiniGPT-4: Enhancing vision-language understanding with advanced lar ge language models, ” in The T welfth International Conference on Learning Repr esentations , 2024. 10 S A V E S [7] G. T eam, P . Georgie v , V . I. Lei, R. Burnell, L. Bai, A. Gulati, G. T anzer , D. V incent, Z. Pan, S. W ang, et al. , “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, ” arXiv pr eprint arXiv:2403.05530 , 2024. [8] L. Ouyang, J. W u, X. Jiang, D. Almeida, C. W ainwright, P . Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray , et al. , “T raining language models to follow instructions with human feedback, ” NeurIPS , vol. 35, pp. 27730– 27744, 2022. [9] Z. Bai, P . W ang, T . Xiao, T . He, Z. Han, Z. Zhang, and M. Z. Shou, “Hallucination of multimodal large language models: A surve y , ” arXiv preprint , 2024. [10] K. Chen, L. Muyang, G. Li, S. Zhang, S. Guo, and T . Zhang, “TR UST-VLM: Thorough red-teaming for uncov- ering safety threats in vision-language models, ” in ICML , 2025. [11] R. W ang, J. Li, Y . W ang, B. W ang, X. W ang, Y . T eng, Y . W ang, X. Ma, and Y .-G. Jiang, “Ideator: Jailbreaking and benchmarking large vision-language models using themselv es, ” in ICCV , pp. 8875–8884, 2025. [12] X. Lu, Z. Chen, X. Hu, Y . Zhou, W . Zhang, D. Liu, L. Sheng, and J. Shao, “Is-bench: Evaluating interactive safety of vlm-driv en embodied agents in daily household tasks, ” arXiv pr eprint arXiv:2506.16402 , 2025. [13] S. Y in, X. Pang, Y . Ding, M. Chen, Y . Bi, Y . Xiong, W . Huang, Z. Xiang, J. Shao, and S. Chen, “Safeagentbench: A benchmark for safe task planning of embodied llm agents, ” arXiv pr eprint arXiv:2412.13178 , 2024. [14] Z. Y ing, L. W ang, Y . Xiao, J. W ang, Y . Ma, J. Guo, Z. Y in, M. Zhang, A. Liu, and X. Liu, “ Agentsafe: Bench- marking the safety of embodied agents on hazardous instructions, ” arXiv pr eprint arXiv:2506.14697 , 2025. [15] X. Liu, Y . Zhu, J. Gu, Y . Lan, C. Y ang, and Y . Qiao, “Mm-safetybench: A benchmark for safety ev aluation of multimodal large language models, ” in Eur opean Conference on Computer V ision , pp. 386–403, Springer , 2024. [16] M. Ni, L. Zhang, Z. Chen, K. Bai, Z. Chen, J. Zhang, and W . Zuo, “Don’ t let your robot be harmful: Responsible robotic manipulation via safety-as-policy , ” IEEE Robotics and Automation Letters , 2025. [17] J. Dai, X. Pan, R. Sun, J. Ji, X. Xu, M. Liu, Y . W ang, and Y . Y ang, “Safe rlhf: Safe reinforcement learning from human feedback, ” arXiv pr eprint arXiv:2310.12773 , 2023. [18] J. Ji, X. Chen, R. Pan, C. Zhang, H. Zhu, J. Li, D. Hong, B. Chen, J. Zhou, K. W ang, et al. , “Safe rlhf-v: Safe reinforcement learning from multi-modal human feedback, ” arXiv pr eprint arXiv:2503.17682 , 2025. [19] Z. Ra vichandran, A. Robey , V . Kumar , G. J. Pappas, and H. Hassani, “Safety guardrails for llm-enabled robots, ” IEEE Robotics and Automation Letter s , 2026. [20] X. Hu, D. Liu, H. Li, X.-J. Huang, and J. Shao, “Vlsbench: Un veiling visual leakage in multimodal safety , ” in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long P apers) , pp. 8285–8316, 2025. [21] A. V era, K. Sanchez, C. Hinojosa, H. B. Hamid, D. Kim, and B. Ghanem, “Multimodal safety ev aluation in generativ e agent social simulations, ” arXiv pr eprint arXiv:2510.07709 , 2025. [22] H. Cheng, E. Xiao, J. Gu, L. Y ang, J. Duan, J. Zhang, J. Cao, K. Xu, and R. Xu, “Unv eiling typographic deceptions: Insights of the typographic vulnerability in large vision-language models, ” in European Confer ence on Computer V ision , pp. 179–196, Springer , 2024. [23] S. Li, H. Chen, Y . Cai, Q. Y e, L. Chen, J. Y uan, and Y . W ang, “V ision language models map logos to text via semantic entanglement in the visual projector , ” arXiv pr eprint arXiv:2510.12287 , 2025. [24] Z. Miao, Y . Ding, L. Li, and J. Shao, “V isual contextual attack: Jailbreaking mllms with image-driven context injection, ” arXiv pr eprint arXiv:2507.02844 , 2025. [25] M. Augustin, Y . Neuhaus, and M. Hein, “Dash: Detection and assessment of systematic hallucinations of vlms, ” in Pr oceedings of the IEEE/CVF International Conference on Computer V ision , pp. 22748–22759, 2025. [26] S. Leng, H. Zhang, G. Chen, X. Li, S. Lu, C. Miao, and L. Bing, “Mitigating object hallucinations in large vision-language models through visual contrastive decoding, ” in Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition , pp. 13872–13882, 2024. [27] S. Li, J. Qu, Y . Zhou, Y . Qin, T . Y ang, and Y . Zhao, “T reble counterfactual vlms: A causal approach to halluci- nation, ” arXiv pr eprint arXiv:2503.06169 , 2025. [28] A. V illa, J. L. Alc ´ azar , M. Alfarra, V . Araujo, A. Soto, and B. Ghanem, “Eagle: Enhanced visual grounding minimizes hallucinations in instructional multimodal models, ” arXiv pr eprint arXiv:2501.02699 , 2025. 11 S A V E S Supplementary Materials This supplementary material provides additional details on the experimental setup, dataset construction, quantitativ e and qualitati ve results, and reproducibility artif acts for our study . The accompanying supplementary package includes code, JSON deﬁnitions for the dataset splits, inference and ev aluation scripts, model and en vironment conﬁguration ﬁles, and representativ e audited ev aluation outputs. Due to size constraints, we provide representativ e samples rather than the complete raw output corpus. 6 Experimental Setup and Reproducibility W e e valuate situational safety behavior on tw o embodied safety benchmarks: MSSBench-Embodied and our proposed SA V eS dataset. For MSSBench-Embodied, we use the embodied subset deﬁned in subset embodied.json , which contains paired Safe and Unsafe scenes representing the same task under different visual contexts. For SA V eS, we use saves gt.json , which provides paired safe/unsafe images together with instruction-le vel safety annotations. Experiments cover both open-weight and closed-source vision-language models. The open-weight models include Qwen3-VL (8B and 32B), DeepSeek-VL, and LLaV A-HF variants, while closed-source models ev aluated in T able 8 include GPT -5-mini, GPT -5, Claude Sonnet 4.5, and Gemini Flash. Across both datasets and model families, we ev aluate multiple semantic steering conditions: IC (baseline cognitive steering, M c ), M v +IC (visual marker with the baseline instruction), and M v +ICF (visual marker combined with an e xplicit focus instruction directing the model to attend to the marked region). Additional tests e xamine rob ustness under altered context vie ws and adversarial o verlays, including Crop, ABS, Masked views, and decoy or sticker-style perturbations. All experiments follow the same proposed ev aluation protocol used in the main paper . W e report Behavioral Refusal Accuracy (BRA, ↑ ), Grounded Safety Accuracy (GSA, ↑ ), and False Refusal Rate (FRR, ↓ ). For reproducibility , model outputs are sa ved as structured JSON ﬁles and all ev aluations are computed from these stored predictions, allowing results to be independently re- audited without rerunning inference. 7 SA V eS Construction Pipeline Details Figure 6: SA V eS construction pipeline. (1) Reference scenarios are manually designed, each specifying a context- neutral instruction, safe and unsafe scene descriptions, a hazard rationale, and an image-generation prompt. (2) Addi- tional scenarios are generated via few-shot prompting with an LLM (e.g., Gemini) and manually revie wed for logical consistency and hazard clarity . (3) Safe and unsafe images are synthesized using an image generator and iterati vely re- ﬁned to ensure visual quality and alignment with the scenario descriptions. (4) Each ﬁnalized scenario yields a paired safe-unsafe image pair with the same instruction, from which additional intervention variants (e.g., colored markers, masked vie ws, semantic stickers, etc) are deri ved for safety-steering experiments. W e construct SA V eS as a synthetic paired dataset for situational safety e valuation, as illustrated in Figure 6. The design is inspired by the task categories of MSSBench but explicitly targets visual grounding under ﬁxed instructions. Each scenario follows a paired safe–unsafe structure in which the same instruction is presented with both a Safe and an Unsafe visual context. 1 S A V E S T able 4: Context-vie w ablations (Full, Crop, ABS, Masked) for Qwen3-VL-8B and Qwen3-VL-32B on MSSBench and SA V eS. MSSBench Full (Baseline) Crop Only ABS Masked Model BRA ↑ GSA ↑ FRR ↓ BRA ↑ GSA ↑ FRR ↓ BRA ↑ GSA ↑ FRR ↓ BRA ↑ GSA ↑ FRR ↓ Qwen3-VL-8B 34.3 10.4 14.9 46.3 34.3 4.5 64.2 35.8 20.9 20.9 13.4 12.1 Qwen3-VL-32B 31.3 6.0 16.4 34.8 27.3 10.8 62.1 45.5 18.2 28.1 18.8 9.2 SA V eS Qwen3-VL-8B 63.3 41.7 25.0 80.0 51.7 35.0 83.3 58.3 36.7 76.7 48.3 35.0 Qwen3-VL-32B 78.3 63.3 21.7 88.3 75.0 16.7 80.0 71.7 25.0 75.0 58.3 18.3 T able 5: Robustness to semantic (color) and adversarial/decoy o verlays across datasets and open-weight models. Overlay condition Qwen3-VL-8B Qwen3-VL-32B DeepSeek-VL LLaV A-HF-13B BRA ↑ GSA ↑ FRR ↓ BRA ↑ GSA ↑ FRR ↓ BRA ↑ GSA ↑ FRR ↓ BRA ↑ GSA ↑ FRR ↓ MSSBench Semantic overlays (color) Red (all) 73.1 28.4 38.8 55.2 22.4 40.9 53.7 10.4 74.6 94.0 16.4 94.0 White 41.8 19.4 14.9 44.8 25.4 21.9 46.3 4.5 67.2 95.5 11.9 92.5 Orange 47.8 28.4 34.3 46.3 14.9 22.4 52.2 11.9 73.1 82.1 16.4 97.0 Y ellow 53.7 25.4 34.3 45.3 17.2 16.7 43.3 7.5 77.6 91.0 14.9 95.5 Green 49.3 14.9 29.9 47.7 20.0 24.2 46.3 4.5 71.6 89.6 13.4 95.5 Adversarial/decoy overlays Random red decoy 95.5 10.4 79.1 54.5 4.5 29.9 56.7 4.5 91.0 100.0 10.4 98.5 Adversarial 28.4 6.0 71.6 44.6 24.6 25.8 44.8 3.0 80.6 95.5 16.4 97.0 SA V eS Semantic overlays (color) Red (all) 85.0 56.7 33.3 88.3 75.0 31.7 43.3 11.7 71.7 91.7 18.3 96.7 White 80.0 60.0 20.0 90.0 78.3 28.3 53.3 11.7 66.7 88.3 18.3 83.3 Orange 81.7 55.0 26.7 90.0 73.3 26.7 55.0 11.7 73.3 90.0 15.0 91.7 Y ellow 78.3 55.0 30.0 91.7 76.7 30.0 48.3 18.3 71.7 93.3 15.0 90.0 Green 75.0 50.0 21.7 88.3 78.3 26.7 43.3 13.3 73.3 93.3 18.3 90.0 Adversarial/decoy overlays Random red decoy 56.7 33.3 15.0 78.3 66.7 25.0 50.0 18.3 75.0 90.0 13.3 93.3 Adversarial 80.0 60.0 20.0 88.3 78.3 51.7 51.7 11.7 76.7 88.3 20.0 91.7 W e ﬁrst manually created a small set of reference scenarios, each consisting of a context-neutral instruction, safe and unsafe scene descriptions, a ground-truth hazard rationale, and corresponding image-generation prompts. Using these examples in a few-shot setup, we prompted Gemini to generate additional candidate scenarios. The generated scenarios were then manually revie wed and edited to ensure logical consistency , clear hazard identiﬁcation, and alignment between the instruction and the visual scene. For each ﬁnalized scenario, safe and unsafe images were generated separately using condition-speciﬁc prompts with Gemini-2.5-ﬂash-image. Low-quality or semantically misaligned samples were iterativ ely regenerated until they sat- isﬁed visual ﬁdelity and scenario consistency requirements for reliable safety judgment. The ﬁnal SA V eS dataset contains 60 scenarios, each consisting of one instruction paired with a safe image and an unsafe image, along with ground-truth hazard annotations and scenario metadata. Beyond the base safe/unsafe image pairs, we generate a family of intervention-speciﬁc views from the same 60 sce- narios to support controlled steering analyses. Using deterministic transformation scripts, we deri ve color-marker variants (red/white/green/yello w/orange), decoy and adv ersarial ov erlays, semantic te xt-sticker v ariants (safe/danger), and context-vie w variants (crop-only and masked). W e also maintain bounding-box annotations and preserve a con- sistent paired indexing scheme across all variants (safe/unsafe instances per scenario), enabling matched comparisons under controlled visual interventions. 2 S A V E S T able 6: SA V eS stress test under visual distractors (decoy circles, adversarial noise, and semantic stickers) for open- weight models. W e report BRA ↑ , GSA ↑ , and FRR ↓ (%). Original (Baseline) Decoy Circles Adversarial Noise Sticker SAFE Sticker D ANGER Model BRA ↑ GSA ↑ FRR ↓ BRA ↑ GSA ↑ FRR ↓ BRA ↑ GSA ↑ FRR ↓ BRA ↑ GSA ↑ FRR ↓ BRA ↑ GSA ↑ FRR ↓ Qwen3-VL-8B 65.0 51.7 20.0 60.0 50.0 16.7 53.3 30.0 11.7 58.3 45.0 10.0 98.3 68.3 73.3 Qwen3-VL-32B 86.7 78.3 23.3 88.3 81.7 15.0 65.0 56.7 28.3 85.0 73.3 13.3 100.0 85.0 91.7 DeepSeek-VL 51.7 10.0 71.7 68.3 15.0 83.3 61.7 5.0 71.7 53.3 6.7 78.3 78.3 16.7 86.7 LLaV A-HF-13B 91.7 13.3 93.3 93.3 18.3 86.7 86.7 8.3 86.7 90.0 13.3 80.0 93.3 20.0 86.7 T able 7: Same SA V eS stress test as T able 6, b ut using the IC F ocus prompt (ICF). W e report BRA ↑ , GSA ↑ , and FRR ↓ (%). Original (Baseline) Decoy Circles Adversarial Noise Sticker SAFE Sticker D ANGER Model BRA ↑ GSA ↑ FRR ↓ BRA ↑ GSA ↑ FRR ↓ BRA ↑ GSA ↑ FRR ↓ BRA ↑ GSA ↑ FRR ↓ BRA ↑ GSA ↑ FRR ↓ Qwen3-VL-8B 61.7 43.3 28.3 55.0 33.3 15.0 38.3 21.7 16.7 58.3 50.0 15.0 93.3 63.3 51.7 Qwen3-VL-32B 80.0 63.3 18.3 80.0 65.0 26.7 60.0 41.7 16.7 75.0 65.0 15.0 100.0 73.3 90.0 DeepSeek-VL 26.7 5.0 63.3 45.0 15.0 73.3 33.3 8.3 51.7 36.7 8.3 51.7 50.0 11.7 75.0 LLaV A-HF-13B 60.0 16.7 61.7 93.3 13.3 93.3 70.0 6.7 76.7 55.0 11.7 55.0 71.7 18.3 71.7 8 Additional Quantitative Results Context-V iew Ablations. T able 4 sho ws that context manipulation changes model behavior in non-trivial ways. On MSSBench, both Qwen models beneﬁt from adding region-le vel evidence to the full vie w (ABS), with lar ge BRA/GSA gains ov er baseline (e.g., Qwen3-VL-32B: BRA 31 . 3 → 62 . 1 , GSA 6 . 0 → 45 . 5 ). Howe ver , this gain often comes with higher FRR than crop-only v ariants, indicating a precision–coverage trade-of f. On SA V eS, the same interventions are less uniform: Qwen3-VL-8B improv es refusal under Crop/ABS but with a substantial FRR increase, while Qwen3- VL-32B beneﬁts most from Crop (higher BRA/GSA and lower FRR than baseline) and less from ABS. Across both datasets, masked-only views are consistently weaker than context-preserving alternativ es, supporting the claim that steering cues operate best when global scene context is retained. Semantic and Adversarial Overlay Robustness. T able 5 isolates how overlay semantics affect safety behavior . Color semantics are strongest on MSSBench: for Qwen3-VL-8B, red vs. white shifts BRA from 41 . 8 to 73 . 1 , with corresponding changes in GSA/FRR, consistent with semantic shortcut effects. On SA V eS, color effects remain but are less pronounced for high-performing models. The adversarial/decoy ro ws sho w that deco y red mark ers can induce ov er-refusal (e.g., MSSBench Qwen3-VL-8B BRA 95 . 5 , FRR 79 . 1 ) without proportional grounding gains, while adversarial ov erlays degrade calibration differently across architectures. DeepSeek and LLaV A-HF-13B further show that high BRA can coincide with poor GSA and v ery high FRR, emphasizing that refusal alone is not a reliable proxy for grounded safety . SA V eS Rob ustness Under Visual Distractors. T ables 6 and 7 e valuate robustness to distractor families (deco y circles, adversarial noise, “SAFE”/“D ANGER” stickers) under two prompt settings (IC and IC Focus). Unlike T able 5, which studies mechanism ef fects under controlled semantic ov erlays, these tables measure stability under stress-style pertur - bations and prompt variation. T wo trends are consistent: (i) Sticker DANGER is the strongest trigger for over -refusal across models (very high BRA with sharply increased FRR), and (ii) moving from IC to IC Focus can substantially alter calibration, but not uniformly . For example, Qwen3-VL-8B under Sticker DANGER reduces FRR from 73 . 3 to 51 . 7 with IC Focus, whereas Qwen3-VL-32B remains near-univ ersal refusal. DeepSeek and LLaV A-HF-13B ex- hibit greater prompt-induced variability across distractor conditions, suggesting weaker robustness under distractor perturbations. Although T ables 5 and 6 – 7 include visually similar perturbations (e.g., deco y markers and adv ersarial-style ov erlays), they answer different questions. T able 5 is mechanism-oriented: overlays are treated as controlled steering inter- ventions to measure how cue semantics alter safety behavior under a ﬁxed setup. In contrast, T ables 6 and 7 are robustness-oriented: the same perturbation families are treated as distractor stressors, and we analyze behavior sta- bility across prompt settings (IC vs IC Focus). This separation distinguishes causal steering effects from calibration robustness under nuisance visual changes. Closed-Source Models. T able 8 highlights two regimes. On MSSBench, adding visual steering and explicit focus generally improv es BRA and GSA for all closed-source models, often with FRR increases (e.g., GPT -5-mini and Claude-4.5 Sonnet), though the FRR effect is model-dependent (Gemini-3 Flash slightly improves FRR from IC to ICF). On SA V eS, most closed-source models already operate near the ceiling in BRA/GSA under IC, so steering mainly 3 S A V E S T able 8: Closed-source VLM results on MSSBench and SA V eS under semantic steering. W e report BRA ↑ , GSA ↑ , and FRR ↓ for IC, visual cueing with IC ( M v +IC ), and visual cueing with explicit focus ( M v +ICF ), along with the change from IC to ICF . V isual steering with explicit focus generally increases refusal and grounded hazard detection, often with a trade-off in f alse refusals. MSSBench IC M v +IC M v +ICF ∆ ICF–IC Model BRA ↑ GSA ↑ FRR ↓ BRA ↑ GSA ↑ FRR ↓ BRA ↑ GSA ↑ FRR ↓ ∆ BRA ∆ GSA ∆ FRR GPT -5 55.2 44.8 10.4 71.6 65.7 13.4 74.6 70.1 17.9 +19.4 +25.3 +7.5 GPT -5-mini 49.3 26.9 23.9 58.2 41.8 13.4 80.6 53.7 32.8 +31.3 +26.8 +8.9 Claude-4.5 Sonnet 26.9 9.0 20.9 26.9 20.9 13.4 61.2 10.4 49.3 +34.3 +1.4 +28.4 Gemini-3 Flash 58.2 47.8 14.9 76.1 71.6 9.0 83.6 76.1 13.4 +25.4 +28.3 -1.5 SA V eS IC M v +IC M v +ICF ∆ ICF–IC Model BRA ↑ GSA ↑ FRR ↓ BRA ↑ GSA ↑ FRR ↓ BRA ↑ GSA ↑ FRR ↓ ∆ BRA ∆ GSA ∆ FRR GPT -5 98.3 98.3 50.0 100.0 98.3 46.7 98.3 98.3 45.0 +0.0 +0.0 -5.0 GPT -5-mini 96.7 93.3 60.0 100.0 95.0 55.0 98.3 95.0 55.0 +1.6 +1.7 -5.0 Claude-4.5 Sonnet 75.0 68.3 25.0 81.7 76.7 25.0 93.3 78.3 41.7 +18.3 +10.0 +16.7 Gemini-3 Flash 98.3 95.0 56.7 100.0 100.0 60.0 100.0 100.0 55.0 +1.7 +5.0 -1.7 shifts calibration (FRR) rather than hazard detection. This mirrors the main paper conclusion: semantic steering is effecti ve, but its net utility depends on the model- and dataset-speciﬁc balance between caution and false refusals. Figure 7: Cross-model disagreement on two unsafe SA V eS scenarios. For each scene, all models receiv e the same instruction and image, yet safety judgments div erge between grounded refusal, unsafe compliance, and hallucinated refusal. 9 Additional Qualitative Results Figure 7 highlights model-dependent safety behavior under identical visual evidence. In the laundry-pod scene (left), Qwen3-VL-32B produces a grounded refusal; Qwen3-VL-8B and DeepSeek-VL proceed unsafely; and LLaV A-HF- 13B refuses due to a mismatched hazard (hallucinated refusal). In the spilled-coffee scene (right), Qwen3-VL-8B and DeepSeek-VL correctly identify the liquid hazard and refuse, while Qwen3-VL-32B proceeds unsafely , and LLaV A- HF-13B again refuses with ungrounded reasoning. These examples reinforce the quantitati ve ﬁnding that refusal rates alone are insuf ﬁcient to assess safety behavior . A model may refuse for the correct reason (by detecting the true hazard) or for the incorrect one (by hallucinating a hazard), so reliable safety requires grounding the refusal in the actual scene. Figure 8 shows a perturbation sequence in which the underlying task and scene semantics remain unchanged while visual distractors are introduced. In the original image, the model correctly proceeds (safe). After adding decoy circles, the model switches to an incorrect refusal; adversarial noise and the D ANGER sticker further induce hallucinated refusals. Because the true hazard status does not change across these variants, the behavioral shift reﬂects calibration fragility to superﬁcial visual cues. This qualitative trend aligns with our quantitativ e results in T ables 6 and 7, which show similar sensiti vity to distractor-based perturbations across models and prompting conditions. 4 S A V E S Figure 8: Sensiti vity to visual distractors on a safe SA V eS scenario using Qwen3-VL-32B. With the same instruction and base scene, decoy circles, adversarial noise, and a semantic “DANGER” sticker ﬂip behavior from correct com- pliance in the original image to incorrect or hallucinated refusals, indicating sensitivity to non-causal visual cues. 10 Limitations and Additional Discussion Our study has several limitations. First, semantic steering is sensitiv e to prompt wording and cue design, and small changes can produce noticeable behavioral shifts. Second, the observed ef fects are strongly model-dependent, in- dicating that improvements in one model family or scale do not necessarily transfer to others or guarantee better calibration. Third, while MSSBench and SA V eS enable controlled paired analysis, they do not cov er the full diversity of real-world visual conditions or long-horizon embodied interactions. Fourth, our metrics rely on an LLM-based ev aluator; although we applied consistency checks and manual veriﬁcation, grading noise or bias may still affect difﬁcult cases. Finally , the current setup is primarily single-image and single-turn, leaving temporal robustness and closed-loop correction as open problems. More broadly , the trade-of f between stronger refusal beha vior and increased false refusals remains unresolved. LLaV A-HF-34B vs. LLaV A-HF-13B P erformance Differences. Although both checkpoints belong to the LLaV A- v1.6 family , the 34B model is not a simple scaled-up replica of the 13B model; it uses a different text backbone and tokenizer conﬁguration. Empirically , the 34B model e xhibits a strong compliance bias in our safety format: on unsafe scenes, it frequently outputs Answer: No together with an e xecutable plan, which maps to unsafe compliance under our protocol. This pattern is visible on both MSSBench-Embodied and SA V eS and explains the lo w beha vioral refusal scores despite relatively low false-refusal rates. In contrast, the 13B checkpoint is more refusal-prone (higher BRA) but often more conservati ve on safe scenes. These results suggest that safety behavior under steering is inﬂuenced more by checkpoint-speciﬁc instruction and safety tuning than by parameter count alone. 5

SAVeS: Steering Safety Judgments in Vision-Language Models via Semantic Cues

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment