Models as Lego Builders: Assembling Malice from Benign Blocks via Semantic Blueprints
Despite the rapid progress of Large Vision-Language Models (LVLMs), the integration of visual modalities introduces new safety vulnerabilities that adversaries can exploit to elicit biased or malicious outputs. In this paper, we demonstrate an underexplored vulnerability via semantic slot filling, where LVLMs complete missing slot values with unsafe content even when the slot types are deliberately crafted to appear benign. Building on this finding, we propose StructAttack, a simple yet effective single-query jailbreak framework under black-box settings. StructAttack decomposes a harmful query into a central topic and a set of benign-looking slot types, then embeds them as structured visual prompts (e.g., mind maps, tables, or sunburst diagrams) with small random perturbations. Paired with a completion-guided instruction, LVLMs automatically recompose the concealed semantics and generate unsafe outputs without triggering safety mechanisms. Although each slot appears benign in isolation (local benignness), StructAttack exploits LVLMs’ reasoning to assemble these slots into coherent harmful semantics. Extensive experiments on multiple models and benchmarks show the efficacy of our proposed StructAttack.
💡 Research Summary
The paper introduces StructAttack, a novel black‑box jailbreak technique that exploits a previously under‑examined safety weakness in large vision‑language models (LVLMs): semantic slot filling (SSF). The authors observe that many task‑oriented queries can be decomposed into a set of slot‑type/value pairs while preserving the original intent. In natural language understanding, models can perform zero‑shot SSF, assigning predefined slot labels to input tokens. The authors invert this process: they craft “malicious slots” that appear benign (e.g., “Making Process”, “Raw Materials”) and ask the model to fill these slots, thereby coaxing it to generate harmful content without triggering safety filters.
StructAttack consists of two main components:
-
Semantic Slot Decomposition (SSD) – A “Decomposer” LLM (implemented with Deepseek‑Chat) receives a harmful query and outputs a central topic together with a set of malicious slot types that together reconstruct the original malicious intent. A second “Distractor” LLM, using a different role‑playing prompt, generates additional harmless slots related to the topic. These distractor slots dilute the malicious density and further obscure the intent.
-
Visual‑Structural Injection (VSI) – The malicious and distractor slots are embedded into a structured visual prompt (mind map, table, or sunburst diagram) via a rendering function ψ. Random perturbations (position jitter, rotation, etc.) are applied to produce the final image I′. The visual prompt is paired with a completion‑guided instruction that tells the LVLM to fill the empty slots.
When an LVLM processes the combined image‑text input, its multimodal attention mechanisms infer the relationship between the topic and the slots and automatically generate slot values. Because each slot type looks innocuous in isolation, the model’s safety mechanisms—typically focused on detecting overtly harmful keywords—are bypassed. The model ends up providing detailed instructions for bomb construction, chemical recipes, or other prohibited content, effectively re‑assembling the malicious blueprint hidden in the visual structure.
The authors evaluate StructAttack on several state‑of‑the‑art LVLMs, including commercial systems (GPT‑4o, Gemini‑2.5‑Flash, Qwen3‑VL‑Flash) and open‑source models (LLaVA‑1.5‑7B, MiniGPT‑4). Results show an average attack success rate (ASR) of 80 % on open‑source models and ≈60 % on commercial models, outperforming prior visual‑perturbation attacks (e.g., FigStep, HADES, SI‑Attack) which require iterative optimization or white‑box access. An ablation study demonstrates that adding distractor slots reduces ASR, confirming that the models perform slot‑level safety checks but fail to capture the global malicious semantics.
Key contributions are:
- Identification of a semantic slot‑filling vulnerability that allows locally benign visual prompts to trigger dangerous completions in LVLMs.
- A simple, optimization‑free, single‑shot jailbreak method that works in black‑box settings.
- Extensive empirical validation across multiple benchmarks and LVLM architectures, establishing the practical potency of the attack.
The paper concludes by suggesting future defense directions: (1) meta‑checks that evaluate semantic coherence across slots, (2) multimodal safety filters that detect suspicious structural patterns in visual prompts, and (3) context‑aware models that can differentiate malicious from harmless slots. By highlighting the “Lego‑like” assembly of harmful semantics from benign blocks, the work opens a new research avenue for securing multimodal AI systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment