Toward Universal and Transferable Jailbreak Attacks on Vision-Language Models
Vision-language models (VLMs) extend large language models (LLMs) with vision encoders, enabling text generation conditioned on both images and text. However, this multimodal integration expands the attack surface by exposing the model to image-based jailbreaks crafted to induce harmful responses. Existing gradient-based jailbreak methods transfer poorly, as adversarial patterns overfit to a single white-box surrogate and fail to generalise to black-box models. In this work, we propose Universal and transferable jailbreak (UltraBreak), a framework that constrains adversarial patterns through transformations and regularisation in the vision space, while relaxing textual targets through semantic-based objectives. By defining its loss in the textual embedding space of the target LLM, UltraBreak discovers universal adversarial patterns that generalise across diverse jailbreak objectives. This combination of vision-level regularisation and semantically guided textual supervision mitigates surrogate overfitting and enables strong transferability across both models and attack targets. Extensive experiments show that UltraBreak consistently outperforms prior jailbreak methods. Further analysis reveals why earlier approaches fail to transfer, highlighting that smoothing the loss landscape via semantic objectives is crucial for enabling universal and transferable jailbreaks. The code is publicly available in our \href{https://github.com/kaiyuanCui/UltraBreak}{GitHub repository}.
💡 Research Summary
The paper addresses the growing security concern of image‑based jailbreak attacks on vision‑language models (VLMs). Existing gradient‑based jailbreaks are highly model‑specific: they overfit to a single white‑box surrogate and fail to generalize across different victim models or new malicious prompts. To overcome this limitation, the authors propose UltraBreak, a universal and transferable jailbreak framework that simultaneously achieves cross‑target (universality) and cross‑model (transferability) effectiveness. UltraBreak combines two complementary ideas. First, it constrains the visual optimization space by applying random transformations (rotation, scaling, patch placement) and a total‑variation (TV) regularizer. These transformations force the optimizer to discover robust, transformation‑invariant patterns (e.g., letters or shapes) that survive model‑specific preprocessing and architectural variations. Second, it replaces the conventional token‑level cross‑entropy loss with a semantic loss defined in the LLM’s embedding space. Both the model’s output logits and the target token sequence are projected into a shared embedding space; cosine distance between the expected output embedding and a weighted, noise‑perturbed target embedding is minimized. An attention mechanism dynamically weights future target embeddings, allowing each prediction step to focus on the most semantically relevant continuation, which smooths the loss landscape and stabilizes training. Formally, the objective is
x* = arg minₓ
Comments & Academic Discussion
Loading comments...
Leave a Comment