Learning to Detect Unseen Jailbreak Attacks in Large Vision-Language Models

Learning to Detect Unseen Jailbreak Attacks in Large Vision-Language Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Despite extensive alignment efforts, Large Vision-Language Models (LVLMs) remain vulnerable to jailbreak attacks. To mitigate these risks, existing detection methods are essential, yet they face two major challenges: generalization and accuracy. While learning-based methods trained on specific attacks fail to generalize to unseen attacks, learning-free methods based on hand-crafted heuristics suffer from limited accuracy and reduced efficiency. To address these limitations, we propose Learning to Detect (LoD), a learnable framework that eliminates the need for any attack data or hand-crafted heuristics. LoD operates by first extracting layer-wise safety representations directly from the model’s internal activations using Multi-modal Safety Concept Activation Vectors classifiers, and then converting the high-dimensional representations into a one-dimensional anomaly score for detection via a Safety Pattern Auto-Encoder. Extensive experiments demonstrate that LoD consistently achieves state-of-the-art detection performance (AUROC) across diverse unseen jailbreak attacks on multiple LVLMs, while also significantly improving efficiency. Code is available at https://anonymous.4open.science/r/Learning-to-Detect-51CB.


💡 Research Summary

The paper addresses the persistent vulnerability of Large Vision‑Language Models (LVLMs) to jailbreak attacks, even after extensive alignment. Existing detection approaches fall into two categories: learning‑based methods that require attack‑specific training data and thus suffer from poor generalization to unseen attacks, and learning‑free, heuristic‑driven methods that avoid attack data but are limited in accuracy and computational efficiency. To overcome these limitations, the authors propose a novel framework called Learning to Detect (LoD) that requires neither attack data nor hand‑crafted heuristics.

LoD consists of two main modules. The first, a representation‑learning module, extracts safety‑focused features from the internal activations of an LVLM. Building on the Safety Concept Activation Vector (SCA‑V) technique originally designed for LLMs, the authors introduce Multi‑modal Safety Concept Activation Vectors (MSCA‑V) classifiers. For each layer ℓ, a linear classifier Cℓ maps the layer’s activation eℓ to a probability that the input is unsafe (sigmoid(wᵀeℓ + b)). By training these classifiers on safe multimodal inputs (both text and image safe) and unsafe multimodal inputs (unsafe text paired with matching unsafe images), the model learns layer‑wise safety probabilities without ever seeing attacked (jailbroken) examples. Concatenating the probabilities across all layers yields an initial safety representation S₀ ∈ ℝᴸ. To reduce noise, only layers whose classifiers achieve validation accuracy above a preset threshold P₀ are retained, producing a refined representation Sᵣ. Empirical validation shows that safety and unsafety are linearly separable already from the 4th layer onward, with >90 % accuracy, confirming the suitability of the linear assumption for LVLMs.

The second module, the attack‑classification module, converts the high‑dimensional safety representation into a one‑dimensional anomaly score using a Safety Pattern Auto‑Encoder (SPAE). SPAE is trained in an unsupervised fashion solely on the refined safety representations of safe inputs, learning to reconstruct them with minimal error. When an attacked input is fed to SPAE, the reconstruction error spikes, providing a clear anomaly signal. This design eliminates any reliance on handcrafted heuristics and enables detection of attacks that were never seen during training.

The authors evaluate LoD on three state‑of‑the‑art LVLMs—Qwen2.5‑VL, LLaVA‑v1.6‑vicuna, and CogVLM‑chat‑hf—against six recent jailbreak techniques (including MML, JOOD, HADES, V‑AJM, UMK, etc.). They compare against five baseline detectors: three learning‑based and two learning‑free methods. LoD consistently achieves the highest AUROC, improving the average by up to 19.32 % over the best baseline. Moreover, inference latency is reduced by up to 62.7 %, demonstrating superior computational efficiency. Additional analysis reveals that while raw safety vectors still exhibit per‑dimension overlap between safe and attacked samples, the SPAE effectively compresses and separates these representations, yielding a robust one‑dimensional score.

Key contributions are: (1) a fully learnable detection framework that generalizes to unseen jailbreak attacks without any attack data; (2) the MSCA‑V classifiers that systematically extract safety‑relevant signals from LVLM internals; (3) the SPAE that leverages unsupervised anomaly detection to produce accurate, low‑overhead attack scores; and (4) extensive empirical evidence of state‑of‑the‑art performance across multiple models and attack families.

In summary, LoD bridges the gap between the high accuracy of supervised detectors and the broad applicability of heuristic‑free methods, offering a scalable, data‑efficient solution for safeguarding multimodal AI systems against emerging jailbreak threats.


Comments & Academic Discussion

Loading comments...

Leave a Comment