FENCE: A Financial and Multimodal Jailbreak Detection Dataset

FENCE: A Financial and Multimodal Jailbreak Detection Dataset
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Jailbreaking poses a significant risk to the deployment of Large Language Models (LLMs) and Vision Language Models (VLMs). VLMs are particularly vulnerable because they process both text and images, creating broader attack surfaces. However, available resources for jailbreak detection are scarce, particularly in finance. To address this gap, we present FENCE, a bilingual (Korean-English) multimodal dataset for training and evaluating jailbreak detectors in financial applications. FENCE emphasizes domain realism through finance-relevant queries paired with image-grounded threats. Experiments with commercial and open-source VLMs reveal consistent vulnerabilities, with GPT-4o showing measurable attack success rates and open-source models displaying greater exposure. A baseline detector trained on FENCE achieves 99 percent in-distribution accuracy and maintains strong performance on external benchmarks, underscoring the dataset’s robustness for training reliable detection models. FENCE provides a focused resource for advancing multimodal jailbreak detection in finance and for supporting safer, more reliable AI systems in sensitive domains. Warning: This paper includes example data that may be offensive.


💡 Research Summary

The paper addresses the growing security concern of jailbreak attacks on large language models (LLMs) and, more critically, vision‑language models (VLMs) that process both text and images. While prior work has largely focused on text‑only jailbreaks, the authors argue that multimodal attacks—especially those that embed harmful content directly in images (image‑based attacks, IA)—pose a broader attack surface, which is particularly dangerous in the financial sector where models may handle sensitive personal data, regulatory compliance, and fraud‑related queries.

To fill the gap of publicly available resources for detecting such attacks in finance, the authors introduce FENCE, a bilingual (Korean‑English) multimodal dataset specifically designed for training and evaluating jailbreak detectors in financial applications. FENCE contains 10,000 samples evenly split between benign and malicious queries (50 % each) and spans more than 15 realistic financial topics (loans, deposits, credit cards, online banking, etc.). Each sample belongs to one of three types:

  • BaseImg – image‑only inputs that convey malicious intent.
  • TextImg – a query‑relevant image paired with text.
  • FigStep – stylized images that embed text (e.g., typographic renderings) to bypass keyword filters.

The dataset construction follows a three‑step pipeline. First, 2,500 real‑world financial FAQs from six major South Korean banks are collected as benign examples. Using a two‑step prompting strategy with GPT‑4o (role‑playing to re‑phrase the query from a malicious actor’s perspective, followed by an evaluation prompt to verify harmfulness), the authors generate a paired malicious version for each benign query. Human validation shows a 95 % agreement with GPT‑4o’s judgments, confirming the semantic alignment of the benign‑malicious pairs.

Second, for each query the authors retrieve real photographs from Pixabay (rather than synthetic diffusion‑generated images) using keyword searches tailored to the sample type. This choice improves visual realism and reduces computational overhead while remaining legally compliant.

Third, the text and images are fused to produce the final multimodal jailbreak instances. IA attacks are represented by embedding harmful text directly in images (e.g., typo‑style attacks, visual role‑playing) or by using images that semantically amplify the malicious intent. TA attacks keep the harmful content in the text while the accompanying image is benign or a distraction.

The authors benchmark six VLMs—including commercial models (GPT‑4o, Gemini) and open‑source models (LLaVA, MiniGPT‑4, etc.)—against FENCE. Even the strongest commercial model, GPT‑4o, exhibits an average jailbreak success rate of ~23 % across the dataset, while open‑source models are more vulnerable to IA attacks, reaching success rates above 35 %. These findings highlight that current alignment techniques are insufficient for the financial domain, where a successful jailbreak could lead to data leakage, misinformation, or regulatory violations.

For detection, a binary classifier is trained solely on FENCE. In‑distribution evaluation yields 99 % accuracy and an F1 score of 0.98. Cross‑dataset testing on external benchmarks such as MM‑SafetyBench and JailBreakV‑28K shows robust performance (≥ 92 % F1), demonstrating that the dataset generalizes beyond its own domain. The balanced benign‑malicious pairing forces the detector to learn subtle semantic differences rather than relying on obvious lexical cues, which explains the high out‑of‑distribution robustness.

Key contributions of the work are:

  1. Focus on Image‑Grounded Threats – FENCE is the first financial‑oriented jailbreak dataset that emphasizes IA attacks, a largely understudied vector.
  2. Bilingual Construction – By originating in Korean and providing an English translation, the dataset supports multilingual safety research and reflects culturally specific financial language.
  3. Diverse Financial Scenarios – Covering over 15 real‑world topics ensures that evaluations reflect the variety of queries encountered in production banking systems.
  4. Balanced Binary Labels – Inclusion of paired benign and malicious examples enables effective training of guard‑rail models, unlike most existing datasets that contain only malicious samples.

In conclusion, FENCE offers a realistic, diverse, and multilingual resource for advancing multimodal jailbreak detection in the high‑stakes financial sector. Its public release is expected to catalyze the development of more robust alignment techniques, improve the safety of deployed VLMs, and set a new benchmark for future research in domain‑specific AI security.


Comments & Academic Discussion

Loading comments...

Leave a Comment