from Benign import Toxic: Jailbreaking the Language Model via Adversarial Metaphors

from Benign import Toxic: Jailbreaking the Language Model via Adversarial Metaphors
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Current studies have exposed the risk of Large Language Models (LLMs) generating harmful content by jailbreak attacks. However, they overlook that the direct generation of harmful content from scratch is more difficult than inducing LLM to calibrate benign content into harmful forms. In our study, we introduce a novel attack framework that exploits AdVersArial meTAphoR (AVATAR) to induce the LLM to calibrate malicious metaphors for jailbreaking. Specifically, to answer harmful queries, AVATAR adaptively identifies a set of benign but logically related metaphors as the initial seed. Then, driven by these metaphors, the target LLM is induced to reason and calibrate about the metaphorical content, thus jailbroken by either directly outputting harmful responses or calibrating residuals between metaphorical and professional harmful content. Experimental results demonstrate that AVATAR can effectively and transferable jailbreak LLMs and achieve a state-of-the-art attack success rate across multiple advanced LLMs.


💡 Research Summary

The paper introduces a novel black‑box jailbreak framework called AVATAR (AdVersArial meTAphoR) that exploits benign metaphors to coerce large language models (LLMs) into producing harmful content. Unlike prior work that directly generates toxic text, AVATAR first selects a set of innocuous but logically related metaphors as seeds, then guides the target model through a two‑stage process: Adversarial Entity Mapping (AEM) and Metaphor‑Induced Reasoning (MIR).

In the AEM stage, the attacker extracts key toxic entities and sub‑entities from the malicious query (e.g., “build a bomb”). Using a pool of crowd‑sourced LLMs with high temperature sampling, the system generates candidate metaphorical mappings (e.g., “cook a dish”). Each candidate is evaluated by two quantitative metrics: Internal Consistency Similarity (ICS), which measures how well the internal relational structure of the metaphor matches that of the original toxic entities, and Conceptual Disparity (CD), which ensures the metaphor remains sufficiently distinct to avoid immediate detection. A sigmoid‑based optimization maximizes the balance between high ICS (effectiveness) and low CD (concealment), selecting the minimal‑toxicity metaphor that still conveys the necessary knowledge.

The MIR stage embeds the chosen metaphor into a series of interaction queries. Fixed base queries (context and detailed explanation) are combined with adaptive queries generated by an attacker model. These queries are ranked by toxic‑aware similarity to the original harmful prompt, and the top‑k are selected to form the initial interaction set Q_init. During the dialogue, the target LLM’s responses are used as feedback; the attacker refines subsequent queries using social influence tactics (politeness, authority cues) and manages conversation history to gradually increase relevance to the harmful goal. Finally, the model is prompted to calibrate the residual gap between the metaphorical answer and a professional harmful answer, effectively converting the benign metaphor into a direct toxic output.

Experiments were conducted on five state‑of‑the‑art LLMs, including GPT‑4o, Claude‑3, Llama‑2‑Chat, Gemini‑1.5, and Mistral‑Large. With a limit of three retries per query, AVATAR achieved an average attack success rate (ASR) of over 92%, reaching 95% on GPT‑4o. In contrast, baseline jailbreak methods based on prompt rewriting or fixed templates achieved ASRs below 60%. Moreover, the same metaphor seeds and AEM pipeline transferred well across models, maintaining >80% success, indicating that the metaphoric knowledge exploited is largely model‑agnostic.

The authors discuss strengths such as high transferability, low detectability, and the exploitation of LLMs’ internal analogical reasoning. Limitations include dependence on the quality of crowd‑sourced tool models for metaphor generation and the potential difficulty for human auditors to interpret complex metaphor chains, which may obscure malicious intent. Ethical considerations are highlighted: while the work advances understanding of jailbreak vulnerabilities and can guide defense research, it also lowers the barrier for malicious actors. Proposed defenses include metaphor detection, consistency checks between metaphorical and factual reasoning paths, and multi‑layered safety filters.

In conclusion, AVATAR demonstrates that calibrating benign content into harmful output via adversarial metaphors is a powerful and efficient jailbreak strategy. It reveals a previously underexplored attack surface—LLMs’ analogical reasoning capabilities—and provides a foundation for future work on both more robust defenses and deeper analyses of model interpretability.


Comments & Academic Discussion

Loading comments...

Leave a Comment