ICON: Intent-Context Coupling for Efficient Multi-Turn Jailbreak Attack

ICON: Intent-Context Coupling for Efficient Multi-Turn Jailbreak Attack
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Multi-turn jailbreak attacks have emerged as a critical threat to Large Language Models (LLMs), bypassing safety mechanisms by progressively constructing adversarial contexts from scratch and incrementally refining prompts. However, existing methods suffer from the inefficiency of incremental context construction that requires step-by-step LLM interaction, and often stagnate in suboptimal regions due to surface-level optimization. In this paper, we characterize the Intent-Context Coupling phenomenon, revealing that LLM safety constraints are significantly relaxed when a malicious intent is coupled with a semantically congruent context pattern. Driven by this insight, we propose ICON, an automated multi-turn jailbreak framework that efficiently constructs an authoritative-style context via prior-guided semantic routing. Specifically, ICON first routes the malicious intent to a congruent context pattern (e.g., Scientific Research) and instantiates it into an attack prompt sequence. This sequence progressively builds the authoritative-style context and ultimately elicits prohibited content. In addition, ICON incorporates a Hierarchical Optimization Strategy that combines local prompt refinement with global context switching, preventing the attack from stagnating in ineffective contexts. Experimental results across eight SOTA LLMs demonstrate the effectiveness of ICON, achieving a state-of-the-art average Attack Success Rate (ASR) of 97.1%. Code is available at https://github.com/xwlin-roy/ICON.


💡 Research Summary

The paper “ICON: Intent‑Context Coupling for Efficient Multi‑Turn Jailbreak Attack” addresses a pressing security problem: how to bypass the safety guards of large language models (LLMs) using multi‑turn dialogues. Existing multi‑turn jailbreak methods, such as ActorAttack, FITD, and AutoDAN‑Turbo, rely on incremental construction of adversarial context from scratch. This approach is inefficient because each turn requires a separate API call, and it often gets stuck in sub‑optimal regions due to surface‑level prompt refinement that ignores semantic compatibility between the malicious intent and the surrounding context.

The authors first formulate the “Intent‑Context Coupling” (ICC) hypothesis. ICC posits that when a malicious intent (e.g., hacking, disinformation) is paired with a semantically congruent context pattern (e.g., scientific research, problem solving), the model’s internal trade‑off between helpfulness and safety tilts toward helpfulness, effectively relaxing safety constraints. To validate this, they conduct a full‑permutation study: five intent categories are each embedded into five distinct context patterns, yielding 250 samples. Using Claude‑4.5‑Sonnet as the target model and GPT‑4o as a judge, they compute StrongREJECT (Str) scores, a continuous measure of jailbreak success. The heatmap shows highly non‑uniform success: certain intent‑context pairs (e.g., disinformation + problem solving) achieve perfect scores, while the same intent in an unrelated context (e.g., disinformation + information retrieval) fails completely. This empirical evidence confirms ICC and also demonstrates that no single context works universally across intents.

Motivated by ICC, the paper introduces ICON, a three‑module framework designed for black‑box attackers who can only interact with the LLM via public APIs. The modules are:

  1. Intent‑Driven Context Routing – The system automatically extracts the malicious intent from the user query and consults a prior‑guided mapping table to select the most promising context pattern. This step replaces costly exploratory search with a directed, high‑probability route.

  2. Adversarial Context Instantiation – Once a pattern is chosen, ICON instantiates an authoritative‑style template (e.g., an academic paper structure). It then generates a sequence of “setup” prompts that progressively build the context (title, abstract, introduction, methods, etc.) and finally inject the malicious request in a way that appears context‑appropriate.

  3. Hierarchical Optimization Strategy – ICON employs two levels of optimization. Tactical optimization refines individual prompts (lexical or syntactic tweaks) when the attack fails early. If repeated tactical attempts do not succeed, strategic optimization triggers a context switch, replacing the current pattern with an alternative that better aligns with the intent. This hierarchy prevents the attack from stagnating due to semantic drift.

The authors evaluate ICON on eight state‑of‑the‑art LLMs, including GPT‑4o, Claude‑3, Llama‑2‑Chat, Gemini‑1.5, and several open‑source models. They test 200 malicious queries spanning the five intent categories. ICON achieves an average Attack Success Rate (ASR) of 97.1 %, outperforming the strongest baselines (ActorAttack, FITD, AutoDAN‑Turbo) which range between 68 % and 73 %. Moreover, ICON requires fewer turns (average 5.2) and fewer prompts (average 7.8) per successful jailbreak, translating to a 40 %–60 % reduction in API cost compared with prior methods.

The paper also discusses limitations. The intent‑to‑context mapping is currently static; new intents or emerging domains would necessitate manual updates or retraining. Experiments focus primarily on English‑language models, leaving multilingual generalization an open question. Finally, while the work reveals a new attack surface, it does not propose concrete defenses; detecting ICC‑driven attacks remains future work.

In conclusion, ICON leverages the empirically validated Intent‑Context Coupling phenomenon to construct high‑quality adversarial contexts efficiently and to adaptively optimize them. It sets a new benchmark for multi‑turn jailbreak effectiveness and highlights the need for defense mechanisms that can recognize and mitigate context‑aligned malicious intents. Future research directions include dynamic mapping learning, extension to multimodal and multilingual LLMs, and the development of ICC‑aware safety filters.


Comments & Academic Discussion

Loading comments...

Leave a Comment