Goal Hijacking Attack on Large Language Models via Pseudo-Conversation Injection

Goal Hijacking Attack on Large Language Models via Pseudo-Conversation Injection
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Goal hijacking is a type of adversarial attack on Large Language Models (LLMs) where the objective is to manipulate the model into producing a specific, predetermined output, regardless of the user’s original input. In goal hijacking, an attacker typically appends a carefully crafted malicious suffix to the user’s prompt, which coerces the model into ignoring the user’s original input and generating the target response. In this paper, we introduce a novel goal hijacking attack method called Pseudo-Conversation Injection, which leverages the weaknesses of LLMs in role identification within conversation contexts. Specifically, we construct the suffix by fabricating responses from the LLM to the user’s initial prompt, followed by a prompt for a malicious new task. This leads the model to perceive the initial prompt and fabricated response as a completed conversation, thereby executing the new, falsified prompt. Following this approach, we propose three Pseudo-Conversation construction strategies: Targeted Pseudo-Conversation, Universal Pseudo-Conversation, and Robust Pseudo-Conversation. These strategies are designed to achieve effective goal hijacking across various scenarios. Our experiments, conducted on two mainstream LLM platforms including ChatGPT and Qwen, demonstrate that our proposed method significantly outperforms existing approaches in terms of attack effectiveness.


💡 Research Summary

The paper introduces a novel prompt‑injection technique called Pseudo‑Conversation Injection (PC‑Inj) to achieve goal hijacking in large language models (LLMs). Unlike prior methods that rely on explicit, often contradictory instructions (“ignore the above and output …”), PC‑Inj exploits the fact that LLMs treat conversational context as a sequence of tokens without verifying the authenticity of each role. By appending a fabricated dialogue segment—marked with the model’s chat‑template delimiters (<|im_start|>, <|im_end|>)—the attacker makes the model believe the original user request has already been answered. The subsequent “user” turn then contains a malicious instruction, which the model follows as if it were a legitimate follow‑up query.

Three variants are proposed:

  1. Scenario‑Tailored Injection constructs a natural‑sounding response to the original query and then adds the malicious instruction. This yields the highest success rate (≈92 % on GPT‑4o) because the fabricated conversation appears most plausible, but it requires per‑query prompt engineering.

  2. Generalized Injection uses a fixed, generic reply such as “Sorry, I’m not able to answer that question.” This approach is easy to deploy across many inputs, but the unnatural refusal can trigger safety filters, leading to a modest drop in effectiveness compared with the tailored version.

  3. Template‑Free Injection avoids model‑specific markers altogether, relying on plain language role labels like “Assistant:” and “User:”. While its success rate is lower, it can bypass defenses that strip or block known token patterns, making it valuable when the target model’s exact chat template is unknown or filtered.

The authors evaluate PC‑Inj on the Safety‑Prompts “Goal Hijacking” subset, testing three state‑of‑the‑art LLMs: GPT‑4o, GPT‑4o‑mini, and Qwen‑2.5. Experiments use success rate and standard deviation as metrics to capture both effectiveness and stability. Results show that all three PC‑Inj variants significantly outperform baseline explicit‑instruction attacks, with the scenario‑tailored version achieving the best performance, followed by the generalized and template‑free versions. Failure analysis reveals three primary causes: (a) the model’s internal safety module overrides the injected instruction, (b) malformed or missing delimiters cause the fabricated turn to be ignored, and (c) the model generates a “refusal” token that blocks further processing.

Based on these findings, the paper proposes defensive measures: (i) role verification, where the system checks whether a purported “assistant” turn actually originates from the model rather than the user; (ii) delimiter normalization, which standardizes or strips chat‑template markers before feeding the prompt to the model; and (iii) enhanced safety filtering, extending detection to non‑standard conversational patterns and employing a secondary classifier trained on known PC‑Inj examples.

The work underscores a critical security gap: LLMs integrated into downstream applications—such as automated grading, customer‑service bots, legal or medical assistants—often rely on conversational history to drive decision‑making. By subtly inserting a fabricated dialogue, an adversary can manipulate the model’s output without raising obvious red flags, potentially causing misinformation, biased advice, or unfair automated decisions.

Future research directions suggested include (a) extending PC‑Inj to multi‑turn, chained injections, (b) exploring attacks that combine textual and multimodal inputs (e.g., images with embedded prompts), and (c) developing real‑time, low‑overhead defenses that can be deployed in production pipelines. Overall, the paper contributes a fresh adversarial perspective on LLM safety, demonstrates its practical impact across leading models, and offers concrete mitigation strategies, thereby advancing the dialogue between attack development and defensive research in the rapidly evolving field of generative AI security.


Comments & Academic Discussion

Loading comments...

Leave a Comment