Proactive defense against LLM Jailbreak
The proliferation of powerful large language models (LLMs) has necessitated robust safety alignment, yet these models remain vulnerable to evolving adversarial attacks, including multi-turn jailbreaks that iteratively search for successful queries. Current defenses, which are primarily reactive and static, often fail to handle these iterative attacks. In this paper, we introduce ProAct, a novel proactive defense framework designed to disrupt and mislead these iterative search jailbreak methods. Our core idea is to intentionally mislead these jailbreak methods into thinking that the model has been jailbroken with “spurious responses”. These misleading responses provide false signals to the attacker’s internal optimization loop, causing the adversarial search to terminate prematurely and effectively jailbreaking the jailbreak. By conducting extensive experiments across state-of-the-art LLMs, jailbreaking frameworks, and safety benchmarks, we demonstrate that our method consistently and significantly reduces attack success rates by up to 94% without affecting utility. When combined with other defense fraeworks, it further reduces the latest attack strategies’ success rate to 0%. ProActrepresents an orthogonal defense strategy that serves as an additional guardrail to enhance LLM safety against the most effective jailbreaking attacks.
💡 Research Summary
The paper addresses a critical weakness in current large language model (LLM) safety mechanisms: multi‑turn jailbreak attacks that iteratively refine prompts based on the model’s refusal signals. Traditional defenses are largely reactive, returning a “refusal” when a harmful request is detected. Attackers exploit these negative signals as feedback, gradually improving their prompts until they succeed in eliciting disallowed content. To counter this, the authors propose ProAct (Proactive Defense Against LLM Jailbreak), a framework that deliberately misleads the attacker’s internal optimization loop by providing “spurious” responses that appear to satisfy the attacker’s objective while actually containing no harmful information.
ProAct operates in three stages. First, a monitoring component detects when the base LLM refuses a query, using an LLM‑as‑judge classifier to distinguish refusals from benign outputs. Second, a dedicated defender module receives a concise summary of the user query (instead of the raw text) and generates a spurious response that is topically relevant but semantically benign. The defender employs chain‑of‑thought prompting, few‑shot examples of known jailbreak strategies, and meta‑prompts that encourage the generation of convincing yet harmless content. These spurious outputs may be encoded in emojis, Base64, hex, or Morse code, giving the appearance of detailed malicious instructions while remaining unintelligible to ordinary users. Third, a surrogate evaluator—an independent model that mimics the attacker’s internal scoring function—iteratively assesses whether the generated response would be judged “successful” by the attacker. If not, the defender regenerates the response until the surrogate evaluator is fooled. The final spurious response is then returned to the user, terminating the attack after a single turn.
Mathematically, the attacker seeks to maximize the expected value of its internal score Sj(T(p)) over a set of prompts PA. ProAct reframes the target system T to maximize Sj(Tθ(p)) while simultaneously minimizing the true safety score Sg(T(p)), i.e., arg max E_p
Comments & Academic Discussion
Loading comments...
Leave a Comment