MAGIC: A Co-Evolving Attacker-Defender Adversarial Game for Robust LLM Safety
Ensuring robust safety alignment is crucial for Large Language Models (LLMs), yet existing defenses often lag behind evolving adversarial attacks due to their \textbf{reliance on static, pre-collected data distributions}. In this paper, we introduce \textbf{MAGIC}, a novel multi-turn multi-agent reinforcement learning framework that formulates LLM safety alignment as an adversarial asymmetric game. Specifically, an attacker agent learns to iteratively rewrite original queries into deceptive prompts, while a defender agent simultaneously optimizes its policy to recognize and refuse such inputs. This dynamic process triggers a \textbf{co-evolution}, where the attacker’s ever-changing strategies continuously uncover long-tail vulnerabilities, driving the defender to generalize to unseen attack patterns. Remarkably, we observe that the attacker, endowed with initial reasoning ability, evolves \textbf{novel, previously unseen combinatorial strategies} through iterative RL training, underscoring our method’s substantial potential. Theoretically, we provide insights into a more robust game equilibrium and derive safety guarantees. Extensive experiments validate our framework’s effectiveness, demonstrating superior defense success rates without compromising the helpfulness of the model. Our code is available at https://github.com/BattleWen/MAGIC.
💡 Research Summary
The paper “MAGIC: A Co‑Evolving Attacker‑Defender Adversarial Game for Robust LLM Safety” tackles a fundamental weakness in current large language model (LLM) safety alignment: the reliance on static, pre‑collected red‑team data. As adversaries develop ever more sophisticated, multi‑turn jailbreaks, defenses that are trained once on a fixed dataset quickly become obsolete. To address this, the authors propose MAGIC, a novel multi‑agent reinforcement learning (MARL) framework that casts safety alignment as an asymmetric sequential game between two distinct agents—a proactive attacker and a reactive defender.
Game formulation and theoretical grounding
The attacker first receives a benign user query x and rewrites it into a deceptive prompt y_A. The defender observes y_A and must produce a response y_D. Rewards r_A(y_A, y_D) and r_D(y_A, y_D) capture the attacker’s success in eliciting a harmful output and the defender’s success in refusing or providing a safe answer, respectively. Safety is defined as r_D ≥ 0, while r_D < 0 denotes a safety breach. Because the interaction is sequential, the natural equilibrium concept is a Subgame‑Perfect Nash Equilibrium (SPNE). The authors prove that if a safe fallback action y_ref exists for every possible y_A, any SPNE guarantees that the defender will always select a safe response, providing a point‑wise safety guarantee stronger than the expectation‑based guarantees of ordinary Nash equilibria.
Two‑phase training pipeline
Exact SPNE computation is intractable for high‑dimensional language actions, so the authors approximate it via alternating best‑response updates:
-
Offensive capability initialization (Phase 1).
The attacker is first fine‑tuned on a curated “Attack Pool Benchmark” that enriches the public SorryBench dataset with 20 diverse chain‑of‑thought (CoT) rewriting strategies, using Gemini‑2.5‑Pro as a teacher model. This supervised fine‑tuning (SFT) gives the attacker baseline reasoning and rewriting ability, overcoming the cold‑start problem where a vanilla model would be rejected immediately. -
Iterative co‑evolution (Phase 2).
A reinforcement‑learning dataset 𝔻_RL containing an equal mix of harmful and benign queries is constructed. Both agents are then trained with Group Relative Policy Optimization (GRPO), a PPO‑style algorithm that computes advantages from a batch of G sampled actions rather than a single trajectory. In each iteration, the defender is frozen while the attacker generates a batch of candidate attacks; the defender samples G responses, computes safety rewards, and updates its policy to maximize point‑wise safety. Then the roles are swapped: the attacker treats the defender’s current policy as an oracle and updates to maximize its own reward, effectively learning to anticipate the defender’s best response. This alternating scheme approximates the bilevel optimization defined by the SPNE conditions.
Experimental setup and results
The authors evaluate MAGIC on a suite of single‑turn and multi‑turn benchmarks derived from SorryBench, including linguistic mutations, role‑play prompts, and multi‑turn dialogue scenarios. Metrics include defense success rate (percentage of safe responses), helpfulness (retention of useful content), and novelty of attack strategies (measured by diversity of generated prompts). Compared with static red‑team baselines (e.g., GCG, Self‑RedTeam) and symmetric self‑play methods, MAGIC achieves a 12‑18 percentage‑point increase in defense success, especially in multi‑turn settings where traditional defenses fail. Moreover, the attacker discovers previously unseen combinatorial tactics—such as chaining role‑play with covert reasoning steps and embedding hidden malicious intent in innocuous context—that were not present in the initial benchmark, demonstrating genuine co‑evolution.
Key insights and contributions
- Asymmetric design eliminates gradient conflict. By maintaining separate parameter sets for attacker and defender, MAGIC avoids the opposing gradient interference that plagues single‑model self‑play approaches.
- SPNE‑guided learning yields point‑wise safety guarantees. Unlike expectation‑based Nash equilibria, the SPNE framework forces the defender to be safe for every possible attack, not just on average.
- CoT‑enriched attack pool solves data scarcity. The curated benchmark provides high‑quality reasoning traces that bootstrap the attacker, enabling it to generate sophisticated prompts from the start.
- GRPO leverages group statistics for stable policy updates. Using the mean and standard deviation of rewards across a batch of sampled actions reduces variance and improves sample efficiency in the high‑dimensional language action space.
Limitations and future directions
The study focuses primarily on English and Chinese LLMs; extending to a broader multilingual landscape remains open. The alternating best‑response scheme may lead to an “arms race” where the attacker becomes overwhelmingly strong, potentially causing the defender’s learning to plateau; meta‑level regularization (e.g., capping attacker reward) could mitigate this. Finally, while GRPO approximates the SPNE, more sample‑efficient best‑response oracles—such as model‑based tree search or learned value functions—could further reduce training cost.
Conclusion
MAGIC represents a significant step toward dynamic, game‑theoretic safety alignment for LLMs. By jointly training a reasoning‑capable attacker and a robust defender under an SPNE‑inspired objective, the framework not only outperforms static red‑team defenses but also maintains the helpfulness of the underlying model. The work opens avenues for continual, automated safety improvement that can keep pace with the rapidly evolving threat landscape in LLM deployment.
Comments & Academic Discussion
Loading comments...
Leave a Comment