RAPO: Risk-Aware Preference Optimization for Generalizable Safe Reasoning
Large Reasoning Models (LRMs) have achieved tremendous success with their chain-of-thought (CoT) reasoning, yet also face safety issues similar to those of basic language models. In particular, while algorithms are designed to guide them to deliberately refuse harmful prompts with safe reasoning, this process often fails to generalize against diverse and complex jailbreak attacks. In this work, we attribute these failures to the generalization of the safe reasoning process, particularly their insufficiency against complex attack prompts. We provide both theoretical and empirical evidence to show the necessity of a more sufficient safe reasoning process to defend against advanced attack prompts. Building on this insight, we propose a Risk-Aware Preference Optimization (RAPO) framework that enables LRM to adaptively identify and address the safety risks with appropriate granularity in its thinking content. Extensive experiments demonstrate that RAPO successfully generalizes multiple LRMs’ safe reasoning adaptively across diverse attack prompts whilst preserving general utility, contributing a robust alignment technique for LRM safety. Our code is available at https://github.com/weizeming/RAPO.
💡 Research Summary
The paper addresses a critical safety shortcoming of Large Reasoning Models (LRMs), which, despite their impressive chain‑of‑thought (CoT) capabilities, remain vulnerable to harmful or illegal content generation when faced with sophisticated jailbreak prompts. Existing alignment techniques—supervised fine‑tuning (SFT) on safety‑aware CoT datasets and reinforcement‑learning (RL) with safety‑specific rewards—provide reasonable protection against simple harmful queries but fail to generalize to complex, adaptive attacks.
The authors reconceptualize the “thinking content” of an LRM as an in‑context learning problem. Each reasoning step is modeled as a pair (concept xᵢ, safety judgment yᵢ), where yᵢ = 1 denotes a refusal and yᵢ = 0 denotes compliance or a neutral step. They argue that successful defense requires a sufficient number of safety‑judgment tokens proportional to the complexity of the input prompt. Theoretical analysis formalizes this intuition: given a prompt composed of k concepts (one harmful, the rest benign), the number of safety‑reasoning tokens t must satisfy t = Ω(k) to guarantee refusal. This theorem links attack strength directly to the required “adequacy” of safe reasoning.
Empirical validation uses the Qwen‑3‑1.7B model on two benchmark suites: SorryBench (simple harmful prompts and basic jailbreaks) and StrataSword (three levels of jailbreak difficulty). By counting total thinking tokens versus safety‑reasoning tokens, the authors show a strong positive correlation between the proportion of safety tokens and successful refusals. As jailbreak difficulty rises (L1→L3), the safety‑token proportion drops sharply, and the attack success rate (ASR) climbs, confirming that current LRMs cannot scale their safe reasoning to match stronger attacks.
To remedy this, the paper introduces Risk‑Aware Preference Optimization (RAPO), a two‑stage framework. The first stage is an SFT warm‑up that forces the model to emit a dedicated safety‑reasoning block at the very beginning of its output, making the safety component easy to extract and evaluate. The second stage applies a reinforcement‑learning algorithm (based on GRPO) with two reward signals: (1) a risk‑aware reward R that measures whether the length and depth of the safety block are appropriate for the assessed risk of the prompt, and (2) a general reward G that preserves overall reasoning quality and utility. The algorithm samples multiple completions per prompt, splits each into safety and response parts, computes A = R + G, and updates the policy accordingly.
Experiments on several base models—including DeepSeek‑distill, LLaMA‑2‑7B, and Falcon‑40B—demonstrate that RAPO dramatically reduces ASR on the WildJailbreak dataset from 68.7 % (baseline) to 5.6 % while maintaining or slightly improving performance on standard reasoning benchmarks such as MMLU, GSM‑8K, and HumanEval. Across the three StrataSword difficulty levels, RAPO consistently sustains a higher safety‑token proportion (≈30 % or more) and achieves robust refusal rates even against the most sophisticated jailbreaks.
The authors claim three main contributions: (1) a unified theoretical and empirical view that safe reasoning adequacy must scale with attack complexity; (2) the RAPO framework that operationalizes adaptive safe reasoning via risk‑aware preference optimization; and (3) extensive empirical evidence that RAPO sets a new safety benchmark without sacrificing general utility.
Limitations are acknowledged: the risk‑aware reward function is currently hand‑crafted and may not capture all nuances of real‑world threats, and extremely multi‑step or novel jailbreak strategies could still evade detection. Future work is suggested to automate risk estimation, incorporate meta‑learning for reward shaping, and leverage human feedback to improve safety judgment reliability.
In summary, RAPO offers a principled, adaptable alignment approach that aligns the amount of safe reasoning with the assessed risk of a prompt, thereby substantially enhancing the practical safety of LRMs against a wide spectrum of jailbreak attacks.
Comments & Academic Discussion
Loading comments...
Leave a Comment