When Models Outthink Their Safety: Unveiling and Mitigating Self-Jailbreak in Large Reasoning Models

February 23, 2026

Reading time: 3 minute

...

📝 Original Info

Title: When Models Outthink Their Safety: Unveiling and Mitigating Self-Jailbreak in Large Reasoning Models
ArXiv ID: 2510.21285
Date: 2025-10-24
Authors: Yingzhi Mao, Chunkang Zhang, Junxiang Wang, Xinyan Guan, Boxi Cao, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun

📝 Abstract

Large Reasoning Models (LRMs) achieve strong performance on complex multi-step reasoning, yet they still exhibit severe safety failures such as harmful content generation. Existing methods often apply coarse-grained constraints over the entire reasoning trajectories, which can undermine reasoning capability while failing to address the root causes of unsafe behavior. In this work, we uncover a previously underexplored failure mode in LRMs, termed Self-Jailbreak, where models initially recognize the harmful intent of a query, but override this judgment during subsequent reasoning steps, ultimately generating unsafe outputs. Such a phenomenon reveals that LRMs are capable of recognizing harm, while safety failures primarily arise from reasoning steps. Motivated by this finding, we propose \emph{Chain-of-Guardrail} (CoG), a trajectory-level training framework that mitigates Self-Jailbreak via targeted, step-level interventions while maintaining reasoning ability. Experiments across multiple safety and reasoning benchmarks indicate that CoG achieves a favorable balance between safety and reasoning performance compared with existing approaches.

💡 Deep Analysis

Deep Dive into When Models Outthink Their Safety: Unveiling and Mitigating Self-Jailbreak in Large Reasoning Models.

📄 Full Content

📄 Read Full PDF on ArXiv

Reference

This content is AI-processed based on ArXiv data.

When Models Outthink Their Safety: Unveiling and Mitigating Self-Jailbreak in Large Reasoning Models

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

Reference

Table of Contents

Table of Contents

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

Reference

Related Posts

A Dynamic Correlation Modelling Framework with Consistent Stochastic Recovery

A General Relativistic Magnetohydrodynamic Model of High Frequency Quasi-periodic Oscillations in Black Hole Low-Mass X-ray Binaries

A Linear LMP Model for Active and Reactive Power with Power Loss

Start searching

No results found