멀티모달 질문 추론 답변 분석을 통한 안전 모더레이션 혁신
📝 Abstract
Multimodal large reasoning models (MLRMs) are increasingly deployed for vision-language tasks that produce explicit intermediate rationales. However, reasoning traces can contain unsafe content even when the final answer is non-harmful, creating deployment risks. Existing multimodal safety guards primarily evaluate only the input question and the final answer, neglecting the intermediate reasoning process. This oversight allows undetected harm, such as biased inferences or policy-violating use of visual context, to emerge during reasoning. We introduce GuardTrace-VL, a vision-aware safety auditor that monitors the full Question-Thinking-Answer (QTA) pipeline via joint image-text analysis, enabling detection of unsafe content as it emerges in the reasoning stage. To support training and evaluation, we construct the GuardTrace dataset, which is generated through diverse prompting strategies and refined via a MLRM- and human-based voting and verification pipeline. Furthermore, we propose a three-stage progressive training scheme combined with the data refinement process, enabling the model to learn nuanced and context-dependent safety preferences according to different risk levels. On our proposed test set covering both in-domain and out-of-domain scenarios, GuardTrace-VL model achieves an F1 score of 93.1% on unsafe reasoning detection tasks, representing a 13.5% improvement in F1 score compared to the previous strongest multimodal safety defense methods. The codes will be made publicly available.
💡 Analysis
Multimodal large reasoning models (MLRMs) are increasingly deployed for vision-language tasks that produce explicit intermediate rationales. However, reasoning traces can contain unsafe content even when the final answer is non-harmful, creating deployment risks. Existing multimodal safety guards primarily evaluate only the input question and the final answer, neglecting the intermediate reasoning process. This oversight allows undetected harm, such as biased inferences or policy-violating use of visual context, to emerge during reasoning. We introduce GuardTrace-VL, a vision-aware safety auditor that monitors the full Question-Thinking-Answer (QTA) pipeline via joint image-text analysis, enabling detection of unsafe content as it emerges in the reasoning stage. To support training and evaluation, we construct the GuardTrace dataset, which is generated through diverse prompting strategies and refined via a MLRM- and human-based voting and verification pipeline. Furthermore, we propose a three-stage progressive training scheme combined with the data refinement process, enabling the model to learn nuanced and context-dependent safety preferences according to different risk levels. On our proposed test set covering both in-domain and out-of-domain scenarios, GuardTrace-VL model achieves an F1 score of 93.1% on unsafe reasoning detection tasks, representing a 13.5% improvement in F1 score compared to the previous strongest multimodal safety defense methods. The codes will be made publicly available.
📄 Content
GuardTrace-VL: Detecting Unsafe Multimodel Reasoning via Iterative Safety Supervision Yuxiao Xiang1, 2, Junchi Chen1, 2, Zhenchao Jin4, Changtao Miao3, Haojie Yuan3, Qi Chu1, 2†, Tao Gong1, 2, Nenghai Yu1, 2 1School of Cyber Science and Technology, University of Science and Technology of China 2Anhui Province Key Laboratory of Digital Security 3Individual Researcher 4The University of Hong Kong Fig. 1: Multimodal Question-Thinking-Answer (QTA) moderation comparison. QA Guard is distracted by safety-aligned statements in the answer, Text-only Guard lacks visual grounding and misses contextual threats, while our GuardTrace-VL jointly models the multimodal question, reasoning trace, and answer to correctly flag harmful intent, demonstrating the necessity of holistic multimodal QTA analysis for robust safety moderation. Abstract—Multimodal large reasoning models (MLRMs) are increasingly deployed for vision-language tasks that produce explicit intermediate rationales. However, reasoning traces can contain unsafe content even when the final answer is non- harmful, creating deployment risks. Existing multimodal safety guards primarily evaluate only the input question and the final answer, neglecting the intermediate reasoning process. This oversight allows undetected harm, such as biased inferences or policy-violating use of visual context, to emerge during reasoning. We introduce GuardTrace-VL, a vision-aware safety auditor that monitors the full Question-Thinking-Answer (QTA) pipeline via joint image–text analysis, enabling detection of unsafe content as it emerges in the reasoning stage. To support training and eval- uation, we construct the GuardTrace dataset, which is generated through diverse prompting strategies and refined via a MLRM- and human-based voting and verification pipeline. Furthermore, we propose a three-stage progressive training scheme combined with the data refinement process, enabling the model to learn nuanced and context-dependent safety preferences according to different risk levels. On our proposed test set covering both in-domain and out-of-domain scenarios, GuardTrace-VL model achieves an F1 score of 93.1% on unsafe reasoning detection tasks, representing a 13.5% improvement in F1 score compared †Corresponding author. to the previous strongest multimodal safety defense methods. The codes will be made publicly available. I. INTRODUCTION Large Reasoning Models (LRMs) show substantial progress in complex reasoning, exemplified by OpenAI’s o1/o3 se- ries [1], [2] and DeepSeek-R1 [3]. This capability now extends to multimodal settings with Multimodal Large Reasoning Models (MLRMs), which jointly process images and text and generate explicit reasoning traces before producing fi- nal answers. Although step-by-step reasoning enhances inter- pretability and task performance, it introduces a distinct class of safety risks absent from conventional Question-Answer (QA) settings, including instances in which unsafe content is confined to intermediate reasoning traces despite benign final answers [4], [5], [6], [7]. Although recent studies document these risks, existing automated content safety systems, including general-purpose moderation APIs [8], [9] and dedicated safety classifiers, do not provide trajectory-level protection in multimodal settings and typically confine analysis to a single modality or shallow QA interaction. Specifically, Multimodal QA guards such as arXiv:2511.20994v1 [cs.CV] 26 Nov 2025 LLaMA-Guard-4 [10] and GuardReasoner-VL [11] evaluate risks in image–text QA pairs at the input–output level but leave intermediate reasoning traces hard to be examined. Conversely, ReasoningShield [12] focuses on chain-of-thought safety but operates purely on text and lacks access to visual evidence. This misalignment between modality coverage and reasoning coverage becomes critical in realistic settings. As illustrated in Figure 1, an MLRM may generate detailed procedural instruc- tions for bypassing the lock of an unauthorized electrical dis- tribution box, while the final answer recommends contacting a professional. Multimodal QA guards[13], [11] tend to accept the ostensibly safe final recommendation, whereas text-only chain-of-thought detectors cannot reliably identify the depicted device as a restricted utility asset without image context. Both classes of methods therefore fail to surface the underlying threat. To address this structural limitation of existing safety mechanisms, particularly their inability to detect risks that arise along multimodal Question–Thinking–Answer (QTA) trajectories, we introduce the GuardTrace dataset and the GuardTrace-VL safety detector. GuardTrace is a multimodal QTA safety benchmark that fills a critical gap in the field by providing the first dedicated evaluation resource for detecting unsafe content in multimodal reasoning trajectories. It is constructed from text-only safety and jailbreak queries through a three-step pipeline: multimodal expansion, Q
This content is AI-processed based on ArXiv data.