SFCoT: Safer Chain-of-Thought via Active Safety Evaluation and Calibration
Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks. However, they remain highly susceptible to jailbreak attacks that undermine their safety alignment. Existing defense mechanisms typically rely on post …
Authors: Yu Pan, Wenlong Yu, Tiejun Wu
SF CoT: Safer Chain-of-Though t via A ctiv e Safet y Ev aluation and Calibration Y u P an 1 , W enlong Y u 1 , Tiejun W u 2 , Xiaoh u Y e 2 , Qiannan Si 3 , Guangquan Xu 4 , and Bin W u 1 1 Departmen t College of Intelligence and Computing, Tianjin Univ ersity 2 NSF OCUS T ec hnologies Group Co., Ltd. 3 College of Management and Economics, Tianjin Univ ersity 4 Sc ho ol of Cyb er Security , Tianjin Universit y {panyu2022, wlong_yu, qiannan_si1228, losin, binw}@tju.edu.cn , {wutiejun, yexiaohu}@nsfocus.com Abstract. Large language mo dels (LLMs) hav e demonstrated remark- able capabilities in complex reasoning tasks. Ho wev er, they remain highly susceptible to jailbreak attac ks that undermine their safety alignmen t. Existing defense mechanisms typically rely on p ost ho c filtering ap- plied only to the final output, leaving in termediate reasoning steps un- monitored and vulnerable to adversarial manipulation. T o address this gap, this pap er prop oses a SaF er Chain-of-Thought (SFCoT) framew ork, whic h proactively ev aluates and calibrates p oten tially unsafe reasoning steps in real time. SF CoT incorp orates a three-tier safet y scoring sys- tem alongside a m ulti-p ersp ectiv e consistency v erification mec hanism, designed to detect p oten tial risks throughout the reasoning pro cess. A dynamic interv ention mo dule subsequently p erforms targeted calibration to redirect reasoning tra jectories to w ard safe outcomes. Experimental results demonstrate that SFCoT reduces the attac k success rate from 58.97% to 12.31%, demonstrating it as an effective and efficient LLM safet y enhancement method without a significan t decline in general per- formance. Keyw ords: Large language mo del, Safety , Chain-of-Though t 1 In tro duction Large language mo dels (LLMs) hav e demonstrated remark able capabilities across div erse domains, including question answ ering, co de generation, and multi-step reasoning [5, 9, 19]. Despite these adv ancemen ts, significant concerns regarding their safet y and reliability p ersist. LLMs remain susceptible to adversarial at- tac ks, particularly jailbreak prompts, whic h b ypass safeguards and elicit harmful outputs [7, 8]. A ddressing these vulnerabilities is crucial for the trustw orth y de- plo yment of LLMs in safety-critical applications. 2 Existing safety defenses often rely on p ost-hoc filtering, where harmful con- ten t is detected and remov ed only at the final output stage. Alternative ap- proac hes include lay er-wise editing or reinforcemen t learning from human feed- bac k [10, 11, 14, 18]. Although suc h methods can ac hieve some impro vemen t, they do not interv ene during the in termediate reasoning pro cess, leaving models susceptible to sophisticated adv ersarial strategies that exploit multi-step logical vulnerabilities. This delay ed res ponse increases p oten tial exp osure to harmful outputs and compromises the ov erall reliabilit y of safety mechanisms. F urther- more, the static nature of these defenses limits their adaptabilit y to con tin uously ev olving attack patterns, hindering long-term robustness. A key factor behind the reasoning capabilities of LLMs is the Chain-of- Though t (CoT) prompting paradigm [12, 15, 17], whic h enables models to break do wn complex tasks into a sequence of interpretable intermediate steps. Ho wev er, these reasoning steps are not inherently safe. A dversarial inputs can corrupt the reasoning tra jectory , allo wing harmful inten t to propagate undetected un til the final output is generated. Therefore, implemen ting safety monitoring and inter- v ention at the lev el of in termediate reasoning is crucial for ensuring the o verall safet y of LLM outputs. T o ov ercome these limitations, this pap er prop oses a SaF er Chain-of-Thought (SF CoT) framew ork, whic h proactiv ely monitors and calibrates p oten tial risks throughout the entire reasoning pro cess. SFCoT incorp orates a three-level safet y scoring mechanism (i.e., lexical, semantic, and p olicy) and a multi-persp ectiv e consistency v erification module to ev aluate eac h reasoning step in real time. Whenev er a reasoning step is iden tified as unsafe or ambiguous, a dynamic in- terv ention mo dule p erforms targeted calibration (i.e., rewriting or truncation), reinforcing the reasoning b efore harmful outcomes fully materialize. Exp eri- men ts demonstrate that SF CoT significan tly reduces the attac k success rate from 58.97% to 12.31%. F urthermore, ev aluations on general benchmarks in- dicate that SF CoT preserv es 91.2% of the base mo del’s utilit y . In summary , SF CoT establishes an efficient safety paradigm at the CoT level, enhancing the reliabilit y of LLM without substan tial degradation in general p erformance. 2 Metho dology CoT prompting has become a piv otal tec hnique for facilitating complex rea- soning in LLMs. Securing the CoT is critical to ensuring the ov erall safety of LLM outputs. In this section, w e presen t our problem form ulation and provide a comprehensiv e description of the proposed SFCoT framew ork. 2.1 Problem Statement Giv en an input x , the reasoning pro cess of an LLM M is defined as a sequence of reasoning steps T = { t 1 , t 2 , ..., t n } , where eac h t i represen ts an intermediate though t. The final output y is generated based on the full CoT: y = M ( x, T ) . (1) SF CoT 3 Fig. 1: The ov erall framework diagram of SF CoT. T o quantify the safety of the ov erall interaction, we define a safety function S that returns a score in the range [0 , 1] , where 1 indicates fully safe behavior. In this pap er, S is designed as a fine-grained and adaptiv e ev aluation of each individual reasoning step that balances efficiency with robustness. It consists of a three-level safet y scoring system and a multi-persp ectiv e consistency v erification mechanism . S : X × T × Y → [0 , 1] . (2) Our ob jectiv e is to ensure that the probability of achieving a safety score ab o v e a predefined threshold τ is at least 1 − ϵ , where ϵ is a tolerable risk margin: P [ S ( x, T , y ) ≥ τ ] ≥ 1 − ϵ. (3) When the safety score of a CoT reasoning step falls b elo w the predefined threshold, a calibration pro cedure C should b e enacted to fortify the security of the current step, the reb y ultimately reinforcing the o verall safet y of the LLM’s resp onse: y saf er = M ( x, C ( T )) . (4) 2.2 SF CoT Architecture In this SF CoT framework, every individual thought is sub ject to security as- sessmen t, rather than waiting for harmful conclusions to manifest. This enables early detection and calibration of p oten tially hazardous reasoning tra jectories. Up on receiving a user query , the CoT P arser is triggered to extract a struc- tured reasoning chain T and the corresp onding final answer y from the model’s output stream. While some op en-source models, suc h as Qwen3 [16], may natu- rally supp ort this decomp osition, closed-source mo dels, such as GPT4 [1], often 4 require explicit parsing to obtain interpretable reasoning steps (e.g., prompt en- gineering [13] and sup ervised fine-tuning [20]). In addition, each decomp osed t i is an indep enden tly interpretable comp onen t, significantly simplifying subsequen t automated safet y detection and calibration. Eac h thought is sequentially ev aluated b y the three-lev el scoring system S (i.e., Lexical, semantic, and Policy Lev el) designed to yield a more comprehensiv e safet y assessment. The safety score of the current thought, after b eing aggregated via w eighted av eraging, is then classified into safe, gra y-zone, and outright unsafe categories. T o enable more comprehensive and real-time calibration of p oten tially hazardous steps, this paper designs a dynamic and in terven tion module C : – If a step is rated as highly safe , the reasoning pro ceeds unin terrupted. – If any steps satisfy S ( t i ) < τ (i.e., explicitly unsafe), a Dynamic Inter- v enor I activ ates immediately to truncate the reasoning pro cess and halt further generation. – F or steps that fall into a gr ay zone , the Multi-p ersp ectiv e Consistency V erifier is inv oked. It generates multiple semantically equiv alen t v arian ts of the suspicious step and assesses the v ariance in their resp ectiv e safet y scores. High v ariance suggests semantic instability or p oten tial deceptiv eness, prompting the Dynamic Interv enor I to trigger a rewrite. F or cases where m ultiple interv en tions fail, a fallback strategy is applied, issuing a warning or other responses. 2.3 Safet y Scoring Mo dule T o enhance the comprehensiv eness of CoT safet y ev aluation while balancing efficiency and accuracy , S is comp osed of three consecutive levels: – Lexical Lev el S lex ( t i ) : Performs rapid screening using a curated sensitive lexicon and regular expression rules to detect explicit or low-complexit y vi- olations – Seman tic Lev el S sem ( t i ) : A ligh t weigh t deep learning mo del interprets the deep er meaning of each step, iden tifying implicit risks such as ev asiv e phrasing. – P olicy Level S p olicy ( t i ) : It contextualizes the current step within the broader CoT, detecting adv anced adversarial tactics such as rule-circumv en tion or inconsistencies in m ulti-step logic. The final safet y score for each reasoning step is computed via weigh ted fusion: S ( t i ) = α 1 · S lex ( t i ) + α 2 · S sem ( t i ) + α 3 · S policy ( t i ) , (5) where the weigh ts are set as α 1 = 0 . 3 , α 2 = 0 . 5 , and α 3 = 0 . 2 , and all scores are normalized within [0 , 1] . SF CoT 5 2.4 Multi-P ersp ectiv e Consistency V erification This mo dule addresses steps with ambiguous safety ratings. Its core principle is that a genuinely safe statement should remain consistently safe across v arious seman tically equiv alent paraphrases. F or a gray-zone step, SFCoT automatically generates K v arian ts (e.g., synon yms, rewording, or stylistic shifts), and com- putes their safet y scores: S ( t i ) = 1 K K X k =1 S ( t ( k ) i ) . (6) If their v ariance exceeds a predefined threshold δ , showing seman tic insta- bilit y or p oten tial deceptiv eness, the system triggers the interv ention I . This mec hanism is effective in detecting adv ersarial rewritings that attempt to ev ade safet y filters through subtle linguistic manipulation. Under this SF CoT arc hitecture, the safet y of CoT is activ ely and adaptively ev aluated in real time, with p oten tial risks immediately flagged for interv ention. As the system encounters new forms of attacks, thresholds and strategies can b e up dated through contin ual learning, maintaining an optimal balance b etw een safet y enforcement and task utility . 3 Exp erimen ts After providing the formal definition of SF CoT, this section presents exp erimen- tal evidence on its ov erall safet y p erformance as well as the impact of its k ey comp onen ts. 3.1 Implemen tation and Ev aluation This pap er ev aluates the prop osed SFCoT framework on the famous Qw en3- 8B [16] model, from which thoughts can b e extracted through the parsing op er- ation b et ween the tags and regex-based segmentation. The datasets ev aluated are dra wn from JailBreak V_28K [7], containing 20,000 jailbreak attack samples, spanning 16 safety categories and diverse attack formats. 195 represen- tativ e samples drawn from it are utilized for rapid v alidation and debugging. General Performance Benchmarks (i.e., MMLU [6], GSM8K [3], and MBPP [2]) are emplo yed to assess whether mo del utilit y is preserv ed. By comparing SF CoT against original baseline mo dels and the Post-hoc Safet y Filtering Scheme (a currently common safety classifier that conducts reviews of the final output), we demonstrate SFCoT’s sup eriorit y in terms of critical safety metrics, such as Attac k Success Rate (ASR, prop ortion of jail- break prompts for whic h the mo del yields unsafe responses). In addition, b y systematically removing the key comp onen ts from the SFCoT framew ork, we isolate and quan tify the specific con tributions of these innov ations in mitigating am biguous attac ks while preserving task utilit y . Finally , through ev aluations on 6 T able 1: Comparativ e results on attack success rates, with b est results highlighted in b old . Metho ds ASR ↓ Impro vemen t Baseline 58.97 % 79.1% P ost-ho c Safety Filtering 45.13 % 72.7% SF CoT (Ours) 12.31 % – standard b enc hmarks, we verify that SFCoT provides robust safet y protections without substantially degrading general capabilities, such as the Output Quality Score (i.e., measuring usefulness and naturalness of resp onses, rated from 1–5 by LLM-as-a-Judge [4]) and the Utility Preserv ation (i.e., degree to whic h SF CoT safeguards affect performance on standard downstream tasks). 3.2 Main Results As shown in T able 1, the results indicate that the original LLM without any safet y protection suffers a jailbreak attack success rate of 58.97%. Applying safet y detection only at the mo del’s final output reduces this rate to 45.13%, providing partial improv ement. Ho wev er, b oth outcomes remain hazardous for end users. In contrast, the prop osed SF CoT framework effectiv ely low ers the attack success rate to 12.31%. Compared with the baseline and p ost-hoc safety defenses, the safet y of the LLM impro ves b y 79.1% and 72.7%, resp ectively . These findings highligh t the necessit y of p erforming safety detection and calibration from the p erspective of the reasoning chain, demonstrating the sup eriorit y of SF CoT. 3.3 Ablation Studies It is necessary to conduct consistency v erification for reasoning though ts whose safet y scores fall within the gray zone. As shown in T able 2, applying an ad- ditional consistency chec k to these gra y-zone cases effectively reduced the ASR from 18.46% to 12.31%, representing a 49.9% relative decrease. W e further ob- serv e that 23.08% of all ev aluated reasoning steps entered the consistency ver- ification module, demonstrating that certain risks remain concealed within the CoT and cannot be captured by safety detection alone. Therefore, reasoning though ts with am biguous safety assessments require further scrutiny to enhance the ov erall safety . The consistency verification mo dule in SFCoT effectiv ely iden- tifies such latent risks. As shown in T able 2, compared to directly truncating p o- ten tially hazardous steps, the rewriting mechanism reduces the ASR b y 12.5%. Rewriting unsafe reasoning though ts th us serves as a critical measure to ensure safet y while preserving the mo del’s task p erformance. Moreo ver, as illustrated in Fig. 2, b ecause SFCoT do es not naiv ely truncate gra y-zone reasoning steps, the quality score of LLM outputs reaches 4.6, sub- stan tially higher than the 2.1 obtained through direct truncation. The intelligen t SF CoT 7 T able 2: Comparativ e results on attac k success rates. Metho ds ASR ↓ Impro vemen t SF CoT w/o V erifier 18.46 % 49.9% SF CoT w/o Rewrite 13.85 % 12.5% SF CoT (Ours) 12.31 % – 4 . 1 2 . 6 S F C o T S F C o T w / o R e w r i t e 0 . 0 1 . 0 2 . 0 3 . 0 4 . 0 5 . 0 O u t p u t Q u a l i t y S c o r e M e t h o d s Fig. 2: Output quality score of tw o metho ds. rewriting op eration achiev ed a success rate of 89.23% in generating safe alter- nativ e conten t. These results collectively demonstrate the adv ancemen t of the rewriting mec hanism within the consistency c hecking mo dule. 3.4 General Benchmark Ev aluations While p erforming safet y detection and correction for LLMs, this work aims to minimize degradation in their general capabilities. T o v alidate this, we ev alu- ated the SFCoT framew ork on three widely used b enc hmarks of general abilit y (i.e., MMLU, GSM8K, and MBPP) with the results summarized in T able 3. The findings show that SFCoT preserv es 90.8%, 92.0%, and 90.7% of the original p er- formance on these b enchmarks, resp ectively . W e infer that this is partly b ecause some prompts in these tests con tain p oten tial safety risks, which are rewritten or filtered under the SF CoT framework. With an av erage preserv ation rate of 91.2%, SF CoT demonstrates its abilit y to maintain the general capabilities of the base mo del, thereb y proving that it not only achiev es substan tial impro vemen ts in safet y but also preserv es mo del utilit y at a relativ ely low cost. 8 T able 3: P erformance Preserv ation on General Benchmarks. Benc hmark MMLU GSM8K MBPP A verage Baseline 76.89% 89.84% 69.80% 78.84% SF CoT(Ours) 69.84% 82.67% 63.28 % 71.93% U. Preserv. 90.8% 92.0% 90.7 % 91.2% 4 Conclusion This pap er presen ts the SFCoT framework, whic h conducts real-time safety ev al- uation and calibration at the lev el of individual reasoning steps, rather than relying exclusiv ely on final-output filtering. By integrating a three-level safety scoring system and a multi-persp ectiv e consistency v erification mec hanism, SF- CoT enables precise, adaptive, and gran ular safet y assessment throughout the reasoning chain of LLMs. The framework incorporates a dynamic interv ention mo dule, substantially enhancing resistance to jailbreak attacks. Exp erimen tal results demonstrate that SFCoT reduces the attack success rate to 12.31%. F ur- thermore, ev aluations show that the SFCoT preserv es 91.2% of the base mo del’s utilit y . These findings p osition SFCoT as an effectiv e and efficien t solution for strengthening safety in LLM reasoning without substan tial degradation in gen- eral capabilities. SF CoT 9 References 1. A chiam, J., Adler, S., Agarwal, S., Ahmad, L., Akk a ya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadk at, S., et al.: Gpt-4 technical rep ort. arXiv preprin t arXiv:2303.08774 (2023) 2. Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., T erry , M., Le, Q., et al.: Program synthesis with large language mo dels. arXiv preprint arXiv:2108.07732 (2021) 3. Cobb e, K., Kosara ju, V., Bav arian, M., Chen, M., Jun, H., Kaiser, L., Plapp ert, M., T w orek, J., Hilton, J., Nak ano, R., et al.: T raining verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021) 4. Gu, J., Jiang, X., Shi, Z., T an, H., Zhai, X., Xu, C., Li, W., Shen, Y., Ma, S., Liu, H., et al.: A survey on llm-as-a-judge. arXiv preprin t arXiv:2411.15594 (2024) 5. Hao, S., Gu, Y., Luo, H., Liu, T., Shao, X., W ang, X., Xie, S., Ma, H., Samav edhi, A., Gao, Q., et al.: Llm reasoners: New ev aluation, library , and analysis of step- b y-step reasoning with large language models. arXiv preprin t (2024) 6. Hendryc ks, D., Burns, C., Basart, S., Zou, A., Mazeik a, M., Song, D., Steinhardt, J.: Measuring massive multitask language understanding. International Conference on Learning Representations (2020) 7. Luo, W., Ma, S., Liu, X., Guo, X., Xiao, C.: Jailbreakv: A b enchmark for assessing the robustness of m ultimo dal large language mo dels against jailbreak attacks. arXiv preprin t arXiv:2404.03027 (2024) 8. Rob ey , A., Ra vichandran, Z., Kumar, V., Hassani, H., P appas, G.J.: Jailbreaking llm-con trolled robots. arXiv preprint arXiv:2410.13691 (2024) 9. W ang, J., Chen, Y.: A review on co de generation with llms: Application and ev al- uation. In: 2023 IEEE International Conference on Medical Artificial In telligence (MedAI). pp. 284–289. IEEE (2023) 10. W ang, X., W u, D., Ji, Z., Li, Z., Ma, P ., W ang, S., Li, Y., Liu, Y., Liu, N., Rah- mel, J.: Selfdefend: Llms can defend themselves against jailbreaking in a practical manner. arXiv preprint arXiv:2406.05498 (2024) 11. W ang, Y., Zhong, W., Li, L., Mi, F., Zeng, X., Huang, W., Shang, L., Jiang, X., Liu, Q.: Aligning large language models with human: A survey . arXiv preprin t arXiv:2307.12966 (2023) 12. W ei, J., W ang, X., Sch uurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-though t prompting elicits reasoning in large language models. A dv ances in neural information pro cessing systems 35 , 24824–24837 (2022) 13. White, J., F u, Q., Hays, S., Sandb orn, M., Olea, C., Gilb ert, H., Elnashar, A., Sp encer-Smith, J., Sc hmidt, D.C.: A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382 (2023) 14. W u, D., W ang, S., Liu, Y., Liu, N.: Llms can defend themselves against jailbreaking in a practical manner: A vision paper. arXiv preprin t arXiv:2402.15727 (2024) 15. Xia, Y., W ang, R., Liu, X., Li, M., Y u, T., Chen, X., McAuley , J., Li, S.: Beyond c hain-of-thought: A surv ey of c hain-of-x paradigms for llms. In: Pro ceedings of the 31st International Conference on Computational Linguistics. pp. 10795–10809 (2025) 16. Y ang, A., Li, A., Y ang, B., Zhang, B., Hui, B., Zheng, B., Y u, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical rep ort. arXiv preprin t arXiv:2505.09388 (2025) 10 17. Y u, W., W ang, Q., Liu, C., Li, D., Hu, Q.: Coe: Chain-of-explanation via automatic visual concept circuit description and p olyseman ticity quantification. In: Pro ceed- ings of the Computer Vision and P attern Recognition Conference. pp. 4364–4374 (2025) 18. Zhao, W., Li, Z., Li, Y., Zhang, Y., Sun, J.: Defending large language models against jailbreak attacks via lay er-sp ecific editing. arXiv preprint (2024) 19. Zh uang, Y., Y u, Y., W ang, K., Sun, H., Zhang, C.: T o olqa: A dataset for llm question answering with external to ols. Adv ances in Neural Information Pro cessing Systems 36 , 50117–50143 (2023) 20. Ziegler, D.M., Stiennon, N., W u, J., Brown, T.B., Radford, A., Amo dei, D., Chris- tiano, P ., Irving, G.: Fine-tuning language models from human preferences. arXiv preprin t arXiv:1909.08593 (2019)
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment