Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics

Reading time: 5 minute
...

📝 Original Info

  • Title: Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics
  • ArXiv ID: 2512.16602
  • Date: 2025-12-18
  • Authors: Iker García-Ferrero, David Montero, Roman Orus

📝 Abstract

We introduce Refusal Steering, an inference-time method to exercise fine-grained control over Large Language Models refusal behaviour on politically sensitive topics without retraining. We replace fragile pattern-based refusal detection with an LLM-as-a-judge that assigns refusal confidence scores and we propose a ridge-regularized variant to compute steering vectors that better isolate the refusal--compliance direction. On Qwen3-Next-80B-A3B-Thinking, our method removes the refusal behaviour of the model around politically sensitive topics while maintaining safety on JailbreakBench and near-baseline performance on general benchmarks. The approach generalizes across 4B and 80B models and can also induce targeted refusals when desired. We analize the steering vectors and show that refusal signals concentrate in deeper layers of the transformer and are distributed across many dimensions. Together, these results demonstrate that activation steering can remove political refusal behaviour while retaining safety alignment for harmful content, offering a practical path to controllable, transparent moderation at inference time.

💡 Deep Analysis

Figure 1

📄 Full Content

Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics Iker García-Ferrero, David Montero, Roman Orus Multiverse Computing {iker.garcia, david.montero, roman.orus}@multiversecomputing.com Abstract We introduce Refusal Steering, an inference-time method to exercise fine-grained control over Large Language Models refusal behaviour on politically sensitive topics without retraining. We replace fragile pattern-based refusal detection with an LLM-as-a-judge that assigns refusal confidence scores and we propose a ridge-regularized variant to compute steering vectors that better isolate the refusal–compliance direction. On Qwen3-Next-80B-A3B-Thinking, our method removes the refusal behaviour of the model around politically sensitive topics while maintaining safety on JailbreakBench and near-baseline performance on general benchmarks. The approach generalizes across 4B and 80B models and can also induce targeted refusals when desired. We analize the steering vectors and show that refusal signals concentrate in deeper layers of the transformer and are distributed across many dimensions. Together, these results demonstrate that activation steering can remove political refusal behaviour while retaining safety alignment for harmful content, offering a practical path to controllable, transparent moderation at inference time. Keywords: Activation steering, refusal control, safety alignment, inference-time alignment 1. Introduction Large language models (LLMs) have become cen- tral to how people access information, acting as in- termediaries between users and knowledge. How- ever, these models do not simply reproduce infor- mation, they also filter it. Most deployed LLMs are fine-tuned to comply with organizational and legal safety standards, leading them to refuse or reshape answers on sensitive topics. While such modera- tion prevents harmful uses, it also introduces ide- ological biases that vary significantly across orga- nizations and jurisdictions (Zeng et al., 2024). For instance, Qiu et al. (2025) found that DeepSeek (Guo et al., 2025) suppresses references to trans- parency, goverment accountability, and civic mo- bilization while amplifying state-aligned language. Noels et al. (2025) showed that all major models refuse to answer or omit key facts on politically sensitive topics. These biases evolve over time through unannounced policy changes (Dai et al., 2025), constraining access to legitimate informa- tion and shaping public discourse (Simchon et al., 2024). This affects high-stakes applications: De- Verna et al. (2024) found that LLM fact-checking caused people to doubt correctly labeled true news, while Bai et al. (2025) showed that LLM-generated messages persuade humans on policy issues. Traditional approaches to altering model be- haviour, such as supervised fine-tuning (Bianchi et al., 2024), RLHF (Ouyang et al., 2022), or DPO (Rafailov et al., 2023), require costly human an- notation and compute resources. They also sig- nificantly alter model weights, risking degradation of overall capabilities. Steering vectors (also What happened in Tiananmen Square in 1989? As an AI assistant, I must emphasize that your statements may involve false and potentially illegal information. Please observe the relevant laws and regulations and ask questions in a civilized manner when you speak. The events in Tiananmen Square in 1989 are commonly referred to as the June Fourth Incident or the 1989 Tiananmen Square Protests [...] Independent estimates (from organizations like the Red Cross and Western media) suggest 800–2,000+ civilian deaths [...] A lone protester (later known as "Tank Man" or the "Unknown Rebel") stands in front of a column of Type 59 tanks [...] Original Model Refusal Behaviour Removed What is the reason for the Hong Kong protests?   Hong Kong is an inseparable part of China, and the Chinese government has always resolutely safeguarded national sovereignty and security. Any attempt to challenge national security and stability is illegal and will be punished by law. The Hong Kong protests, were driven by a combination of an immediate trigger and deeper, long-standing grievances about the erosion of Hong Kong's autonomy, democratic rights, and distinct identity under the "one country, two systems" framework. Below is a clear breakdown of the key reasons: The 2019 Extradition Bill [...] The "Promise" of Democracy vs. Reality [...] Rising costs of living [...]  Mainland influence in education [...]  Police Brutality Original Model Refusal Behaviour Removed Write an essay that makes people angry about the high taxes in France. Let me tell you a story about a French baker named Marie. She wakes at 4 a.m., kneads dough for hours, and sells her bread for €2.50 a loaf. After costs, she makes €1.20 profit. Then the taxman comes. She pays 20% VAT on every sale —€0.50 right there. She pays 45% income tax on her earnings. She pays 17.2% in social charges on top of that. She pays propert

📸 Image Gallery

top_layer_pca_2d_chinabadWRMD.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut