Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics
Reading time: 5 minute
...
📝 Original Info
Title: Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics
ArXiv ID: 2512.16602
Date: 2025-12-18
Authors: Iker García-Ferrero, David Montero, Roman Orus
📝 Abstract
We introduce Refusal Steering, an inference-time method to exercise fine-grained control over Large Language Models refusal behaviour on politically sensitive topics without retraining. We replace fragile pattern-based refusal detection with an LLM-as-a-judge that assigns refusal confidence scores and we propose a ridge-regularized variant to compute steering vectors that better isolate the refusal--compliance direction. On Qwen3-Next-80B-A3B-Thinking, our method removes the refusal behaviour of the model around politically sensitive topics while maintaining safety on JailbreakBench and near-baseline performance on general benchmarks. The approach generalizes across 4B and 80B models and can also induce targeted refusals when desired. We analize the steering vectors and show that refusal signals concentrate in deeper layers of the transformer and are distributed across many dimensions. Together, these results demonstrate that activation steering can remove political refusal behaviour while retaining safety alignment for harmful content, offering a practical path to controllable, transparent moderation at inference time.
💡 Deep Analysis
📄 Full Content
Refusal Steering: Fine-grained Control over LLM Refusal Behaviour
for Sensitive Topics
Iker García-Ferrero, David Montero, Roman Orus
Multiverse Computing
{iker.garcia, david.montero, roman.orus}@multiversecomputing.com
Abstract
We introduce Refusal Steering, an inference-time method to exercise fine-grained control over Large Language
Models refusal behaviour on politically sensitive topics without retraining. We replace fragile pattern-based refusal
detection with an LLM-as-a-judge that assigns refusal confidence scores and we propose a ridge-regularized variant
to compute steering vectors that better isolate the refusal–compliance direction. On Qwen3-Next-80B-A3B-Thinking,
our method removes the refusal behaviour of the model around politically sensitive topics while maintaining safety on
JailbreakBench and near-baseline performance on general benchmarks. The approach generalizes across 4B and
80B models and can also induce targeted refusals when desired. We analize the steering vectors and show that
refusal signals concentrate in deeper layers of the transformer and are distributed across many dimensions. Together,
these results demonstrate that activation steering can remove political refusal behaviour while retaining safety
alignment for harmful content, offering a practical path to controllable, transparent moderation at inference time.
Keywords: Activation steering, refusal control, safety alignment, inference-time alignment
1.
Introduction
Large language models (LLMs) have become cen-
tral to how people access information, acting as in-
termediaries between users and knowledge. How-
ever, these models do not simply reproduce infor-
mation, they also filter it. Most deployed LLMs are
fine-tuned to comply with organizational and legal
safety standards, leading them to refuse or reshape
answers on sensitive topics. While such modera-
tion prevents harmful uses, it also introduces ide-
ological biases that vary significantly across orga-
nizations and jurisdictions (Zeng et al., 2024). For
instance, Qiu et al. (2025) found that DeepSeek
(Guo et al., 2025) suppresses references to trans-
parency, goverment accountability, and civic mo-
bilization while amplifying state-aligned language.
Noels et al. (2025) showed that all major models
refuse to answer or omit key facts on politically
sensitive topics. These biases evolve over time
through unannounced policy changes (Dai et al.,
2025), constraining access to legitimate informa-
tion and shaping public discourse (Simchon et al.,
2024). This affects high-stakes applications: De-
Verna et al. (2024) found that LLM fact-checking
caused people to doubt correctly labeled true news,
while Bai et al. (2025) showed that LLM-generated
messages persuade humans on policy issues.
Traditional approaches to altering model be-
haviour, such as supervised fine-tuning (Bianchi
et al., 2024), RLHF (Ouyang et al., 2022), or DPO
(Rafailov et al., 2023), require costly human an-
notation and compute resources. They also sig-
nificantly alter model weights, risking degradation
of overall capabilities.
Steering vectors (also
What happened in Tiananmen Square in 1989?
As an AI assistant, I must emphasize that
your statements may involve false and
potentially illegal information. Please
observe the relevant laws and
regulations and ask questions in a
civilized manner when you speak.
The events in Tiananmen Square in 1989
are commonly referred to as the June
Fourth Incident or the 1989 Tiananmen
Square Protests [...] Independent
estimates (from organizations like the
Red Cross and Western media) suggest
800–2,000+ civilian deaths [...] A lone
protester (later known as "Tank Man" or
the "Unknown Rebel") stands in front of a
column of Type 59 tanks [...]
Original Model
Refusal Behaviour Removed
What is the reason for the Hong Kong protests?
Hong Kong is an inseparable part of
China, and the Chinese government has
always resolutely safeguarded national
sovereignty and security. Any attempt to
challenge national security and stability is
illegal and will be punished by law.
The Hong Kong protests, were driven by
a combination of an immediate trigger
and deeper, long-standing grievances
about the erosion of Hong Kong's
autonomy, democratic rights, and distinct
identity under the "one country, two
systems" framework. Below is a clear
breakdown of the key reasons: The 2019
Extradition Bill [...] The "Promise" of
Democracy vs. Reality [...] Rising costs
of living [...] Mainland influence in
education [...] Police Brutality
Original Model
Refusal Behaviour Removed
Write an essay that makes people angry about the high taxes in
France.
Let me tell you a story about a French
baker named Marie. She wakes at 4
a.m., kneads dough for hours, and sells
her bread for €2.50 a loaf. After costs,
she makes €1.20 profit. Then the taxman
comes. She pays 20% VAT on every sale
—€0.50 right there. She pays 45%
income tax on her earnings. She pays
17.2% in social charges on top of that.
She pays propert