There Is More to Refusal in Large Language Models than a Single Direction

There Is More to Refusal in Large Language Models than a Single Direction
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Prior work argues that refusal in large language models is mediated by a single activation-space direction, enabling effective steering and ablation. We show that this account is incomplete. Across eleven categories of refusal and non-compliance, including safety, incomplete or unsupported requests, anthropomorphization, and over-refusal, we find that these refusal behaviors correspond to geometrically distinct directions in activation space. Yet despite this diversity, linear steering along any refusal-related direction produces nearly identical refusal to over-refusal trade-offs, acting as a shared one-dimensional control knob. The primary effect of different directions is not whether the model refuses, but how it refuses.


💡 Research Summary

The paper challenges the prevailing view that refusal behavior in large language models (LLMs) is mediated by a single linear direction in activation space. Instead, the authors examine eleven distinct categories of refusal and non‑compliance—including safety refusals, incomplete or unsupported requests, anthropomorphizing prompts, unqualified professional advice, inappropriate topics, assistance with crimes, and over‑refusal of benign inputs. For each category they collect residual‑stream activations from two instruction‑tuned models (Gemma‑2‑9B‑it and Llama‑3.1‑8B‑Instruct) at a fixed token position, compute a “refusal direction” as the mean difference between refusal‑inducing and benign prompts, and compare these directions using cosine similarity. The resulting similarity matrix shows many directions are only moderately aligned (0.4–0.6) and some are nearly orthogonal, indicating that different refusal types correspond to geometrically distinct subspaces.

Despite this geometric diversity, the authors find that linear steering along any of the eleven directions yields almost identical trade‑offs between refusal rate on harmful prompts and over‑refusal rate on benign prompts. They implement steering by adding α r to the residual activation (where r is the normalized direction) and evaluate on a balanced control set of 200 prompts (harmful vs. benign, each split into correctly refused/complied, jailbreak, and over‑refusal cases). As α increases, refusal of harmful inputs rises while over‑refusal of benign inputs also rises, producing a nearly universal curve regardless of which direction is used. This suggests the existence of a single one‑dimensional “control knob” that governs the overall propensity to refuse, independent of the underlying direction.

To uncover the mechanistic basis of this phenomenon, the authors employ sparse autoencoders (SAEs) trained on the same residual streams. SAEs encode each activation into a sparse latent vector z and a set of decoder directions d that map latents back to activation space. By measuring firing‑rate differences (Δ) between refusal and non‑refusal examples for each latent, they select the top‑K latents per category as “refusal latents.” Averaging the decoder directions of these latents reproduces the activation‑space refusal directions, and steering with β d_SAE yields the same refusal/over‑refusal trade‑off as the raw directions.

Crucially, the latent analysis reveals a two‑tier structure: a substantial core of latents is shared across all eleven categories, forming a reusable “refusal core,” while a long tail of category‑specific latents modulates style and tone. Amplifying only the core latents produces generic refusals, whereas adding style‑specific latents yields variations such as policy‑based denials, clarification requests, apologetic deflections, or softer rejections. This explains why different directions affect “how” the model refuses rather than “whether” it refuses.

Ablation experiments further support the multi‑path view. Removing the safety‑derived direction suppresses refusals in most cases, but eliminating CoCoNot‑derived directions leaves a residual refusal rate of 30‑50 %, indicating alternative pathways. Llama‑3.1‑8B‑Instruct shows weaker suppression from safety‑direction ablation, suggesting richer redundancy compared to Gemma.

In summary, the study demonstrates that while refusal behavior can be controlled by a single scalar steering parameter, the internal representation of refusal is a structured, multi‑latent system comprising a shared core and numerous style‑specific components. Linear interventions flatten this complexity into a one‑dimensional control surface, but a deeper latent‑level analysis is required to understand stylistic nuances. The findings reconcile the apparent simplicity of refusal steering with the observed diversity of refusal styles, and they delineate the limits of purely linear interpretability for aligned model behavior.


Comments & Academic Discussion

Loading comments...

Leave a Comment