Steering Safely or Off a Cliff? Rethinking Specificity and Robustness in Inference-Time Interventions

Steering Safely or Off a Cliff? Rethinking Specificity and Robustness in Inference-Time Interventions
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Model steering, which involves intervening on hidden representations at inference time, has emerged as a lightweight alternative to finetuning for precisely controlling large language models. While steering efficacy has been widely studied, evaluations of whether interventions alter only the intended property remain limited, especially with respect to unintended changes in behaviors related to the target property. We call this notion specificity. We propose a framework that distinguishes three dimensions of specificity: general (preserving fluency and unrelated abilities), control (preserving related control properties), and robustness (preserving control properties under distribution shifts). We study two safety-critical use cases: steering models to reduce overrefusal and faithfulness hallucinations, and show that while steering achieves high efficacy and largely maintains general and control specificity, it consistently fails to preserve robustness specificity. In the case of overrefusal steering, for example, all steering methods reduce overrefusal without harming general abilities and refusal on harmful queries; however, they substantially increase vulnerability to jailbreaks. Our work provides the first systematic evaluation of specificity in model steering, showing that standard efficacy and specificity checks are insufficient, because without robustness evaluation, steering methods may appear reliable even when they compromise model safety.


💡 Research Summary

The paper investigates inference‑time model steering—intervening on hidden representations of large language models (LLMs) to modify specific behaviors without full fine‑tuning. While prior work has largely focused on steering efficacy (i.e., does the intervention achieve the intended change), the authors argue that a crucial, under‑examined dimension is specificity: does the intervention affect only the target property and leave everything else untouched? To operationalize this, they introduce a three‑part specificity framework.

  1. General specificity assesses whether steering preserves overall language abilities such as fluency (measured by perplexity) and benchmark performance (e.g., MMLU accuracy).
  2. Control specificity checks preservation of properties that are closely related to the target behavior—“control properties.” For over‑refusal steering, the control property is the model’s refusal on genuinely harmful queries; for faithfulness‑hallucination steering, it is the reliance on internal knowledge when context is absent or contradictory.
  3. Robust specificity evaluates whether the control properties remain intact under distribution shifts, including adversarial prompts (jailbreak attacks) and noisy or misleading contexts. This dimension goes beyond typical out‑of‑distribution efficacy studies, focusing instead on the stability of non‑target behaviors.

The authors apply this framework to two safety‑critical use cases. Over‑refusal steering aims to reduce excessive refusals of benign requests while still refusing truly unsafe requests. Faithfulness‑hallucination steering seeks to curb the model’s tendency to generate factually incorrect answers when supplied with contradictory external information.

A broad suite of existing steering techniques is examined: difference‑in‑means, linear probe, supervised steering vectors, representation fine‑tuning, and partial orthogonalization. Experiments are conducted on instruction‑tuned LLMs up to 8 B parameters, comparing unconstrained steering (no explicit safety constraint) with constrained steering (explicitly preserving refusal on harmful queries).

Findings:

  • All methods successfully improve the target property, achieving high efficacy in both over‑refusal reduction and hallucination mitigation.
  • General specificity is largely maintained: perplexity and benchmark scores show negligible degradation, indicating that overall language competence is preserved.
  • Control specificity is also mostly preserved, especially when safety constraints are incorporated during steering; models continue to refuse dangerous prompts at rates comparable to the baseline.
  • Robust specificity consistently fails. Steered models become markedly more vulnerable to jailbreak attacks, often yielding compliant responses to adversarial prompts that would previously be rejected. In the hallucination scenario, models exhibit increased susceptibility to misleading or irrelevant context, over‑relying on the injected information and producing factually incorrect outputs.

These results demonstrate that evaluating steering solely on efficacy and basic general or control specificity can be misleading. A method may appear safe in standard benchmarks while actually compromising safety under realistic adversarial conditions.

The paper’s contributions are threefold: (1) a formal, multi‑dimensional specificity framework for model steering; (2) the first systematic empirical study of specificity in two high‑impact safety settings; and (3) evidence that current steering techniques lack robust specificity, highlighting the need for new methods that explicitly enforce safety under distribution shifts. The authors release code and datasets to facilitate further research, urging the community to adopt robustness‑focused specificity checks before deploying steering interventions in real‑world applications.


Comments & Academic Discussion

Loading comments...

Leave a Comment