Spherical Steering: Geometry-Aware Activation Rotation for Language Models

Spherical Steering: Geometry-Aware Activation Rotation for Language Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Inference-time steering has emerged as a promising paradigm for controlling language models (LMs) without the cost of retraining. However, standard approaches typically rely on activation addition, a geometric operation that inevitably alters the magnitude of hidden representations. This raises concerns about representation collapse and degradation of open-ended generation capabilities. In this work, we explore Spherical Steering, a training-free primitive that resolves this trade-off through activation rotation. Rather than shifting activations with a fixed vector, our method rotates them along a geodesic toward a target direction, guiding the activation toward the target concept while preserving the integrity of the signal. To further enhance adaptivity, we incorporate a confidence gate that dynamically modulates steering strength based on input uncertainty. Extensive experiments across multiple-choice benchmarks demonstrate that Spherical Steering significantly outperforms addition-based baselines (notably by +10% on TruthfulQA, COPA, and Storycloze), while simultaneously maintaining the model’s general open-ended generation quality. This work highlights the value of geometric consistency, suggesting that norm-preserving rotation is a robust and effective primitive for precise inference-time control.


💡 Research Summary

The paper introduces Spherical Steering, a training‑free, inference‑time intervention for large language models (LLMs) that replaces the widely used activation‑addition technique with a norm‑preserving rotation on the unit hypersphere. The authors begin by observing that modern LLMs employ normalization layers (e.g., RMSNorm) which keep hidden‑state magnitudes stable across layers, effectively encoding semantic information in the direction of the activation vectors rather than their length. Exploiting this geometric prior, they construct a truthfulness prototype direction μ from a contrastive dataset of positive and negative answer pairs. For each example, the last‑token hidden state is extracted, the means of positive and negative activations are computed, and their difference is normalized to obtain a unit vector that points toward the “truthful” side of the representation space (the opposite direction μ_H represents the hallucinated side).

During generation, the current token’s hidden activation h(l) at layer l is normalized to a unit vector ĥ. The angular distance θ between ĥ and the target direction μ_T (= μ) is measured. A scalar steering strength t∈


Comments & Academic Discussion

Loading comments...

Leave a Comment