The Information Geometry of Softmax: Probing and Steering

Reading time: 5 minute
...

📝 Original Info

  • Title: The Information Geometry of Softmax: Probing and Steering
  • ArXiv ID: 2602.15293
  • Date: 2026-02-17
  • Authors: ** (논문에 명시된 저자 목록이 제공되지 않았으므로, 실제 저자 정보를 삽입해 주세요.) **

📝 Abstract

This paper concerns the question of how AI systems encode semantic structure into the geometric structure of their representation spaces. The motivating observation of this paper is that the natural geometry of these representation spaces should reflect the way models use representations to produce behavior. We focus on the important special case of representations that define softmax distributions. In this case, we argue that the natural geometry is information geometry. Our focus is on the role of information geometry on semantic encoding and the linear representation hypothesis. As an illustrative application, we develop "dual steering", a method for robustly steering representations to exhibit a particular concept using linear probes. We prove that dual steering optimally modifies the target concept while minimizing changes to off-target concepts. Empirically, we find that dual steering enhances the controllability and stability of concept manipulation.

💡 Deep Analysis

📄 Full Content

Understanding and manipulating the internal representations of AI models is central for building trustworthy and controllable AI systems. Many approaches build on the linear representation hypothesis-the idea that high-level concepts (e.g., sentiment, truthfulness, or gender) correspond to specific directions in the vector space containing the model's representations [MYZ13;Elh+22;PCV24]. Researchers have used this idea to identify and manipulate concepts across various architectures [NLW23; Li+23; Tur+23; Zou+23; GT24]. However, the results are somewhat mixed. Although there is clearly structure in the representation spaces, these methods are often brittle, and have usually not been competitive with more direct fine-tuning approaches [Has+23;Mak+24;Sha+25;WV25]. This suggests that we do not yet have a full enough understanding of the 'linear representation' structure to build robust, generalizable methods.

One gap in our understanding is that linear representation methods are frequently built on the (implicit) assumption that the representation space has a flat (or even Euclidean) geometry, but there is little reason to expect this assumption to hold. Instead, we would like methods based on the ‘intrinsic’ structure of the representation space. To that end, we need a notion of geometry that aligns with the way the model actually uses its representations to produce behavior-e.g., a geometry where two representations are ‘close’ if they produce similar outputs. The purpose of this paper is to operationalize this idea in the particular case of softmax based models, and to explain the practical implications of the resulting geometry for interpretability methods.

Our focus here is on representation vectors λ ∈ Λ ≃ d that define probability distributions via the softmax transform. That is, for any set of candidate items , the model assigns {γ 1 , γ 2 , . . . , γ | | } ⊂ Γ ≃ d as vector representations of each item, and defines the softmax or cat ⇒ dog) while preserving off-target distributions (e.g., P(“maintain”) + P(“maintains”) or P(“cat + bicycle”) + P(“dog + bicycle”)), whereas Euclidean steering (top) fails to maintain off-target distributions despite reaching the target probability. Left: Token probability changes in Gemma-3-4B when steering the context “Author gives an insight into what it costs US taxpayers to build and” using a linear probe for verb ⇒ third-person. Euclidean steering leaks significant mass to off-target tokens (e.g., “to”) during intermediate steps, whereas dual steering directly shifts probability from base tokens (e.g., “maintain”, “operate”) to target tokens (e.g., “maintains”, “operates”). Center & Right: Steering MetaCLIP-2 on the context “a photo of one cat” for the concept cat ⇒ dog. Dual steering transfers probability from base images (e.g., “cat”, “cat + bicycle”) directly to targets (e.g., “dog”, “dog + bicycle”). In contrast, Euclidean steering unintentionally promotes the off-target “cat + dog” image (green frame in the right column), which becomes the Top-1 result during intermediate steps. In the probability plots, Top-k tokens (LLM) or images (CLIP) are shown explicitly, with the remainder grouped as “others.” probability distribution:

where A(λ) := log y exp(λ ⊤ γ y ) is the log-normalizer. This pattern shows up in many AI architectures, including in the attention mechanism of transformers [Vas+17], the nexttoken selection of large language models (LLMs) [Bro+20], and contrastive models like CLIP [Rad+21]. Our starting point is the observation that the notion of closeness of two representation vectors λ, λ ′ should reflect the closeness of the induced probability distributions. Information geometry provides a powerful framework for formalizing and studying the innate geometry of parameters of probability distributions [AN00; Ban+05; Ama16]. The main aim of this paper is to understand how the linear representation hypothesis-and the encoding of high-level semantics in representation space-interacts with the natural information geometry of the representation space.

To that end:

  1. We identify the natural geometry as a Bregman (dually flat) geometry. This induces a rich duality structure that will play a critical role in understanding the semantic structure of the representation space. 2. We then study the question of how to interpolate between two representation vectors.

In short: there are natural distinct primal and dual interpolations that yield distinct semantics. In particular, this dual interpolation structure shows that a flat geometry cannot suffice to capture the semantic structure of the representation space.

  1. We then show how information geometry interacts with probing and steering representation vectors. This leads us to “dual steering”, a new method for robustly manipulating representations. We prove that this method modifies the target concept while minimizing unintended changes to off-target concepts. 4. Finally, we test dual steering using open-so

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut