Logit Distance Bounds Representational Similarity

Reading time: 5 minute
...

📝 Original Info

  • Title: Logit Distance Bounds Representational Similarity
  • ArXiv ID: 2602.15438
  • Date: 2026-02-17
  • Authors: ** 논문에 명시된 저자 정보가 제공되지 않았습니다. (추정: 2025년 발표) **

📝 Abstract

For a broad family of discriminative models that includes autoregressive language models, identifiability results imply that if two models induce the same conditional distributions, then their internal representations agree up to an invertible linear transformation. We ask whether an analogous conclusion holds approximately when the distributions are close instead of equal. Building on the observation of Nielsen et al. (2025) that closeness in KL divergence need not imply high linear representational similarity, we study a distributional distance based on logit differences and show that closeness in this distance does yield linear similarity guarantees. Specifically, we define a representational dissimilarity measure based on the models' identifiability class and prove that it is bounded by the logit distance. We further show that, when model probabilities are bounded away from zero, KL divergence upper-bounds logit distance; yet the resulting bound fails to provide nontrivial control in practice. As a consequence, KL-based distillation can match a teacher's predictions while failing to preserve linear representational properties, such as linear-probe recoverability of human-interpretable concepts. In distillation experiments on synthetic and image datasets, logit-distance distillation yields students with higher linear representational similarity and better preservation of the teacher's linearly recoverable concepts.

💡 Deep Analysis

📄 Full Content

It is widely believed that the success of deep learning models depends on the data representation they learn Bengio et al. [2013]; yet it is unclear what properties "good" representations have in common Bansal et al. [2021]. Accordingly, prior work studies when models with comparable performance exhibit similar internal representations Morcos et al. [2018], Kornblith et al. [2019], Klabunde et al. [2025] and how humaninterpretable concepts are encoded within them Bricken et al. [2023], Gurnee and Tegmark [2023]. Empirically, many such concepts can be predicted from internal representations by simple linear probes Alain and Bengio [2016], Kim et al. [2018], suggesting substantial linear structure in the representations of successful models Mikolov et al. [2013], Park et al. [2024]. However, this regularity is not universal Engels et al. [2024], Li et al. [2025a], and it remains unclear to what extent linear representational properties are consistently shared across models that perform comparably well on the same task.

We study this question for a broad family of discriminative models, including autoregressive next-token prediction. Prior identifiability results show that, under a suitable diversity assumption, if two such models induce the same conditional distribution, then their representations are equal up to an invertible linear transformation Khemakhem et al. [2020b], Roeder et al. [2021], Lachapelle et al. [2023], and one can characterize which linear properties are shared across the equivalence class [Marconato et al., 2025]. A key question is whether these conclusions hold approximately when two models induce distributions that are close but not equal [Buchholz and Schölkopf, 2024]. Nielsen et al. [2025] show that the answer depends on how Figure 1: In the center: The intuition of bounding representational similarity using distributional distance. P Θ is the set of probability distributions parametrized by models in Θ (Eq. ( 1)) which satisfy Thm. 2.1. These distributions are one-to-one with identifiability classes [(f , g)] in the quotient space Θ/ ∼ L [Khemakhem et al., 2020a]. The colored areas in P Θ contain the distributions which are ϵ-close to a reference p f ,g , as measured by d logit (Thm. 3.1, blue area) or by d KL (Eq. ( 9), pink area). Our Thm. 3.4 lower-bounds representational similarity (in terms of m CCA , blue arrow) using the logit distance d logit ; similarly, Thm. 3.9 upper-bounds dissimilarity in terms of our d rep (Thm. 3.7). In Thm. 3.3 (pink arrow) we prove that d KL yields weak bounds on d logit . We illustrate this with representations of two student models distilled from a teacher on the SUB dataset [Bader et al., 2025], see §5.2 for details. On the left, a student model trained to minimize a variant of d logit (Eq. ( 19)) to the teacher distribution p f ,g preserves linearly encoded concepts (Thm. 4.3): for 6 attributes, we visualize their linear separability in the embeddings by projecting them to two dimensions through LDA [Bishop and Nasrabadi, 2006, Ch. 4.1]. Distinct concept attributes can be well separated linearly in this 2d subspace. On the right, for a model trained to minimize d KL the LDA reduction shows that different concept attributes are not linearly separable, as reflected by the extremely low accuracy in Tab. 2. distributional closeness is measured: in particular, models can be arbitrarily close in KL divergence while their representations remain far from linearly related. On the other hand, robustness can be recovered under suitable divergences for which distributional closeness does imply representational similarity.

In this work, we ask: when two models in this family induce similar conditional distributions, to what extent do their representations agree up to an invertible linear map? We introduce a distributional distance based on logit differences and prove quantitative representation-level guarantees: small logit distance (i) implies high linear representational similarity, yielding an explicit lower bound on mCCA between representation spaces Raghu et al. [2017], Morcos et al. [2018] and (ii) upper-bounds a representational dissimilarity measure that we design to respect the model family’s equivalence class. Finally, we clarify the (limited) extent to which KL control can recover such guarantees: if model probabilities are bounded away from zero, then the KL divergence upper-bounds the logit distance, but the resulting tight robustness bounds are insufficient for practical settings.

Our results have an immediate implication for knowledge distillation. Standard approaches train a student to match a teacher’s predictions by minimizing a KL divergence between output distributions Hinton et al. [2015]; yet our theory implies that a student can be very close to its teacher in KL while still learning representations far from being linearly aligned. This motivates alternative objectives, and our theory suggests that minimizing logit differences-an appr

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut