How Many Features Can a Language Model Store Under the Linear Representation Hypothesis?
We introduce a mathematical framework for the linear representation hypothesis (LRH), which asserts that intermediate layers of language models store features linearly. We separate the hypothesis into two claims: linear representation (features are linearly embedded in neuron activations) and linear accessibility (features can be linearly decoded). We then ask: How many neurons $d$ suffice to both linearly represent and linearly access $m$ features? Classical results in compressed sensing imply that for $k$-sparse inputs, $d = O(k\log (m/k))$ suffices if we allow non-linear decoding algorithms (Candes and Tao, 2006; Candes et al., 2006; Donoho, 2006). However, the additional requirement of linear decoding takes the problem out of the classical compressed sensing, into linear compressed sensing. Our main theoretical result establishes nearly-matching upper and lower bounds for linear compressed sensing. We prove that $d = Ω_ε(\frac{k^2}{\log k}\log (m/k))$ is required while $d = O_ε(k^2\log m)$ suffices. The lower bound establishes a quantitative gap between classical and linear compressed setting, illustrating how linear accessibility is a meaningfully stronger hypothesis than linear representation alone. The upper bound confirms that neurons can store an exponential number of features under the LRH, giving theoretical evidence for the “superposition hypothesis” (Elhage et al., 2022). The upper bound proof uses standard random constructions of matrices with approximately orthogonal columns. The lower bound proof uses rank bounds for near-identity matrices (Alon, 2003) together with Turán’s theorem (bounding the number of edges in clique-free graphs). We also show how our results do and do not constrain the geometry of feature representations and extend our results to allow decoders with an activation function and bias.
💡 Research Summary
The paper provides a rigorous mathematical treatment of the Linear Representation Hypothesis (LRH), a widely cited principle in recent language‑model research that posits intermediate layers store features linearly. The authors split LRH into two distinct claims: linear representation, meaning that the activation vector of a layer can be written as a linear combination of feature values, i.e., (f(\ell)=Az(\ell)) where (z(\ell)\in\mathbb{R}^m) is the feature vector for input (\ell); and linear accessibility, meaning each feature can be recovered by a linear probe, i.e., there exists a matrix (B) such that (|B^{\top}Az - z|_{\infty}<\epsilon) for all relevant inputs. The paper asks, for a given number of features (m) and a sparsity level (k) (the number of features active for any single input), how many neurons (d) are required to satisfy both claims simultaneously.
The authors first recall classical compressed‑sensing results (Candès & Tao, Donoho) which show that if one allows a non‑linear recovery algorithm (e.g., (\ell_1) minimization), then a random matrix (A\in\mathbb{R}^{d\times m}) with (d=O(k\log(m/k))) suffices to embed all (k)-sparse feature vectors and recover them exactly. This addresses the “general accessibility” setting (Q1) but does not meet the linear‑probe requirement.
The core contribution is the analysis of the linear compressed‑sensing setting (Q2), where recovery must be a linear map. Defining (d(m,k,\epsilon)) as the smallest dimension for which matrices (A,B) exist satisfying (|B^{\top}Az - z|_{\infty}<\epsilon) for all (k)-sparse (z\in
Comments & Academic Discussion
Loading comments...
Leave a Comment