Evidence from fMRI Supports a Two-Phase Abstraction Process in Language Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Research has repeatedly demonstrated that intermediate hidden states extracted from large language models are able to predict measured brain response to natural language stimuli. Yet, very little is known about the representation properties that enable this high prediction performance. Why is it the intermediate layers, and not the output layers, that are most capable for this unique and highly general transfer task? In this work, we show that evidence from language encoding models in fMRI supports the existence of a two-phase abstraction process within LLMs. We use manifold learning methods to show that this abstraction process naturally arises over the course of training a language model and that the first “composition” phase of this abstraction process is compressed into fewer layers as training continues. Finally, we demonstrate a strong correspondence between layerwise encoding performance and the intrinsic dimensionality of representations from LLMs. We give initial evidence that this correspondence primarily derives from the inherent compositionality of LLMs and not their next-word prediction properties.

💡 Research Summary

This paper investigates why intermediate hidden layers of large language models (LLMs) consistently provide the best linear predictions of human brain activity measured with functional magnetic resonance imaging (fMRI) during natural language listening. The authors propose that the phenomenon is rooted in a two‑phase abstraction process that emerges during model training: an early “composition” phase that builds rich, high‑dimensional representations, followed by a later “prediction” phase that compresses these representations to optimize next‑token prediction.

To test this hypothesis, three observables are measured across multiple models and training checkpoints: (1) brain‑model representational similarity, quantified as the voxel‑wise correlation between fMRI signals and a linear ridge‑regression mapping from LLM activations; (2) intrinsic dimensionality (ID) of the activations, estimated with the non‑linear GRIDE estimator, and linear effective dimensionality (d) obtained via PCA variance‑thresholding and the Participation Ratio; (3) layer‑wise surprisal (next‑token prediction error) computed with the TunedLens method, which learns an affine map from a hidden layer to the vocabulary distribution.

The fMRI data consist of three subjects listening to ~20 hours of English podcasts, yielding ~33 000 time points per voxel. Encoding models are trained on activations from three OPT models (125 M, 1.3 B, 13 B parameters) and a 6.9 B‑parameter Pythia model. For the training‑trajectory analysis, nine Pythia checkpoints (1 K to 143 K steps) are examined.

Key findings:

A strong positive correlation (ρ≈0.85) exists between encoding performance and intrinsic dimensionality across layers, model sizes, and brain regions. Higher‑ID layers capture abstract linguistic features that align with higher‑order cortical areas (e.g., frontal and temporal language zones).
In OPT‑1.3 B, encoding performance peaks at layer 17, precisely where surprisal sharply drops, marking a transition from the composition phase to the prediction phase. Similar transitions are observed in other model sizes, though the boundary is more gradual in Pythia.
Inter‑layer similarity measured by linear Centered Kernel Alignment (CKA) reveals two distinct blocks of layers separated near the ID peak, supporting the notion of separate functional phases.
Across training, both encoding performance and ID increase in tandem. The layer that maximizes ID converges to the same layer that maximizes encoding performance as training proceeds, ruling out trivial explanations based on fixed architectural depth.

The authors argue that the intermediate‑layer advantage is not a by‑product of the autoregressive loss; rather, it reflects the model’s internal compositional machinery. As models become more proficient at next‑token prediction, the “prediction” pressure compresses representations, reducing their dimensionality and consequently their brain‑matching ability. This explains why, with larger or more fully trained models, the optimal layer for brain encoding drifts slightly toward earlier layers.

Implications: (1) Brain‑LLM similarity should be interpreted as evidence of shared abstract representation building rather than shared predictive objectives. (2) The two‑phase abstraction process, previously hypothesized in interpretability literature, receives independent validation from neuroimaging data. (3) Practically, combining spectral information from multiple layers to construct a representation with higher intrinsic dimensionality than any single layer could improve encoding models beyond the current linear‑mapping ceiling.

The paper acknowledges limitations: only two model families were examined, and brain‑model similarity alone cannot definitively reveal causal mechanisms. Future work should extend the analysis to diverse architectures, multimodal models, and other neuroimaging modalities (ECoG, MEG) to test the generality of the two‑phase abstraction hypothesis.

Evidence from fMRI Supports a Two-Phase Abstraction Process in Language Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment