Understanding the Emergence of Seemingly Useless Features in Next-Token Predictors
Trained Transformers have been shown to compute abstract features that appear redundant for predicting the immediate next token. We identify which components of the gradient signal from the next-token prediction objective give rise to this phenomenon, and we propose a method to estimate the influence of those components on the emergence of specific features. After validating our approach on toy tasks, we use it to interpret the origins of the world model in OthelloGPT and syntactic features in a small language model. Finally, we apply our framework to a pretrained LLM, showing that features with extremely high or low influence on future tokens tend to be related to formal reasoning domains such as code. Overall, our work takes a step toward understanding hidden features of Transformers through the lens of their development during training.
💡 Research Summary
The paper investigates why transformers trained solely on the next‑token prediction (NTP) objective nevertheless develop internal features that are not useful for predicting the immediate next token—so‑called “NTP‑useless” features. The authors first formalize the flow of gradient information in a causally masked transformer and show that the loss gradient reaching any layer‑position pair ⟨k,i⟩ can be decomposed into three disjoint components: (1) direct learning, which propagates through the residual stream at position i and directly influences the prediction of token x_{i+1}; (2) pre‑caching, where the representation at i influences loss terms at future positions j > i+1 via attention, allowing the model to compute features that will be useful later; and (3) circuit sharing, which bypasses the residual stream at i entirely but still affects the shared parameters because the same weights are used across positions. These three components are expressed mathematically as ∇θL^{direct}{k,i}, ∇θL^{pre‑cached}{k,i}, and ∇θL^{shared}_{k,i}, and Proposition 3.1 proves that their sum equals the full gradient.
To quantify each component’s contribution to the emergence of a particular linear feature w_{k,i}, the authors define a feature mismatch R(x|θ₁,θ₂,w) measuring the squared difference of the feature’s projection between two model checkpoints. They then define the influence of a gradient vector G as the directional derivative of R with respect to an infinitesimal step ε G. Applying this to the three gradient components yields three influence terms I_{direct}, I_{pre‑cached}, and I_{shared}. By integrating these influences over the entire training trajectory (adjusted for Adam’s momentum updates), they obtain aggregate measures eI_{direct}, eI_{pre‑cached}, and eI_{shared} for each feature.
The methodology is validated on several settings. In two synthetic tasks—Majority and Conditioned Majority—two‑layer GPT‑2‑style models are trained. Ablations that block pre‑caching (myopic training) or circuit sharing (m‑untied training) dramatically reduce the linear probe accuracy for NTP‑useless features, confirming that both mechanisms are necessary for learning such features. The authors also conduct intervention experiments where they train models with one component removed and compare the resulting feature representations.
Next, the framework is applied to OthelloGPT, a transformer trained on board‑state sequences of the game Othello. Certain board squares are irrelevant for the immediate move but become predictive of future moves. The analysis shows that these squares acquire strong pre‑caching influence (they are stored in the KV‑cache for later attention) and also benefit from circuit sharing, which spreads strategic patterns learned at later positions back to earlier token representations. This explains the observed fragility of the world model: when pre‑caching is disabled, the model loses its ability to anticipate distant board configurations.
A third set of experiments examines a small language model (≈10 M parameters) trained on natural text. Linear probes reveal syntactic features such as “previous‑token identity” and “majority‑so‑far” that are NTP‑useless until a certain sequence length. Again, both pre‑caching and circuit sharing are essential for these features to emerge, as shown by the probe performance drop under myopic or untied training.
Finally, the authors scale the analysis to a pretrained 1.3 B‑parameter LLM. By computing eI values for a large set of hand‑crafted features (code tokens, mathematical symbols, logical operators, etc.), they discover that features related to formal reasoning domains exhibit extremely high pre‑caching or shared influence, while some low‑influence features correspond to surface‑level statistics. This suggests that large language models allocate substantial gradient budget to compute latent structures that will be useful many steps ahead, even though they do not directly affect the next‑token loss.
Overall, the paper makes three major contributions: (i) a theoretically grounded decomposition of the NTP gradient into direct, pre‑cached, and shared pathways; (ii) a novel quantitative influence metric that attributes feature emergence to each pathway; and (iii) empirical evidence across toy tasks, game‑world modeling, small‑scale language modeling, and large‑scale LLMs that pre‑caching and circuit sharing are the primary drivers of seemingly useless feature formation. The work opens avenues for more interpretable training diagnostics, the design of alternative objectives that suppress unnecessary complexity, and deeper understanding of how transformers build internal world models despite being optimized only for next‑token prediction.
Comments & Academic Discussion
Loading comments...
Leave a Comment