Feature Identification via the Empirical NTK
We provide evidence that eigenanalysis of the empirical neural tangent kernel (eNTK) can surface the features used by trained neural networks. Across three standard toy models for mechanistic interpretability, Toy Models of Superposition (TMS), a 1-layer MLP trained on modular addition and a 1-layer Transformer trained on modular addition, we find that top eigenspaces of the eNTK align with ground-truth features. In TMS, the eNTK recovers the ground-truth features in both the sparse (high superposition) and dense regimes. In modular arithmetic, the eNTK can be used to recover Fourier feature families. Moreover, we provide evidence that a layerwise eNTK localizes features to specific layers and that the evolution of the eNTK spectrum can be used to diagnose the grokking phase transition. These results suggest that eNTK analysis may provide a practical handle for feature discovery and for detecting phase changes in small models.
💡 Research Summary
The paper investigates whether the eigenvectors of the empirical Neural Tangent Kernel (eNTK) can reveal the actual features used by trained neural networks. Building on the “eNTK hypothesis” – that the NTK evaluated at the end of training can still approximate the network’s function – the authors conduct systematic experiments on three well‑studied toy problems that are common in mechanistic interpretability research.
-
Toy Models of Superposition (TMS) – A 1‑layer auto‑encoder with hidden size m learns the identity map on synthetic data containing n ground‑truth features. By computing the flattened eNTK (size N·C by N·C) and examining its eigenvalue spectrum, the authors observe two pronounced “cliffs”. The second cliff occurs exactly at the n‑th eigenvalue, matching the number of true features. Heatmaps of cosine similarity between the top eigenvectors and feature activation vectors show an almost one‑to‑one alignment, especially when the importance weighting β is set to 0 (no weighting). This demonstrates that the eNTK’s leading subspace encodes all features, with the weighting controlling which ones dominate.
-
1‑layer MLP on Modular Arithmetic – The network learns to add two one‑hot encoded integers modulo a prime p (e.g., p = 29). In the grok‑king regime the model first reaches perfect training accuracy, then after a delay suddenly attains perfect test accuracy. The eNTK spectrum exhibits two cliffs: the first of dimension k = 4⌊p/2⌋ corresponds to the Fourier basis learned by the first hidden layer; the second cliff appears precisely at the grok‑king transition and aligns with the “sum” and “difference” Fourier components learned by the second layer. By computing a layer‑wise eNTK (summing only over parameters of a given layer) the authors localize the first cliff to the input‑to‑hidden weights and the second cliff to the hidden‑to‑output weights. Thus, eNTK analysis simultaneously identifies which features exist and where they reside in the architecture, and it provides a diagnostic for phase transitions.
-
1‑layer Transformer on Modular Arithmetic – Multiple random seeds are trained on the same modular addition task. The top eigenvectors of the full eNTK, as well as of layer‑wise eNTKs, consistently align with Fourier “key frequencies” that the transformer uses to implement the algorithm. Specifically, eigenvectors concentrate on the attention block’s O/V layers, the first MLP layer (input), the second MLP layer (output), and the final unembedding layer. Each of these subspaces aligns with a distinct set of Fourier features, showing that the transformer modularly distributes the computation across layers.
Across all experiments, the authors find that:
- The eNTK spectrum is highly anisotropic, with large gaps (“cliffs”) that correspond to the number of meaningful features.
- Top eigenvectors align strongly (cosine similarity > 0.9 in many cases) with known ground‑truth features (raw input features in TMS, Fourier modes in modular arithmetic).
- Layer‑wise eNTK decomposition isolates the layer responsible for each feature family.
- The emergence of a new cliff coincides with the grok‑king transition, suggesting eNTK can serve as an early‑warning signal for sudden changes in generalization behavior.
The paper situates its contributions relative to prior work that visualized NTK eigenvectors in image models and studies on tangent‑feature rotation. It also discusses connections to Hessian‑based influence functions and recent mechanistic interpretability tools such as the Local Interaction Basis.
Implications: The results indicate that eNTK eigenanalysis is a practical, computationally inexpensive method for feature discovery in small networks and for monitoring training dynamics. While the current evidence is limited to synthetic, low‑dimensional tasks, the methodology could be scaled to larger language or vision models, potentially offering a kernel‑based lens into their internal representations without requiring exhaustive probing or intervention.
Limitations & Future Work: The experiments are confined to toy settings with known ground‑truth features; extending to real‑world datasets where features are unknown remains an open challenge. Moreover, computing the full eNTK scales quadratically with dataset size, so approximations or stochastic estimators will be needed for large‑scale models. The authors suggest exploring closed‑form analyses of the eNTK in simple architectures, investigating the relationship between eNTK cliffs and other training phenomena (e.g., double‑descent), and integrating eNTK diagnostics into training pipelines.
Comments & Academic Discussion
Loading comments...
Leave a Comment