Late-Stage Generalization Collapse in Grokking: Detecting anti-grokking with Weightwatcher

Late-Stage Generalization Collapse in Grokking: Detecting anti-grokking with Weightwatcher
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

\emph{Memorization} in neural networks lacks a precise operational definition and is often inferred from the grokking regime, where training accuracy saturates while test accuracy remains very low. We identify a previously unreported third phase of grokking in this training regime: \emph{anti-grokking}, a late-stage collapse of generalization. We revisit two canonical grokking setups: a 3-layer MLP trained on a subset of MNIST and a transformer trained on modular addition, but extended training far beyond standard. In both cases, after models transition from pre-grokking to successful generalization, test accuracy collapses back to chance while training accuracy remains perfect, indicating a distinct post-generalization failure mode. To diagnose anti-grokking, we use the open-source \texttt{WeightWatcher} tool based on HTSR/SETOL theory. The primary signal is the emergence of \emph{Correlation Traps}: anomalously large eigenvalues beyond the Marchenko–Pastur bulk in the empirical spectral density of shuffled weight matrices, which are predicted to impair generalization. As a secondary signal, anti-grokking corresponds to the average HTSR layer quality metric $α$ deviating from $2.0$. Neither metric requires access to the test or training data. We compare these signals to alternative grokking diagnostic, including $\ell_2$ norms, Activation Sparsity, Absolute Weight Entropy, and Local Circuit Complexity. These track pre-grokking and grokking but fail to identify anti-grokking. Finally, we show that Correlation Traps can induce catastrophic forgetting and/or prototype memorization, and observe similar pathologies in large-scale LLMs, like OSS GPT 20/120B.


💡 Research Summary

The paper introduces a previously undocumented phase of neural‑network training that the authors call “anti‑grokking,” a late‑stage collapse of generalization that follows the classic grokking phenomenon. Grokking, as originally described, is a delayed emergence of high test accuracy after the model has already achieved perfect training accuracy; test performance remains near chance for a long period before suddenly improving. The authors revisit two canonical grokking experiments—a three‑layer ReLU MLP trained on a subset of MNIST and a small transformer trained on a modular addition task—and extend the training runs far beyond the usual stopping points (up to 10⁷ steps). In both settings, after the model reaches the grokking peak (high test accuracy), the test accuracy subsequently falls back to chance while training accuracy stays at 100 %. This regression constitutes a distinct failure mode that the authors label anti‑grokking.

To detect anti‑grokking without any access to training or test labels, the authors employ the open‑source WeightWatcher (WW) toolkit, which is built on Random Matrix Theory (RMT) and the Heavy‑Tailed Self‑Regularization (HTSR) / SETOL framework. WW provides two key diagnostics:

  1. Correlation Traps – eigenvalues of a shuffled version of a layer’s weight matrix that lie far beyond the Marchenko–Pastur bulk. Such outliers indicate atypical, highly correlated structures in the weight matrix, which theory predicts to be detrimental to generalization.

  2. HTSR power‑law exponent α – the slope of the heavy‑tailed part of the empirical spectral density (ESD). An α≈2 is theoretically optimal; α>2 signals weak correlations, while α<2 signals a very heavy tail (VHT) associated with over‑fitting.

During the anti‑grokking phase the authors observe a surge in the number of Correlation Traps and a systematic deviation of the average α from 2. In the MLP, α drops below 2; in the modular addition transformer, α rises above 2 because the entire spectrum is already VHT. These spectral signatures coincide precisely with the test‑accuracy collapse, whereas other metrics—ℓ₂ weight norm, Activation Sparsity, Absolute Weight Entropy, Approximate Local Circuit Complexity—track the pre‑grokking and grokking transitions but fail to flag anti‑grokking.

The paper also explores the effect of weight decay (WD). Adding a non‑zero WD suppresses the emergence of Correlation Traps and mitigates the severity of the test‑accuracy drop, confirming that regularization can curb the pathological spectral dynamics.

Beyond diagnostics, the authors provide mechanistic interpretations. In the MLP, specific Correlation Traps correspond to “prototype over‑fitting”: the model memorizes particular training instances, leading to confusion and loss of generalization. In the transformer, the spectral density becomes uniformly heavy‑tailed (VHT) across the entire bulk, which the authors term “rule‑based memorization.” Both mechanisms illustrate how different forms of over‑parameterization manifest as spectral anomalies.

Finally, the authors report observing similar trap‑like spectral outliers in large‑scale open‑source language models (GPT‑20B/120B), suggesting that anti‑grokking is not limited to toy tasks but may affect real‑world systems deployed for long periods.

In summary, the contributions are:

  • Identification of a third, late‑stage phase of grokking (anti‑grokking) where test accuracy collapses while training accuracy remains perfect.
  • Demonstration that WeightWatcher’s Correlation Traps and HTSR α provide reliable, data‑free early warnings of this collapse.
  • Empirical evidence that weight decay mitigates anti‑grokking by limiting trap formation.
  • Detailed analysis of the underlying spectral mechanisms (prototype over‑fitting, rule‑based memorization).
  • Extension of the phenomenon to large language models, highlighting practical relevance.

The work bridges empirical observations with RMT‑based theory, offering a practical toolbox for monitoring and potentially preventing catastrophic generalization loss in deep networks.


Comments & Academic Discussion

Loading comments...

Leave a Comment