Decoupling Dynamical Richness from Representation Learning: Towards Practical Measurement

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Dynamic feature transformation (the rich regime) does not always align with predictive performance (better representation), yet accuracy is often used as a proxy for richness, limiting analysis of their relationship. We propose a computationally efficient, performance-independent metric of richness grounded in the low-rank bias of rich dynamics, which recovers neural collapse as a special case. The metric is empirically more stable than existing alternatives and captures known lazy-torich transitions (e.g., grokking) without relying on accuracy. We further use it to examine how training factors (e.g., learning rate) relate to richness, confirming recognized assumptions and highlighting new observations (e.g., batch normalization promotes rich dynamics). An eigendecomposition-based visualization is also introduced to support interpretability, together providing a diagnostic tool for studying the relationship between training factors, dynamics, and representations.

💡 Research Summary

The paper addresses a fundamental gap in deep learning research: the lack of a performance‑independent, computationally cheap metric that quantifies the “richness” of training dynamics. While the community often equates richer dynamics with better representations, existing proxies—such as changes in the Neural Tangent Kernel (NTK), similarity to the initial kernel, parameter norms, or neural‑collapse‑based class separation—are either tied to accuracy, require heavy computation, or depend on label information.

To overcome these limitations, the authors propose a novel metric, denoted Dₗᵣ (Dynamic Low‑Rank measure). The construction proceeds as follows. Let Φ(x) be the penultimate‑layer feature map of a network, and define the feature kernel operator T = Σₖ |Φₖ⟩⟨Φₖ| acting on the L² function space over the data distribution. For a given network, the learned output functions form a subspace Ĥ = span{f̂₁,…,f̂_C}. The authors introduce the Minimum Projection (MP) operator Tₘₚ, which is essentially the orthogonal projection onto Ĥ (up to an additive constant term). In an ideal “rich” regime, the feature kernel collapses onto this low‑dimensional subspace, i.e., T ≈ Tₘₚ, meaning that only the minimal C directions are needed to express the learned function.

Dₗᵣ is defined as
Dₗᵣ = 1 – CKA(T, Tₘₚ),
where CKA denotes Centered Kernel Alignment, a normalized similarity measure ranging from 0 to 1. Consequently, Dₗᵣ ≈ 0 indicates that the feature kernel is already aligned with the minimal projection (high richness), while Dₗᵣ ≈ 1 signals a large gap (lazy dynamics). Crucially, this formulation does not involve test accuracy, initial weights, or class labels, making it applicable to any training stage and even to unlabeled data.

The authors prove that when T equals Tₘₚ, the classic neural‑collapse conditions NC1 (within‑class covariance vanishes) and NC2 (class means form a simplex equiangular tight frame) automatically hold. Thus, Dₗᵣ generalizes neural collapse: it reduces to the neural‑collapse metric in the special case of perfect class‑wise alignment, but remains meaningful in broader settings where labels may be absent or the task is regression.

From a computational standpoint, Dₗᵣ requires only a forward pass on a modest number of samples to collect pre‑ and post‑last‑layer activations, yielding matrices of size n × p and n × C (p = last‑layer width, C = number of outputs). The metric can be computed in O(npC) time; with typical values p≈10³, C≈10², and n≈p, the cost is negligible compared with NTK‑based methods that scale with the total parameter count.

Empirical validation proceeds in three parts. First, the authors compare Dₗᵣ against three established richness proxies: (1) similarity to the initial kernel (S_init), (2) L₂ norm of the parameters, and (3) the neural‑collapse class‑separation score (NC1). In a controlled experiment with an MLP on MNIST under extreme weight decay and a vanishing learning rate, existing metrics mistakenly suggest increased richness after training, whereas Dₗᵣ correctly reports a lack of richness (high Dₗᵣ).

Second, they examine the “target down‑scaling” manipulation, where the training targets are divided by a factor α. Theory predicts that larger α induces lazier dynamics. Across α values spanning three orders of magnitude, Dₗᵣ monotonically increases with α, faithfully tracking the lazy‑to‑rich transition, while the other metrics either remain flat or move in the opposite direction.

Third, the paper explores how various training hyper‑parameters affect richness. Increasing the learning rate consistently lowers Dₗᵣ, indicating richer dynamics, whereas smaller rates keep the model in a lazy regime. Notably, adding batch normalization to a VGG‑16 trained on CIFAR‑100 dramatically reduces Dₗᵣ, revealing that batch norm promotes low‑rank feature compression and thus richer dynamics—a novel observation not captured by prior metrics. Additional studies on architecture depth/width and weight decay corroborate the intuition that larger, more expressive models tend to converge to lower‑rank representations when training dynamics are sufficiently rich.

To aid interpretability, the authors introduce an eigendecomposition‑based visualization. By decomposing T and Tₘₚ, they plot (i) the cumulative contribution of last‑layer features to the target function, (ii) the contribution of each feature to the learned function, and (iii) the norm distribution across features. In rich models, a handful of features dominate both contributions and have large norms, whereas lazy models exhibit a more gradual decay, confirming the low‑rank hypothesis visually.

In summary, the paper delivers a theoretically grounded, label‑agnostic, and computationally efficient metric for dynamical richness. Dₗᵣ bridges the gap between rich dynamics and representation quality, validates known lazy‑to‑rich phenomena (e.g., grokking, target scaling), uncovers new relationships (batch normalization’s effect), and provides a practical diagnostic toolkit for researchers probing the interplay between training choices, dynamics, and learned representations.

Decoupling Dynamical Richness from Representation Learning: Towards Practical Measurement

💡 Research Summary

Comments & Academic Discussion

Leave a Comment