Degradation of Feature Space in Continual Learning

Degradation of Feature Space in Continual Learning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Centralized training is the standard paradigm in deep learning, enabling models to learn from a unified dataset in a single location. In such setup, isotropic feature distributions naturally arise as a mean to support well-structured and generalizable representations. In contrast, continual learning operates on streaming and non-stationary data, and trains models incrementally, inherently facing the well-known plasticity-stability dilemma. In such settings, learning dynamics tends to yield increasingly anisotropic feature space. This arises a fundamental question: should isotropy be enforced to achieve a better balance between stability and plasticity, and thereby mitigate catastrophic forgetting? In this paper, we investigate whether promoting feature-space isotropy can enhance representation quality in continual learning. Through experiments using contrastive continual learning techniques on CIFAR-10 and CIFAR-100 data, we find that isotropic regularization fails to improve, and can in fact degrade, model accuracy in continual settings. Our results highlight essential differences in feature geometry between centralized and continual learning, suggesting that isotropy, while beneficial in centralized setups, may not constitute an appropriate inductive bias for non-stationary learning scenarios.


💡 Research Summary

The paper investigates whether the isotropic feature representations that naturally emerge in centralized deep‑learning training are also beneficial in continual learning (CL), where data arrives sequentially and non‑stationarily. The authors begin by noting that in centralized settings, well‑trained networks tend to produce isotropic feature spaces, a phenomenon linked to Neural Collapse: within‑class features cluster tightly around class means, and class means form an equiangular tight frame. In contrast, CL must balance plasticity (learning new tasks) and stability (retaining old knowledge), and the authors hypothesize that this dynamic will drive the feature space toward anisotropy.

To quantify isotropy, they extend two metrics to high‑dimensional spaces. IsoScore is a direct generalization of a previously proposed 3‑D measure, but it suffers from sampling noise in high dimensions. Consequently, they introduce IsoEntropy, which treats the normalized eigenvalues of the covariance matrix as a probability distribution and computes the Shannon entropy, normalized by the maximal entropy log D. Both metrics range from 0 (completely anisotropic) to 1 (perfectly isotropic). Synthetic Gaussian clusters with controllable anisotropy (parameter ρ) are generated to provide reference curves for these metrics.

Experiments are conducted on CIFAR‑10 and CIFAR‑100 using a ResNet‑18 backbone with a 2‑layer projection head that maps features to a 128‑dimensional latent space. Four learning strategies are evaluated: SupCon (supervised contrastive loss), Co²L (SupCon + instance‑wise relation distillation), SupCP (SupCon + prototype‑based loss), and NCI (Co²L + prototype distillation). Each method is trained in a fully centralized regime and in three CL regimes: 2 experiences (50 %/50 %), 3 experiences (40 %/30 %/30 %), and 5 experiences (20 % per experience). A replay buffer (200 samples for CIFAR‑10, 800 for CIFAR‑100) is used for CL methods. The authors also test an isotropy regularization term λ_iso · IsoEntropy added to the loss.

Key findings:

  1. Centralized training yields high IsoEntropy and IsoScore values; t‑SNE visualizations show compact, near‑spherical class clusters, confirming isotropy.
  2. In all CL scenarios, both isotropy metrics drop sharply as the number of experiences grows, and t‑SNE plots reveal elongated, irregular clusters—evidence of progressive anisotropy.
  3. Adding isotropy regularization does not improve performance. For Co²L and NCI, accuracy declines by 2–4 percentage points, despite modest gains in IsoEntropy. The regularization appears to conflict with the knowledge‑distillation objectives that preserve past knowledge, reducing plasticity for new tasks.
  4. Synthetic baselines confirm that IsoEntropy monotonically decreases with increasing ρ, and the anisotropy observed in CL exceeds that of synthetic data with comparable ρ, indicating that the sequential nature of CL introduces additional geometric distortion beyond simple covariance scaling.
  5. Linear probing on frozen features shows that representations learned under CL are less transferable than centralized ones, correlating with lower isotropy scores.

The authors conclude that isotropy, while a useful inductive bias in static, centralized training, is not universally advantageous in continual learning. The non‑stationary data stream and limited replay buffer impose geometric constraints that naturally drive the feature space toward anisotropy. Enforcing isotropy can degrade both representation quality and downstream accuracy. Future work should explore regularization strategies that explicitly accommodate or even exploit anisotropy, rather than trying to force a centralized‑style geometry onto a fundamentally different learning paradigm.


Comments & Academic Discussion

Loading comments...

Leave a Comment