Learning Dynamic Representations via An Optimally-Weighted Maximum Mean Discrepancy Optimization Framework for Continual Learning

Learning Dynamic Representations via An Optimally-Weighted Maximum Mean Discrepancy Optimization Framework for Continual Learning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Continual learning has emerged as a pivotal area of research, primarily due to its advantageous characteristic that allows models to persistently acquire and retain information. However, catastrophic forgetting can severely impair model performance. In this study, we address network forgetting by introducing a novel framework termed Optimally-Weighted Maximum Mean Discrepancy (OWMMD), which imposes penalties on representation alterations via a Multi-Level Feature Matching Mechanism (MLFMM). Furthermore, we propose an Adaptive Regularization Optimization (ARO) strategy to refine the adaptive weight vectors, which autonomously assess the significance of each feature layer throughout the optimization process, The proposed ARO approach can relieve the over-regularization problem and promote the future task learning. We conduct a comprehensive series of experiments, benchmarking our proposed method against several established baselines. The empirical findings indicate that our approach achieves state-of-the-art performance.


💡 Research Summary

The paper tackles catastrophic forgetting in continual learning (CL) by introducing a novel framework called Optimally‑Weighted Maximum Mean Discrepancy (OWMMD). Unlike most existing CL approaches that focus on replaying past samples, distilling final logits, or applying uniform regularization to parameters, OWMMD directly controls the drift of internal feature representations across all layers. The core of the method is a Multi‑Level Feature Matching Mechanism (MLFMM) that measures the distributional distance between the feature embeddings of the current task and those stored from previous tasks using the Maximum Mean Discrepancy (MMD) statistic. MMD is a kernel‑based, non‑parametric distance that captures differences in the mean embeddings of two distributions; by minimizing this distance for each layer, the network is forced to keep its intermediate representations stable while still learning new knowledge.

To avoid the over‑regularization problem that can arise when strong penalties are applied uniformly, the authors propose Adaptive Regularization Optimization (ARO). ARO computes a per‑layer importance weight in real time, combining three signals: (i) the contribution of that layer to the overall MMD loss, (ii) Fisher‑information‑based parameter importance (as used in EWC), and (iii) the magnitude of gradients observed during the current training step. These importance scores form a weight vector w_k that multiplies the layer‑wise MMD terms, yielding a final regularization term Σ_k w_k·MMD_k. The weights are learned jointly with the main task loss via back‑propagation, allowing the system to automatically tighten constraints on crucial layers while relaxing them on less critical ones. This adaptive scheme mitigates forgetting without choking the capacity needed for new tasks.

The authors evaluate OWMMD + ARO on a suite of standard CL benchmarks, including Split‑MNIST, CIFAR‑100, ImageNet‑Subset, and class‑incremental variants of CORe50. Baselines span rehearsal methods (ER, GEM, A‑GEM, GSS), knowledge‑distillation approaches (LwF, iCaRL, DER), and regularization techniques (EWC, oEWC, SI, RW). Across all datasets, OWMMD consistently outperforms the baselines in average accuracy and forgetting metrics, often by 2–7 percentage points. Notably, under severe memory constraints (≤200 stored samples), the proposed method still retains a clear advantage, demonstrating that its representation‑level regularization is more data‑efficient than pure replay. Computationally, the added cost is limited to the MMD kernel matrix computation, which scales quadratically with batch size; the authors argue that this overhead is comparable to that of other regularization methods and can be alleviated with kernel approximations.

The paper also discusses limitations. The choice of kernel (RBF in experiments) strongly influences performance, and the full kernel matrix can become memory‑intensive for large‑scale datasets. The authors suggest future work on random Fourier feature approximations, sub‑sampling strategies, and meta‑learning of the adaptive weights to further reduce overhead and improve scalability.

In summary, OWMMD introduces a principled, layer‑wise feature‑matching regularizer that directly curbs representation drift, while ARO provides an adaptive mechanism to balance stability and plasticity. The combined system achieves state‑of‑the‑art results on diverse continual learning benchmarks, offering a compelling alternative to existing replay‑, distillation‑, and uniform‑regularization approaches.


Comments & Academic Discussion

Loading comments...

Leave a Comment