Forgetting is Everywhere

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

A fundamental challenge in developing general learning algorithms is their tendency to forget past knowledge when adapting to new data. Addressing this problem requires a principled understanding of forgetting; yet, despite decades of study, no unified definition has emerged that provides insights into the underlying dynamics of learning. We propose an algorithm- and task-agnostic theory that characterises forgetting as a lack of self-consistency in a learner’s predictive distribution, manifesting as a loss of predictive information. Our theory naturally yields a general measure of an algorithm’s propensity to forget and demonstrates that exact Bayesian inference allows for adaptation without forgetting. To validate the theory, we design a comprehensive set of experiments that span classification, regression, generative modelling, and reinforcement learning. We empirically demonstrate how forgetting is present across all deep learning settings and plays a significant role in determining learning efficiency. Together, these results establish a principled understanding of forgetting and lay the foundation for analysing and improving the information retention capabilities of general learning algorithms.

💡 Research Summary

The paper tackles the pervasive yet poorly understood phenomenon of forgetting in machine‑learning systems. While most prior work treats forgetting as a drop in performance on earlier tasks—especially in continual‑learning (CL) settings—the authors argue that this view conflates two distinct effects: backward transfer (where new learning improves past performance) and genuine forgetting (where it degrades past knowledge). To obtain a definition that applies across any learning paradigm, they propose a predictive self‑consistency perspective: a learner maintains a predictive distribution over all future observations and actions given its current internal state. If, after an update that receives no new information, this predictive distribution changes, the learner has “forgotten.”

The authors formalise learning as an interaction between a learner (state Z, prediction function f, learning‑mode update u, inference‑mode update u′) and an environment (observation kernel pₑ). Histories H of observation‑output pairs are defined, and the predictive distribution q(Hₜ₊₁:∞ | Zₜ, H₀:ₜ) is obtained by rolling out the inference‑mode dynamics (u′) without affecting the real learning process. This construction mirrors predictive Bayesianism: the learner implicitly holds an infinite‑horizon posterior that forms a martingale and converges over time.

From this foundation they derive several key theoretical contributions:

General learning‑as‑stochastic‑process framework that subsumes supervised learning, reinforcement learning, and generative modelling under a single probabilistic interaction model.
Theorem 5.1 (Exact Bayesian non‑forgetting) showing that if the learner updates its belief via exact Bayesian conditioning, the predictive distribution remains unchanged, i.e., forgetting never occurs. This provides a clean, information‑theoretic justification for why Bayesian methods are inherently stable.
A principled forgetting metric (Definition 4.7) defined as the KL‑divergence (or equivalently, information loss) between successive predictive distributions: Iₜ = Dₖₗ

Forgetting is Everywhere

💡 Research Summary

Comments & Academic Discussion

Leave a Comment