Training Memory in Deep Neural Networks: Mechanisms, Evidence, and Measurement Gaps
Modern deep-learning training is not memoryless. Updates depend on optimizer moments and averaging, data-order policies (random reshuffling vs with-replacement, staged augmentations and replay), the nonconvex path, and auxiliary state (teacher EMA/SWA, contrastive queues, BatchNorm statistics). This survey organizes mechanisms by source, lifetime, and visibility. It introduces seed-paired, function-space causal estimands; portable perturbation primitives (carry/reset of momentum/Adam/EMA/BN, order-window swaps, queue/teacher tweaks); and a reporting checklist with audit artifacts (order hashes, buffer/BN checksums, RNG contracts). The conclusion is a protocol for portable, causal, uncertainty-aware measurement that attributes how much training history matters across models, data, and regimes.
💡 Research Summary
The paper presents a comprehensive survey of “training memory” in deep neural networks, arguing that modern training is far from memory‑less. It defines training memory as any dependence of the next update on more than the current parameters and minibatch—specifically on optimizer state, data‑order decisions, and the trajectory taken through the loss landscape. The authors organize the myriad mechanisms that create such memory along three axes: source, lifetime, and visibility.
Sources are grouped into five categories: (S1) optimizer/trajectory state (momentum, Adam’s first/second moments, EMA/SWA, SAM, K‑FAC, etc.); (S2) sampler and data‑order state (random reshuffling vs. with‑replacement, curricula, paced augmentations, prioritized replay, BatchNorm running statistics, augmentation schedules); (S3) parameter‑path dependence (non‑commutativity of updates, mode‑connectivity, flat‑minimum preference); (S4) architectural or external memory (contrastive queues, memory banks, Hebbian traces); and (S5) meta‑state such as teacher‑student EMA or learned optimizers, including federated server‑side accumulators.
Lifetime ranges from step‑scale (moments decay over tens to hundreds of steps), epoch‑scale (ordering and augmentation policies persisting across full passes over the data), phase‑scale (pre‑training → fine‑tuning boundaries, EMA/SWA carried across phases), to task/round‑scale (federated rounds, slow‑fast learned optimizers). The paper visualizes these lifetimes and suggests appropriate intervention windows (e.g., resetting momentum for 1–2 half‑lives, swapping batch order for a full epoch, or flushing a queue after its turnover).
Visibility captures whether a state is observable or resettable. For instance, BatchNorm statistics are explicitly used at inference time, while Adam moments are hidden from the user unless logged.
The survey collates theoretical and empirical evidence for each source: momentum’s exponential decay formula, EMA’s smoothing effect on loss basins, the distinct convergence of reshuffling versus with‑replacement sampling, curriculum‑induced sample‑distribution drift, the impact of queue length on contrastive learning performance, and the stabilizing role of teacher EMA in self‑supervised training. It also points out that many existing studies report only final accuracy, neglecting these hidden variables, which hampers reproducibility and cross‑study comparison.
To address this gap, the authors introduce seed‑paired, function‑space causal estimands. By fixing the random seed, one can run a baseline training and a perturbed version where a single memory component is altered (e.g., zeroing momentum, swapping an epoch’s order, clearing a queue). The difference in function space—measured via logits, CKA, SVCCA, or downstream metrics—provides an unbiased estimate of that component’s causal contribution, together with uncertainty quantification (bootstrap or Bayesian intervals).
They also propose a set of portable perturbation primitives: (i) carry or reset optimizer buffers (momentum, Adam moments, EMA, BN statistics); (ii) perform short AB swaps of batch order at epoch boundaries; (iii) truncate or re‑initialize external queues or teacher models. These primitives are lightweight to implement and enable systematic ablations across the five source categories.
A major practical contribution is a reporting checklist designed to make training‑memory experiments auditable. Required artifacts include RNG contracts and order hashes, checksums of optimizer and BN buffers, logs of EMA/queue updates, and per‑epoch function‑space similarity matrices. Publishing these alongside code and model checkpoints would allow other researchers to verify that the same memory conditions were reproduced, and to compute effect sizes with confidence intervals.
In the conclusion, the paper frames the proposed protocol as a “portable, causal, uncertainty‑aware measurement” of training memory. It outlines desiderata for future diagnostics: (1) isolate each memory source via controlled perturbations; (2) report effect sizes in function space, not just final accuracy; (3) track representation drift with appropriate cavets; and (4) identify early‑phase indicators that predict later generalization. By providing a unified taxonomy, a synthesis of theory and evidence, causal estimands, perturbation tools, and a concrete checklist, the survey aims to standardize how the community quantifies and compares the influence of training history across architectures, datasets, and training regimes.
Comments & Academic Discussion
Loading comments...
Leave a Comment