Noradrenergic-inspired gain modulation attenuates the stability gap in joint training
Recent work in continual learning has highlighted the stability gap – a temporary performance drop on previously learned tasks when new ones are introduced. This phenomenon reflects a mismatch between rapid adaptation and strong retention at task boundaries, underscoring the need for optimization mechanisms that balance plasticity and stability over abrupt distribution changes. While optimizers such as momentum-SGD and Adam introduce implicit multi-timescale behavior, they still exhibit pronounced stability gaps. Importantly, these gaps persist even under ideal joint training, making it crucial to study them in this setting to isolate their causes from other sources of forgetting. Motivated by how noradrenergic (neuromodulatory) bursts transiently increase neuronal gain under uncertainty, we introduce a dynamic gain scaling mechanism as a two-timescale optimization technique that balances adaptation and retention by modulating effective learning rates and flattening the local landscape through an effective reparameterization. Across domain- and class-incremental MNIST, CIFAR, and mini-ImageNet benchmarks under task-agnostic joint training, dynamic gain scaling effectively attenuates stability gaps while maintaining competitive accuracy, improving robustness at task transitions.
💡 Research Summary
The paper tackles the “stability gap” – a transient drop in performance on previously learned tasks that occurs when a new task is introduced in continual learning (CL). While this phenomenon has been observed even under ideal joint‑training conditions, suggesting that it stems from the dynamics of optimization rather than from memory constraints or catastrophic forgetting, existing optimizers such as momentum‑SGD (MSGD) and Adam still exhibit pronounced gaps. The authors draw inspiration from the noradrenergic system in the brain, where bursts of norepinephrine transiently increase neuronal gain in response to unexpected uncertainty, thereby enabling rapid adaptation without erasing existing representations.
They propose Dynamic Gain Scaling (DGS), a two‑timescale optimizer that modulates an effective learning rate through a time‑varying gain variable g. By defining an effective weight W = g·w, the gain dynamics naturally decompose W into a slow component (g₀ w) and a fast component ((g – g₀) w). The slow part reflects tonic neuromodulation and provides stable, long‑term consolidation, while the fast part, driven by phasic gain bursts, allows rapid, temporary adjustments to new data. Mathematically, the updates are:
wₜ₊₁ = wₜ – α ∇₍w₎L,
gₜ₊₁ = γ gₜ + (1 – γ) g₀ + η H(y),
where α is the base learning rate, γ controls exponential decay of the gain back to its tonic baseline g₀ (= 1), η scales the phasic boost, and H(y) ≈ |∇₍g₎L| serves as a “surprise” signal derived from the loss gradient. This formulation adds negligible computational overhead and can be combined with any base optimizer.
A key theoretical insight is that increasing gain flattens the loss landscape: the curvature λ is effectively scaled to λ/g². Consequently, during high‑gain phases the optimizer can traverse steep regions without causing large loss spikes, which mitigates the overshooting that typically produces the stability gap at task boundaries. The authors illustrate this effect with a simple linear model and confirm it empirically on deep networks.
Experiments span domain‑incremental and class‑incremental settings on Split MNIST, Rotated MNIST, Split CIFAR‑10, Domain CIFAR‑100, and split mini‑ImageNet. All evaluations are performed under a task‑agnostic joint‑training regime, meaning that the joint loss over all tasks is available throughout training, thereby isolating the optimization dynamics from memory‑related issues. DGS is benchmarked against MSGD, Adam, and vanilla SGD. Results show:
- Stability Gap Reduction – DGS dramatically shrinks the transient accuracy dip at each task transition (often from 10‑15 % down to < 3 %).
- Competitive Accuracy – Overall test accuracy across the full sequence matches or slightly exceeds that of the baselines, even when strong momentum or adaptive learning rates are used.
- Loss‑Landscape Flattening – Measured curvature decreases by roughly 30‑40 % during high‑gain periods, confirming the theoretical prediction.
The contributions are threefold: (i) a biologically motivated, analytically grounded two‑timescale optimizer; (ii) a demonstration that gain‑induced reparameterization flattens the loss surface and reduces interference at task switches; (iii) empirical validation that the method effectively attenuates stability gaps in a range of CL benchmarks without sacrificing performance.
Limitations include the focus on ideal joint‑training (no replay buffer or memory constraints) and the need for manual tuning of the gain hyper‑parameters (γ, η). Future work is suggested to integrate meta‑learning for automatic gain‑parameter adaptation, test DGS in memory‑limited online CL scenarios, and scale the approach to large language models or distributed training environments.
Comments & Academic Discussion
Loading comments...
Leave a Comment