Low-Rank Filtering and Smoothing for Sequential Deep Learning

Low-Rank Filtering and Smoothing for Sequential Deep Learning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Learning multiple tasks sequentially requires neural networks to balance retaining knowledge, yet being flexible enough to adapt to new tasks. Regularizing network parameters is a common approach, but it rarely incorporates prior knowledge about task relationships, and limits information flow to future tasks only. We propose a Bayesian framework that treats the network’s parameters as the state space of a nonlinear Gaussian model, unlocking two key capabilities: (1) A principled way to encode domain knowledge about task relationships, allowing, e.g., control over which layers should adapt between tasks. (2) A novel application of Bayesian smoothing, allowing task-specific models to also incorporate knowledge from models learned later. This does not require direct access to their data, which is crucial, e.g., for privacy-critical applications. These capabilities rely on efficient filtering and smoothing operations, for which we propose diagonal plus low-rank approximations of the precision matrix in the Laplace approximation (LR-LGF). Empirical results demonstrate the efficiency of LR-LGF and the benefits of the unlocked capabilities.


💡 Research Summary

The paper tackles the fundamental problem of continual deep learning, where a neural network must learn a sequence of tasks without catastrophically forgetting earlier ones while remaining plastic enough to acquire new knowledge. The authors cast the problem as Bayesian state‑space inference: the network parameters θₜ at task t are treated as the hidden state of a nonlinear Gaussian model. The transition model is a linear Gaussian p(θₜ₊₁ | θₜ)=𝒩(θₜ₊₁; θₜ, Q), where Q is a diagonal process‑noise covariance that can encode prior beliefs about how parameters drift between tasks (e.g., only upper layers should change). The observation model is the usual supervised loss, written as an un‑normalized likelihood p(Dₜ | θₜ)∝exp(−λ L(θₜ, Dₜ)).

Exact Bayesian filtering (predict‑update) is intractable for deep networks because the posterior is non‑Gaussian and the Hessian is massive. The authors therefore employ a Laplace‑Gaussian filter (LGF): after observing task t they form a regularized loss L_reg(θ)=L(θ,Dₜ)+½(θ−μ̂ₜ₋₁)ᵀΣ̂ₜ₋₁⁻¹(θ−μ̂ₜ₋₁), optimize it to obtain the MAP estimate μ̂ₜ, and approximate the posterior by a Gaussian whose precision (inverse covariance) is the Hessian of L_reg at μ̂ₜ.

Storing full precision matrices is impossible for modern networks (D≈10⁶). The key technical contribution is to restrict every precision matrix to a “diagonal + low‑rank” form:
 Pₜ = Dₜ + Uₜ Σₜ Uₜᵀ,
where Dₜ is diagonal, Uₜ∈ℝ^{D×k} with k≪D, and Σₜ∈ℝ^{k×k}. This representation reduces memory to O(D + Dk + k²) and enables matrix‑vector products in O(Dk + k²).

During the predict step, the authors apply the Woodbury identity twice to combine the previous precision Pₜ₋₁ with the diagonal process‑noise Q, yielding a new diagonal + low‑rank precision (D′,U′,Σ′). In the update step, the Hessian of the loss is approximated by the Generalized Gauss‑Newton (GGN) matrix, which is naturally low‑rank because it can be expressed as a sum over minibatch Jacobians J_b: H≈∑_b J_bᵀ Ĥ_b J_b. Adding this low‑rank GGN to the predicted precision again produces a diagonal + low‑rank matrix, but its rank may increase. To keep the rank bounded, the authors truncate the resulting matrix via a rank‑k SVD, preserving the most significant directions. This yields an efficient Laplace‑Gaussian filter (LR‑LGF) that can be run sequentially over tasks with modest computational overhead.

Beyond filtering, the paper introduces Bayesian smoothing (Rauch‑Tung‑Striebel smoother) to obtain task‑specific parameter estimates that incorporate information from all tasks, even those observed later. Because the transition model is linear Gaussian, the smoothing distribution remains Gaussian, and its mean and covariance can be updated backward in time using the smoothing gain Gₜ = Cₜ(Cₜ+Q)⁻¹. By maintaining the diagonal + low‑rank structure for the filtering precision, the smoothing equations can also be executed in O(Dk + k²) time. Crucially, smoothing does not require revisiting any past data, making it suitable for privacy‑sensitive or data‑restricted scenarios.

The experimental section validates three claims: (1) The diagonal + low‑rank approximation closely matches a full‑matrix Laplace approximation while cutting memory usage by an order of magnitude. (2) Encoding domain knowledge via Q (e.g., assigning larger variance to upper layers) successfully steers which parts of the network adapt, leading to better forward transfer and reduced forgetting. (3) Applying smoothing improves the performance on early tasks by 2–4 % absolute accuracy and lowers overall loss, especially when later tasks provide complementary information. The authors also demonstrate that LR‑LGF scales to standard continual‑learning benchmarks and that the method works for both low‑data and privacy‑critical settings.

In summary, the paper contributes a principled Bayesian framework for sequential deep learning that (i) treats weight updates as state‑space inference, (ii) introduces an efficient diagonal + low‑rank Laplace‑Gaussian filter (LR‑LGF) to make the approach tractable for large networks, (iii) allows explicit incorporation of prior knowledge about task relationships via the process‑noise matrix, and (iv) leverages Bayesian smoothing to retroactively improve earlier task models without accessing their data. This combination of statistical rigor and computational practicality offers a compelling solution to the stability‑plasticity dilemma in continual learning.


Comments & Academic Discussion

Loading comments...

Leave a Comment