Scalable Linearized Laplace Approximation via Surrogate Neural Kernel

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We introduce a scalable method to approximate the kernel of the Linearized Laplace Approximation (LLA). For this, we use a surrogate deep neural network (DNN) that learns a compact feature representation whose inner product replicates the Neural Tangent Kernel (NTK). This avoids the need to compute large Jacobians. Training relies solely on efficient Jacobian-vector products, allowing to compute predictive uncertainty on large-scale pre-trained DNNs. Experimental results show similar or improved uncertainty estimation and calibration compared to existing LLA approximations. Notwithstanding, biasing the learned kernel significantly enhances out-of-distribution detection. This remarks the benefits of the proposed method for finding better kernels than the NTK in the context of LLA to compute prediction uncertainty given a pre-trained DNN.

💡 Research Summary

The paper addresses a fundamental scalability bottleneck of the Linearized Laplace Approximation (LLA), a popular Bayesian post‑hoc method that linearizes a pretrained neural network around its MAP parameters and uses the resulting Gaussian posterior to quantify predictive uncertainty. Traditional LLA requires the explicit computation and storage of the Jacobian matrix Jθ∗(x) for every training example, an operation that becomes infeasible for modern deep networks with millions of parameters.

To overcome this, the authors propose Scalable Linearized Laplace Approximation (ScaLLA), which replaces the exact Jacobian‑based kernel with a learned surrogate kernel. A small auxiliary network gϕ(·) is trained to output a compact feature matrix of size C × m (with m ≪ P). The inner product gϕ(x) gϕ(x′)ᵀ is intended to approximate the Neural Tangent Kernel (NTK) K_NTK(x, x′) = Jθ∗(x) Jθ∗(x′)ᵀ. Crucially, the surrogate is trained without ever forming the full Jacobian; instead, the loss is based on Jacobian‑vector products (JVPs) z_v(x) = Jθ∗(x)v, which can be obtained efficiently via automatic differentiation. By sampling random vectors v (the authors favor Rademacher vectors for lower variance) and minimizing the squared difference between gϕ(x) gϕ(x′)ᵀ and z_v(x) z_v(x′)ᵀ, the surrogate learns to reproduce the NTK geometry in expectation. This training can be performed with standard mini‑batch stochastic optimization, yielding a memory footprint of O(C m) rather than O(C P).

Beyond accurate kernel approximation, the paper introduces a deliberate biasing scheme to improve out‑of‑distribution (OOD) detection. During training, a set of “context points” drawn from an auxiliary dataset (intended to resemble potential OOD data) is concatenated with the regular training batch. After the surrogate kernel is computed, cross‑covariances between training points and context points are forcibly set to zero, producing a block‑diagonal covariance structure. This encourages the posterior to revert to the prior in regions far from the training manifold, thereby inflating predictive variance for OOD inputs while preserving calibration on in‑distribution data.

Empirical evaluation uses a convolutional network trained on Fashion‑MNIST. The surrogate network has 100 output dimensions per class (m = 100). MNIST test samples serve as context points, and KMNIST test samples are treated as OOD. ScaLLA is compared against several state‑of‑the‑art Bayesian approximations: LLLA (Laplace with kernel‑free approximation), VALLA (variational linearized Laplace), FMGP (fixed‑mean Gaussian processes), MFVI (mean‑field variational inference), and SNGP (spectral‑normalized Gaussian processes). Metrics include accuracy, negative log‑likelihood (NLL), expected calibration error (ECE), and OOD detection AUC‑ROC based on predictive entropy (both softmax‑based and Gaussian‑based).

Results show that ScaLLA without bias matches or slightly outperforms baselines in NLL and ECE while maintaining comparable accuracy. When the biasing step is applied, OOD detection AUC‑ROC improves markedly (from ~0.78 to ~0.82), and the Gaussian‑based AUC reaches ~0.98, all without sacrificing in‑distribution performance. By contrast, LLLA achieves the highest OOD AUC but suffers degraded NLL/ECE. The authors also discuss computational advantages: the surrogate requires only JVPs, leading to linear scaling in the number of parameters and sub‑linear scaling in the number of training samples.

Limitations are acknowledged. The effectiveness of the biasing strategy hinges on the choice of context data; if the auxiliary set does not resemble the true OOD distribution, the benefit may diminish. Moreover, while the surrogate adds a modest number of extra parameters, its training still incurs additional overhead compared to a vanilla LLA implementation.

In conclusion, the paper presents a practical, scalable framework for Bayesian uncertainty quantification in large deep networks. By learning a compact surrogate kernel via Jacobian‑vector products and optionally biasing the covariance structure, ScaLLA delivers accurate predictive uncertainties, well‑calibrated posteriors, and strong OOD detection—all with a memory and compute profile compatible with modern large‑scale models. This work opens the door for broader adoption of Laplace‑based Bayesian methods in real‑world deep learning systems.

Scalable Linearized Laplace Approximation via Surrogate Neural Kernel

💡 Research Summary

Comments & Academic Discussion

Leave a Comment