Richer Bayesian Last Layers with Subsampled NTK Features

Richer Bayesian Last Layers with Subsampled NTK Features
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Bayesian Last Layers (BLLs) provide a convenient and computationally efficient way to estimate uncertainty in neural networks. However, they underestimate epistemic uncertainty because they apply a Bayesian treatment only to the final layer, ignoring uncertainty induced by earlier layers. We propose a method that improves BLLs by leveraging a projection of Neural Tangent Kernel (NTK) features onto the space spanned by the last-layer features. This enables posterior inference that accounts for variability of the full network while retaining the low computational cost of inference of a standard BLL. We show that our method yields posterior variances that are provably greater or equal to those of a standard BLL, correcting its tendency to underestimate epistemic uncertainty. To further reduce computational cost, we introduce a uniform subsampling scheme for estimating the projection matrix and for posterior inference. We derive approximation bounds for both types of sub-sampling. Empirical evaluations on UCI regression, contextual bandits, image classification, and out-of-distribution detection tasks in image and tabular datasets, demonstrate improved calibration and uncertainty estimates compared to standard BLLs and competitive baselines, while reducing computational cost.


💡 Research Summary

Bayesian last layers (BLLs) are a popular, computationally cheap way to obtain uncertainty estimates for deep neural networks: a Bayesian linear regression is performed on the final hidden representation, while the rest of the network is treated deterministically. This simplicity, however, comes at the cost of severely under‑estimating epistemic uncertainty because the variability induced by earlier layers is ignored. In contrast, the Neural Tangent Kernel Gaussian Process (NTK‑GP) captures the full parameter‑gradient covariance (the empirical NTK) and thus provides a principled Bayesian treatment of the whole network, but its inference scales as O(N³) in the number of training points, making it impractical for realistic datasets.

The paper introduces “Richer Bayesian Last Layers” (Rich‑BLL), a method that bridges this gap by projecting the high‑dimensional NTK features onto the subspace spanned by the last‑layer (NNGP) features. Concretely, the full empirical NTK feature vector ϕₚ(x)=∇θ fθ̂(x) is split into two parts: ϕᵣ(x) (gradient w.r.t. the last‑layer weights) and ϕₘ(x) (gradient w.r.t. all preceding parameters). The authors posit a linear relationship ϕₘ(x)≈A ϕᵣ(x) and estimate A by ordinary least squares on the training data, yielding A=ΦₘᵀΦᵣ(ΦᵣᵀΦᵣ)⁻¹. By stacking A with the identity, they form B=


Comments & Academic Discussion

Loading comments...

Leave a Comment