Finite Neural Networks as Mixtures of Gaussian Processes: From Provable Error Bounds to Prior Selection

Finite Neural Networks as Mixtures of Gaussian Processes: From Provable Error Bounds to Prior Selection
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Infinitely wide or deep neural networks (NNs) with independent and identically distributed (i.i.d.) parameters have been shown to be equivalent to Gaussian processes. Because of the favorable properties of Gaussian processes, this equivalence is commonly employed to analyze neural networks and has led to various breakthroughs over the years. However, neural networks and Gaussian processes are equivalent only in the limit; in the finite case there are currently no methods available to approximate a trained neural network with a Gaussian model with bounds on the approximation error. In this work, we present an algorithmic framework to approximate a neural network of finite width and depth, and with not necessarily i.i.d. parameters, with a mixture of Gaussian processes with error bounds on the approximation error. In particular, we consider the Wasserstein distance to quantify the closeness between probabilistic models and, by relying on tools from optimal transport and Gaussian processes, we iteratively approximate the output distribution of each layer of the neural network as a mixture of Gaussian processes. Crucially, for any NN and $ε>0$ our approach is able to return a mixture of Gaussian processes that is $ε$-close to the NN at a finite set of input points. Furthermore, we rely on the differentiability of the resulting error bound to show how our approach can be employed to tune the parameters of a NN to mimic the functional behavior of a given Gaussian process, e.g., for prior selection in the context of Bayesian inference. We empirically investigate the effectiveness of our results on both regression and classification problems with various neural network architectures. Our experiments highlight how our results can represent an important step towards understanding neural network predictions and formally quantifying their uncertainty.


💡 Research Summary

The paper tackles the long‑standing gap between the theoretical equivalence of infinitely wide/deep neural networks (NNs) and Gaussian processes (GPs) and the practical reality of finite‑size, possibly trained networks whose parameters are not i.i.d. The authors propose a constructive algorithmic framework that approximates the input‑output distribution of any stochastic neural network (SNN) – whether untrained or trained, with arbitrary weight‑bias correlations – by a Gaussian Mixture Model (GMM) while providing provable error bounds measured in the 2‑Wasserstein distance.

Core methodology

  1. Layer‑wise signature approximation – The continuous output distribution of each layer is discretized into a finite set of Dirac masses (a “signature”). This is analogous to optimal quantization or codebook construction.
  2. Exact propagation through affine + non‑linear layers – Assuming the weight and bias vectors follow a multivariate Gaussian (not necessarily independent), the affine transformation of each Dirac mass yields a Gaussian distribution. The subsequent non‑linear activation is applied to the discrete support, after which the resulting distribution can be re‑expressed as a Gaussian mixture.
  3. Error bounding – For each layer the introduced approximation error is bounded in 2‑Wasserstein distance. By employing interval arithmetic the per‑layer bounds are summed, yielding a global bound for the whole network. The authors prove that this bound converges uniformly to zero as the number of mixture components (M) grows, i.e., for any finite set of inputs and any (\epsilon>0) one can construct a GMM that is (\epsilon)-close to the SNN.
  4. Differentiability of the bound – The derived error bound is piece‑wise differentiable with respect to both the NN parameters and the GMM parameters. Consequently, gradient‑based optimization can be used to adjust the NN so that its functional behavior mimics a target GP (or any chosen GMM). This enables prior selection for Bayesian NNs with explicit guarantees, a task previously addressed only via KL‑divergence or Wasserstein‑1 minimization without formal error control.

Theoretical contributions

  • A rigorous formulation of the approximation problem in the space of probability measures, using the 2‑Wasserstein metric, which directly controls differences in means and covariances.
  • Proof of uniform convergence of the GMM approximation on any finite input set, together with explicit formulas for the required mixture size to achieve a prescribed error.
  • Demonstration that the error bound is amenable to automatic differentiation, opening the door to end‑to‑end training of NNs under functional‑space constraints.

Empirical validation
The framework is evaluated on a variety of architectures (fully‑connected, convolutional, residual) and tasks (regression on UCI benchmarks, classification on MNIST and CIFAR‑10). Key findings include:

  • With as few as 10–20 mixture components the GMM closely matches the Monte‑Carlo estimate of the SNN’s output distribution, both visually and in terms of the measured Wasserstein distance.
  • In uncertainty quantification experiments, the GMM‑derived predictive intervals are tighter yet maintain proper calibration compared to standard MC dropout or variational inference baselines.
  • For prior selection, initializing a NN to emulate a chosen GP kernel (e.g., RBF, Matern) via the differentiable error bound yields improved posterior predictive performance and faster convergence in subsequent Bayesian training, outperforming recent KL‑based methods.

Limitations and future work
The algorithm’s computational cost scales with the product of the number of Dirac masses per layer and the mixture size, which can become significant for deep or high‑dimensional networks. Extending the approach to handle non‑continuous activations (e.g., ReLU) more efficiently, integrating dimensionality‑reduction techniques for image‑scale inputs, and developing scalable approximations (e.g., hierarchical mixtures) are identified as promising directions.

Overall impact
By delivering a practical, theoretically grounded method to approximate finite NNs with Gaussian mixtures and to control the approximation error, the paper bridges a crucial gap between deep learning and Bayesian non‑parametrics. It provides a new tool for rigorous uncertainty quantification and for encoding functional priors in Bayesian neural networks, potentially influencing both theoretical analyses of deep models and applied Bayesian deep learning pipelines.


Comments & Academic Discussion

Loading comments...

Leave a Comment