Exact closed-form Gaussian moments of residual layers
We study the problem of propagating the mean and covariance of a general multivariate Gaussian distribution through a deep (residual) neural network using layer-by-layer moment matching. We close a longstanding gap by deriving exact moment matching for the probit, GeLU, ReLU (as a limit of GeLU), Heaviside (as a limit of probit), and sine activation functions; for both feedforward and generalized residual layers. On random networks, we find orders-of-magnitude improvements in the KL divergence error metric, up to a millionfold, over popular alternatives. On real data, we find competitive statistical calibration for inference under epistemic uncertainty in the input. On a variational Bayes network, we show that our method attains hundredfold improvements in KL divergence from Monte Carlo ground truth over a state-of-the-art deterministic inference method. We also give an a priori error bound and a preliminary analysis of stochastic feedforward neurons, which have recently attracted general interest.
💡 Research Summary
The paper tackles the long‑standing problem of propagating the first two moments of a multivariate Gaussian distribution through deep neural networks that include residual connections. While many prior works rely on restrictive assumptions—small covariance, diagonal covariance, or mean‑field independence—this work derives exact closed‑form expressions for the mean and covariance after a single layer for a set of widely used activation functions: the probit (Φ), GeLU, ReLU (as the λ→∞ limit of GeLU), Heaviside (as the λ→∞ limit of Φ), and the sine function.
The authors introduce three fundamental scalar functions, Mσ, Kσ, and Lσ, which respectively give the expected activation, the covariance between two activations, and the cross‑covariance between an activation and a linear term. By applying multivariate Stein’s lemma, the Gaussian ordinary differential equation φ′(x)+xφ(x)=0, characteristic functions, and the Dominated Convergence Theorem, they obtain analytic formulas for these functions for each activation. Lemma 2.4 then shows how, for a generic residual layer g(x)=σ(Ax+b)+Cx+d, the output mean and covariance can be expressed solely in terms of the input mean μ, input covariance Σ, and the pre‑computed Mσ, Kσ, Lσ evaluated at low‑dimensional arguments (μi, νij, τij, κij). This yields a layer‑wise Gaussian approximation Y_ana that is exact for any single layer, regardless of the dimensionality of the input or the presence of residual (skip) connections.
Theoretical analysis in Appendix H provides a Wasserstein‑distance error recursion: the distance between the true output distribution Y₀ and the Gaussian approximation Y_ana at layer k is bounded by the Lipschitz constant of the layer times the previous‑layer distance plus a “non‑normality forcing term”. The latter term is bounded using a second‑order Poincaré inequality, linking the error to the interaction of variance and nonlinearity. Although the bound is loose, it clarifies why deep networks can accumulate non‑Gaussianity and offers a principled way to assess approximation quality.
Empirically, the authors evaluate Y_ana against four baseline deterministic propagation methods—mean‑field (Y_mfa), linear Jacobian (Y_lin), and two unscented transforms (Y_u’95, Y_u’02)—across 38 ensembles of random networks with varying depth, width, weight scales, and activation functions, as well as on real‑world regression and classification tasks with noisy or missing inputs. They measure KL divergence from the best Gaussian approximation (obtained via quasi‑Monte‑Carlo) and Wasserstein distance to the true (non‑Gaussian) output. Results show that Y_ana achieves KL values close to machine precision (≈10⁻¹⁵) and improves over baselines by factors ranging from 10³ to 10⁶, especially when input variance is large. In a variational Bayes neural network experiment, the Y_ana‑based variational posterior yields a KL divergence to Monte‑Carlo ground truth that is over 100× smaller than that of the state‑of‑the‑art deterministic method.
The paper also discusses failure modes of existing approximations through concrete counter‑examples (e.g., sinusoidal networks where linearization collapses) and demonstrates that Y_ana remains exact on those single‑layer cases. A preliminary exploration of stochastic activations (e.g., stochastic ReLU) is presented, indicating that the analytic framework can be extended, though a full theory is left for future work.
Limitations include: (1) the Wasserstein error bound is not tight, so practical error may be smaller; (2) the computational cost of handling full covariance matrices scales as O(d³), which may be prohibitive for very wide networks; (3) stochastic activation analysis is only initial. Nevertheless, the contribution is seminal: it provides the first exact, closed‑form moment‑matching scheme for a broad class of activations and residual architectures, enabling precise uncertainty propagation, robustness analysis, and more accurate variational inference in deep learning.
Comments & Academic Discussion
Loading comments...
Leave a Comment