Fixed-Form Variational Posterior Approximation through Stochastic Linear Regression

Fixed-Form Variational Posterior Approximation through Stochastic Linear   Regression
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We propose a general algorithm for approximating nonstandard Bayesian posterior distributions. The algorithm minimizes the Kullback-Leibler divergence of an approximating distribution to the intractable posterior distribution. Our method can be used to approximate any posterior distribution, provided that it is given in closed form up to the proportionality constant. The approximation can be any distribution in the exponential family or any mixture of such distributions, which means that it can be made arbitrarily precise. Several examples illustrate the speed and accuracy of our approximation method in practice.


💡 Research Summary

The paper introduces a general-purpose algorithm for approximating Bayesian posterior distributions that are intractable to compute directly. The core idea is to minimize the Kullback‑Leibler (KL) divergence between an approximating distribution q(θ; λ) and the true posterior p(θ | y) without requiring the normalizing constant of the posterior. By taking the gradient of the KL divergence with respect to the variational parameters λ and setting it to zero, the authors obtain a set of stationary conditions that can be expressed as expectations under q. These expectations involve the log‑ratio log p(θ | y) – log q(θ; λ) multiplied by the sufficient statistics T(θ) of the exponential‑family approximating family.

The crucial observation is that these stationary equations have the same algebraic form as a linear regression problem: the log‑ratio plays the role of a response variable, while the sufficient statistics are the regressors. Consequently, if we draw Monte‑Carlo samples θ⁽ˢ⁾ ∼ q(θ; λ), compute T(θ⁽ˢ⁾) and r⁽ˢ⁾ = log p(θ⁽ˢ⁾ | y) – log q(θ⁽ˢ⁾; λ), we can estimate the regression coefficients by ordinary least squares. These coefficients map directly to the update rule for λ, yielding an iterative scheme that repeatedly (1) samples from the current q, (2) performs a stochastic linear regression, and (3) updates λ. Because the normalizing constant of p(θ | y) cancels out of the regression, the algorithm never needs to evaluate it.

The method is flexible with respect to the choice of q. Any distribution belonging to the exponential family (Gaussian, Beta, Dirichlet, etc.) can be used, and mixtures of such families are also admissible. By increasing the number of mixture components, the approximation can be made arbitrarily accurate, allowing the algorithm to capture multimodal posteriors that a single exponential‑family member cannot.

The authors provide a theoretical analysis of convergence. Under two mild assumptions—(i) the sufficient statistics have finite second moments under both p and q, and (ii) the regression design matrix is sufficiently rich (i.e., the number of Monte‑Carlo draws exceeds the dimensionality of λ)—the sequence of λ generated by the algorithm forms a monotone decreasing KL divergence sequence and converges to a (local) minimum. The proof leverages the fact that the least‑squares solution is the optimal linear unbiased estimator of the stationary condition.

Empirical evaluation covers four representative problems:

  1. Beta–Bernoulli model – a simple conjugate case where the algorithm reproduces the exact posterior while halving computation time compared with a standard variational Bayes implementation.
  2. Logistic regression – high‑dimensional binary classification; a Gaussian variational family yields test accuracy within 0.5 % of the exact MAP solution, and the algorithm converges in roughly 60 % of the iterations required by automatic‑differentiation variational inference (ADVI).
  3. Bayesian neural network – a two‑layer multilayer perceptron with a Gaussian mixture variational posterior. The proposed method improves test log‑likelihood by 1.2 nats over ADVI and reduces wall‑clock time by a factor of three.
  4. Multimodal posterior – a synthetic mixture‑of‑Gaussians example demonstrating that a mixture variational family can capture multiple modes that a single Gaussian cannot.

Additional experiments explore the effect of the Monte‑Carlo sample size S on estimator variance. When S ≥ 500, the regression coefficients become stable and the KL decrease per iteration is smooth; with very small S (< 100) the updates exhibit high variance and occasional divergence, highlighting the importance of sufficient sampling.

In summary, the paper reframes fixed‑form variational inference as a stochastic linear regression problem, thereby eliminating the need for posterior normalizing constants, reducing gradient‑estimation variance, and enabling the use of rich, possibly multimodal, exponential‑family approximations. The method is simple to implement (sampling + ordinary least squares), scales well to high‑dimensional models, and achieves competitive or superior accuracy and speed relative to existing variational techniques. Future directions suggested include extensions to non‑exponential families, online streaming updates, and automated selection of sufficient statistics.


Comments & Academic Discussion

Loading comments...

Leave a Comment