Variational inference via radial transport

Reading time: 5 minute
...

📝 Original Info

  • Title: Variational inference via radial transport
  • ArXiv ID: 2602.17525
  • Date: 2026-02-19
  • Authors: ** 논문에 명시된 저자 정보가 제공되지 않았습니다. (원문에 저자 리스트가 포함되지 않음) **

📝 Abstract

In variational inference (VI), the practitioner approximates a high-dimensional distribution $π$ with a simple surrogate one, often a (product) Gaussian distribution. However, in many cases of practical interest, Gaussian distributions might not capture the correct radial profile of $π$, resulting in poor coverage. In this work, we approach the VI problem from the perspective of optimizing over these radial profiles. Our algorithm radVI is a cheap, effective add-on to many existing VI schemes, such as Gaussian (mean-field) VI and Laplace approximation. We provide theoretical convergence guarantees for our algorithm, owing to recent developments in optimization over the Wasserstein space--the space of probability distributions endowed with the Wasserstein distance--and new regularity properties of radial transport maps in the style of Caffarelli (2000).

💡 Deep Analysis

📄 Full Content

Variational inference (VI) is a fundamental optimization problem that takes place over subsets of probability distributions (Wainwright and Jordan, 2008;Blei et al., 2017). We consider a standard setup that arises in many applications, where the practitioner is given a high-dimensional posterior distribution π ∝ exp(-V ) and the goal is to solve where C ⊂ P(R d ) is a fixed set of probability distributions. VI is a powerful computational stand-in for standard Markov Chain Monte Carlo (MCMC) methods for sampling from unnormalized posteriors π. Indeed, while MCMC methods require simulating Markov chains for prohibitively long periods of time, it might be possible to instead quickly learn a surrogate density that is a good enough approximation to the posterior for practical purposes; see the review by Blei et al. (2017) for more details.

In VI, the choice of C ⊂ P(R d ) is of the utmost importance. For example, the case where C is the set of all Gaussians (with positive definite covariance) is known as Gaussian VI (Barber and Bishop, 1997;Seeger, 1999;Opper and Archambeau, 2009). In large-scale machine learning applications, it is also common to optimize over the class of Gaussians with diagonal covariance, resulting in mean-field Gaussian VI. While these algorithms have a long history, a rigorous, theoretical analysis is only just emerging, based on the theory of optimal transport through Wasserstein gradient flows (Ambrosio et al., 2008). For example, the Gaussian case has been studied by Lambert et al. (2022); Diao et al. (2023); Kim et al. (2024). We note that it is possible to implement algorithms based on mixtures of Gaussians as outlined by Lambert et al. (2022); Petit-Talamon et al. (2025), though the mathematical analysis in this case is significantly more challenging. Separately, Laplace approximation is an alternative means of obtaining a surrogate measure to π, where one considers the following Gaussian approximation N (x ⋆ , (∇ 2 V (x ⋆ )) -1 ) where x ⋆ = argmin V . The literature on Laplace approximations is vast; see e.g., Robert and Casella (2004). Margossian and Saul (2025) highlight the strengths and weaknesses of existing VI-based algorithms. Notably, they provide some characterizations for when VI can hope to exactly recover the mean and correlation matrix of a target distribution π. While they are only interested in these particular statistics, a key detail of the paper is that the variational approximating family must be decided in advance, which leads to demonstrable shortcomings even in small-scale examples.

To mitigate the issues brought about by these approximations, we study the VI problem over radial profiles.

For fixed m ∈ R d and Σ ≻ 0, we consider the following variational family:

as h ranges over non-negative functions on [0, ∞). If m and Σ are known, or if estimates thereof can be imputed, then we can assume m = 0 and Σ = I via a whitening procedure (see Section 4.5). We henceforth assume that this has been done, so that our variational family is the set C rad of radially symmetric distributions. This family encompasses the standard Gaussian via h(y) = exp(-y/2), but also Student-t distributions, the nonsmooth Laplace distribution, the logistic distribution, among others. *

In this paper, we propose and analyze a tractable algorithm for solving π ⋆ rad := arg min

where π ∝ exp(-V ). Our contributions are of both theoretical and computational interest. We stress that our only assumptions throughout this work will be on the true posterior π, namely that π is log-smooth and strongly log-concave and centered at the origin. This pair of assumptions has been leveraged in nearly all works on the theory of sampling (Chewi, 2026) and in the theoretical and computational study of variational inference (Lambert et al., 2022;Arnese and Lacker, 2024;Lacker et al., 2024;Lavenant and Zanella, 2024;Jiang et al., 2025).

In Section 3, we prove existence and uniqueness of the radial minimizer π ⋆ rad , as well as establish regularity properties of said minimizer. For example, if π is log-smooth and strongly log-concave, Theorem 3.4 states that π ⋆ rad is as well. We also prove Caffarellitype contraction estimates (Caffarelli, 2000) for the corresponding optimal radial transport map T ⋆ rad from the standard Gaussian ρ = N (0, I), say, to π ⋆ rad ; see Theorem 3.5.

In Section 4, we embrace the conventional wisdom of “parametrizing then optimizing” in order to compute π ⋆ rad , leading to our proposed algorithm radVI (see Algorithm 1). Concretely, we make use of the representation of a given radial measure as the pushforward of the standard Gaussian by a radial map T rad . Our approach is based on carefully parametrizing radial transport maps T λ for λ ∈ R J+1 + for some J > 0 where, if J is large enough, our parameterized set should encompass all possible radial maps (see Theorem 4.1). Then, writing our objective over the non-negative orthant as

we show that standard Euclidean gradient d

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut