Variable sigma Gaussian processes: An expectation propagation perspective
Gaussian processes (GPs) provide a probabilistic nonparametric representation of functions in regression, classification, and other problems. Unfortunately, exact learning with GPs is intractable for large datasets. A variety of approximate GP methods have been proposed that essentially map the large dataset into a small set of basis points. The most advanced of these, the variable-sigma GP (VSGP) (Walder et al., 2008), allows each basis point to have its own length scale. However, VSGP was only derived for regression. We describe how VSGP can be applied to classification and other problems, by deriving it as an expectation propagation algorithm. In this view, sparse GP approximations correspond to a KL-projection of the true posterior onto a compact exponential family of GPs. VSGP constitutes one such family, and we show how to enlarge this family to get additional accuracy. In particular, we show that endowing each basis point with its own full covariance matrix provides a significant increase in approximation power.
💡 Research Summary
Gaussian processes (GPs) are a powerful non‑parametric Bayesian tool for modelling functions, but exact inference scales cubically with the number of data points, making them impractical for large‑scale problems. Sparse approximations address this issue by summarising the full dataset with a small set of inducing (basis) points. The most sophisticated of these, the variable‑sigma GP (VSGP) introduced by Walder et al. (2008), allows each inducing point to have its own length‑scale, thereby capturing local anisotropy in the input space. However, VSGP was originally derived only for regression with Gaussian likelihoods, leaving a gap for classification and other non‑Gaussian observation models.
The present paper bridges that gap by re‑deriving VSGP as an instance of Expectation Propagation (EP), a general algorithm for approximating intractable posteriors with members of a chosen exponential family. In this view, any sparse GP method can be interpreted as a KL‑projection of the true posterior onto a compact family of Gaussian processes defined by a set of site factors. VSGP corresponds to the family where each inducing point carries an individual scalar length‑scale σᵢ. By casting VSGP within EP, the authors obtain a principled way to extend it to classification (e.g., probit or logistic likelihoods) and to any other non‑Gaussian observation model, simply by updating the site factors according to the EP moment‑matching rules.
A key contribution of the paper is the proposal to enrich the VSGP family further: instead of a single scalar σᵢ, each inducing point is endowed with a full covariance matrix Σᵢ. This “full‑covariance VSGP” can model direction‑dependent variations in the input space, which is especially beneficial in high‑dimensional settings where anisotropy is not aligned with the coordinate axes. The EP derivation yields closed‑form updates for the site parameters (mean and covariance) and retains the overall O(M³) computational complexity, where M is the number of inducing points, independent of the total data size N.
The algorithm proceeds as follows: (1) initialise inducing locations and their parameters (σᵢ or Σᵢ); (2) for each observation compute a site factor that approximates the likelihood; (3) combine all site factors with the prior to form the global Gaussian approximation q(f); (4) perform EP moment matching to update the site parameters; and (5) iterate until convergence. Natural gradient steps are employed for efficient optimisation of the inducing‑point parameters.
Empirical evaluation is conducted on synthetic 2‑D functions with known anisotropic structure and on several real‑world classification benchmarks (e.g., UCI Wine, a subset of MNIST). Results demonstrate that the full‑covariance VSGP consistently outperforms the original scalar‑σ VSGP, as well as other popular sparse GP schemes such as FITC, PITC, and variational inducing‑point methods. Improvements are observed both in predictive accuracy (lower error rates) and in calibrated uncertainty estimates (higher log‑likelihood, better Brier scores). The advantage of full covariance becomes more pronounced as the dimensionality of the data increases, confirming the theoretical expectation that richer local geometry can be captured.
In conclusion, the paper provides a unified EP‑based framework that generalises VSGP beyond regression, introduces a powerful extension with per‑inducing‑point full covariances, and validates the approach empirically. The work opens several avenues for future research, including automatic selection of inducing locations, online EP updates for streaming data, and integration with deep kernel learning to combine expressive feature extraction with the flexible local length‑scale modelling offered by full‑covariance VSGP.
Comments & Academic Discussion
Loading comments...
Leave a Comment