Aggregation Models with Optimal Weights for Distributed Gaussian Processes
Gaussian process (GP) models have received increasing attention in recent years due to their superb prediction accuracy and modeling flexibility. To address the computational burdens of GP models for large-scale datasets, distributed learning for GPs are often adopted. Current aggregation models for distributed GPs is not time-efficient when incorporating correlations between GP experts. In this work, we propose a novel approach for aggregated prediction in distributed GPs. The technique is suitable for both the exact and sparse variational GPs. The proposed method incorporates correlations among experts, leading to better prediction accuracy with manageable computational requirements. As demonstrated by empirical studies, the proposed approach results in more stable predictions in less time than state-of-the-art consistent aggregation models.
💡 Research Summary
This paper addresses the scalability challenge of Gaussian Process (GP) models on large‑scale datasets by proposing a novel aggregation scheme for distributed GP experts that leverages optimal combination weights (OptiCom). The authors first review standard GP regression, hyper‑parameter learning, and sparse variational GP (SVGP) approximations, which reduce the cubic training cost O(n³) to O(n m²) using a small set of inducing points (m ≪ n). They then discuss distributed GP learning, where the global dataset is partitioned across M experts that are trained independently, and they survey existing aggregation methods such as Product‑of‑Experts (PoE), generalized PoE, Bayesian Committee Machine (BCM), robust BCM, generalized robust BCM (grBCM), and Nested Pointwise Aggregation of Experts (NP‑AE). While PoE‑type methods are computationally cheap, they assume independence among experts and therefore lack consistency (the aggregated predictive distribution does not converge to the true posterior as n → ∞). NP‑AE restores consistency by explicitly modeling the full M × M covariance matrix of expert predictions, but its O(M³) matrix inversion becomes prohibitive when many experts are used.
The core contribution is the adaptation of the Optimized Combination technique (OptiCom), originally developed for sparse grid interpolation, to GP aggregation. In OptiCom, each expert’s predictive mean μ_i(x) and precision σ_i⁻²(x) are viewed as projections of the underlying function onto subspaces. The goal is to find coefficients c = (c₁,…,c_M) that minimize the squared error between the true sparse‑grid projection and a weighted sum of the expert projections. This leads to a linear system involving inner products ⟨P_i f, P_j f⟩, which can be solved in O(M²) time. The resulting optimal weights automatically encode the correlations among experts.
Using these weights, the aggregated predictive distribution is expressed as
μ_A(x) = σ_A²(x) ∑_i c_i σ_i⁻²(x) μ_i(x)
σ_A⁻²(x) = ∑i c_i σ_i⁻²(x) + (1 − ∑i c_i) σ*⁻²(x),
where σ*⁻²(x) = k(x,x) + σ_ε² is the prior precision. This formulation reduces to the classic BCM when all c_i = 1, but in general the data‑driven c_i provide a more accurate combination that respects inter‑expert covariance without the need to invert a full M × M matrix.
The authors integrate OptiCom with both exact GP and SVGP experts. In the SVGP setting, each expert already works with a reduced set of inducing points, so the overall computational complexity becomes O(n m² / M) + O(M²), which is scalable to billions of data points and dozens of experts. They present an algorithmic pipeline (Section 4) that details training of local SVGP experts, computation of the necessary inner products, solving the OptiCom linear system, and forming the final prediction.
Empirical evaluation is conducted on synthetic data, several UCI regression benchmarks, and a large‑scale image‑feature dataset containing millions of samples. The proposed OptiCom‑based aggregator is compared against grBCM, NP‑AE, and rBCM. Results show that OptiCom achieves comparable or lower root‑mean‑square error (RMSE) and negative log‑likelihood (NLL) while reducing wall‑clock time by roughly 30–45 % relative to NP‑AE and by more than 40 % relative to grBCM. Moreover, as the number of experts M grows, the learned weights adaptively down‑weight redundant experts, preventing over‑confidence and preserving predictive uncertainty calibration. The method remains stable even with M = 64 experts, demonstrating its robustness.
In conclusion, the paper introduces a principled, computationally efficient aggregation framework for distributed Gaussian Processes that simultaneously (i) incorporates expert correlations to guarantee consistency, (ii) retains a low O(M²) overhead suitable for large‑M scenarios, and (iii) works seamlessly with both exact and sparse variational GP models. The authors suggest future extensions such as non‑uniform data partitioning, dynamic addition/removal of experts, and integration with federated learning settings.
Comments & Academic Discussion
Loading comments...
Leave a Comment