Efficient Marginal Likelihood Computation for Gaussian Process Regression
In a Bayesian learning setting, the posterior distribution of a predictive model arises from a trade-off between its prior distribution and the conditional likelihood of observed data. Such distribution functions usually rely on additional hyperparameters which need to be tuned in order to achieve optimum predictive performance; this operation can be efficiently performed in an Empirical Bayes fashion by maximizing the posterior marginal likelihood of the observed data. Since the score function of this optimization problem is in general characterized by the presence of local optima, it is necessary to resort to global optimization strategies, which require a large number of function evaluations. Given that the evaluation is usually computationally intensive and badly scaled with respect to the dataset size, the maximum number of observations that can be treated simultaneously is quite limited. In this paper, we consider the case of hyperparameter tuning in Gaussian process regression. A straightforward implementation of the posterior log-likelihood for this model requires O(N^3) operations for every iteration of the optimization procedure, where N is the number of examples in the input dataset. We derive a novel set of identities that allow, after an initial overhead of O(N^3), the evaluation of the score function, as well as the Jacobian and Hessian matrices, in O(N) operations. We prove how the proposed identities, that follow from the eigendecomposition of the kernel matrix, yield a reduction of several orders of magnitude in the computation time for the hyperparameter optimization problem. Notably, the proposed solution provides computational advantages even with respect to state of the art approximations that rely on sparse kernel matrices.
💡 Research Summary
The paper addresses the computational bottleneck that arises when tuning hyper‑parameters of Gaussian Process Regression (GPR) – and equivalently kernel ridge regression – via Empirical Bayes marginal likelihood maximization. In the standard formulation, the marginal log‑likelihood L_y depends on the N×N covariance matrix Σ_y = σ²K(K+σ²λ²I)^{-1}+I and its inverse. Computing the log‑determinant and the quadratic form (μ_y−y)ᵀΣ_y^{-1}(μ_y−y) requires an O(N³) matrix inversion at every evaluation, which becomes prohibitive for realistic data sizes. Moreover, global optimization strategies (e.g., grid search, PSO, genetic algorithms) need hundreds of function evaluations, and local refinement (e.g., Newton‑Raphson) needs Jacobian and Hessian evaluations at each iteration, further amplifying the cost.
The authors propose a novel set of identities that exploit the eigendecomposition of the kernel matrix K. By writing K = USUᵀ, where U contains orthonormal eigenvectors and S = diag(s_i) holds eigenvalues, they show that Σ_y and its inverse become simultaneously diagonalizable with the same eigenvectors. Consequently, all scalar quantities required for the marginal likelihood and its derivatives can be expressed as simple sums over the eigenvalues s_i and the transformed target vector ỹ = Uᵀy.
Key results include:
- A closed‑form expression for the marginal log‑likelihood (Proposition 2.1) that reduces to N log σ² plus a sum over i of log d_i and ỹ_i² g_i, where d_i and g_i are rational functions of s_i, σ², and λ².
- Analytic formulas for the gradient ∂L_y/∂σ² and ∂L_y/∂λ² (Proposition 2.2) that involve only first‑order derivatives of d_i and g_i, again computable in O(N).
- Analytic formulas for the Hessian entries (Proposition 2.3) that involve second‑order derivatives of d_i and g_i, still O(N) because each term is a scalar function of a single eigenvalue.
- An efficient representation of the posterior covariance Σ_c (Proposition 2.4) using the same eigenbasis, allowing either full reconstruction via Strassen’s algorithm (≈O(N²·log 7)) or extraction of diagonal elements in O(N).
The computational profile changes dramatically: after a one‑time O(N³) eigendecomposition, every subsequent evaluation of L_y, its gradient, and Hessian costs only O(N). For a global optimizer requiring k* function evaluations, the total cost drops from k*·O(N³) to O(N³) + k*·O(N). The authors prove that the speed‑up factor is essentially k*, with an upper bound of O(N²) as k*→∞.
A simulation study validates the theory. Synthetic datasets with N ranging from 200 to 500 are used. The proposed method matches the exact marginal likelihood values and the resulting optimal hyper‑parameters obtained by the naïve O(N³) approach, while achieving speed‑ups of one to two orders of magnitude. Compared with state‑of‑the‑art sparse approximations (e.g., Nystrom, FITC), the new technique retains exactness without sacrificing accuracy, yet remains faster for the tested problem sizes.
In conclusion, the paper delivers a mathematically elegant and practically impactful solution: by leveraging the spectral properties of the kernel matrix, it transforms the hyper‑parameter optimization problem in GPR from a cubic‑time task into a linear‑time one after a single eigendecomposition. This enables exact Empirical Bayes tuning on datasets that were previously out of reach, positioning the method as a new benchmark for efficient, exact marginal likelihood computation in kernel‑based learning.
Comments & Academic Discussion
Loading comments...
Leave a Comment