Have ASkotch: A Neat Solution for Large-scale Kernel Ridge Regression
Kernel ridge regression (KRR) is a fundamental computational tool, appearing in problems that range from computational chemistry to health analytics, with a particular interest due to its starring role in Gaussian process regression. However, full KRR solvers are challenging to scale to large datasets: both direct (i.e., Cholesky decomposition) and iterative methods (i.e., PCG) incur prohibitive computational and storage costs. The standard approach to scale KRR to large datasets chooses a set of inducing points and solves an approximate version of the problem, inducing points KRR. However, the resulting solution tends to have worse predictive performance than the full KRR solution. In this work, we introduce a new solver, ASkotch, for full KRR that provides better solutions faster than state-of-the-art solvers for full and inducing points KRR. ASkotch is a scalable, accelerated, iterative method for full KRR that provably obtains linear convergence. Under appropriate conditions, we show that ASkotch obtains condition-number-free linear convergence. This convergence analysis rests on the theory of ridge leverage scores and determinantal point processes. ASkotch outperforms state-of-the-art KRR solvers on a testbed of 23 large-scale KRR regression and classification tasks derived from a wide range of application domains, demonstrating the superiority of full KRR over inducing points KRR. Our work opens up the possibility of as-yet-unimagined applications of full KRR across a number of disciplines.
💡 Research Summary
Kernel ridge regression (KRR) is a cornerstone of many machine learning applications, but solving the full n‑by‑n linear system K w = y becomes infeasible for large n because direct methods require O(n³) time and O(n²) memory, while iterative methods such as pre‑conditioned conjugate gradient (PCG) still need O(n²) per‑iteration work and large low‑rank preconditioners. The common workaround is to select a modest set of inducing points and solve an approximate problem, yet this approximation often degrades predictive performance.
The authors introduce ASkotch, a scalable, accelerated iterative solver that tackles the full KRR problem directly. ASkotch combines two ideas: (i) a sketch‑and‑project (SAP) framework, and (ii) a Nyström approximation of the kernel matrix. At each iteration a random subset B of size r is drawn using approximate ridge leverage score (ARLS) sampling, which is far cheaper than exact DPP sampling or Hadamard transforms. The Nyström matrix K≈K_{:B}K_{BB}^{†}K_{B:} serves as a low‑rank surrogate for K. The SAP update then projects the residual onto the subspace spanned by the Nyström columns, followed by a Nesterov‑type acceleration step. Crucially, the step size is computed automatically from the spectrum of the Nyström projector, so no manual tuning is required.
The theoretical contribution is two‑fold. First, the authors develop a reduction from ARLS sampling to determinantal point processes (Lemma 12) and prove a non‑trivial lower bound on the smallest eigenvalue of the expected SAP projector (Lemma 10). This shows that ARLS, despite being much cheaper, still yields a well‑conditioned projection matrix. Second, they analyze the Nyström projector, showing it is “almost” a projection matrix and that its expectation is bounded below by the SAP projector up to a constant factor (Theorem 13). Combining these results yields a fine‑grained linear convergence rate for ASkotch (Theorem 17). When the λ‑effective dimension d_λ(K) is moderate, the convergence is condition‑number‑free, and the total work to reach an ε‑accurate solution is ˜O(n² log 1/ε), matching PCG while using far less memory.
From a computational standpoint, each iteration costs O(n b) where b is a modest batch size (often b≈√n or smaller), and the algorithm stores only O(r²) for the Nyström factors plus O(d_λ(K) b) auxiliary vectors. Thus memory scales with the preconditioner rank r, not with n, enabling execution on a single 48 GB GPU even for n≈10⁸.
Empirically, the paper evaluates ASkotch on 23 large‑scale regression and classification tasks spanning chemistry, healthcare, and computer vision. Datasets range from 10⁵ to 10⁸ samples. ASkotch consistently outperforms state‑of‑the‑art solvers: PCG often fails to complete a single iteration within a 24‑hour limit on the largest problems; EigenPro 2.0/3.0 diverge on default hyper‑parameters; Falkon (an inducing‑point PCG method) is limited to m≈2·10⁴ points due to memory. In contrast, ASkotch with default settings (r≈50–200) reaches lower RMSE or higher classification accuracy in a few minutes, and it solves a full KRR problem of size 10⁸ × 10⁸ (effective dimension ≈10¹⁶) – two orders of magnitude larger than the biggest inducing‑point problem previously tackled.
The authors also release a PyTorch implementation with automatic step‑size computation and GPU‑accelerated kernels, facilitating reproducibility and easy adoption.
In summary, ASkotch delivers (i) linear‑time per‑iteration and low‑memory footprint, (ii) rigorous condition‑number‑free linear convergence, and (iii) robust default hyper‑parameters, thereby establishing a new practical baseline for solving full‑scale kernel ridge regression problems that were previously thought to be tractable only via inducing‑point approximations.
Comments & Academic Discussion
Loading comments...
Leave a Comment