Data-dependent kernels in nearly-linear time

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We propose a method to efficiently construct data-dependent kernels which can make use of large quantities of (unlabeled) data. Our construction makes an approximation in the standard construction of semi-supervised kernels in Sindhwani et al. 2005. In typical cases these kernels can be computed in nearly-linear time (in the amount of data), improving on the cubic time of the standard construction, enabling large scale semi-supervised learning in a variety of contexts. The methods are validated on semi-supervised and unsupervised problems on data sets containing upto 64,000 sample points.

💡 Research Summary

**
The paper addresses a fundamental scalability bottleneck in semi‑supervised learning: the construction of data‑dependent kernels that incorporate a large amount of unlabeled data. The classic formulation of Sindhwani et al. (2005) augments a base kernel K with an “intrinsic regularizer” defined by a symmetric positive‑semi‑definite matrix Q (often a graph Laplacian). The resulting kernel e_K requires inversion of the n × n matrix (I + η QK), leading to O(n³) time and O(n²) memory, which is prohibitive for even moderately sized datasets.

The authors propose a two‑stage approximation that reduces the computational cost to nearly linear in the number of data points while preserving the regularization effect of the full Q. First, they restrict the evaluation of functions to a small subsample b ≪ n of the data, denoted bX_S. For any function h, they compute its values on bX_S (vector bh) and then interpolate these values to the whole dataset using the covariance Q⁺ (the Moore‑Penrose pseudoinverse of Q). The interpolated function h* is the minimum‑norm extension of bh under the regularizer reg_Q(h) = hᵀQh. The intrinsic regularizer is then approximated by reg_Q(h*) = bhᵀ bQ bh, where bQ is defined as the pseudoinverse of the sub‑matrix of Q⁺ restricted to the subsample: bQ = (Q⁺|_{bX_S})⁺.

The second stage concerns the efficient computation of bQ. When Q is a sparse, symmetric diagonally dominant (SDD) matrix—as is the case for graph Laplacians—the recent nearly‑linear‑time solvers for SDD systems (Spielman‑Teng, Koutis‑Miller‑Peng) can compute an ε‑approximation of Q⁺·v for any vector v in O(s log n (log log n)² log 1/ε) time, where s is the number of non‑zero entries of Q. By applying these solvers to the b × b sub‑system, the authors obtain an ε‑approximation A to bQ⁺ in time O(b · s log n (log log n)² log 1/ε + b² n). Consequently, the overall cost of constructing the approximated kernel \tilde K is O(b³) for the one‑time matrix inversion (I_b + η bQ cK)⁻¹ plus O(b²) per kernel evaluation, which is dramatically lower than O(n³) and O(n²) respectively.

Theoretical contributions include:

Theorem 3.1, which formalizes the duality between a regularization operator R and its Green’s function (the kernel given by R⁺).
Theorem 3.2, which shows that the inner product defined by bQ on the subsample exactly equals the regularizer evaluated on any minimum‑norm interpolant of the subsample values under the full Q. This provides both an interpolation and a Bayesian interpretation (the interpolant is the posterior mean of a Gaussian process with covariance Q⁺).

The authors also discuss specialization to graph‑Laplacian regularizers, showing that bQ captures the global graph structure despite being built from a tiny subset of vertices. They illustrate this with a figure (not reproduced here) that visualizes how a smooth function on the full graph can be reconstructed from values on a few nodes.

Empirical evaluation is performed on several semi‑supervised classification benchmarks (up to 64 000 points) and on spectral clustering tasks. The proposed kernel is compared against:

LapSVM (the exact semi‑supervised kernel method),
Standard RBF‑kernel SVM,
A “budget” LapSVM that discards most unlabeled points.

Results show that on small datasets the approximated kernel matches LapSVM’s accuracy, confirming the theoretical fidelity of the approximation. On larger datasets where LapSVM cannot be run due to memory/time constraints, the proposed method outperforms the RBF kernel and the budget LapSVM, while requiring only a fraction of the computational resources. In clustering experiments, the kernel yields comparable or superior normalized cut values relative to traditional graph‑based methods.

In summary, the paper delivers a practical, theoretically grounded framework for constructing data‑dependent kernels at scale. By decoupling the number of points used to build the regularizer (which can be all n points) from the number of points actually measured (the small subsample), and by leveraging state‑of‑the‑art SDD solvers, the authors achieve nearly linear time complexity without sacrificing the regularization power of the full graph. This makes semi‑supervised kernel methods viable for modern large‑scale problems and opens the door for their integration into a wide range of kernel‑based algorithms beyond classification, such as regression, clustering, and manifold learning.

Data-dependent kernels in nearly-linear time

💡 Research Summary

Comments & Academic Discussion

Leave a Comment