Alignment-Sensitive Minimax Rates for Spectral Algorithms with Learned Kernels
We study spectral algorithms in the setting where kernels are learned from data. We introduce the effective span dimension (ESD), an alignment-sensitive complexity measure that depends jointly on the signal, spectrum, and noise level $σ^2$. The ESD is well-defined for arbitrary kernels and signals without requiring eigen-decay conditions or source conditions. We prove that for sequence models whose ESD is at most $K$, the minimax excess risk scales as $σ^2 K$. Furthermore, we analyze over-parameterized gradient flow and prove that it can reduce the ESD. This finding establishes a connection between adaptive feature learning and provable improvements in generalization of spectral algorithms. We demonstrate the generality of the ESD framework by extending it to linear models and RKHS regression, and we support the theory with numerical experiments. This framework provides a novel perspective on generalization beyond traditional fixed-kernel theories.
💡 Research Summary
The paper tackles the problem of understanding generalization for spectral algorithms when the kernel is not fixed but learned from data. Classical kernel regression theory assumes a fixed kernel with known eigenvalue decay and source conditions on the target function; these assumptions break down for adaptive kernels that evolve during training. To bridge this gap, the authors introduce a new population‑level complexity measure called the Effective Span Dimension (ESD).
ESD is defined for any triple (θ*, λ, σ²) where θ* are the coefficients of the target in the kernel eigenbasis, λ are the eigenvalues (ordered decreasingly), and σ² is the noise variance. Formally, d†(σ²;θ*,λ) is the smallest integer k such that the squared ℓ₂‑norm of the tail of θ* beyond the top‑k coordinates is at most σ². Intuitively, it is the minimal number of leading eigen‑directions required to capture the signal energy above the noise floor. Crucially, ESD depends on the alignment between the signal and the spectrum, unlike traditional measures (effective dimension, effective rank) that ignore the signal.
The authors first study a canonical sequence model z_j = θ*_j + ξ_j with independent noise ξ_j ∼ N(0,σ²). For this model, they show that the Principal Component (PC) estimator, which keeps the top‑ν eigencomponents, achieves risk R_PC* satisfying (d†−1)σ² ≤ R_PC* ≤ 2d†σ². This establishes that the PC estimator’s optimal truncation point exactly matches the ESD.
The central minimax result (Theorem 3.3) states that for any class of signals whose ESD is bounded by K, the minimax excess risk over all estimators is Θ(Kσ²). In other words, ESD is the intrinsic difficulty parameter: the larger the allowed span, the larger the unavoidable risk, linearly in K. This result holds without any eigenvalue decay or source‑condition assumptions, making it broadly applicable.
To illustrate the power of ESD, the paper provides several examples (polynomial eigenvalue decay, power‑law signals, finite‑dimensional settings) and shows how classic minimax rates are recovered as special cases when σ² is scaled as σ₀²/n. The authors also define a “span profile” D_{θ*,λ}(τ) = d†(τ;θ*,λ), which tracks how ESD varies with the noise level τ, enabling direct comparison of two kernels: if the ratio D_{θ*,λ₁}(τ)/D_{θ*,λ₂}(τ) → 0 as τ→0, kernel λ₁ yields strictly better risk.
A major contribution is the analysis of over‑parameterized gradient flow (GF) on the eigenvalues while keeping eigenfunctions fixed. The authors prove that GF can reduce the ESD, i.e., the learning dynamics automatically re‑weight eigenvalues to improve signal‑kernel alignment, thereby lowering the minimax risk. This provides a concrete theoretical link between adaptive feature learning and provable generalization gains.
Beyond the sequence model, the framework is extended to linear regression (via whitening) and to RKHS regression (using Mercer’s decomposition). In both cases, the excess risk can be expressed in terms of the corresponding ESD, confirming that the same principle governs a wide range of learning problems.
Empirical experiments corroborate the theory: (i) learned kernels achieve lower ESD than fixed kernels for the same data; (ii) reordering eigenfunctions without changing eigenvalues can dramatically reduce ESD, demonstrating that alignment matters even when the spectrum is unchanged; (iii) risk curves across noise levels align with the predicted σ²·ESD scaling.
In conclusion, the Effective Span Dimension offers a signal‑aware, alignment‑sensitive complexity measure that captures the fundamental limits of spectral methods with learned kernels. It explains why adaptive kernel learning can outperform classical fixed‑kernel approaches and opens avenues for future work on non‑linear feature learning, multi‑task settings, and more general regularization schemes.
Comments & Academic Discussion
Loading comments...
Leave a Comment