Comparing regularisation paths of (conjugate) gradient estimators in ridge regression

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We consider standard gradient descent, gradient flow and conjugate gradients as iterative algorithms for minimising a penalised ridge criterion in linear regression. While it is well known that conjugate gradients exhibit fast numerical convergence, the statistical properties of their iterates are more difficult to assess due to inherent non-linearities and dependencies. On the other hand, standard gradient flow is a linear method with well-known regularising properties when stopped early. By an explicit non-standard error decomposition we are able to bound the prediction error for conjugate gradient iterates by a corresponding prediction error of gradient flow at transformed iteration indices. This way, the risk along the entire regularisation path of conjugate gradient iterations can be compared to that for regularisation paths of standard linear methods like gradient flow and ridge regression. In particular, the oracle conjugate gradient iterate shares the optimality properties of the gradient flow and ridge regression oracles up to a constant factor. Numerical examples show the similarity of the regularisation paths in practice.

💡 Research Summary

This paper investigates the statistical regularisation properties of three iterative algorithms used to minimise the penalised least‑squares objective of ridge regression in high‑dimensional linear models: standard gradient descent (GD), its continuous‑time limit Gradient Flow (GF), and Conjugate Gradient (CG). While CG is well known for its rapid numerical convergence, its statistical behaviour is harder to characterise because the iterates depend non‑linearly on the data. The authors introduce a novel, non‑standard error decomposition that separates approximation, stochastic, and cross‑terms for any estimator expressed as a residual‑filter applied to the ridge‑adjusted response. Using this decomposition they first extend the implicit regularisation result of Ali, Kolter and Tibshirani (2019) to the penalised case, showing that GF at time t has prediction risk bounded (up to a constant factor) by ridge regression with penalty λ+1/t, provided a mild geometric condition on the target vector holds.

The core contribution is Theorem 3.7, which establishes a deterministic, data‑dependent time re‑parameterisation τ(t) based on the zeros of the CG residual polynomial. The theorem proves that the in‑sample penalised prediction risk of the CG estimator after t continuous‑time steps is bounded by a constant (depending only on the spectrum of the empirical covariance) times the risk of GF at the transformed time τ(t). Consequently, the entire CG regularisation path inherits the same implicit regularisation effects as GF and, by extension, ridge regression. Corollary 3.10 shows that the CG “oracle” iterate (the one achieving minimal risk along the CG path) attains risk within a constant factor of both the GF and ridge‑regression oracles. Proposition 3.11 proves monotonicity of the GF risk for sufficiently large penalties, which together with Theorem 3.7 yields a monotone risk bound for CG as well.

To bridge the gap between in‑sample and out‑of‑sample performance, Proposition 3.13 provides a high‑probability transfer result under the assumption that the effective rank of the empirical covariance is small relative to the sample size—a realistic condition in many high‑dimensional, low‑rank settings. The authors also discuss how the constant factor depends on the eigenvalue decay of the covariance matrix, covering polynomial decay, Marc̆enko–Pastur type spectra, and spiked models.

Algorithm 1 presents the penalised CG method, and the authors extend the discrete iterates to a continuous regularisation path by linear interpolation of the residual polynomials, following Hucker and Reiß (2022). They compare this path with the explicit GF filter R_GF(t)(x)=e^{-tx} and the ridge filter R_RR,λ,λ′(x)=(λ′−λ)/(λ′−λ+x).

Empirical validation includes synthetic experiments with p≫n and a real‑world genomics dataset. In both cases the regularisation trajectories of CG, GD (discrete), and ridge regression are strikingly similar; CG reaches near‑optimal risk after far fewer iterations than GD, confirming the theoretical predictions.

Overall, the paper delivers a rigorous, non‑asymptotic analysis showing that CG not only converges quickly numerically but also enjoys the same implicit regularisation guarantees as Gradient Flow and ridge regression. This bridges a gap between optimisation efficiency and statistical optimality, positioning CG as a theoretically sound and computationally attractive tool for large‑scale penalised regression.

Comparing regularisation paths of (conjugate) gradient estimators in ridge regression

💡 Research Summary

Comments & Academic Discussion

Leave a Comment