Complexity reduction in online stochastic Newton methods with potential O(N d) total cost

Complexity reduction in online stochastic Newton methods with potential O(N d) total cost
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Optimizing smooth convex functions in stochastic settings, where only noisy estimates of gradients and Hessians are available, is a fundamental problem in optimization. While first-order methods possess a low per-iteration cost, their convergence is slow for ill-conditioned problems. Stochastic Newton methods utilize second-order information to correct for local curvature, but the O(d 3 ) per-iteration cost of computing and inverting a full Hessian, where d is the problem dimension, is prohibitive in high dimensions. This paper introduces an online mini-batch stochastic Newton algorithm. The method employs a random masking strategy that selects a subset of Hessian columns at each iteration, substantially reducing the per-step computational cost. This approach allows the algorithm, in the mini-batch setting, to achieve a total computational cost for a single pass over N data points of O(N d), which is comparable to first-order methods while retaining the advantages of second-order information. We establish the almost sure convergence and asymptotic efficiency of the resulting estimator. This property is obtained without requiring iterate averaging, which distinguishes this work from prior analyses.


💡 Research Summary

The paper addresses the classic problem of minimizing a smooth, convex function F(θ) in a stochastic online setting where only noisy gradient and Hessian estimates are available. First‑order methods such as stochastic gradient descent (SGD) are cheap per iteration (O(d)) but suffer from slow convergence on ill‑conditioned problems. Stochastic Newton methods can dramatically accelerate convergence by pre‑conditioning with an approximation of the inverse Hessian, yet the naïve implementation requires O(d³) operations per iteration to compute and invert a full Hessian, which is infeasible in high dimensions.

The authors propose a novel algorithm called the masked Stochastic Newton Algorithm (mSNA) that brings together three essential ingredients: (1) a stochastic gradient descent (SGD) scheme for directly estimating the inverse Hessian H⁻¹, (2) a random masking (or coordinate‑sampling) technique that reduces the per‑iteration cost of the SGD updates, and (3) a mini‑batch framework that processes data in blocks while preserving the online nature of the method.

Key technical ideas:

  • The inverse Hessian H⁻¹ is the unique minimizer of the quadratic functional J(A)=‖H^{1/2}A−H^{−1/2}‖_F². J is 2λ_min(H)‑strongly convex and 2λ_max(H)‑smooth, and its gradient simplifies to ∇J(A)=2(HA−I_d). Hence, a simple SGD on J would converge to H⁻¹ if H were known.
  • Directly applying SGD would still require dense matrix‑matrix multiplications (O(d³)). To avoid this, the authors replace the full gradient with a “sketched” version: at each iteration they randomly select ℓ ≪ d columns of the current matrix estimate and compute only the corresponding Hessian‑column products. This is equivalent to coordinate‑sampling SGD and reduces the cost of each update to O(ℓ b d + ℓ d²), where b is the mini‑batch size and ℓ is a rank‑parameter controlling the amount of information retained.
  • By setting the batch size b = d and ℓ = 1, the algorithm needs only N/b = N/d updates for a data set of size N, leading to a total computational complexity of O(N d). This matches the scaling of first‑order methods while still exploiting second‑order curvature information.

Theoretical contributions: under standard stochastic approximation assumptions (unbiasedness, bounded variance, positive‑definite Hessian at the optimum, Lyapunov conditions, and a covariance limit), the paper proves:

  1. Almost‑sure convergence of the parameter iterates θₙ to the true minimizer θ*.
  2. Asymptotic efficiency of the estimator without any iterate‑averaging step. Specifically, the pre‑conditioning matrix Cₙ produced by the masked SGD converges almost surely to H⁻¹, and the asymptotic covariance of √n(θₙ−θ*) attains the Cramér‑Rao lower bound. This extends earlier results that required averaging to achieve efficiency.

Implementation details: the inverse‑Hessian estimator is obtained by running the masked SGD on the symmetrized functional J_sym(A)=½


Comments & Academic Discussion

Loading comments...

Leave a Comment