Accelerating Single-Point Zeroth-Order Optimization with Regression-Based Gradient Surrogates
Zeroth-order optimization (ZO) is widely used for solving black-box optimization and control problems. In particular, single-point ZO (SZO) is well-suited to online or dynamic problem settings due to its requirement of only a single function evaluation per iteration. However, SZO suffers from high gradient estimation variance and slow convergence, which severely limit its practical applicability. To overcome this limitation, we propose a novel yet simple SZO framework termed regression-based SZO (ReSZO), which substantially enhances the convergence rate. Specifically, ReSZO constructs a surrogate function via regression using historical function evaluations and employs the gradient of this surrogate function for iterative updates. Two instantiations of ReSZO, which fit linear and quadratic surrogate functions respectively, are introduced. Moreover, we provide a non-asymptotic convergence analysis for the linear instantiation of ReSZO, showing that its convergence rates are comparable to those of two-point ZO methods. Extensive numerical experiments demonstrate that ReSZO empirically converges two to three times faster than two-point ZO in terms of function query complexity.
💡 Research Summary
**
The paper tackles a long‑standing drawback of single‑point zeroth‑order optimization (SZO): the high variance of gradient estimates and consequently slow convergence. While classic SZO methods estimate the gradient at the current iterate using only one function evaluation, all past evaluations are discarded. The authors propose a new framework called Regression‑based Single‑point ZO (ReSZO) that reuses the entire history of function values to build a surrogate model and then uses the surrogate’s gradient for updates.
Two concrete instantiations are introduced. The linear version (L‑ReSZO) fits a linear surrogate
(f_{s}^{t}(x)=f(\hat x_{t})+g_{t}^{\top}(x-\hat x_{t}))
by solving a least‑squares problem on the most recent (m) perturbed points (\hat x_{k}=x_{k}+\delta_{k}u_{k}). The coefficient vector (g_{t}) becomes the gradient estimator. A sliding‑window approach keeps the regression computationally cheap: only the newest sample is added and the oldest removed, allowing the matrix inverse to be updated with two Sherman‑Morrison‑Woodbury rank‑one updates, reducing the cost from (O(d^{3})) to (O(d^{2})).
An adaptive smoothing radius (\delta_{t}= |x_{t}-x_{t-1}|) is employed, scaling the perturbation magnitude with the current step size. This prevents the algorithm from stalling at an error floor proportional to a fixed (\delta) and enables high‑precision convergence.
Theoretical analysis focuses on the linear version. For both non‑convex and (\mu)-strongly convex objectives, the authors prove that after (T) iterations
\
Comments & Academic Discussion
Loading comments...
Leave a Comment