Nonasymptotic CLT and Error Bounds for Two-Time-Scale Stochastic Approximation
We consider linear two-time-scale stochastic approximation algorithms driven by martingale noise. Recent applications in machine learning motivate the need to understand finite-time error rates, but conventional stochastic approximation analysis focus on either asymptotic convergence in distribution or finite-time bounds that are far from optimal. Prior work on asymptotic central limit theorems (CLTs) suggest that two-time-scale algorithms may be able to achieve $1/\sqrt{n}$ error in expectation, with a constant given by the expected norm of the limiting Gaussian vector. However, the best known finite-time rates are much slower. We derive the first nonasymptotic central limit theorem with respect to the Wasserstein-1 distance for two-time-scale stochastic approximation with Polyak-Ruppert averaging. As a corollary, we show that expected error achieved by Polyak-Ruppert averaging decays at rate $1/\sqrt{n}$, which significantly improves on the rates of convergence in prior works.
💡 Research Summary
This paper presents a rigorous finite-time analysis of linear two-time-scale stochastic approximation (TSA) algorithms driven by martingale noise. TSA, characterized by two iterates updated with different step sizes (α_t > γ_t), is fundamental in many machine learning applications such as gradient temporal-difference learning and actor-critic methods. While asymptotic convergence properties and non-optimal finite-time bounds exist for TSA without averaging, the finite-sample performance of TSA with Polyak-Ruppert (PR) averaging remained poorly understood.
The authors’ primary contribution is establishing the first non-asymptotic Central Limit Theorem (CLT) for two-time-scale stochastic approximation with Polyak-Ruppert averaging (TSA-PR). They prove that the distribution of the averaged iterates, √n (x̄_n - x*) and √n (ȳ_n - y*), converges to their limiting Gaussian distributions π_x and π_y at a rate of O(n^{-1/4}) measured by the Wasserstein-1 distance. This strong form of convergence directly implies a key corollary: the expected error of the PR-averaged iterates decays at the optimal rate of O(√(d/n)), a significant improvement over the slower rates known for the last iterates of TSA without averaging. A matching lower bound (Theorem 2) confirms the order-optimality of this rate.
To achieve this main result, the authors first derive a crucial intermediate result (Lemma 1) on the convergence rate of the second-order moments (covariance matrices) of the underlying non-averaged TSA iterates under general polynomial step sizes (α_t ∝ t^{-a}, γ_t ∝ t^{-b}, 1/2 < a < b < 1). This lemma, which provides a sharper analysis than prior work, is essential for bounding the variance terms in the martingale representation of the averaging error.
Beyond establishing the convergence rate, the paper offers novel insights into algorithm design. By analyzing the non-asymptotic error bound, the authors demonstrate that the optimal finite-time performance of TSA-PR is achieved with a finite time-scale separation ratio (ϵ = lim α_t/γ_t = Θ(1)). This contrasts with the conventional wisdom from analyzing the last iterate (non-averaged) covariance, which often suggests an infinitely large separation (ϵ → ∞) for asymptotic optimality. This finding provides a more nuanced and potentially more practical guideline for step-size selection in applications.
The analysis is built upon standard assumptions in stochastic approximation: stability of the system matrices (A_ff and its Schur complement Δ), martingale difference noise with bounded moments, and carefully chosen diminishing step sizes. The proof technique involves decoupling the two-time-scale dynamics, applying the non-asymptotic martingale CLT from
Comments & Academic Discussion
Loading comments...
Leave a Comment