A Multistep Lyapunov Approach for Finite-Time Analysis of Biased Stochastic Approximation
Motivated by the widespread use of temporal-difference (TD-) and Q-learning algorithms in reinforcement learning, this paper studies a class of biased stochastic approximation (SA) procedures under a mild “ergodic-like” assumption on the underlying stochastic noise sequence. Building upon a carefully designed multistep Lyapunov function that looks ahead to several future updates to accommodate the stochastic perturbations (for control of the gradient bias), we prove a general result on the convergence of the iterates, and use it to derive non-asymptotic bounds on the mean-square error in the case of constant stepsizes. This novel looking-ahead viewpoint renders finite-time analysis of biased SA algorithms under a large family of stochastic perturbations possible. For direct comparison with existing contributions, we also demonstrate these bounds by applying them to TD- and Q-learning with linear function approximation, under the practical Markov chain observation model. The resultant finite-time error bound for both the TD- as well as the Q-learning algorithms is the first of its kind, in the sense that it holds i) for the unmodified versions (i.e., without making any modifications to the parameter updates) using even nonlinear function approximators; as well as for Markov chains ii) under general mixing conditions and iii) starting from any initial distribution, at least one of which has to be violated for existing results to be applicable.
💡 Research Summary
The paper tackles the challenging problem of providing finite‑time performance guarantees for biased stochastic approximation (SA) algorithms, a class that includes many reinforcement‑learning (RL) methods such as temporal‑difference (TD) learning and Q‑learning. The authors adopt a “multistep Lyapunov” perspective: instead of analyzing the usual one‑step Lyapunov function, they construct a Lyapunov candidate that aggregates the standard Lyapunov values over a horizon of T future iterates. This look‑ahead design enables the analysis to capture the effect of bias that accumulates over several updates, a feature that is essential when the stochastic noise sequence exhibits dependence (e.g., a Markov chain) and the instantaneous gradient estimator is biased.
The theoretical setting starts from a generic SA recursion
Θ_{k+1}=Θ_k+ε f(Θ_k,X_k),
with constant stepsize ε>0. The mapping f is assumed globally Lipschitz in Θ and to grow at most linearly (Assumption 1). The associated ODE \dotθ=f(θ) is required to have a unique globally asymptotically stable equilibrium at the origin, formalized through a twice‑differentiable Lyapunov function W(θ) satisfying quadratic bounds and a negative drift condition (Assumption 2).
The novelty lies in Assumption 3, an “ergodic‑like” condition on the noise process {X_k}. It states that the average of T consecutive gradient estimates deviates from its limit f(θ) by at most σ(T;k)·L(‖θ‖+1), where σ(T;k)→0 as either the averaging window T→∞ or the time index k→∞. This condition is satisfied by i.i.d. sequences, finite‑state irreducible aperiodic Markov chains, and even Ornstein‑Uhlenbeck processes, thus covering a far broader class of stochastic perturbations than prior works that typically require unbiased or weakly dependent noise.
With this setup, Proposition 1 shows that after T steps the iterate can be written as
Θ_{k+T}=Θ_k+εT f(Θ_k)+g′(k,T,Θ_k),
where the bias term g′ has a conditional expectation bounded by εLT β_k(T,ε)(‖Θ_k‖+1). The factor β_k(T,ε)=εLT(1+εL)^{T‑2}+σ(T;k) captures both the deterministic growth due to the stepsize and the stochastic bias σ.
The main convergence result (Theorem 1) constructs a multistep Lyapunov function
W′(k,Θ_k)=∑_{j=k}^{k+T‑1}W(Θ_j(k,Θ_k))
and proves that for sufficiently small ε, the expected drift satisfies
E
Comments & Academic Discussion
Loading comments...
Leave a Comment