Statistical Inference for Temporal Difference Learning with Linear Function Approximation
We investigate the statistical properties of Temporal Difference (TD) learning with Polyak-Ruppert averaging, arguably one of the most widely used algorithms in reinforcement learning, for the task of estimating the parameters of the optimal linear approximation to the value function. Assuming independent samples, we make three theoretical contributions that improve upon the current state-of-the-art results: (i) we establish refined high-dimensional Berry-Esseen bounds over the class of convex sets, achieving faster rates than the best known results, and (ii) we propose and analyze a novel, computationally efficient online plug-in estimator of the asymptotic covariance matrix; (iii) we derive sharper high probability convergence guarantees that depend explicitly on the asymptotic variance and hold under weaker conditions than those adopted in the literature. These results enable the construction of confidence regions and simultaneous confidence intervals for the linear parameters of the value function approximation, with guaranteed finite-sample coverage. We demonstrate the applicability of our theoretical findings through numerical experiments.
💡 Research Summary
This paper investigates the statistical properties of Temporal Difference (TD) learning with Polyak‑Ruppert averaging for estimating the parameters of the optimal linear approximation to a value function, under the simplifying assumption that the data are independent and identically distributed (i.i.d.). The authors make four major contributions.
First, they derive a high‑dimensional Berry‑Esseen bound for the TD estimation error Δ_T = θ̄_T − θ* measured over the class of convex sets. The bound shows that the convex‑set distance between the distribution of Δ_T and its asymptotic Gaussian limit N(0, Λ*) decays at rate O(T⁻¹ᐟ³), which improves on the previously best known O(T⁻¹ᐟ⁴) rate for linear stochastic approximation with averaging. The proof relies on a careful martingale decomposition of the stochastic‑approximation recursion, recent high‑dimensional martingale central limit theorems, and refined non‑asymptotic inequalities.
Second, the paper proposes a computationally efficient online plug‑in estimator (\hat\Lambda_T) of the asymptotic covariance matrix Λ*. The estimator updates a running sum of the observed Jacobian‑like matrices A_t and the empirical covariance of the stochastic gradients, requiring only O(d²) memory and O(d) arithmetic per iteration (where d is the feature dimension). The authors prove that the total‑variation distance between N(0, (\hat\Lambda_T)) and N(0, Λ*) also shrinks at O(T⁻¹ᐟ³) with high probability, matching the Berry‑Esseen rate.
Third, they establish a sharp high‑probability convergence guarantee for the averaged TD iterate. With an appropriately chosen initial stepsize η_t = c t⁻¹, they show that, for any confidence level δ∈(0,1),
\
Comments & Academic Discussion
Loading comments...
Leave a Comment