Online Statistical Inference of Constant Sample-averaged Q-Learning
Reinforcement learning algorithms have been widely used for decision-making tasks in various domains. However, the performance of these algorithms can be impacted by high variance and instability, particularly in environments with noise or sparse rewards. In this paper, we propose a framework to perform statistical online inference for a sample-averaged Q-learning approach. We adapt the functional central limit theorem (FCLT) for the modified algorithm under some general conditions and then construct confidence intervals for the Q-values via random scaling. We conduct experiments to perform inference on both the modified approach and its traditional counterpart, Q-learning using random scaling and report their coverage rates and confidence interval widths on two problems: a grid world problem as a simple toy example and a dynamic resource-matching problem as a real-world example for comparison between the two solution approaches.
💡 Research Summary
This paper addresses the lack of statistical inference tools for Q‑learning, a cornerstone algorithm in reinforcement learning (RL). While Q‑learning is widely used for policy evaluation and control, its estimates can exhibit high variance and instability, especially in noisy or sparse‑reward environments. The authors propose a “sample‑averaged Q‑learning” variant that, at each state‑action pair, collects a fixed batch of B rewards and next‑state samples, averages them, and uses this average as an unbiased estimator of the Bellman operator. When B = 1 the method reduces to standard Q‑learning, making the approach a natural generalization.
The theoretical contributions are twofold. First, under a bounded‑reward assumption (A1) and for sufficiently small step‑size η, the authors prove that the Markov process generated by the sample‑averaged updates possesses a unique stationary distribution Q_η and that the bias ‖Q_η − Q*‖∞ scales as O(√η). Second, they establish a functional central limit theorem (FCLT) for the trajectory of Q‑values: after appropriate scaling, the cumulative deviation of Q_t from its expectation converges to a d‑dimensional Brownian motion transformed by a covariance matrix Σ{Q_η}. This result provides an asymptotic normality foundation for the estimator.
Leveraging the FCLT, the paper introduces an online “random scaling” procedure to construct confidence intervals for the optimal Q‑function Q*. The key statistic κ is formed by normalizing the squared norm of the averaged error ‖\bar Q_T − Q*‖ with a data‑driven covariance estimator \hat D_T derived from the sample path. By the continuous‑mapping theorem, κ converges to a pivotal distribution whose quantiles are known (Abadir & Paruolo, 1997). Consequently, a (1 − α) confidence interval for each component j of Q* is given by
\
Comments & Academic Discussion
Loading comments...
Leave a Comment