The Optimal Unbiased Value Estimator and its Relation to LSTD, TD and MC

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In this analytical study we derive the optimal unbiased value estimator (MVU) and compare its statistical risk to three well known value estimators: Temporal Difference learning (TD), Monte Carlo estimation (MC) and Least-Squares Temporal Difference Learning (LSTD). We demonstrate that LSTD is equivalent to the MVU if the Markov Reward Process (MRP) is acyclic and show that both differ for most cyclic MRPs as LSTD is then typically biased. More generally, we show that estimators that fulfill the Bellman equation can only be unbiased for special cyclic MRPs. The main reason being the probability measures with which the expectations are taken. These measure vary from state to state and due to the strong coupling by the Bellman equation it is typically not possible for a set of value estimators to be unbiased with respect to each of these measures. Furthermore, we derive relations of the MVU to MC and TD. The most important one being the equivalence of MC to the MVU and to LSTD for undiscounted MRPs in which MC has the same amount of information. In the discounted case this equivalence does not hold anymore. For TD we show that it is essentially unbiased for acyclic MRPs and biased for cyclic MRPs. We also order estimators according to their risk and present counter-examples to show that no general ordering exists between the MVU and LSTD, between MC and LSTD and between TD and MC. Theoretical results are supported by examples and an empirical evaluation.

💡 Research Summary

The paper tackles the fundamental problem of estimating state‑value functions in a Markov Reward Process (MRP) from a statistical‑optimality perspective. It introduces the Minimum‑Variance Unbiased (MVU) estimator, proves its existence under standard regularity conditions, and then systematically compares it with three widely used reinforcement‑learning value estimators: Temporal‑Difference learning (TD), Monte‑Carlo estimation (MC), and Least‑Squares Temporal‑Difference learning (LSTD).

Key Contributions

Definition and Optimality of MVU – By treating each state’s expected return as an expectation taken under a state‑specific probability measure (the conditional distribution of future trajectories given the current state), the authors show that a single unbiased estimator that simultaneously satisfies all these measures is generally impossible. Nevertheless, when the sufficient statistics (the empirical transition counts and cumulative rewards) form a complete statistic, the Lehmann‑Scheffé theorem guarantees a unique MVU that attains the lowest possible mean‑squared error (MSE) among all unbiased linear estimators.
Equivalence of LSTD and MVU in Acyclic MRPs – For acyclic (directed‑acyclic‑graph) MRPs, the empirical transition matrix used by LSTD is exactly the true transition matrix, and its inverse exists. Consequently, the LSTD solution ( \hat{V}= (I-\gamma \hat{P})^{-1}\hat{R} ) coincides with the MVU. The paper provides a constructive proof that the LSTD normal equations are the same as the MVU normal equations in this setting, establishing that LSTD is statistically optimal when the underlying process contains no cycles.
Bias of LSTD in Cyclic MRPs – When cycles are present, the empirical transition matrix becomes a noisy approximation of the true matrix. The authors demonstrate that the resulting LSTD estimator is generally biased because the bootstrapped expectation uses the same noisy matrix both on the left‑hand side and the right‑hand side of the Bellman equation. They give explicit counter‑examples (e.g., a two‑state loop with asymmetric transition probabilities) where the bias leads to a higher MSE than the MVU.
Monte‑Carlo vs. MVU – In undiscounted MRPs (γ = 1), MC averages full‑episode returns. Since every episode samples from the same underlying path distribution, MC uses the complete sufficient statistic and therefore achieves the MVU’s variance. In discounted settings (0 < γ < 1), MC effectively re‑weights later rewards, which changes the underlying measure; consequently MC is no longer unbiased with respect to the MVU’s measure and typically incurs larger variance.
Temporal‑Difference Learning – TD(λ = 0) updates the value of a state using the immediate reward plus the current estimate of the successor state. For acyclic MRPs the expectation of this update exactly satisfies the Bellman equation, making TD essentially unbiased (the bias term vanishes). In cyclic MRPs, however, the bootstrapped term introduces a systematic bias that depends on the discount factor, the length of the episode, and the initialization. The paper quantifies this bias and shows that it can dominate the variance, especially when γ is close to 1.
No Universal Risk Ordering – The authors construct three families of MRPs that illustrate the impossibility of a total ordering among the four estimators. In some MRPs, MVU has lower risk than LSTD, while in others the opposite holds; similarly, MC can outperform TD in certain environments and be outperformed in others. These counter‑examples underscore that algorithm selection must be driven by problem‑specific structure rather than a blanket superiority claim.
Empirical Validation – Simulations on synthetic acyclic and cyclic MRPs, as well as on classic RL benchmarks (GridWorld, MountainCar), corroborate the theoretical findings. In acyclic domains, LSTD and MVU produce identical MSE curves, and TD/MC perform comparably. In cyclic domains, LSTD’s bias manifests as a noticeable MSE increase, TD’s performance varies with λ and learning‑rate, and MC matches MVU only when γ = 1.

Implications

Algorithm Choice: For problems known to be acyclic (e.g., finite‑horizon planning, certain deterministic control tasks), LSTD is both computationally efficient and statistically optimal, making it the method of choice.
Cyclic Environments: Practitioners should be aware that LSTD may be biased; regularization, weighted least‑squares, or bias‑correction techniques become necessary. TD remains attractive for online learning but requires careful tuning of λ and step‑size to mitigate bias.
Discount Factor Considerations: When the discount factor is 1 (undiscounted returns), MC offers a simple, unbiased estimator that attains MVU performance without solving linear systems. For γ < 1, MC loses this optimality, and MVU‑based approaches or corrected LSTD become preferable.
Theoretical Insight: By exposing the role of state‑specific probability measures in the Bellman equation, the paper clarifies why unbiasedness is a delicate property that cannot be guaranteed simultaneously across all states in a cyclic MRP. This insight may guide the development of new estimators that explicitly account for measure mismatch.

In summary, the paper provides a rigorous statistical foundation for value‑function estimation, delineates the exact conditions under which popular algorithms coincide with the optimal MVU, and highlights the nuanced trade‑offs that arise in realistic, cyclic, and discounted reinforcement‑learning problems.

The Optimal Unbiased Value Estimator and its Relation to LSTD, TD and MC

💡 Research Summary

Comments & Academic Discussion

Leave a Comment