We investigate projection methods, for evaluating a linear approximation of the value function of a policy in a Markov Decision Process context. We consider two popular approaches, the one-step Temporal Difference fix-point computation (TD(0)) and the Bellman Residual (BR) minimization. We describe examples, where each method outperforms the other. We highlight a simple relation between the objective function they minimize, and show that while BR enjoys a performance guarantee, TD(0) does not in general. We then propose a unified view in terms of oblique projections of the Bellman equation, which substantially simplifies and extends the characterization of (schoknecht,2002) and the recent analysis of (Yu & Bertsekas, 2008). Eventually, we describe some simulations that suggest that if the TD(0) solution is usually slightly better than the BR solution, its inherent numerical instability makes it very bad in some cases, and thus worse on average.
We consider linear approximations of the value function of the policy in the framework of Markov Decision Processes (MDP). We focus on two popular methods: the computation of the projected Temporal Difference fixed point (TD(0), TD for short), which Antos et al. (2008); Farahmand et al. (2008); Sutton et al. (2009) have recently presented as the minimization of the mean-square projected Bellman Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s).
Equation, and the minimization of the meansquare Bellman Residual (BR). In this article, we present some new analytical and empirical data, that shed some light on both approaches. The paper is organized as follows. Section 1 describes the MDP linear approximation framework and the two projection methods. Section 2 presents small MDP examples, where each method outperforms the other. Section 3 highlights a simple relation between the quantities TD and BR optimize, and show that while BR enjoys a performance guarantee, TD does not in general. Section 4 contains the main contribution of this paper: we describe a unified view in terms of oblique projections of the Bellman equation, which simplifies and extends the characterization of Schoknecht (2002) and the recent analysis of Yu & Bertsekas (2008). Eventually, Section 5 presents some simulations, that address the following practical questions: which of the method gives the best approximation? and how useful is our analysis for selecting it a priori?
The model We consider an MDP with a fixed policy, that is an uncontrolled discrete-time dynamic system with instantaneous rewards. We assume that there is a state space X of finite size N . When at state i ∈ {1, .., N }, there is a transition probability p ij of getting to the next state j. Let i k the state of the system at time k. At each time step, the system is given a reward γ k r(i k ) where r is the instantaneous reward function, and 0 < γ < 1 is a discount factor. The value at state i is defined as the total expected return: v(i) := lim N →∞ E Approximation Scheme When the size N of the state space is large, one usually comes down to solving the Bellman Equation approximately. One possibility is to look for an approximate solution v in some specific small space. The simplest and best understood choice is a linear parameterization: ∀i, v(i) = m j=1 w j φ j (i) where m ≪ N , the φ j are some feature functions that should capture the general shape of v, and w j are the weights that characterize the approximate value v. For all i and j, write φ j the N -dimensional vector corresponding to the j th feature function and φ(i) the mdimensional vector giving the features of state i. For any vector of matrix X, denote X ′ its transpose. The following N × m feature matrix Φ = (φ 1 . . . φ m ) = (φ(i 1 ) . . . φ(i N ))
′ leads to write the parameterization of v in a condensed matrix form: v = Φw, where w = (w 1 , …, w m ) is the m-dimensional weight vector. We will now on denote span (Φ) this subspace of R N and assume that the vectors φ 1 , …, φ m form a linearly independent set. Some approximation v of v can be obtained by minimizing v → v -v for some norm • , that is equivalently by projecting v onto span (Φ) orthogonally with respect to • . In a very general way, any symmetric positive definite matrix
It is well known that the orthogonal projection with respect to such a norm, which we will denote Π • Q , has the following closed form: Π
returns the coordinates of the projection of a point in the basis (φ 1 , . . . , φ m ). With these notations, the following relations π
In an MDP approximation context, where one is modeling a stochastic system, one usually considers a specific kind of norm/projection. Let ξ = (ξ i ) be some distribution on X such that ξ > 0 (it assigns a positive probability to all states). Let Ξ be the diagonal matrix with the elements of ξ on the diagonal. Consider the orthogonal projection of R N onto the feature space span (Φ) with respect to the ξ-weighted quadratic
For clarity of exposition, we will denote this specific projection
Ideally, one would like to compute the “best” approximation vbest = Φw best with w best = πv = πL -1 r.
This can be done with algorithms like TD(1) / LSTD(1) (Bertsekas & Tsitsiklis, 1996;Boyan, 2002), but they require simulating infinitely long trajectories and usually suffer from a high variance. The projections methods, which we focus on in this paper, are alternatives that only consider one-step samples. TD(0) fix point method The principle of the TD(0) method (TD for short) is to look for a fixed point of ΠT , that is, one looks for vT D in the space span (Φ) satisfying vT D = ΠT vT D . Assuming that the matrix inverse below exists 1 , it can be proved 2 that vT D = Φw T D with
As pointed out by Antos et al. (2008); Farahmand et al. (2008); Sutton et al. (2009), when the inverse exists, the above computation i
This content is AI-processed based on open access ArXiv data.