Closing the Learning-Planning Loop with Predictive State Representations

Closing the Learning-Planning Loop with Predictive State Representations Byron Boots Machine Learning Depar tment Carnegie Mellon University Pittsburgh, P A 15213 beb@cs.cm u.edu Sajid M. Siddiqi Robotics Institute Carnegie Mellon University Pittsburgh, P A 15213 siddiqi@cs.cm u.edu Geoffrey J . Gordon Machine Learning Depar tment Carnegie Mellon University Pittsburgh, P A 15213 ggordon@cs.cm u.edu ABSTRA CT A cen tral problem in artiﬁcial in telligence is that of plan- ning to maximize future reward under unc ertaint y in a par- tially observ able en vironmen t. In this paper we propose and demonstrate a nov el algorithm which accurately le arns a model of suc h an environmen t directly from sequences of action-observ ation pairs. W e then close the lo op from ob- serv ations to actions b y planning in the learned model and reco vering a policy whic h is near-optimal in the original en vironmen t. Speciﬁcally , we present an eﬃcient and sta- tistically consisten t sp ectral algorithm for learning the pa- rameters of a Predictiv e State Representation (PSR). W e demonstrate the algorithm by learning a model of a simu- lated high-dimensional, vision-based mob ile robot planning task, and then p erform approxima te p oin t-based planning in the learned PSR. Analysis of our results sho ws that the algorithm learns a state space which eﬃciently captures the essen tial features of the environmen t. This representation allo ws accurate prediction with a small nu mber of parame- ters, and enables successful and eﬃcient planning. 1. INTR ODUCTION Planning a sequenc e of actions or a policy to maximize fu- ture rew ard has long been consid ered a fundamen tal problem for autonomous agents. F or many years, Partial ly Observ- able Markov De cision Pr o c esses (POMDPs) [1, 27, 4] hav e been considered the most general fram ework for single agen t planning. POMDPs mo del the state of the world as a latent v ariable and explicitly reason ab out uncertaint y in b oth ac- tion eﬀects and state observ ability . Plans in POMDPs are expressed as p olicies , which sp ecify the action to take given an y p ossible probabilit y distribution ov er state. Unfortu- nately , exact planning algorithms such as value iter ation [27] are computationally intracta ble for most realistic POMDP planning problems. There are arguably t wo primary reasons for this [18]. The ﬁrst is the “curse of dimensionalit y” : for a POMDP with n states, the optimal policy is a function of an n − 1 dimensional distribution ov er latent state. The sec- ond is the “curse of history” : the n um b er of distinct p olicies increases exp onen tially in the planning horizon. W e hop e to mitigate the curse of dimensionality by seeking a dynamical system mo del with c omp act dimensionality , and to mitigate the curse of history by lo oking for a mo del that is susceptible to appr oximate planning. Pr e dictive State R epr esentations (PSRs) [13] and the closely related Observable Op er ator Mo dels (OOM s) [9] are gen- eralizations of POMDPs that hav e attracted in terest be- cause they b oth hav e greater representationa l capacity than POMDPs and yiel d represen tations that are at le ast as com- pact [24, 5]. In con trast to the latent-v ariable representa- tions of POMDPs, PSRs and OOMs represent the state of a dynamical system b y tracking o ccurrence probabilities of a set of future even ts (called tests or char acteristic events ) conditioned on past ev ents (called histories or indic ative events ). Because tests and histories are observ able quan- tities, it has b een suggested that learning PSRs and OOMs should b e easier than learning POMDPs. A ﬁnal b eneﬁt of PSRs and OOMs is that many successful approximate planning techniques for POMDPs can be used to plan in these observ able mo dels with minimal adjustmen t. Accord- ingly , PSR and OOM mo dels of dynamical systems ha ve po- ten tial to ov ercome b oth the “curse of dimensionality” (b y compactly mo deling state), and the “curse of history” (by applying approximate planning techniques). The qualit y of an optimi zed policy for a POMDP , PSR, or OOM dep ends strongly on the accuracy of the model: inac- curate models t ypically lead to useless plan s. W e can sp ecify a mo del manually or learn one from data, but due to the diﬃ- cult y of learning, it is far more common to see planning algo- rithms applied to man ually-speciﬁed models. Unfortunately , it is usually only p ossible to hand-sp ecify accurate mo dels for small systems where there is extensive and goal-relev ant do- main kno wledge. F or example, recent extensions of approx- imate planning techniques for PSRs hav e only been applied to mo dels constructed by hand [11, 8]. F or the most part, learning models for pla nning in partially observ able environ- men ts has b een hampered by the inaccuracy of learning al- gorithms. F or example, Expectation-Maximization (EM) [2] does not av oid lo cal minima or scale to large state spaces; and, although man y learning algorithms ha v e been proposed for PSRs [25, 10, 34, 16, 30, 3] and OOMs [9, 6, 14] that attempt to take adv antage of the observ ability of the state represen tation, none hav e b een shown to learn mo dels that are accurate enough for planning. As a result, there hav e been few successful attempts at learning a mo del directly from data and then closing the lo op by planning in that model. Sev eral researchers hav e, how ev er, made progress in the problem of planning using a learned mo del. In one in- stance [21], researchers obtained a POMDP heuristically from the output of a mo del-free algorithm [15] and demon- strated planning on a small toy maze. In another instance [20], researc hers used Marko v Chain Mon te Carlo (MCMC) in- ference b oth to learn a factored Dynamic Bay esian Netw ork (DBN) represen tation of a POMDP in a small syn thetic net- w ork administration domain, as well as to p erform online planning. Due to the cost of the MCMC sampler used, this approac h is still impractical for larger mo dels. In a ﬁnal ex- ample, researchers learned Linear-Linear Exp onen tial F am- ily PSRs from an agent trav ersing a simulated environ ment, and found a policy using a p olicy gradient technique with a parameterized function of the learned PSR staten as in- put [33, 31]. In this case b oth the learning and the planning algorithm were sub ject to lo cal optima. In addition, the au- thors determined that the learned mo del was to o inaccurate to support v alue-function-based planning metho ds [31]. The curren t pap er diﬀers from these and other previous examples of planning in learned mo dels: it b oth uses a prin- cipled and prov ably statistically consistent model-learning algorithm, and demonstrates p ositiv e results on a challeng- ing high-dimensional problem with contin uous observ ations. In particular, we prop ose a nov el, consistent sp ectral algo- rithm for learning a v ariant of PSRs called T r ansforme d PSRs [19] directly from execution traces. The algorithm is closely related to subspace iden tiﬁcation for learning lin- ear dynamical systems (LDSs) [26, 29] and spectral algo- rithms for learning Hidden Marko v Mo dels (HMMs) [7] and reduced-rank Hidden Marko v Mo dels [22]. W e then demon- strate that this algorithm is able to learn compact mo dels of a diﬃcult, realistic dynamical system without any prior domain knowledg e built into the mo del or algorithm. Fi- nally , we perform p oint-based approximate v alue iteration in the learned compact mo dels, and demonstrate that the greedy policy for the resulting v alue function works well in the original (not the learned) system. T o our kno wledge this is the ﬁrst researc h that combines all of these achiev emen ts, closing the lo op from observ ations to actions in an unknown domain with no human interv ention beyond collecting the ra w transition data. 2. PREDICTIVE ST A TE REPRESENT A TIONS A predictive state representation (PSR) [13] is a compact and complete description of a dynamical system that repre- sen ts state as a set of predictions of observ able exp eriments or tests that one could perform in the system. Sp eciﬁcally , a test of length k is an ordered sequence of action-observ ation pairs τ = a 1 o 1 . . . a k o k that can b e executed and observ ed at a given time. Lik ewise, a history is an ordered sequence of action-observ ation pairs h = a h 1 o h 1 . . . a h t o h t that has b een executed and observed prior to a given time. The pr ediction for a test τ is the probabilit y of the sequence of observ ations o 1 , . . . , o k being generated, given that we interv ene to tak e the sequence of actions a 1 , . . . , a k . If the observ ations pro- duced b y the dynamical system match those sp eciﬁed b y the test, then the test is said to hav e suc c e e de d . The key idea behind a PSR is that, if the exp ected outcomes of execut- ing all p ossible tests are known, then everything there is to kno w ab out the state of a dynamical system is also known. In PSRs, actions in tests are interventions , not observ a- tions. Thus it is notationally conv enien t to separate a test τ into the observ ation component τ O and the action com- ponent τ A . In equations that contain probabilities, a single v ertical bar | indicates conditioning and a double vertical bar || indicates interv ening. F or example, p ( τ O i | h || τ A i ) is the probabilit y of the observ ations in test τ i , conditioned on his- tory h , and given that w e interv ene to execute the actions in τ i . F ormally a PSR consists of ﬁv e elemen ts { A, O , Q, m 1 , F } . A is the set of actions that can b e executed at eac h time- step, O is the set of p ossible observ ations, and Q is a set of c or e tests . A set of c ore tests Q has the prop erty that for any test τ , there exists some function f τ suc h that p ( τ O | h || τ A ) = f τ ( p ( Q O | h || Q A )) for all histories h . Here, the pr e diction ve ctor p ( Q O | h || Q A ) = [ p ( q O 1 | h || q A 1 ) , ..., p ( q O | Q | | h || q A | Q | )] T (1) con tains the probabilities of success of the tests in Q . The existence of f τ means that knowing the probabilities for the tests in Q is suﬃcient for computing the probabilities for all other tests, so the prediction vector is a suﬃcient statistic for the system. The v ector m 1 is the initial prediction for the outcome s of the tests in Q giv en some initial distribu tion o ver histories ω . W e will allo w the initial distribution to b e general; in practice ω might corresp ond to the steady state distribution for a heuristic exploration p olicy , or the distri- bution ov er histories when we ﬁrst encounter the system, or the empt y history with probabilit y 1. In order to maintain predictions in the tests in Q we need to compute p ( Q O | ho || a, Q A ), the distribution ov er test out- comes giv en a new extended history , from the current distri- bution p ( Q O | h || Q A ) (here p ( Q O | ho || a, Q A ) is th e probability o ver test outcomes conditioned on history h and observ ation o given the interv ention of choosing the immediate next ac- tion a and the appropriate actions for the test). Let f aoq be the function needed to up date our prediction of test q ∈ Q giv en an action a and an observ ation o . (This function is guaran teed to exist since we can set τ = aoq in f τ abov e.) Finally , F is the set of functions f aoq for all a ∈ A , o ∈ O , and q ∈ Q . In this wo rk we will restrict ourselves to line ar PSRs, a subset of PSRs where the functions f aoq are required to be linear in the prediction v ector p ( Q O | h || Q A ), so that f aoq ( p ( Q O | h || Q A )) = m T aoq p ( Q O | h || Q A ) for some vector m aoq ∈ R | Q | . 1 W e write M ao to b e the matrix with rows m T aoq . By Bay es’ Rule, the up date from history h , after taking action a and seeing observ ation o , is: p ( Q O | ho || a, Q A ) = p ( o, Q O | h || a, Q A ) p ( o | h || a ) = M ao p ( Q O | h || Q A ) m T ∞ M ao p ( Q O | h || Q A ) (2) where m ∞ is a normalizing vector. Specifying a PSR in- v olves ﬁrst ﬁnding a set of core tests Q , called the disc overy pr oblem , and then ﬁnding the parameters M ao and m ∞ for those tests as well as an initial state m 1 , called the lear n- ing pr oblem . The discov ery problem is usually solved by searc hing for linearly indep endent tests by rep eatedly p er- forming Singular V alue Decomp ositions (SVDs) on collec- tions of tests [10, 34]. The learning problem is then solved b y regression. 1 Linear PSRs hav e b een shown to b e a highly expressive class of models [9, 24]: if the set of core tests is minimal , then the set of PSRs with n = | Q | core tests is prov ably e quivalent to the set of dynamical systems with linear di- mension n . The line ar dimension of a dynamical system is a measure of its intrinsic complexit y; sp eciﬁcally , it is the rank of the system-dynamics matrix [24] of the dynamical system. Since there exist dynamical systems of ﬁnite linear dimension which cannot b e mo deled b y any POMDP (or HMM) with a ﬁnite n um b er of states (see [9] for an exam- ple), POMDPs and HMMs are a proper subset of PSRs [24]. 2.1 T ransf ormed PSRs T ransformed PSRs (TPSRs) [19] are a generalization of PSRs that maintain a small num b er of line ar c ombinations of test probabilities as suﬃcient statistics of the dynamical system. As we will see, transformed PSRs can b e thought of as line ar tr ansformations of regular PSRs. Accordingly , TPSRs include PSRs as a special case since this transfor- mation can b e the identit y matrix. The main b eneﬁt of TPSRs is that given a set of core tests, the parameter learn- ing problem can b e solv ed and a large step to wa rd solving the discov ery problem can b e achiev ed in closed form. In this respect, TPSRs are closely related to the transformed represen tations of LDSs and HMMs found b y subsp ac e iden- tiﬁc ation [29, 26, 7]. F or some dynamical system, let Q be the minimal set of core tests with cardinality n = | Q | equal to the dimension- alit y of the linear system. Then, let T b e a set of core tests (not necessarily minimal) and let H b e a suﬃcien t set of indicativ e even ts. A set of indic ative events is a mutually exclusiv e and exhaustive partition of the set of all possible histories. W e will deﬁne a suﬃcient set of indicativ e even ts below. F or TPSRs, |T | and |H| may b e arbitrarily larger than n ; in practice we might choose T and H b y selecting sets that we b elieve to b e large enough and v aried enough to exhibit the t ypes of behavior that we wish to model. W e deﬁne several matrices in terms of T and H . In eac h of these matrices we assume that histories H are sampled according to ω ; further actions and observ ations are sp eciﬁed in the individual probability expressions. P H ∈ R |H| is a v ector conta ining the probabilities of every h ∈ H . [ P H ] i ≡ Pr[ H ∈ h i ] = ω ( H ∈ h i ) ≡ π h i ⇒ P H = π (3a) Here we ha ve deﬁned tw o notations, P H and π , for the same v ector. Below we will generalize P H , bu t k eep the same meaning for π . Next we deﬁne P T , H ∈ R |T |×|H| , a matrix with entries that contain the joint probability of every test τ i ∈ T (1 ≤ i ≤ |T | ) and ev ery indicativ e ev en t h j ∈ H (1 ≤ j ≤ |H| ) (assuming we execute test actions τ A i ): [ P T , H ] i,j ≡ Pr[ τ O i , H ∈ h j || τ A i ] = Pr[ τ O i | H ∈ h j || τ A i ] Pr[ H ∈ h j ] ≡ r T τ i Pr[ Q O | H ∈ h j || Q A ] Pr[ H ∈ h j ] ≡ r T τ i s h j Pr[ H ∈ h j ] = r T τ i s h j π ⇒ P T , H = RS diag( π ) (3b) The vector r τ i is the linear function that sp eciﬁes the prob- abilit y of the test τ i giv en the probabilities of core tests Q . The vector s h j con tains the probabilities of all core tests Q giv en that the history belongs to the indicativ e even t h j . Be- cause of our assumptions about the linear dimension of the system, the matrix P T , H factors according to R ∈ R |T |× n (a matrix with rows r T τ i for all 1 ≤ i ≤ |T | ) and S ∈ R n ×|H| (a matrix with columns s h j for all 1 ≤ j ≤ |H| ). Therefore, the r ank of P T , H is no more than the linear dimension of the system. A t this p oint we can deﬁne a suﬃcient set of indicativ e even ts as promised: it is a set of indicative even ts whic h ensures that the rank of P T , H is e qual to the linear dimension of the system. Finally , m 1 , which we hav e de- ﬁned as the initial prediction for the outcomes of tests in Q giv en some initial distribution ov er histories h , is giv en b y m 1 = S π (here we are taking the expectation of the columns of S according to the correct distribution ov er histories ω ). W e deﬁne P T ,ao, H ∈ R |T |×|H| , a set of matrices, one for eac h action-observ ation pair, that represen t the probabilities of a triple of an indicative ev ent h j , the immediate following observ ation O , and a subsequent test τ j , given the appropri- ate actions: [ P T ,ao, H ] i,j ≡ Pr[ τ O i , O = o, H ∈ h j || A = a, τ A i ] = Pr[ τ O i , O = o | H ∈ h j || A = a, τ A i ] Pr[ H ∈ h j ] = Pr[ τ O i | H ∈ h j , O = o || A = a, τ A i ] Pr[ O = o | H ∈ h j || A = a ] Pr[ H ∈ h j ] = r T τ i Pr[ Q O | H ∈ h j , O = o || A = a, Q A ] Pr[ O = o | H ∈ h j || A = a ] Pr[ H ∈ h j ] = r T τ i M ao Pr[ Q O | H ∈ h j || Q A ] Pr[ H ∈ h j ] = r T τ i M ao s h j Pr[ H ∈ h j ] = r T τ i M ao s h j π h j ⇒ P T ,ao, H = RM ao S diag ( π ) (3c) The matrices P T ,ao, H factor according to R and S (deﬁned abov e) and the PSR transition matrix M ao ∈ R n × n . Note that R spans the column space of b oth P T , H and the matri- ces P T ,ao, H ; we make use of this fact b elo w. Finally , we will use the fact th at m ∞ is a normalizing v ec- tor to deriv e the equations below (by repeatedly multiplying b y S and S † , and using the facts S S † = I and m T ∞ S = 1 T , since each column of S is a v ector of core-test predictions). Here, k = |H| and 1 k denotes the ones-v ector of length k : m T ∞ S = 1 T k m T ∞ S S † = 1 T k S † m T ∞ = 1 T k S † (4a) m T ∞ S = 1 T k S † S 1 T k = 1 T k S † S (4b) W e no w deﬁne a TPSR in terms of the matrices P H , P T , H , P T ,ao, H and an additiona l matrix U that ob eys the condition that U T R is inv ertible. In other words, the columns of U deﬁne an n -dimensional subspace that is not orthogonal to the column space of P T , H . A natural choice for U is giv en b y the left singular vec tors of P T , H . With these deﬁnitions, we deﬁne the parameters of a TPSR in terms of observ able matrices and simplify the expressions using Equations 3(a–c), as follo ws (here, B ao is a similarity transform of the low-d imensional linear transition matrix M ao and b 1 and b ∞ are the corresp onding linear transforma- tions of the minimal PSR initial state M 1 and the normal- izing v ector): b 1 ≡ U T P T , H 1 k = U T RS diag( π )1 k = U T RS π = ( U T R ) m 1 (5a) b T ∞ ≡ P T H ( U T P T , H ) † = 1 T n S † S diag ( π )( U T P T , H ) † = 1 T n S † ( U T R ) − 1 ( U T R ) S diag( π )( U T P T , H ) † = 1 T n S † ( U T R ) − 1 U T P T , H ( U T P T , H ) † = 1 T n S † ( U T R ) − 1 = m T ∞ ( U T R ) − 1 (5b) B ao ≡ U T P T ,ao, H ( U T P T , H ) † = U T RM ao S diag ( π )( U T P T , H ) † = U T RM ao ( U T R ) − 1 ( U T R ) S diag( π )( U T P T , H ) † = ( U T R ) M ao ( U T R ) − 1 U T P T , H ( U T P T , H ) † = ( U T R ) M ao ( U T R ) − 1 (5c) The deriv ation of Equation 5b makes use of Equations 4a and 4b. Given these parameters we can calculate the prob- abilit y of observ ations o 1: t at any time t giv en that we int er- v ened with actions a 1: t , from the initial state m 1 . Here we write the pro duct of each M ao (one for each action observ a- tion pair) M a 1 o 1 M a 2 o 2 . . . M a t o t as M ao 1: t . Pr[ o 1: t || a 1: t ] = m T ∞ M ao 1: t m 1 = m T ∞ ( U T R ) − 1 ( U T R ) M ao 1: t ( U T R ) − 1 ( U T R ) m 1 = b T ∞ B ao 1: t b 1 (6) In addition to the initial TPSR state b 1 , w e deﬁne normal- ized conditional ‘internal states’ b t . W e deﬁne the TPSR state at time t + 1 as: b t +1 ≡ B ao 1: t b 1 b T ∞ B ao 1: t b 1 (7) W e can deﬁne a r e cursive state up date for t > 1 as follows (using Equation 7 as the base case for t = 1): b t +1 ≡ B ao 1: t b 1 b T ∞ B ao 1: t b 1 = B ao t B ao 1: t − 1 b 1 b T ∞ B ao t B ao 1: t − 1 b 1 = B ao t b t b T ∞ B ao t b t (8) The prediction of tests p ( T O | h ||T A ) at time t is given by U b t = U U T Rs t = Rs t , and the rotation from a TPSR to a PSR is given by s t = ( U T R ) − 1 b t where s t is the prediction v ector for the PSR. Note that in general, the elemen ts of the linear com binations b t cannot be interpreted as probabilities since they ma y lie outside the range [0 , 1]. 3. LEARNING TPSRS Our learning algorithm works by building empirical esti- mates b P H , b P T , H , and b P T ,ao, H of the matrices P H , P T , H , and P T ,ao, H deﬁned ab ov e. T o build these estimates, w e rep eat- edly sample a history h from the distribution ω , execute a sequence of actions, and record the resulting observ ations. This data gathering strategy implies that w e must b e able to arrange for the system to b e in a state corresp onding to h ∼ ω ; for example, if our system has a reset, w e can take ω to b e the distribution resulting from executing a ﬁxed exploration p olicy for a few steps after reset. In practice, reset is often not av ailable. In this case we can estimate b P H , b P T , H , and b P T ,ao, H b y dividing a single long sequence of action-observ ation pairs into subsequences and pretending that each subsequence started with a reset. W e are forced to use an initial distribution ov er histories, ω , equal to the steady state distribution of the p olicy which generated the data. This approach is called the suﬃx-history algorithm [34]. With this method, the estimated matrices will b e only appro ximately correct, since interv entions that w e take at one time will aﬀect the distribution o ver histories at future times; how ever, the approximation is often a goo d one in practice. Once we hav e computed b P H , b P T , H , and b P T ,ao, H , we can generate b U by singular v alue decomposition of b P T , H . W e can then learn the TPSR parameters by plugging b U , b P H , b P T , H , and b P T ,ao, H in to Equation 5. F or reference, we summarize the abov e steps here 2 : 1. Compute empirical estimates b P H , b P T , H , b P T ,ao, H . 2. Use SVD on b P T , H to compute b U , the matrix of left singular vectors corresp onding to the n largest singular v alues. 3. Compute mo del parameter estimates: (a) b b 1 = b U T b P H , (b) b b ∞ = ( b P T T , H b U ) † b P H , (c) b B ao = b U T b P T ,ao, H ( b U T b P T , H ) † As we include more data in our av erages, the law of large n umbers guarantees that our estimates b P H , b P T , H , and b P T ,ao, H con verge to the true matrices P H , P T , H , and P T ,ao, H (de- ﬁned in Equation 3). So b y contin uity of the formulas in steps 3(a–c) ab o v e, if our system is truly a TPSR of ﬁnite rank, our estimates b b 1 , b b ∞ , and b B ao con verge to the true parameters up to a linear transform. Although parameters estimated with ﬁnite data can sometimes lead to negative probabilit y estimates when ﬁltering or predicting, this can be av oided in practice b y thresholding the prediction vectors b y some small p ositiv e probabilit y . Note th at the learning algorithm presented here is disti nct from the TPSR learning algorithm presented in Rosencrantz et al. [19]. The principal diﬀerence b etw een the tw o algo- rithms is that here we estimate the joint probabilit y of a past even t, a current observ ation, and a future even t in the matrix b P T ,ao, H whereas in [19], the authors instead estimate the p robabilit y of a future even t, c onditione d on a past ev ent and a current observ ation. T o compensate, Rosencran tz et al. later mul tiply this estimate b y an approximation of the probabilit y of the current observ ation, conditioned on the past even t, but not until after the SVD is applied. Rosen- cran tz et al. also deriv e the approximat e probability of the curren t observ ation diﬀerently: as the result of a regression instead of directly from empirical coun ts. Finally , Rosen- cran tz et al. do not make any attempt to multiply by the marginal probabilit y of the past even t, although this term 2 The learning strategy employ ed here may be seen as a gen- eralization of Hsu et al.’s spectral algorithm for learning HMMs [7] to PSRs. Note that since HMMs and POMDPs are a prop er subset of PSRs, we can use the algorithm in this pap er to learn bac k b oth HMMs and POMDPs in PSR form. cancels in the curren t work so it is p ossible that, in the absence of estimation errors, both algorithms arrive at the same answ er. Belo w we present tw o extensions to our learning a lgo- rithm that preserve consistency while relaxing the require- men t that we ﬁnd a discrete set of indicativ e even ts and tests. These extensions make learning substan tially easier for many diﬃcult domains (e.g. for con tin uous observ ations) in practice. 3.1 Learning TPSRs with Indicative and Char- acteristic Featur es In data gathered from complex real-w orld dynamical sys- tems, it may not be p ossible to ﬁnd a reasonably-sized set of discrete core tests T or indicativ e even ts H . When this is the case, we can generalize the TPSR learning algorithm and work with fe atur es of tests and histories, which we call char acteristic fe atur es and indic ative fe atur es respectively . In particular let T and H b e large sets of tests and indica- tiv e even ts (p ossibly to o large to work with directly) and let φ T and φ H be shorter vectors of characteris tic and in- dicativ e features. The matrices P H , P T , H , and P T ,ao, H will no longer contain probabilities but rather exp e cte d values of features or pro ducts of features. F or the special case of features that are indic ator functions of tests and histories, w e recov er the TPSR matrices from Section 2.1 where P H , P T , H , and P T ,ao, H consist of probabilities. Here w e pro v e the c onsistency of our estimation algorithm using these more general matrices as inputs. In the follow- ing equations Φ T and Φ H are matrices of characteristic and indicativ e features resp ectively , with ﬁrst dimension equal to the num b er of characteristic or indicativ e features and second dimension equal to |T | and |H| resp ectiv ely . An entry of Φ H is the exp ectation of one of the indicative features giv en the o ccurrence of one of the indicative ev ents. An entry of Φ T is the w eight of one of our tests in calculating one of our characteri stic features. With these features we generalize the matrices P H , P T , H , and P T ,ao, H : [ P H ] i ≡ E ( φ H i ( h )) = X h ∈H Pr[ H ∈ h ]Φ H ih ⇒ P H = Φ H π (9a) [ P T , H ] i,j ≡ E ( φ T i ( τ O ) · φ H j ( h ) || τ A ) = X τ ∈T X h ∈H Pr[ τ O , H ∈ h || τ A ]Φ T iτ Φ H j h = X τ ∈T X h ∈H r T τ s h π h Φ T iτ Φ H j h (b y Eq. (3b)) = X τ ∈T r T τ Φ T iτ X h ∈H s h π h Φ H j h ⇒ P T , H = Φ T RS diag( π )Φ H T (9b) [ P T ,ao, H ] i,j ≡ E ( φ T i ( τ O ) · φ H j ( h ) · δ ( O = o ) || τ A A = a ) = X τ ∈T X h ∈H Pr[ τ O , O = o, H ∈ h || A = a, τ A ]Φ T iτ Φ H j h = X τ ∈T X h ∈H r T τ M ao s h π h Φ T iτ Φ H j h (b y Eq. (3c)) = X τ ∈T r T τ Φ T iτ ! M ao X h ∈H s h π h Φ H j h ! ⇒ P T ,ao, H = Φ T RM ao S diag ( π )Φ H T (9c) where δ ( O = o ) is an indicator function for a particular observ ation. The parameters of the TPSR are deﬁned in terms of a matrix U that ob eys the condition that U T Φ T R is inv ertible (w e can take U to b e the left singular v alues of P T , H ), and in terms of the matrices P H , P T , H , and P T ,ao, H . W e also deﬁne a new vector e s.t. Φ H T e T = 1 k ; this means that the ones vector 1 T k m ust b e in the row space of Φ H . Since Φ H is a matrix of features, we can alwa ys ensure that this is the case by requiring one of our features to be a constan t. Then, one row of Φ H is 1 T k , and we can set e T = [ 1 0 . . . 0 ] T . Finally we deﬁne the generalized TPSR parameters b 1 , b ∞ , and B ao as follo ws: b 1 ≡ U T P T , H e T = U T Φ T RS diag( π )Φ H T e T = U T Φ T RS diag( π )1 k = ( U T Φ T R ) S π = ( U T Φ T R ) m 1 (10a) b T ∞ ≡ P T H ( U T P T , H ) † = 1 T n diag( π )Φ H T ( U T P T , H ) † = 1 T n S † S diag ( π )Φ H T ( U T P T , H ) † = 1 T n S † ( U T Φ T R ) − 1 ( U T Φ T R ) S diag( π )Φ H T ( U T P T , H ) † = 1 T n S † ( U T Φ T R ) − 1 U T P T , H ( U T P T , H ) † = 1 T n S † ( U T Φ T R ) − 1 = m T ∞ ( U T Φ T R ) − 1 (10b) B ao ≡ U T P T ,ao, H ( U T P T , H ) † = U T Φ T RM ao S diag ( π )Φ H T ( U T P T , H ) † = U T Φ T RM ao ( U T Φ T R ) − 1 ( U T Φ T R ) S diag( π )Φ H T ( U T P T , H ) † = ( U T Φ T R ) M ao ( U T Φ T R ) − 1 U T P T , H ( U T P T , H ) † = ( U T Φ T R ) M ao ( U T Φ T R ) − 1 (10c) Just as in the b eginning of Section 3, we can estimate b P H , b P T , H , and b P T ,ao, H , and then plug the matrices into Equa- tions 10(a–c). Thus w e see that if we work with characteris- tic and indicativ e features, and if our system is truly a TPSR of ﬁnite rank, our estimates b b 1 , b b ∞ , and b B ao again conv erge to the true PSR parameters up to a linear transform. 3.2 K ernel Density Estimation for Continuous Observations F or contin uous observ ations, we use Kernel Density Esti- mation (KDE) [23] to model the observ ation probability den- sit y function (PDF). W e use a fraction of the training data points as kernel centers, placing one multiv ariate Gaussian k ernel at each p oint. 3 The KDE estimator of the observ a- tion PDF is a con v ex combination of these k ernels; since eac h k ernel in tegrates to 1, this estimator also integrates to 1. KDE theory [23] tells us that, with the correct kernel w eights, as the num ber of kernel centers and the num b er of samples go to inﬁnity and the k ernel bandwidth go es to 3 W e use a general elliptical cov ariance matrix, chosen by PCA: that is, we use a spherical cov ariance after pro jecting on to the eigen vec tors of the co v ariance matrix of the obser- v ations, and scaling by the square ro ots of the eigenv alues. zero (at appropriate rates), the KDE estimator conv erges to the observ ation PDF in L 1 norm. The kernel densit y esti- mator is completely determined by the normalized vector of k ernel wei ghts; therefore, if we can estimate this vector ac- curately , our estimate of the observ ation PDF will conv erge to the observ ation PDF as well. Hence our goal is to predict the correct exp ected v alue of this normalized kernel v ector giv en all past observ ations. In the contin uous-observ ation case, we can still write our latent-state update in the same form, using a matrix B ao ; ho w ev er, rather than learning eac h of the uncountably-man y B ao matrices separately , we learn one base operator p er kernel cen ter, and use conv ex com binations of these base op erators to compute observ able operators as needed. F or more details on practical asp ects of the learning pro cedure with contin uous observ ations, see Section 5.2. 4. PLANNING IN TPSRS The primary motiv ation for mo deling a con trolled dynam- ical system is for reasoning ab out the eﬀects of taking a se- quence of actions in the system. The TPSR mo del can b e augmen ted for this purp ose b y sp ecifying a reward function for taking an action a in state b : R ( b, a ) = η T a b (11) where η T a ∈ R n is the linear reward function for taking action a . Given this function and a discount factor γ , the planning problem for TPSRs is to ﬁnd a p olicy that maximizes the ex- pected discounted sum of rew ards E ˆ P t γ t R ( b t , a t ) ˜ . The optimal p olicy can b e compactly represented using the op- timal v alue function V ∗ , whic h is deﬁned recursiv ely as: V ∗ ( b ) = max a ∈ A " R ( b, a ) + γ X o ∈ O p ( o | b, a ) V ∗ ( b ao ) # (12) where b ao is the state obtained from b after executing action a and observing o . When optimized exactly , this v alue func- tion is alwa ys piecewise linear and con vex (PWLC) in the state and has ﬁnitely many pieces in ﬁnite-horizon planning problems. 4 The optimal action is then obtained by taking the arg max instead of the max in Equation 12. Exact v alue iteration in POMDPs or TPSRs optimizes the v alue function o ver all p ossible belief or state vectors. Computing the exact v alue function is problematic b ecause the num ber of sequences of actions that must b e consid- ered grows exp onentially with the planning horizon, called the “curse of history .” Appro ximate p oint-based planning tec hniques (see belo w) attempt only to calculate the best se- quence of actions at some ﬁnite set of b elief points. Unfortu- nately , in high dimensions, appro ximate planning tec hniques ha ve diﬃcu lty adequat ely sampling the space of possible b e- liefs. This is due to the “curse of dimensionalit y .” Because TPSRs often admit a compact low-dimensional representa- tion, appro ximate point-based planning tec hniques can work w ell in these mo dels. Point-Base d V alue Iter ation (PBVI) [17] is an eﬃcien t appro ximation of exact v alue iteration that p erforms v alue 4 This observ ation follows from that fact that a TPSR is a linear transformation of a PSR, and PSRs like POMDPs ha ve PWLC v alue functions [11]. bac kup steps on a ﬁnite set of heuristically-chosen b elief points rather than ov er the entire b elief simplex. PBVI ex- ploits the fact that the v alue function is PWLC. A linear lo wer b ound on the v alue function at one p oint b can b e used as a lo wer bound at nearby p oints; this insigh t allo ws the v alue function to b e approximated with a ﬁnite set of h yp erplanes (often called α -vectors), one for each point. Al- though PBVI was designed for POMDPs, the approac h has been generalized to PSRs [8]. F ormally , given some set of points B = { b 0 , . . . , b k } in the TPSR state space, we recur- siv ely compute the v alue function and linear low er b ounds at only these p oin ts. The approximation of the v alue func- tion can be represented by a set Γ = { α 0 , . . . , α k } such that each α i corresponds to the optimal v alue function at at least one prediction vector b i . T o obtain the approximate v alue function V t +1 ( b ) from the previous v alue fun ction V t ( b ) w e apply the r e cursive b ackup op er ator on p oints in B : if V t ( b ) = max α ∈ Γ t α T b , then V t +1 ( b ) = max a ∈ A " R ( b, a ) + γ X o ∈ O max α ∈ Γ t α T B ao b # (13) In addition to b eing tractable on muc h larger-scale plan- ning problems than exact v alue iteration, PBVI comes with theoretical guarantees in the form of error b ounds that are lo w-order polynomials in the degree of appro ximation, range of reward v alues, and discount factor γ [17, 8]. Perseus [28, 11] is a v ariant of PBVI that updates the v alue function o ver a small randomized subset of a large set of reachable b elief points at eac h time step. By only updating a subset of belief points, Perseus can achiev e a computational adv antage ov er plain PBVI in some domains. W e use Perseus in this pap er due to its speed and simplicity of implementation. 5. EXPERIMENT AL RESUL TS W e hav e introduced a nov el algorithm for learning TPSRs directly from data, as well as a k ernel-based extension for modeling con tin uous observ ations, and discussed ho w to plan in the learned mo del. First w e demonstrate the viability of this approach to planning in a challenging non-linear, par- tially observ able, con trolled domain by learning a model di- rectly from sensor inputs and then “closing the lo op” b y plan- ning in the learned model. Second, unlike previous attempts to learn PSRs, which either lac k planning results [19, 32], or whic h compare p olicies within the learned system [33], we compare our resulting p olicy to a b ound on the b est p ossi- ble solution in the original system and demonstrate that the policy is close to optimal. 5.1 The A utonomous Robot Domain The simulated autonomous rob ot domain consists of a simple 45 × 45 unit square arena with a cen tral obstacle and brightly colored w alls (Figure 1(A-B)). W e mo deled the robot as a sphere of radius 2 units. The rob ot can mov e around the ﬂoor of the arena, and rotate to face in an y direc- tion. The rob ot has a simula ted 16 × 16 pixel color camera, whose fo cal plane is lo cated one unit in front of the rob ot’s cen ter of rotation. The rob ot’s visual ﬁeld w as 45 ◦ in b oth azim uth and elev ation, thus pro viding the rob ot with an an- gular resolution of ∼ 2 . 8 ◦ per pixel. Images on the sensor matrix at any moment were sim ulated by a non-linear p er- spective transformation of the pro jected v alues arising from the robot’s p osition and orient ation in the en vironmen t at −8 −4 0 4 8 x 10 −3 −4 0 4 x 10 −3 B. Outer W alls Inner W alls A. C. Simulated Environment Simulated Ebvironment 3-d V iew (to scale) D. Learned Subspace Learned Representation Mapped to Geometric Space Figure 1: Learning the Autonomous Rob ot Domain. (A) The rob ot uses visual sensing to trav erse a square domain with multi-colored w alls and a central obstacle. Examples of images recorded by the rob ot o ccupying t w o diﬀerent p ositions in the environmen t are sho wn on the at the bottom of the ﬁgure. (B) A to-scale 3-dimensional view of the environmen t. (C) The 2nd and 3rd dimension of the learned subspace (the ﬁrst dimension primarily con tained normalization information). Eac h p oin t is the embedding of a single history , displa y ed with color equal to the av erage RGB color in the ﬁrst image in the highest probability test. (D) The same p oin ts in (C) pro jected on to the envi ronmen t’s geometric space. that time. The resulting 768-element pattern of unpro cessed R GB v alues w as the only input to an robot (images were not preprocessed to extract features), and each action pro duced a new set of pixel v alues. The rob ot was able to mov e for- w ard 1 or 0 units, and simultaneously rotate 15 ◦ , − 15 ◦ , or 0 ◦ , resulting in 6 unique actions. In the real world, friction, unev en surfaces, and other factors confound precisely pre- dictable mo v ements. T o simulate this uncertaint y , a small amoun t of Gaussian noise was added to the translation and rotation comp onen ts of the actions. The rob ot was allow ed to occupy any real-v alued ( x, y , θ ) p ose in the environmen t, but was not allow ed to intersect w alls. In case of a collision, w e interrupted the current motion just b efore the rob ot in- tersected an obstacle, simulating an inelastic collision. 5.2 Learning a Model W e learn our mo del from a sample of 10000 short tra- jectories, eac h containing 7 action-observ ation pairs. W e generate each tra jectory by starting from a uniformly ran- domly sampled p osition in the environmen t and executing a uniform random sequence of actions. W e used the ﬁrst l = 2000 tra jectories to generate kernel cen ters, and the re- maining w = 8000 to estimate the matrices P H , P T , H , and P T ,ao, H . T o deﬁne these matrices, we need to specify a set of in- dicativ e features, a set of observ ation kernel centers, and a set of characteristic features. W e use Gaussian kernels to deﬁne our indicative and characteristic features, in a similar manner to the Gaussian kernels describ ed abov e for observ a- tions; our analysis allows us to use arbitrary indicativ e and c haracteristic features, but we found Gaussian kernels to b e con venien t and eﬀective. Note that the resulting features o ver tests and histories are just fe atur es ; unlike the kernel cen ters deﬁned ov er observ ations, there is no need to let the k ernel width approach zero, since we are not attempting to learn accurate PDFs ov er the histories and tests in H and T . In more detail, we deﬁne a set of 2000 indic ative kernels , eac h one cen tered at a sequence of 3 observ ations from the initial segmen t of one of our tra jectories. W e choose the k ernel cov ariance using PCA on these sequences of observ a- tions, just as describ ed for single observ ations in Section 3.2. W e then generate our indicative features for a new sequence of three observ ations by ev aluating each indicative k ernel at the new sequence, and normalizing so that the vector of fea- tures sums to one. Similarly , we deﬁne 2000 char acteristic kernels , eac h one cen tered at a sequence of 3 observ ations from the end of one of our sample tra jectories, c ho ose a k ernel cov ariance, and deﬁne our c haracteristic feature vec- tor by ev aluating each kernel at a new observ ation sequence and normalizing. The initial distribution ω is, therefore, the distribution obtained by initializing uniformly and taking 3 random actions. Finally , we deﬁne 500 observation kernels , eac h one centered at a single observ ation from the middle of one of our sample tra jectories, and replace each observ ation b y its corresp onding vector of normalized kernel weigh ts. Next, we construct the matrices b P H , b P T , H , and b P T ,ao, H . As deﬁned ab ov e, each elemen t of b P H is the empirical ex- pectation (ov er our 8,000 training tra jectories) of the cor- responding element of the indicativ e feature vector—that is, element i is 1 w P w t =1 φ H it , where φ H it is the i th indicative feature, ev aluated at the current history at time t . Simi- larly , each element of b P T , H is the empirical exp ectation of the pr o duct of one indicativ e feature and one characteris- tic feature: element i, j is 1 w P w t =1 φ T it φ H j t . Once we hav e constructed b P T , H , w e can compute b U as the matrix of left singular vect ors of b P T , H . One of the adv antages of subspace iden tiﬁcation is that the complexity of the mo del can b e tuned by selecting the nu mber of singular vectors in b U . T o learn an exact TPSR, we should pick the ﬁrst n singular v ectors that correspond to singular v alues in b P T , H greater than some cutoﬀ that v aries with the noise resolution of our data. Ho wev er, we ma y wish to pick a smaller set of sin- gular vectors; doing so will pro duce a more compact TPSR at the p ossible loss of prediction quality . W e chose n = 5, the smallest TPSR that was able to pro duce high quality policies (see Section 5.4 b elow). Estimated V alue Function Policies Executed in Learned Subspace Paths T aken in Geometric Space −8 −4 0 4 8 x 10 −3 −4 0 4 −3 −8 −4 0 4 8 x 10 −3 −4 0 4 x 10 −3 200 400 600 13.9 18.2 507.8 Number of Actions Optimal Greedy Perseus Random W alk 0 B. A. D. C. * x 10 Figure 2: Planning in the Learned State Space. (A) The v alue function computed for each embedded point; ligh ter indicates higher v alue. (B) Po licies executed in the learned subspace. The red, green, magenta, and y ello w paths corresp ond to the policy executed by a robot with starting positions facing the red, green, magen ta, and y ello w walls resp ectiv ely . (C) The paths taken by the rob ot in geometric space while executing the p olicy . Eac h of the paths corresp onds to the path of the same color in (B). The darker circles indicate the starting and ending p osition of eac h path, and the tick-mark in the circles indicates the robot’s orientation. (D) Mean num ber of actions in path from 100 randomly sampled start p osition to the target image (facing blue w all). The ﬁrst bar (left) is the mean n umber of actions in the optimal solu tion found by A* searc h in t he rob ot’s conﬁguration space. The second bar (center) is the mean num b er of actions taken b y executing the p olicy computed by P erseus in the learned mo del (the asterisk indicates that this mean was only computed o v er the 78 suc c essful paths). The last bar (righ t) is the mean n um ber of actions required to ﬁnd the target with a random p olicy . The graph indicates that the policy computed from the le arned TPSR is close to optimal. Finally , rather than computing b P T ,ao, H directly , we in- stead compute b U T b P T ,ao, H for each pair a, o : the latter ma- trices are m uch smaller, and in our exp eriments, w e sav ed substan tially on both memory and runtime b y a voiding co n- struction of the larger matrices. T o co nstruct b U T b P T ,ao, H , w e restrict to those training tra jectories in which the action at the middle time step (i.e., step 4) is a . Then, eac h elemen t of b P T ,ao, H is the empirical expectation (among the restricted set of tra jectories) of the pro duct of one indicative feature, one characteris tic feature, and element o of the observ ation k ernel vector. So, b U T b P T ,ao, H = 1 w a w a X t =1 ( b U T φ T t )( φ H t ) T 1 Z t K ( o t − o ) (14) where K ( . ) is the k ernel function and Z t is the kernel nor- malization constant computed b y summing ov er the 500 ob- serv ation kernels for each o t . Given the matrices P H , P T , H , and P T ,ao, H , we can compute the TPSR parameters using the equations in Section 3. 5.3 Qualitative Evaluation Ha ving learned the parameters of the TPSR, the mo del can b e used for prediction, ﬁltering, and planning in the autonomous rob ot domain. W e ﬁrst ev aluated the mo del qualitatively by pro jecting the sets of histories in the train- ing data on to the learned TPSR state space: b U T b P H . W e colored each datap oin t according to the av erage of the red, green, and blue components of the highest pro bability obser- v ation follo wing the pro jected history . The features of the lo w dimensional embedding clearly capture the top ology of the ma jor features of the robot’s visual en vironmen t (Figure 1(C-D)), and contin uous paths in the environmen t translate in to cont inuous paths in the laten t space (Figure 2(B)). 5.4 Planning in the Learned Model T o test the quality of the learned mo del, we set up a na v- igation problem where the robot was required to plan a set of actions in order to reach a goal image (lo oking directly at the blue wall). W e speciﬁed a large reward (1000) for this observ ation, a rewa rd of − 1 for colliding with a wall, and 0 for every other observ ation. W e next learned a reward function by linear regression from the histories embedded in the learned TPSR state space to the reward sp eciﬁed at eac h image that follow ed an embedded history . W e used the rew ard function to compute an approximate v alue function using the P erseus algorithm with discount factor γ = . 8, a prediction horizon of 10 steps, and with the 8000 embedded histories as the set of b elief p oin ts. The learned v alue func- tion is displa yed in Figure 2(A). Onc e the approximate v alue function has been learned, and an initial b elief sp eciﬁed, the robot greedily chooses the action which maximizes the ex- pected v alue. The initial b eliefs were computed by starting with b 1 and then incorporating 3 random action-observ ation pairs. Examples of paths planned in the learned mo del are presen ted in Figure 2(B); the same paths are shown in geo- metric space (recall t hat the rob ot only h as acce ss to images; the geometric space is never observed by the rob ot) in Fig- ure 2(C). Not e that there are a set of v alid target p ositions in the environmen t since one can receive an identical close-up image of a blue wall from anywhere along the corresponding edge of the en vironmen t. The reward function encouraged the rob ot to navigate to a sp eciﬁc set of p oin ts in the environmen t, therefore the planning problem can be viewed as solving a shortest path problem. Even though w e don’t encode this intuition in to our algorithm, w e can use it to quantitativ ely ev aluate the performance of the p olicy in the original system. First we randomly sampled 100 initial histories in the environmen t and asked the rob ot to plan a path based on its learned p ol- icy . The robot was able to reach the goal in 78 of the trials. In 22 trials, the rob ot got stuck rep eatedly taking alternat- ing actions whose eﬀects cancelled (for example, alternating betw een turning − 15 ◦ and 15 ◦ ). 5 When the rob ot was able to reach the goal, we compared the num b er of actions taken both to the minimal p ath , calculated by A* search in the robot’s conﬁguration space giv en the true underlying p osi- tion , and to a random p olicy . Note that comparison to the optimal p olicy is somewhat unfair: in order to reco ver the optimal p olicy the rob ot would ha v e to know its true under- lying position (which is not av ailable to it), our mo del as- sumptions would ha v e to b e exact, and the algorithm would need an unlimited amoun t of training data. The results, summarized in Figure 2(D), indicate that the TPSR p olicy is close to the optimal p olicy in the original system. W e think that this result is remark able, especially giv en that previous approac hes hav e encountered signiﬁcan t diﬃculty modeling contin uous domains [12] and domains with simi- larly high lev els of complexit y [33]. 6. CONCLUSIONS W e hav e presented a nov el c onsistent subspace identiﬁ- cation algorithm that simultaneously solves the disc overy and le arning problems for TPSRs. In addition, we provided t wo extensions to the learning algorithm that are useful in practice, while main taining consistency: chara cteristic and indicativ e features only require one to know relev ant fea- tures of tests and histories, rather than sets of core tests and histories, while kernel densit y estimation can b e used to ﬁnd observ able operators when observ ations are real-v alued. W e also show ed ho w p oint-based appro ximate planning tec h- niques can b e used to solve the planning problem in the learned model. W e demonstrated the represen tational ca- pacit y of our mo del and the eﬀectiveness of our learning algorithm by learning a v ery compact mo del from simulated autonomous rob ot vision data. W e closed the loop b y suc- cessfully planning with the learned mo dels, using Perseus to appro ximately compute the v alue function and optimal p ol- icy for a na vigation task. T o our knowledge this is the ﬁrst instance of learning a model for a simulated rob ot in a par- tially observ able environmen t using a consistent algorithm and successfully planning in the learned mo del. W e com- pare the p olicy generated by our model to a b ound on the best possible v alue, and determine that our p olicy is close to optimal. W e b elieve the spectral PSR learning algorithm presented here, and subspace identiﬁcation procedures for learning PSRs in general, can increase the scope of planning under uncertain t y for autonomous agents in previously intracta ble scenarios. W e believe that this impro vemen t is partly due to 5 In an actual application, we believe that we could av oid getting stuck by p erforming a short lo ok ahead or simply by randomizing our p olicy; for purp oses of comparison, how- ev er, we rep ort results for the greedy policy . the greater represen tational pow er of the PSR as compared to POMDPs and partly due to the eﬃcient and statistically consisten t nature of the learning metho d. 7. REFERENCES [1] K. J. Astr ¨ om. Optimal control of Mark o v decision processes with incomplete state estimation. Journal of Mathematic al Analysis and Applic ations , 10:174–205, 1965. [2] J. Bilmes. A gen tle tutorial on the EM algorithm and its application to parameter estimation for gaussian mixture and hidden marko v mo dels. T ec hnical Report, ICSI-TR-97-021, 1997. [3] M. Bo wling, P . McCrack en, M. James, J. Neufeld, and D. Wilkinson. Learning predictive state represen tations using non-blind p olicies. In Pr oc. ICML , 2006. [4] A. R. Cassandra, L. P . Kaelbling, and M. R. Littman. Acting Optimally in Partially Observ able Stochastic Domains. In Pro c. AAAI , 1994. [5] Ey al Even -Dar and Sham M. Kak ade and Yishay Mansour. Planning in POMDPs Using Multiplicity Automata. In UAI , 2005. [6] H. Jaeger, M. Zhao, A. Kolling. Eﬃcient T raining of OOMs. In NIPS , 2005. [7] D. Hsu, S. Kak ade, and T. Zhang. A spectral algorithm for learning hidden marko v models. In COL T , 2009. [8] M. T. Izadi and D. Precup. Poin t-based Planning for Predictiv e State Represen tations. In Pr o c. Canadian AI , 2008. [9] H. Jaeger. Observ able operator mo dels for discrete stochastic time series. Neur al Computation , 12:1371–1398, 2000. [10] M. James and S. Singh. Learning and discov ery predictiv e state represen tations in dynamical systems with reset. In Pr o c. ICML , 2004. [11] M. R. James, T. W essling, and N. A. Vlassis. Impro ving approximate v alue iteration using memories and predictiv e state represen tations. In AAAI , 2006. [12] N. K. Jong and P . Stone. T ow ards Employing PSRs in a Con tinuo us Domain. T echnical Rep ort UT-AI-TR-04-309, Universit y of T exas at Austin, 2004. [13] M. Littman, R. Sutton, and S. Singh. Predictive represen tations of state. In A dvanc es in Neur al Information Pr o c essing Systems (NIPS) , 2002. [14] M. Zhao and H. Jaeger and M. Thon. A Bound on Modeling Error in Observ able Op erator Mo dels and an Associated Learning Algorithm. Neur al Computation . [15] A. McCallum. Reinforcement Learning with Selective P erception and Hidden State. PhD Thesis, Universit y of Rochester, 1995. [16] P . McCrack en and M. Bowling. Online discov ery and learning of predictiv e state represen tations. In Pro c. NIPS , 2005. [17] J. Pineau, G. Gordon, and S. Thrun. Poin t-based v alue iteration: An anytime algorithm for POMDPs. In Pr o c. IJCAI , 2003. [18] J. Pineau, G. Gordon, and S. Thrun. Anytime point-based approximations for large POMDPs. Journal of Artiﬁcial Intel ligenc e R ese arch (JAIR) , 27:335–380, 2006. [19] M. Rosencran tz, G. J. Gordon, and S. Thrun. Learning lo w dimensional predictive representations. In Pr o c. ICML , 2004. [20] S. Ross and J. Pineau. Mo del-Based Bay esian Reinforcemen t Learning in Large Structured Domains. In Pr o c. UAI , 2008. [21] G. Shani, R. I. Brafman, and S. E. Shimony . Model-based online learning of POMDPs. In Pr oc. ECML , 2005. [22] S. M. Siddiqi, B. Bo ots, and G. J. Gordon. Reduced-Rank Hidden Marko v Models. http://arxiv.or g/abs/0910.0902 , 2009. [23] B. W. Silv erman. Density Estimation for Statistics and Data Analysis . Chapman & Hall, 1986. [24] S. Singh, M. James, and M. Rudary . Predictiv e state represen tations: A new theory for mo deling dynamical systems. In Pr o c. UAI , 2004. [25] S. Singh, M. L. Littman, N. K. Jong, D. Pardoe, and P . Stone. Learning predictive state representatio ns. In Pr o c. ICML , 2003. [26] S. Soatto and A. Chiuso. Dynamic data factorization. T echnical rep ort, UCLA, 2001. [27] E. J. Sondik. The Optimal Control of Partially Observ able Marko v Pro cesses. PhD. Thesis, Stanford Univ ersity, 1971. [28] M. T. J. Spaan and N. Vlassis. Perseus: Randomized point-based v alue iteration for POMDPs. Journal of Ar tiﬁcial Intel ligenc e R ese ar ch , 24:195–220, 2005. [29] P . V an Oversc hee and B. De Moor. Subsp ac e Identiﬁc ation for Line ar Systems: The ory, Implementation, Applic ations . Kluw er, 1996. [30] E. Wiewiora. Learning predictive representations from a history . In Pr o c. ICML , 2005. [31] D. Wingate. Exp onential F amily Predictiv e Represen tations of State. PhD Thesis, Universit y of Mic higan, 2008. [32] D. Wingate and S. Singh. On discov ery and learning of models with predictive representations of state for agen ts with con tinuo us actions and observ ations. In Pr o c. AAMAS , 2007. [33] D. Wingate and S. Singh. Eﬃciently learning linear-linear exp onential family predictive represen tations of state. In Pr o c. ICML , 2008. [34] B. W olfe, M. James, and S. Singh. Learning predictiv e state represen tations in dynamical systems without reset. In Pr o c. ICML , 2005.

Closing the Learning-Planning Loop with Predictive State Representations

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment