Least Squares Temporal Difference Actor-Critic Methods with Applications to Robot Motion Control

We consider the problem of finding a control policy for a Markov Decision Process (MDP) to maximize the probability of reaching some states while avoiding some other states. This problem is motivated by applications in robotics, where such problems n…

Authors: Reza Moazzez Estanjini, Xu Chu Ding, Morteza Lahijanian

Least Squares Temporal Difference Actor-Critic Methods with Applications   to Robot Motion Control
Least Squares T emporal Dif ference Actor -Critic Methods with Applications to Robot Motion Control ∗ Reza Moazzez Estanjini † , Xu Ch u Ding ‡ , Morteza Lahijanian ‡ , Jing W ang † , Calin A. Belta ‡ , and Ioannis Ch. Pasch alidis § Abstract — W e consider the problem o f findin g a control policy fo r a M arko v Decision Process (MDP) to maximize the probability of r eaching some states while a voiding some other states. Thi s problem is motiv ated by applications in robotics, where such problems naturally arise when probabilistic models of robot m otion are re quired to s atisfy temporal logic task specifications. W e transform thi s problem into a St ochastic Shortest Path (S SP) problem and develop a new approximate dynamic program ming algorithm t o solve it. This algorithm is of the actor-cr itic type and uses a least-squ are temporal difference learning meth od. It op erates on sample paths of the system and optimizes the policy withi n a pre-specified class parameterized by a p arsimonious set of parameters. W e show its con ver gence to a policy corresponding to a stationary point in the p arameters’ space. Simulation results confi rm the effectiv eness of t he proposed solut ion. Index T erms — Marko v Decision Processes, dynamic p ro- gramming, actor -critic methods, robot motion control, robotics. I . I N T RO D U C T I O N Markov Decision Pro cesses (MDPs) have been widely used in a variety of app lication d omains. I n par ticular, they hav e been increasing ly used to mo del and con trol autonom ous agen ts sub ject to noises in th eir sensing and actuation, or u ncertainty in th e en vironm ent they opera te. Examples in clude: unman ned aircraft [1], groun d robots [ 2], and steering of medical n eedles [3]. In these studies, the underly ing motion of the s ystem cannot be predicted with certainty , but they can b e obtaine d from the sensing and the actu ation m odel thr ough a simu lator or emp irical trials, providing tr ansition probabilities. Recently , th e pro blem of contro lling an MDP fro m a temporal log ic spe cification has received a lot of attention . T empor al log ics suc h as Lin ear T empo ral Lo gic (L TL) and Computation al Tree Lo gic (CTL) are app ealing as they provide fo rmal, high level languages in wh ich the b ehavior of the system can be specified (see [4]) . In the context * Researc h partia lly supported by the NSF under grant EF RI-0735974, by the DOE under grant DE-FG52-06N A27490, by the ODDR&E MUR I10 program under grant N00014-10- 1-0952, and by ONR MURI under grant N00014-09-105 1. † Re za Moazzez Estanj ini and Jing W ang a re with the Di visi on of Systems Eng., Boston Univ ersity , 8 St. Mary’ s St ., Boston, MA 02215, email: { rez a,wangjing } @bu.ed u . ‡ Xu Chu Ding, Morteza Lahijani an, and Calin A. Belta are with the Dept. of Mechani cal Eng., Boston Univ ersity , 15 St. Mary’ s St., Boston, MA 02215, email: { xcding,mor teza,cbelta } @bu.e du . § Ioannis Ch. Paschalidi s is with the Dept. of Electrica l & Computer Eng., and the Di vision of Systems Eng., Boston Univ ersity , 8 St. Mary’ s St., Boston, MA 02215, email: yannisp@bu.ed u . § Corresponding author of MDPs, provid ing p robab ilistic gu arantees mean s find ing optimal policies that maximize the pro babilities o f satisfying these specifications. In [2], [5], it ha s been shown that, the problem of finding an op timal policy that max imizes the probab ility of satisfyin g a tempora l logic for mula can be naturally translated to one of maximizin g the p robability of reachin g a set of states in th e MDP . Such p roblem s are referred to as Max imal Reachability Probability (MRP) problem s. It has b een known [3] tha t th ey are equiv alent to Stochastic Shortest P ath (SSP) problems, which belong to a standar d class of infinite h orizon problems in dyn amic progr amming. Howe ver , as suggested in [2], [5], th ese problems usually in volve MDPs with large state spaces. For examp le, in o rder to synthesize an op timal policy fo r an MDP satisfyin g an L TL f ormula , on e needs to solve an MRP p roblem on a much larger MDP , which is the p roduct of the orig inal MDP and an a utomaton re presenting the for mula. Thus, co mputin g the exact so lution can be computatio nally prohib itiv e for realistically-sized settings. Moreover, in some cases, the system of intere st is so co mplex that it is no t feasible to determine transition probabilities fo r all actions and states explicitly . Motiv ated by th ese limitatio ns, in this pa per we develop a new appr oximate dyna mic pr ogramm ing algorithm to solve SSP MDPs and we establish its conv ergence. The alg orithm is of the actor-critic typ e and uses a Lea st Square T emp oral Differ en ce (LSTD) le arning metho d. Our proposed algorith m is based on sample paths, an d thu s only req uires transition probab ilities a long th e sampled paths and not over the en tire state space. Actor-critic algorithms are typically used to optimize s ome Rando mized Station ary P olicy (RSP) using po licy grad ient estimation. RSPs are parameter ized by a pa rsimoniou s set of parameter s and the objective is to op timize the policy with respect to these par ameters. T o this end, one need s to estimate ap propr iate po licy gradients, which can be done using learn ing methods tha t are mu ch more efficient than computin g a co st-to-go fun ction over the entire state-ac tion space. Many different v ersions of actor-critic algorithm s have been pro posed which h av e been shown to be quite efficient for various ap plications (e.g., in robo tics [6] and n avigation [7], power manageme nt of wireless transmitters [8], biolog y [9], and op timal bid ding f or e lectricity g eneration co mpanies [10]). A particularly attra ctiv e de sign o f th e acto r-critic ar chi- tecture was pro posed in [11], wher e the critic estimates the policy g radient using sequ ential observations fro m a sample p ath while the acto r upda tes the po licy at the same time, althou gh at a slower time-scale. I t was proved that the estimate of the critic tracks the slowly varying policy asymptotically u nder suitable condition s. A center piece o f these co nditions is a relationsh ip between the ac tor step-size and the critic step-size, which will be discussed later . The critic of [11] uses first-order variants of the T emporal Differ en ce (TD) algor ithm ( TD(1) and TD( λ )). Howev er, it has been shown that the least s quares methods – LSTD (Least Squares TD) an d LSPE ( Least Squar es Policy Evaluation) – are super ior in terms of convergence rate (see [ 12], [13]). LSTD and L SPE were first proposed for d iscounted cost problem s in [12] and [ 14], resp ectiv ely . La ter , [1 3] sh owed that the conver gence rate of LSTD is op timal. Their re sults clearly d emonstra ted that LSTD converges m uch faster and more reliably than TD(1) and TD( λ ). Motiv ated by these findin gs, we p ropose an actor-critic algorithm that adopts LSTD learning methods tailored to SSP problem s, while at the same time maintain s the concu rrent update ar chitecture of the ac tor an d the critic. ( Note th at [1 5] also used LSTD in an actor-critic metho d, but the actor had to wait for the c ritic to conver ge before m aking each po licy update.) T o illustrate salient feature s of the approach , we present a case study where a robo t in a large en vironm ent is required to satisfy a ta sk specification of “go to a set of goa l states while av oiding a set of u nsafe states. ” (W e note that more comp lex task specifications can be directly converted to MRP problem s as shown in [2], [5].) The rest of the paper is organized as follo ws. W e formu late the pro blem in Sec. II. Th e LSTD acto r-critic algo rithm with co ncurr ent u pdates is presented in Sec. ?? , where th e conv ergence of the algor ithm is shown. A case study is presented in Sec. V. W e conclud e the paper in Sec. VI. Notatio n: W e use b old letters to denote vectors an d matrices; ty pically vectors are lower case and matrices u pper case. V ectors are assumed to be co lumn vectors unless explicitly stated o therwise. Transpose is deno ted by p rime. For any m × n matrix A , with rows a 1 , . . . , a m ∈ R n , v ( A ) denotes the colum n vector ( a 1 , . . . , a m ) . k · k stands fo r the Euclidean norm and k · k θ is a special nor m in the MDP state- action space that we will define later . 0 denotes a vector or matrix with all co mponen ts set to zero and I is the id entity matrix. | S | denotes the card inality o f a set S . I I . P RO B L E M F O R M U L AT I O N Consider an SSP MDP with finite state and action spaces. Let k d enote time, X denote th e state spa ce with car dinality | X | , and U den ote the action space with card inality | U | . Let x k ∈ X and u k ∈ U be th e state of the system and the action taken at time k , respectively . Let g ( x k , u k ) b e the on e-step cost of applying action u k while the system is at state x k . Let x 0 and x ∗ denote the initial state and the special cost- free termin ation state, respectively . Let p ( j | x k , u k ) d enote the state tra nsition probab ilities (which a re typically no t explicitly known); that is, p ( j | x k , u k ) is the pro bability of transition f rom state x k to state j given th at a ction u k is taken while the system is a t state x k . A policy µ is said to be pr oper if, wh en using this policy , there is a positive pr obability th at x ∗ will b e reached after at m ost | X | transitions, r egardless o f the in itial state x 0 . W e make the following assumption . Assumption A Ther e exis t a pr oper stationary policy . The policy candidate s are assumed to belong to a par am- eterized family of Randomized Sta tionary P olicies (RSPs) { µ θ ( u | x ) | θ ∈ R n } . That is, gi ven a state x ∈ X and a parameter θ , th e policy app lies a ction u ∈ U with probab ility µ θ ( u | x ) . De fine the expected total co st ¯ α ( θ ) to be lim t →∞ E { P t − 1 k =0 g ( x k , u k ) | x 0 } where u k is g enerated accordin g to RSP µ θ ( u | x ) . The go al is to optimize th e expected to tal cost ¯ α ( θ ) over the n -dimen sional vector θ . W ith no explicit model o f the state transitions but only a sample path denoted b y { x k , u k } , th e actor-critic algo- rithms typically optimize θ lo cally in the fo llowing way: first, the cr itic estimates the po licy gradien t ∇ ¯ α ( θ ) us- ing a T emporal Differ ence (TD) algorithm; the n the actor modifies the policy param eter alon g the g radient dire c- tion. Let the operato r P θ denote takin g expectation af- ter one tra nsition. More p recisely , for a f unction f ( x , u ) , ( P θ f )( x , u ) = P j ∈ X ,ν ∈ U µ θ ( ν | j ) p ( j | x , u ) f ( j , ν ) . Define the Q θ -value functio n to be any fu nction satisfyin g the Poisson equation Q θ ( x , u ) = g ( x , u ) + ( P θ Q θ )( x , u ) , where Q θ ( x , u ) can be interpreted as th e expected fu ture cost we incur if we start at state x , app ly contr ol u , an d then app ly RSP µ θ . W e note that in general, the Poisson equation need not hold for SSP , ho we ver , it holds if the policy correspo nding to RSP µ θ is a p roper policy [16]. W e make the following assumption . Assumption B F or any θ , and for an y x ∈ X , µ θ ( u | x ) > 0 if action u is feasible at state x , and µ θ ( u | x ) ≡ 0 otherwise. W e note that one po ssible RSP f or which Assumption B holds is the “Boltzmann” policy (see [ 17]), that is µ θ ( u | x ) = exp( h ( u ) θ ( x )) P a ∈ U exp( h ( a ) θ ( x )) , (1) where h ( u ) θ ( x ) is a fu nction that co rrespon ds to action u and is p arameterize d by θ . The Boltzmann policy is simp le to use and is the policy tha t will be used in th e case stud y in Sec. V. Lemma II.1 If Assumptio ns A a nd B hold, the n for a ny θ the policy corr esponding to RSP µ θ is pr oper . Pr oof: The pro of follows from the definition of a prop er policy . Under suitable ergodicity condition s, { x k } and { x k , u k } are Markov chains with station ary distributions und er a fixed policy . These station ary distributions are denoted b y π θ ( x ) and η θ ( x , u ) , r espectively . W e will no t elaborate o n the ergodicity cond itions, except to n ote that it suffi ces th at the process { x k } is irredu cible and aper iodic giv en any θ , an d Assumption B h olds. Deno te by Q θ the ( | X || U | ) - dimensiona l vecto r Q θ = ( Q θ ( x , u ); ∀ x ∈ X , u ∈ U ) . Let now ψ θ ( x , u ) = ∇ θ ln µ θ ( u | x ) , where ψ θ ( x , u ) = 0 when x , u are such that µ θ ( u | x ) ≡ 0 for all θ . I t is assumed that ψ θ ( x , u ) is bou nded and co ntinuo usly differentiable. W e write ψ θ ( x , u ) = ( ψ 1 θ ( x , u ) , . . . , ψ n θ ( x , u )) wh ere n is th e dimensio nality of θ . As we did in defining Q θ we will denote by ψ i θ the ( | X || U | ) - dimensiona l vector ψ i θ = ( ψ i θ ( x , u ); ∀ x ∈ X , u ∈ U ) . A key fact u nderlyin g th e acto r-critic algorithm is that the policy gr adient can be expressed as ( Theorem 2.15 in [13]) ∂ ¯ α ( θ ) ∂ θ i = h Q θ , ψ i θ i θ , i = 1 , . . . , n, where for a ny two function s f 1 and f 2 of x an d u , expressed as ( | X || U | ) -dimensional vectors f 1 and f 2 , we define h f 1 , f 2 i θ △ = X x ,u η θ ( x , u ) f 1 ( x , u ) f 2 ( x , u ) . (2) Let k · k θ denote the nor m in duced by the inn er p rodu ct (2), i.e. , k f k 2 θ = h f , f i θ . Let also S θ be the subspa ce of R | X || U | spanned by th e vectors ψ i θ , i = 1 , . . . , n and denote by Π θ the projection with respect to the norm k · k θ onto S θ , i.e. , for any f ∈ R | X || U | , Π θ f is the unique vector in S θ that minimizes k f − ˆ f k θ over all ˆ f ∈ S θ . Since for all i h Q θ , ψ i θ i θ = h Π θ Q θ , ψ i θ i θ , it is sufficient to k now the projection of Q θ onto S θ in order to com pute ∇ ¯ α ( θ ) . One possibility is to ap proxim ate Q θ with a parametric linear ar chitecture of the following form (see [11]): Q r θ ( x , u ) = ψ ′ θ ( x , u ) r ∗ , r ∗ ∈ R n . (3) This d ramatically redu ces the comp lexity of learnin g fro m the space R | X || U | to the space R n . Further more, th e tempo ral difference algorithm s can be used to learn such an r ∗ effecti vely . The elements o f ψ θ ( x , u ) are und erstood as features associated with an ( x , u ) state-action pair in the sense of b asis fun ctions used to develop an app roximatio n of the Q θ -value fu nction. I I I . A C T O R - C R I T I C A L G O R I T H M U S I N G L S T D The critic in [11] used either T D( λ ) or TD(1) . T he algorithm we p ropo se uses least squares T D methods ( LSTD in particular ) in stead as th ey have been sh own to provid e far superior perfo rmance. In th e sequel, we first de scribe the LSTD actor-critic algorithm an d then we pr ove its conv ergence. A. The Algo rithm The algorithm uses a sequen ce of simulated tra jectories, each of wh ich starting at a given x 0 and end ing as soon as x ∗ is v isited f or the first time in th e sequ ence. Once a trajectory is com pleted, the state of the system is reset to the initial state x 0 and the process is repeated. Let x k denote the state of th e system at time k . Let r k , the iterate for r ∗ in (3), be the param eter vecto r of the critic at time k , θ k be the p arameter vector of the actor at time k , and x k +1 be the n ew state, obtained after action u k is applied wh en the state is x k . A new action u k +1 is gene rated accordin g to the RSP correspo nding to the acto r parameter θ k (see [1 1]). The cr itic and the acto r carry ou t the following updates, where z k ∈ R n represents Sutton’ s eligibility trace [17], b k ∈ R n refers to a statistical estimate of the single period rew ard, and A k ∈ R n × n is a sample estimate o f the matr ix form ed by z k ( ψ ′ θ k ( x k +1 , u k +1 ) − ψ ′ θ k ( x k , u k )) , which can b e viewed as a sample observation of the scaled difference o f th e o bservation of the state in cidence vecto r for iteration s k and k + 1 , scale d to the featu re space by th e basis function s. LSTD Actor -Critic for SSP Initialization: Set all entries in z 0 , A 0 , b 0 and r 0 to zero s. Let θ 0 take some initial value, potentially correspo nding to a heuristic policy . Critic: z k +1 = λ z k + ψ θ k ( x k , u k ) , b k +1 = b k + γ k [ g ( x k , u k ) z k − b k ] , A k +1 = A k + γ k [ z k ( ψ ′ θ k ( x k +1 , u k +1 ) − ψ ′ θ k ( x k , u k )) − A k ] , (4) where λ ∈ [0 , 1) , γ k △ = 1 k , and finally r k +1 = − A − 1 k b k . (5) Actor: θ k +1 = θ k − β k Γ( r k ) r ′ k ψ θ k ( x k +1 , u k +1 ) ψ θ k ( x k +1 , u k +1 ) . (6) In th e above, { γ k } co ntrols th e critic step-size, while { β k } and Γ ( r ) contro l the acto r step-size together . An implemen - tation of this algor ithm needs to make these ch oices. The role of Γ( r ) is mainly to keep the actor up dates bound ed, and we can for instance use Γ( r ) =    D || r || , if || r || > D , 1 , otherwise, for some D > 0 . { β k } is a d eterministic and no n-increa sing sequence for which we need to hav e X k β k = ∞ , X k β 2 k < ∞ , lim k →∞ β k γ k = 0 . (7) An example of { β k } satisfying Eq. (7) is β k = c k ln k , k > 1 , (8) where c > 0 is a constant param eter . Also, ψ θ ( x , u ) is defined as ψ θ ( x , u ) = ∇ θ ln µ θ ( u | x ) , where ψ θ ( x , u ) = 0 when x , u are such that µ θ ( u | x ) ≡ 0 for all θ . I t is assumed that ψ θ ( x , u ) is bou nded and contin uously differentiable. Note th at ψ θ ( x , u ) = ( ψ 1 θ ( x , u ) , . . . , ψ n θ ( x , u )) wh ere n is th e dimensio nality of θ . The convergence of the a lgorithm is stated in the fo llowing Theorem (see the App endix for the proo f). Theorem III.1 [Actor Con ver gence] F or the LSTD ac tor- critic with some step- size sequence { β k } satisfying (7 ), for any ǫ > 0 , ther e exists some λ sufficiently close to 1 , such that lim inf k ||∇ ¯ α ( θ k ) || < ǫ w . p.1. Tha t is, θ k visits an arbitrary neigh borhoo d of a stationa ry point infinitely often. I V . T H E M R P A N D I T S C O N V E R S I O N I N T O A N S S P P R O B L E M In the MRP prob lem, we assume that ther e is a set of unsafe states which are set to be absorbing on the MDP ( i.e. , there is only on e control at each state, c orrespon ding to a self -transition with p robab ility 1 ). Let X G and X U denote the set of goal states and u nsafe states, r espectively . A sa fe state is a state th at is not u nsafe. It is assumed th at if the system is at a safe state, th en there is a t least on e sequenc e of actions that can reach on e of the states in X G with positive probab ility . Note that this implies that Assumption A ho lds. In the MRP , the goal is to find the optimal p olicy that maximizes the pr obability of reac hing a state in X G from a given in itial state. Note that since th e un safe states are absorbing , to satisfy this specification th e sy stem must not visit the unsafe states. W e now con vert the MRP problem into an SSP p roblem, which req uires us to chan ge the orig inal MDP (now d enoted as MDP M ) into a SSP MDP (deno ted as MDP S ). No te that [3] established the equi valence between an MR P problem and an SSP p roblem where the expected reward is maximized . Here we p resent a d ifferent transfor mation wher e an MRP problem is converted to a mo re stan dard SSP prob lem wher e the expected cost is minimized. T o b egin, we den ote th e state space of MDP M by X M , and define X S , the state space of MDP S , to be X S = ( X M \ X G ) ∪ { x ∗ } , where x ∗ denotes a special termina tion state. L et x 0 denote the in itial state, and U deno te th e actio n space of MDP M . W e d efine the action space o f MDP S to be U , i.e. , the same as for MDP M . Let p M ( j | x , u ) d enote the pro bability o f tr ansition to state j ∈ X M if action u is taken at state x ∈ X M . W e n ow defin e the transition p robab ility p S ( j | x , u ) for all states x , j ∈ X S as: p S ( j | x , u ) =    X i ∈ X G p M ( i | x , u ) , if j = x ∗ , p M ( j | x , u ) , if j ∈ X M \ X G , (9) for all x ∈ X M \ ( X G ∪ X U ) and all u ∈ U . Furth ermore, we set p S ( x ∗ | x ∗ , u ) = 1 and p S ( x 0 | x , u ) = 1 if x ∈ X U , for all u ∈ U . Th e transition p robab ility of M DP S is defined to be the same as for MDP M , except that th e pr obability of visiting the goal states in MDP M is changed into th e pr obability o f visiting the termination state; and the un safe states transit to the initial state with probab ility 1 . For all x ∈ X S , we d efine the cost g ( x , u ) = 1 if x ∈ X U , and g ( x , u ) = 0 o therwise. Defin e the expected total cost of a p olicy µ to be ¯ α S µ = lim t →∞ E { P t − 1 k =0 g ( x k , u k ) | x 0 } where actions u k are obtained acc ording to p olicy µ in MDP S . More over , fo r each policy µ o n MDP S , we ca n define a policy on MDP M to be th e same as µ fo r all states x ∈ X M \ ( X G ∪ X U ) . Since actions are ir relev ant at the g oal and unsafe states in both MDPs, with slight abuse of notatio n we de note both policies to be µ . Finally , we defin e the Reachability Pr obability R M µ as the proba bility of re aching one of the g oal states from x 0 under po licy µ on M DP M . The Lemma below relates R M µ and ¯ α S µ : Lemma IV .1 F or any RSP µ , we have R M µ = 1 ¯ α S µ +1 . Pr oof: From the definitio n o f th e g ( x , u ) , ¯ α S µ is th e expected numbe r of time s when unsafe states in X U are visited before x ∗ is reached. From the construction of MDP S , reaching x ∗ in MDP S is equ iv alent to reaching one of the goal states in MDP M . On the other h and, fo r MDP M , by definition of X G and X U , in the M arkov cha in g enerated by µ , the states X G and X U are the only absor bing states, and all oth er states are transient. Thu s, the pr obability of visiting a state in X U from x 0 on MDP M is 1 − R M µ , wh ich is the same as the pr obability of visiting X U for each run of M DP S , due to the construction o f tra nsition pro babilities (9). W e can now consider a g eometric distribution wh ere the probability of success is R M µ . Because ¯ α S µ is th e expecte d num ber of times when an unsafe state in X U is visited bef ore x ∗ is reached, th is is the sam e as the expected n umber o f failures of Berno ulli trails (with p robability of success being R M µ ) before a success. This imp lies ¯ α S µ = 1 − R M µ R M µ . Rearrang ing ¯ α S µ = 1 − R M µ R M µ completes the proof. The ab ove lemma mea ns that µ as a solution to the SSP problem on MDP S (minimizing ¯ α S µ ) correspon ds to a solution for the MRP problem on MDP M (maximizin g R M µ ). Note that the algor ithm uses a seq uence of simulated trajecto ries, each of which starting at x 0 and ending as soon as x ∗ is visited for the first tim e in th e sequ ence. Once a tr ajectory is co mpleted, the state of the system is reset to th e initial state x 0 and the process is repeated. Thus, the actor-critic algorithm is applied to a mod ified version of th e MDP S where transition to a goal state is always f ollowed by a transition to the initial state. V . C A S E S T U DY In th is section we apply ou r algorith m to contr ol a robo t moving in a squ are-shaped mission environmen t, wh ich is partitioned into 2500 smaller squar e regions (a 50 × 50 gr id) as shown in Fig. 1. W e model th e m otion o f the robot in the en vironm ent as an MDP: each region corr esponds to a state of the MDP , and in each region (state), the rob ot can take the following contro l p rimitives (actions): “Nor th”, “East”, “South”, “W est”, which r epresent the directions in which the robot intend s to move (depending on the location of a region, some of th ese a ctions may n ot be ena bled, for example, in the lower-left corn er , o nly actio ns “North” and “Ea st” are enabled) . These contr ol primitives are not re liable a nd are subject to noise in actuatio n an d p ossible surface roug hness in the environment. Thus, for each motion primitive at a region, ther e is a probab ility that the robot enter s a n ad jacent region. X X O X Fig. 1. V ie w of the mission envi ronment. The initi al region is marked by o, the goal regions by x, and the unsafe regions are shown in black. W e lab el the region in the south -west co rner a s the initial state. W e marked th e regions located at th e other three co rners as the set of g oal states as shown in Fig . 1. W e assume that there is a set o f unsafe states X U in the en vironm ent (shown in black in Fig. 1) . Our go al is to find the optima l policy that maximize s the probab ility of reaching a state in X G (set of goal states) from the initial state (an instance of an MRP problem ). A. Designing an RSP T o apply the L STD Acto r-Critic algor ithm, the key step is to design an RSP µ θ ( u | x ) . I n this case stud y , we defin e the RSP to b e a n exponential fun ction of two scalar parameters θ 1 and θ 2 , resp ectiv ely . These parameters ar e used to provid e a balance b etween safety an d pr ogr ess from applyin g the control policy . For each pair of states x i , x j ∈ X , we d efine d ( x i , x j ) as th e minimum number of tr ansitions f rom x i and x j . W e denote x j ∈ N ( x i ) if and on ly if d ( x i , x j ) ≤ r n , where r n is a fixed integer given aprior i. If x j ∈ N ( x i ) , then we say x i is in the neighbor hood of x j , and r n represents the radius of the neighb orho od aro und each state. For each state x ∈ X , the saf ety scor e s ( x ) is defined as the r atio of the safe neighbo uring states over all ne ighbor ing states of x . T o be more specific, we define s ( x ) = P y ∈ N ( x ) I s ( y ) | N ( x ) | (10) where I s ( y ) is an indicato r f unction such that I s ( y ) = 1 if and on ly if y ∈ X \ X U and I s ( y ) = 0 if otherwise. A higher safety score for the current state of robot mean s it is less likely for the ro bot to r each an unsaf e region in the future. W e d efine the progre ss score of a state x ∈ X as d g ( x ) := min y ∈ X G d ( x , y ) , which is the min imum number of transitions from x to any goal re gion. W e c an now propo se the RSP p olicy , which is a Boltzm ann policy as define d in (1). Note that U = { u 1 , u 2 , u 3 , u 4 } , which cor respond s to “North”, “E ast”, “South”, and “W est”, r espectively . W e first define a i ( θ ) = F i ( x ) e θ 1 E { s ( f ( x ,u i )) } + θ 2 E { d g ( f ( x ,u i )) − d g ( x ) } , (11) where θ := ( θ 1 , θ 2 ) , and F i ( x ) is an in dicator fun ction such that F i ( x ) = 1 if u i is available at x i and F i ( x ) = 0 if otherwise. No te th at th e av ailability of co ntrol a ctions at a state is limited fo r the states at the bou ndary . For examp le, at th e in itial state, which is at th e lower -left cor ner, the set of av ailable actions is { u 1 , u 2 } , co rrespon ding to “N orth” and “East”, resp ectiv ely . If an action u i is n ot available at state x , we set a i ( θ ) = 0 , which means that µ θ ( u i | x ) = 0 . Note th at a i ( θ ) is defined to b e th e co mbination o f th e expected saf ety score of the next state app lying con trol u i , and the expected improved progr ess score fro m the curren t state applyin g u i , weighted by θ 1 and θ 2 . The RSP is then giv en by µ θ ( u i | x ) = a i ( θ ) P 4 i =1 a i ( θ ) . (12) W e n ote that Assumption B holds f or the propo sed RSP . Moreover , Assumption A also holds, theref ore Th eorem I I.1 holds for this RSP . B. Generating transition pr obab ilities T o imp lement the LSTD Actor-Critic algo rithm, we first constructed the M DP . As men tioned a bove, th is MDP repre- sents the motion of the rob ot in the en vironm ent wh ere each state cor respond s to a cell in the environment (Fig. 1). T o capture the transition p robab ilities of the ro bot f rom a cell to its adjacent one under an action, we built a simulato r . The simulato r uses a un icycle model (see, e.g. , [19]) for the dyna mics of the ro bot w ith noisy sensors and actuato rs. In th is m odel, th e motion of the robo t is determined b y sp ec- ifying a forward and a n ang ular velocity . At a given region, the robot implements one o f th e f ollowing four contro llers (motion primitives) - “East”, “North”, “W est”, “South”. Each of the se co ntrollers ope rates by o btaining th e d ifference between the c urrent hea ding angle and the desired heading angle. Then, it is tr anslated into a p ropor tional feedba ck control law for angu lar velocity . The desire d h eading angles for the “Ea st”, “North”, “W est”, and “So uth” controller s are 0 ◦ , 90 ◦ , 180 ◦ , and 27 0 ◦ , respectively . Each co ntroller also uses a constant for ward velocity . T he en vironm ent in the simulator is a 50 b y 50 square grid as shown in Fig. 1. T o each cell of the en vironment, we rando mly ass igned a surface rough ness wh ich affects the m otion of the r obot in tha t cell. The perimeter of the en vironmen t is made of walls, and when the robo t run s to them , it boun ces with the mirro r-angle. T o find the tr ansition pr obabilities, we per formed a total of 5000 simulatio ns for each con troller and state o f the MDP . I n each trial, the rob ot was initialized at the center of th e cell, and then an actio n was a pplied. The robo t moved in that cell accordin g to its dynamic s an d surface rough ness of th e region. As soon as the robot exited the cell, a transition was encoun tered. Then, a reliable center-conv erging controller was automatically applied to steer the rob ot to the center of the n ew cell. W e assumed that th e center-conver ging controller is reliable en ough that always drives the robot to the center of the new cell befo re exiting it. Thus, the robot always started from the center of a cell. This m akes the process Mar kov (the p robability of th e curren t tran sition depend s on ly the con trol and the cu rrent state, and n ot on the history up to the cur rent state). W e also assumed pe rfect observation at the bo undaries o f the cells. It should be no ted that, in gener al, it is n ot requ ired to have all the tran sition prob abilities o f the model in order to apply the LSTD Actor-Critic algor ithm, but r ather, we only n eed tr ansition pr obabilities alon g the trajectories of the system simulated while run ning the algorithm. T his becomes an impo rtant ad vantage in the case where the en viron ment is large and ob taining all tra nsition p robab ilities becom es infeasible. C. Results W e first obtained th e exact optimal policy for this pro b- lem u sing the methods describ ed in [2], [ 5]. The m aximal reachability pr obability is 99.9 988%. W e then used our LSTD actor-critic algor ithm to optimiz e with respect to θ as outlined in Sec. III and IV. Giv en θ , we c an comp ute the exact probab ility of reaching X G from any state x ∈ X a pplying the RSP µ θ by solving the following set of linear equ ations p θ ( x ) = X u ∈ U µ θ ( u | x ) X y ∈ X p ( y | x , u ) p θ ( y ) , for all x ∈ X \ ( X U ∪ X G ) (13) such th at p θ ( x ) = 0 if x ∈ X U and p θ ( x ) = 1 if x ∈ X G . Note that the eq uation system given by (1 3) contain s exactly | X | − | X U | − | X G | number of equation s and unk nowns. W e plotted in Fig. 2 th e reac hability prob ability of the RSP from the initial state (i.e., p θ ( x 0 ) ) against the numb er of iterations in th e actor-critical algorithm each time θ is updated . As θ conver ges, the r eachability prob ability conv erges to 9 0.3%. The param eters for this examples are: 0 200 400 600 800 1000 1200 1400 1600 0 0.2 0.4 0.6 0.8 1 iteration reachability probability Fig. 2. The dashed line represents the optimal solution (the maximal reacha bility probability ) and the solid line represe nts the exac t reachab ility probabil ity for the RSP as a function of the number of iterations applying the proposed algorithm. r n = 2 , λ = 0 . 9 , D = 5 and the initial θ is (50 , − 1 0) . W e use (8) for β k with c = 0 . 05 . V I . C O N C L U S I O N W e con sidered the problem of finding a control policy fo r a Markov D ecision Process (MDP) to maximize the probab ility of reach ing som e states of the MDP while av oiding som e other states. W e presented a tran sformation of the pr oblem into a Stochastic Shortest Path (SSP) MDP and dev eloped a new appr oximate dyna mic pr ogramm ing algorithm to solve this c lass of problems. Th e algorithm operates on a sample- path of the system a nd optimizes the po licy within a p re- specified class parameter ized by a parsimonio us set of pa- rameters. Simulatio n results confirm the effecti veness o f the propo sed solution in ro bot m otion planning applications. A P P E N D I X : C O N V E R G E N C E O F T H E L S T D A C T O R - C R I T I C A L G O R I T H M W e first cite the theo ry of linear stochastic appr oximation driven b y a slowly varying Markov cha in [1 3] (with simpli- fications). Let { y k } be a fin ite Markov chain whose transition probab ilities d epend on a param eter θ ∈ R n . Conside r a generic iteration of the form s k +1 = s k + γ k ( h θ k ( y k +1 ) − G θ k ( y k +1 ) s k ) + γ k Ξ k s k , (14) where s k ∈ R m , and h θ ( · ) ∈ R m , G θ ( · ) ∈ R m × m are θ - parameteriz ed vector and ma trix func tions, respec ti vely . It has been shown in [ 13] that the critic in (1 4) conver ges if the following set of con ditions a re met. Condition 1 1) The seq uence { γ k } is d eterministic, non- incr easing, and X k γ k = ∞ , X k γ 2 k < ∞ . 2) The random sequen ce { θ k } satisfies || θ k +1 − θ k || ≤ β k H k for some pr ocess { H k } with bo unded mo ments, wher e { β k } is a deterministic sequence such that X k  β k γ k  d < ∞ for so me d > 0 . 3) Ξ k is a n m × m -matrix va lued martingale differ ence with bound ed moments. 4) F or each θ , there exist ¯ h ( θ ) ∈ R m , ¯ G ( θ ) ∈ R m × m , and corr esponding m -vector and m × m -matrix fun c- tions ˆ h θ ( · ) , ˆ G θ ( · ) that sa tisfy the P oisson equation . That is, for each y , ˆ h θ ( y ) = h θ ( y ) − ¯ h ( θ ) + ( P θ ˆ h θ )( y ) , ˆ G θ ( y ) = G θ ( y ) − ¯ G ( θ ) + ( P θ ˆ G θ )( y ) . 5) F or some con stant C and for all θ , we ha ve max( || ¯ h ( θ ) || , || ¯ G ( θ ) || ) ≤ C. 6) F or an y d > 0 , there exists C d > 0 such that sup k E [ || f θ k ( y k ) || d ] ≤ C d , wher e f θ ( · ) repr esents any of the functions ˆ h θ ( · ) , h θ ( · ) , ˆ G θ ( · ) and G θ ( · ) . 7) F or some co nstant C > 0 and for a ll θ , ¯ θ ∈ R n , max( || ¯ h ( θ ) − ¯ h ( ¯ θ ) || , || ¯ G ( θ ) − ¯ G ( ¯ θ ) || ) ≤ C || θ − ¯ θ || . 8) Ther e exists a po sitive mea surable func tion C ( · ) such that for every d > 0 , sup k E [ C ( y k ) d ] < ∞ , and || f θ ( y ) − f ¯ θ ( y ) || ≤ C ( y ) || θ − ¯ θ || . 9) Ther e exists a > 0 such tha t for all s ∈ R m and θ ∈ R n s ′ ¯ G ( θ ) s ≥ a || s || 2 . For now , let’ s fo cus on th e first two items of Con dition 1. Recall that for any matr ix A , v ( A ) is a column vector th at stacks all row vectors of A ( also written as c olumn vectors). Simple algebra sugg ests th at th e core iteratio n of the LSTD critic can be written as (14) with s k =   b k v ( A k ) 1   , y k = ( x k , u k , z k ) , h θ ( y ) =   g ( x , u ) z v ( z (( P θ ψ ′ θ )( x , u ) − ψ ′ θ ( x , u ))) 1   , G θ ( y ) =  I  , (15) Ξ k =   0 0 0 0 0 D 0 0 0   , where D = v ( z k ( ψ ′ θ k ( x k +1 , u k +1 ) − ( P θ ψ θ ) ′ ( x k , u k ))) , and M is an arbitr ary (large) po siti ve constant whose role is to facilitate the conv ergence proof, and y = ( x , u, z ) deno tes a value of the triplet y k . The step -sizes γ k and β k in (4) and (6) correspo nd exactly to the γ k and β k in Condition 1.(1) a nd 1.(2), respectively . If the MDP has finite state and action space, then the conditions on { β k } reduce to ([13]) X k β k = ∞ , X k β 2 k < ∞ , lim k →∞ β k γ k = 0 , (16) where { β k } is a deterministic and non-inc reasing sequence. Note that we can use γ k = 1 /k (cf. Condition 1). The following theo rem estab lishes the con vergence o f the critic. Theorem VI.1 [Critic Conver gence] F o r the LS TD actor- critic (4) a nd (5) with some step -size sequenc e { β k } satis- fying (16), the sequence s k is bounde d, and lim k →∞ | ¯ G ( θ k ) s k − ¯ h ( θ k ) | = 0 . (17) Pr oof: T o show that (14) co n verges with s , y , h θ ( · ) , G θ ( · ) and Ξ substituted b y ( 15), the co nditions 1.(1)- (9) sho uld be checked. Howe ver , a c omparison with the convergence pr oof for the TD( λ ) critic in [ 11] gives a simpler proof . Let F θ ( y ) = z ( ψ ′ θ ( x , u ) − ( P θ ψ θ ) ′ ( x , u )) . While pr oving the c onv ergence of TD( λ ) critic operating concur rently with the actor, [11] showed that ˜ h θ ( y ) = " ˜ h (1) θ ( y ) ˜ h (2) θ ( y ) # =  M g ( x , u ) g ( x , u ) z  , ˜ G θ ( y ) =  1 0 z / M F θ ( y )  , and ˜ Ξ k =  0 0 0 z k ( ψ ′ θ k ( x k +1 , u k +1 ) − ( P θ ψ θ ) ′ ( x k , u k ))  satisfy Condition 1.(3)-1(8). In o ur case, (15) can be rewritten as h θ ( y ) =   ˜ h (2) θ ( y ) − F θ ( y ) 1   , G θ ( y ) =  I  , Ξ k =  ˜ Ξ k 0  . (18) Note that althou gh the two iterates are very different, they in volve the same qu antities and b oth in a linear fashion. So, h θ ( · ) , G θ ( · ) and Ξ k also satisfy con ditions 1 .(3)-1(8). Meanwhile, the step-size { γ k } satisfies co ndition 1.(1 ), and the step-size { β k } satisfies Eq. (16) (wh ich is a s explained above implies conditio n 1. (2)). No w , only conditio n (9 ) remains to be checked. T o that end, note that all diag onal ele- ments of G θ ( y ) equal to one, so, G θ ( y ) is positive definite. This proves the convergence. Using the same cor respond ence and the result in [11], o ne can f urther ch eck that (1 7) also holds here. Pr oof of Th eor em III.1: The result follows b y setting φ θ = ψ θ and f ollowing the proof in Section 6 of [11]. R E F E R E N C E S [1] S. T emiz er , M. Koc henderfer , L. Kaelbling , T . L ozano-P ´ erez, and J. Kuchar , “Colli sion avoid ance for unmanned aircra ft using Markov decisio n processes. ” [2] M. Lahijanian, J. W asnie wski, S. B. Andersson , and C. Belta, “Mot ion plannin g and control from temporal logic specifica tions with proba- bilisti c satisfacti on guarante es, ” in IE EE Int. Conf. on Robotics and Automat ion , Anchorage, AK, 2010, pp. 3227 – 3232. [3] R. Alterovi tz, T . Sim ´ eon, and K. Goldber g, “The stochastic motion roadmap: A sampling frame work for planning with Marko v motion uncerta inty , ” in Robotics: Science and Systems . Citesee r , 2007. [4] C. Baier , J.-P . Katoen, and K. G. L arsen, Principles of Model Chec k- ing . MIT Press, 2008. [5] X. Ding, S. Smith, C. Belta, and D. Rus, “L TL contro l in unce rtain en viron ments with probabil istic satisfa ction guarant ees, ” in IF AC , 2011. [6] J. Peters and S. Schaal, “Polic y gradient methods for roboti cs, ” in Proc eedings of the 2006 IE EE/RSJ Internationa l Confere nce on Intell igent R obots and Systems , 2006. [7] K. Samejima and T . Omori, “ Adapti ve internal state space construction method for reinforcement learnin g of a real-w orld agent, ” Neural Network s , vol. 12, pp. 1143–1155, 1999. [8] H. Bere nji and D. V engerov , “ A conv er gent Actor-Critic -based FRL algorit hm with applicatio n to powe r management of wireless trans- mitters, ” IEEE T ransac tions on Fuzzy Systems , vol. 11, no. 4, pp. 478–485, 2003. [9] “ Actor-crit ic models of reinforcement learning in the basal ganglia: From natur al to artificia l rats, ” Adaptive Behavior , vol. 13, no. 2, pp. 131–148, 2005. [10] G. Gajjar , S. Khaparde, P . Nagaraju, and S. Soman, “ Applicati on of actor-cri tic learnin g algorithm for optimal bidding problem of a GenCo, ” IEEE Tr ansactio ns on P ower Engineering Review , vol. 18, no. 1, pp. 11–18, 2003. [11] V . R. Ko nda and J. N. Tsitsiklis, “On actor -criti c algorithms, ” SIAM J ournal on Contr ol and Optimizati on , vol. 42, no. 4, pp. 1143–1166, 2003. [12] S. Bradtke and A . Barto, “Linear least-squares algorithms for temporal dif ference learning, ” Machine Learning , vol. 22, no. 2, pp. 33–57, 1996. [13] V . R. Konda, “ Acto r-crit ic algori thms, ” Ph.D. dissertation, MIT , Cam- bridge, MA, 2002. [14] D. Bertsekas and S . Iof fe, “T empora l diffe rences-ba sed policy iterat ion and applica tions in neuro-dyna mic programming, ” LIDS REPOR T , T ech. Rep. 2349, 1996, mIT . [15] J. Peters and S. Schaal, “Natura l acto r-crit ic, ” Neur ocomputi ng , vol. 71, pp. 1180–1190, 2008. [16] D. Bertsekas, Dynamic Pro gramming and Optimal Contr ol . Athena Scienti fic, 1995. [17] R. S. Sutton and A. G. Barto, R einfor ceme nt Learning: An Intr oduc- tion . MIT Press, 1998. [18] R. M. Estanjini, X. C. Ding, M. Lahijanian, J. W ang, C. A. Belta , and I. C. Paschali dis, “Least squares temporal differe nce actor -criti c methods with applicat ions to robot motion control, ” 2011, av ai lable at http:/ /arxi v .org/submit/0304711 . [19] S. LaV alle, Planning algori thms . Cambridge Univ ersity Press, 2006.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment