Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving

Autonomous driving is a multi-agent setting where the host vehicle must apply sophisticated negotiation skills with other road users when overtaking, giving way, merging, taking left and right turns and while pushing ahead in unstructured urban roadw…

Authors: Shai Shalev-Shwartz, Shaked Shammah, Amnon Shashua

Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving
Safe, Multi-Agent, Reinf or cement Learning f or A utonomous Driving Shai Shalev-Shwartz Shaked Shammah Amnon Shashua Abstract Autonomous driving is a multi-agent setting where the host vehicle must apply sophisticated negotiation skills with other road users when ov ertaking, giving way , merging, taking left and right turns and while pushing ahead in unstructured urban roadways. Since there are many possible scenarios, manually tackling all possible cases will likely yield a too simplistic policy . Moreov er , one must balance between unexpected beha vior of other drivers/pedestrians and at the same time not to be too defensiv e so that normal traffic flo w is maintained. In this paper we apply deep reinforcement learning to the problem of forming long term dri ving strategies. W e note that there are two major challenges that make autonomous dri ving different from other robotic tasks. First, is the necessity for ensuring functional safety — something that machine learning has difficulty with gi ven that performance is optimized at the le vel of an expectation o ver man y instances. Second, the Markov Decision Process model often used in robotics is problematic in our case because of unpredictable beha vior of other agents in this multi-agent scenario. W e mak e three contributions in our work. First, we sho w ho w policy gradient iterations can be used, and the v ariance of the gradient estimation using stochastic gradient ascent can be minimized, without Markovian assumptions. Second, we decompose the problem into a composition of a Policy for Desires (which is to be learned) and trajectory planning with hard constraints (which is not learned). The goal of Desires is to enable comfort of driving, while hard constraints guarantees the safety of driving. Third, we introduce a hierarchical temporal abstraction we call an “Option Graph” with a gating mechanism that significantly reduces the ef fectiv e horizon and thereby reducing the variance of the gradient estimation e ven further . The Option Graph plays a similar role to “structured prediction” in supervised learning, thereby reducing sample comple xity , while also playing a similar role to LSTM gating mechanisms used in supervised deep networks. 1 Introduction Endowing a robotic car with the ability to form long term dri ving strategies, referred to as “Dri ving Policy”, is k ey for enabling fully autonomous dri ving. The process of sensing , i.e., the process of forming an environmental model consisting of location of all moving and stationary objects, the position and type of path delimiters (such as curbs, barriers, and so forth), all driv able paths with their semantic meaning and all traffic signs and traffic lights around the car — is well defined. While sensing is well understood, the definition of Dri ving Policy , its underlying assumptions, and its functional breakdo wn is less understood. The e xtent of the challenge to form driving strate gies that mimic human driv ers is underscored by the flurry of media reports on the simplistic driving policies exhibited by current autonomous test vehicles by v arious practitioners (e.g. [ 23 ]). In order to support autonomous capabilities a robotic driven v ehicle should adopt human driving negotiation skills when ov ertaking, giving way , merging, taking left and right turns and while pushing ahead in unstructured urban roadways. Since there are many possible scenarios, manually tackling all possible cases will likely yield a too simplistic policy . Moreover , one must balance between unexpected behavior of other dri vers/pedestrians and at the same time not to be too defensi ve so that normal traffic flow is maintained. These challenges naturally suggest using machine learning approaches. T raditionally , machine learning approaches for planning strate gies are studied under the framework of Reinforcement Learning (RL) — see [ 6 , 17 , 31 , 35 ] for a general ov erview and [ 19 ] for a comprehensiv e revie w of reinforcement learning in robotics. Using machine learning, and specifically RL, raises two concerns which we address in this paper . The first is about ensuring functional safety of the Dri ving Policy — something that machine learning has dif ficulty with given that performance is optimized at the le vel of an expectation over many instances. Namely , giv en the very lo w probability of an accident the only way to guarantee safety is by scaling up the v ariance of the parameters to be estimated and the sample complexity of the learning problem — to a degree which becomes unwieldy to solve. Second, the Markov Decision Process model often used in robotics is problematic in our case because of unpredictable behavior of other agents in this multi-agent scenario. Before explaining our approach for tackling these difficulties, we briefly describe the key idea behind most common reinforcement learning algorithms. T ypically , RL is performed in a sequence of consecutiv e rounds. At round t , the agent (a.k.a planner) observes a state, s t ∈ S , which represents the sensing state of the system, i.e., the environmental model as mentioned above. It then should decide on an action a t ∈ A . After performing the action, the agent receives an immediate rew ard, r t ∈ R , and is moved to a new state, s t +1 . The goal of the planner is to maximize the cumulative rew ard (maybe up to a time horizon or a discounted sum of future rewards). T o do so, the planner relies on a policy , π : S → A , which maps a state into an action. Most of the RL algorithms rely in some way or another on the mathematically ele gant model of a Markov Decision Process (MDP), pioneered by the work of Bellman [ 4 , 5 ]. The Markovian assumption is that the distrib ution of s t +1 is fully determined gi ven s t and a t . This yields a closed form expression for the cumulati ve re ward of a gi ven policy in terms of the stationary distrib ution ov er states of the MDP . The stationary distribution of a policy can be expressed as a solution to a linear programming problem. This yields two families of algorithms: optimizing with respect to the primal problem, which is called policy search, and optimizing with respect to the dual problem, whose variables are called the value function , V π . The value function determines the expected cumulativ e rew ard if we start the MDP from the initial state s , and from there on pick actions according to π . A related quantity is the state-action value function, Q π ( s, a ) , which determines the cumulativ e reward if we start from state s , immediately pick action a , and from there on pick actions according to π . The Q function giv es rise to a crisp characterization of the optimal policy (using the so called Bellman’ s equation), and in particular it sho ws that the optimal policy is a deterministic function from S to A (in fact, it is the greedy policy with respect to the optimal Q function). In a sense, the ke y advantage of the MDP model is that it allo ws us to couple all the future into the present using the Q function. That is, giv en that we are no w in state s , the value of Q π ( s, a ) tells us the effect of performing action a at the moment on the entire future. Therefore, the Q function giv es us a local measure of the quality of an action a , thus making the RL problem more similar to supervised learning. Most reinforcement learning algorithms approximate the V function or the Q function in one way or another . V alue iteration algorithms, e.g. the Q learning algorithm [ 40 ], relies on the fact that the V and Q functions of the optimal policy are fix ed points of some operators deriv ed from Bellman’ s equation. Actor-critic policy iteration algorithms aim to learn a polic y in an iterati ve way , where at iteration t , the “critic” estimates Q π t and based on this, the “actor” improv es the policy . Despite the mathematical elegancy of MDPs and the conv eniency of switching to the Q function representation, there are se veral limitations of this approach. First, as noted in [ 19 ], usually in robotics, we may only be able to find some approximate notion of a Markovian beha ving state. Furthermore, the transition of states depends not only on the agent’ s action, but also on actions of other players in the environment. For example, in the context of autonomous dri ving, while the dynamic of the autonomous vehicle is clearly Marko vian, the next state depends on the beha vior of the other road users (vehicles, pedestrians, cyclists), which is not necessarily Marko vian. One possible solution to this problem is to use partially observed MDPs [ 41 ], in which we still assume that there is a Markovian state, b ut we only get to see an observ ation that is distributed according to the hidden state. A more direct approach considers game theoretical generalizations of MDPs, for example 2 the Stochastic Games framework. Indeed, some of the algorithms for MDPs were generalized to multi-agents games. For example, the minimax-Q learning [ 20 ] or the Nash-Q learning [ 15 ]. Other approaches to Stochastic Games are explicit modeling of the other players, that goes back to Brown’ s fictitious play [ 9 ], and vanishing regret learning algorithms [ 13 , 10 ]. See also [ 39 , 38 , 18 , 8 ]. As noted in [ 30 ], learning in multi-agent setting is inherently more complex than in the single agent setting. T ak en together , in the context of autonomous dri ving, giv en the unpredictable beha vior of other road users, the MDP framew ork and its extensions are problematic in the least and could yield impractical RL algorithms. When it comes to categories of RL algorithms and ho w they handle the Marko v assumption, we can divide them into four groups: • Algorithms that estimate the V alue or Q function – those clearly are defined solely in the context of MDP . • Policy based learning methods where, for e xample, the gradient of the policy π is estimated using the likelihood ratio trick (cf. [ 2 , 32 , 25 ]) and thereby the learning of π is an iterativ e process where at each iteration the agent interacts with the en vironment while acting based on the current Policy estimation. Polic y gradient methods are derived using the Marko v assumption, but we will see later that this is not necessarily required. • Algorithms that learn the dynamics of the process, namely , the function that takes ( s t , a t ) and yields a distribution o ver the next state s t +1 . Those are kno wn as Model-based methods, and those clearly rely on the Markov assumption. • Behavior cloning (Imitation) methods. The Imitation approach simply requires a training set of examples of the form ( s t , a t ) , where a t is the action of the human driver (cf. [ 7 ]). One can then use supervised learning to learn a policy π such that π ( s t ) ≈ a t . Clearly there is no Markov assumption in volv ed in the process. The problem with Imitation is that different human driv ers, and ev en the same human, are not deterministic in their polic y choices. Hence, learning a function for which k π ( s t ) − a t k is very small is often infeasible. And, once we hav e small errors, they might accumulate ov er time and yield large errors. Our first observ ation (detailed in sec. 2) is that Polic y Gradient does not really require the Mark ov assumption and furthermore that some methods for reducing the v ariance of the gradient estimator (cf. [ 26 ]) would not require Markov assumptions as well. T aken together , the RL algorithm could be initialized through Imitation and then updated using an iterati ve Polic y Gradient approach without the Markov assumption. The second contrib ution of the paper is a method for guaranteeing functional safety of the Dri ving Policy outcome. Giv en the very small probability p  1 of an accident, the corresponding rew ard of a trajectory leading to an accident should be much smaller than − 1 /p , thus generating a v ery high variance of the gradient estimator (see Lemma 3). Regardless of the means of reducing v ariance, as detailed in sec. 2, the v ariance of the gradient not only depends on the beha vior of the reward but also on the horizon (time steps) required for making decisions. Our proposal for functional safety is twofold. First we decompose the Policy function into a composition of a Policy for Desires (which is to be learned) and trajectory planning with hard constraints (which is not learned). The goal of Desires is to enable comfort of driving, while hard constraints guarantees the safety of dri ving (detailed in sec. 4). Second, follo wing the options mechanism of [ 33 ] we employ a hierarchical temporal abstraction we call an “Option Graph” with a g ating mechanism that significantly reduces the effecti ve horizon and thereby reducing the v ariance of the gradient estimation e ven further (detailed in sec. 5). The Option Graph plays a similar role to “structured prediction” in supervised learning (e.g. [ 36 ]), thereby reducing sample comple xity , while also playing a similar role to LSTM [ 14 ] gating mechanisms used in supervised deep networks. The use of options for skill reuse was also been recently studied in [ 37 ], where hierarchical deep Q networks for skill reuse hav e been proposed. Finally , in sec. 6, we demonstrate the application of our algorithm on a double merging maneuv er which is notoriously difficult to e xecute using con ventional motion and path planning approaches. Safe reinforcement learning was also studied recently in [ 3 ]. Their approach in volv es first optimizing the expected re ward (using policy gradient) and then applying a (Bregman) projection of the solution onto a set of linear constraints. This approach is dif ferent from our approach. In particular , it assumes that the hard constraints on safety can be expressed as linear constraints on the parameter vector θ . In our case θ are the weights of a deep network and the hard constraints in volv e highly non-linear dependency on θ . Therefore, con ve x-based approaches are not applicable to our problem. 3 2 Reinf orcement Learning without Markovian Assumption W e begin with setting up the RL notations geared tow ards deriving a Policy Gradient method with variance reduction while not making any Markov assumptions. W e follow the REINFORCE [ 42 ] likelihood ratio trick and make a very modest contrib ution — more an observ ation than a no vel deriv ation — that Marko v assumptions on the en vironment are not required. Let S be our state space which contains the “en vironmental model” around the vehicle generated from interpreting sensory information and any additional useful information such as the kinematics of moving objects from pre vious frames. W e use the term “state space” in order not to introduce ne w terminology b ut we actually mean a state vector in an agnostic sense, without the Markov assumptions — simply a collection of information around the vehicle generated at a particular time stamp. Let A denote the action space, where at this point we will keep it abstract and later in sec. 5 we will introduce a specific discrete action space for selecting “desires” tailored to the domain of autonomous driving. The hypothesis class of parametric stochastic policies is denoted by { π θ : θ ∈ Θ } , where for all θ , s we hav e P a π θ ( a | s ) = 1 , and we assume that π θ ( a | s ) is differentiable w .r .t. θ . Note that we hav e chosen a class of policies π θ : S × A → [0 , 1] as part of an architectural design choice, i.e., that the (distribution o ver) action a t at time t is determined by the agnostic state s t and in particular , gi ven the differentiability o ver θ , the policy π θ is implemented by a deep layered network. In other words, we are not claiming that the optimal polic y is necessarily contained in the hypothesis class b ut that “good enough” policies can be modeled using a deep network whose input layer consists of s t . The theory belo w does not depend on the nature of the hypothesis class and any other design choices can be substituted — for example, π θ ( a t | s t , s t − 1 ) would correspond to the class of recurrent neural networks (RNN). Let ¯ s = (( s 1 , a 1 ) , . . . , ( s T , a T )) define a sequence (trajectory) of state-action over a time period sufficient for long-term planning, and let ¯ s i : j = (( s i , a i ) , . . . , ( s j , a j )) denote a sub-trajectory from time stamp i to time stamp j . Let P θ ( ¯ s ) be the probability of trajectory ¯ s when actions are chosen according to the policy π θ and there are no other assumptions on the en vir onment . The total reward associated with the trajectory ¯ s is denoted by R ( ¯ s ) which can be any function of ¯ s . For example, R can be a function of the immediate re wards, r 1 , . . . , r T , such as R ( ¯ s ) = P t r t or the discounted rew ard R ( ¯ s ) = P t γ t r t for γ ∈ (0 , 1] . But, any re ward function of ¯ s can be used and therefore we can keep it abstract. Finally , the learning problem is: argmax θ ∈ Θ E ¯ s ∼ P θ [ R ( ¯ s )] The gradient policy theorem belo w follo ws the standard likelihood ratio trick (e.g., [ 2 , 11 ]) and the formula is well kno wn, but in the proof (which follo ws the proof in [ 25 ]), we make the observ ation that Marko v assumptions on the environment are not required for the v alidity of the policy gradient estimator: Theorem 1 Denote ˆ ∇ ( ¯ s ) = R ( ¯ s ) T X t =1 ∇ θ log( π θ ( a t | s t )) (1) Then, E ¯ s ∼ P θ ˆ ∇ ( ¯ s ) = ∇ E ¯ s ∼ P θ [ R ( ¯ s )] . The gradient policy theorem sho ws that it is possible to obtain an unbiased estimate of the gradient of the expected total re ward [ 42 , 32 , 12 ], thereby using noisy gradient estimates in a stochastic gradient ascent/descent (SGD) algorithm for training a deep network representing the polic y π θ . Unfortunately , the variance of the gradient estimator scales unfa vorably with the time horizon T and moreov er due to the very low probability of critical “corner” cases, such as the probability of an accident p , the immediate reward − r must satisfy r  1 /p and in turn the variance of the random variable R ( ¯ s ) grows with pr 2 , i.e., much larger than 1 /p (see sec. 4 and Lemma 3). High variance of the gradient has a detrimental effect on the con vergence rate of SGD [ 22 , 29 , 16 , 28 , 24 ] and giv en the nature of our problem domain, with e xtremely lo w-probability corner cases, the ef fect of an e xtremely high variance could bring about bad polic y solutions. W e approach the v ariance problem along three thrusts. First, we use base-line subtraction methods (which goes back to [ 42 ]) for v ariance reduction. Second, we deal with the v ariance due to “corner” cases by decomposing the policy into a learnable part and a non-learnable part, the latter induces 4 hard constraints on functional safety . Last, we introduce a temporal abstraction method with a gating mechanism we call an “option graph” to ameliorate the effect of the time horizon T on the variance. In Section 3 we focus on base-line subtraction, derive the optimal baseline (follo wing [ 25 ]) and generalize the recent results of [ 26 ] to a non-Markovian setting. In the next section we deal with variance due to “corner cases”. 3 V ariance Reduction Consider again policy gradient estimate ˆ ∇ ( ¯ s ) introduced in eqn.1. The baseline subtraction method reduces the variance of R (¯ s ) by subtracting a scalar b t,i from R ( ¯ s ) : ˆ ∇ i ( ¯ s ) = T X t =1 ( R ( ¯ s ) − b t,i ) ∇ θ i log( π ( a t | s t )) . The Lemma below describes the conditions on b t,i for the baseline subtraction to work: Lemma 1 F or e very t and i , let b t,i be a scalar that does not depend on a t , ¯ s t +1: T , but may depend on ¯ s 1: t − 1 , s t , and on θ . Then, E ¯ s ∼ P θ " X t b t,i ∇ θ i log π θ ( a t | s t ) # = 0 . The optimal baseline, one that would reduce the v ariance the most, can be deriv ed following [25]: E ˆ ∇ i ( ¯ s ) 2 = E T X t =1 ( R ( ¯ s ) − b t,i ) ∇ θ i log( π ( a t | s t )) ! 2 T aking the deriv ative w .r .t. b τ ,i and comparing to zero we obtain the follo wing equation for the optimal baseline: E T X t =1 ( R ( ¯ s ) − b t,i ) ∇ θ i log( π ( a t | s t )) ! ( ∇ θ i log( π ( a τ | s τ ))) = 0 . This can be written as X b = y , where X is a T × T matrix with X τ ,t = E ∇ θ i log( π ( a t | s t )) ∇ θ i log( π ( a τ | s τ )) and y is a T dimensional vector with y t = E R ( ¯ s ) ∇ θ i log( π ( a t | s t )) ∇ θ i log( π ( a τ | s τ )) . W e can estimate X and y from a mini-batch of episodes and then set b · ,i to be X − 1 y . A more efficient approach is to think about the problem of finding the baseline as an online linear regression problem and ha ve a separate process that update b · ,i in an online manner [27]. Many polic y gradient variants [ 26 ] replace R ( ¯ s ) with the Q-function, which assumes the Markovian setting. The following lemma gi ves a non-Markovian analogue of the Q function. Lemma 2 Define Q θ ( ¯ s 1: t ) := X ¯ s t +1: T P θ ( ¯ s t +1: T | ¯ s 1: t ) R ( ¯ s ) . (2) Let ξ be a random variable and let ˆ Q θ ( ¯ s 1: t , ξ ) be a function such that E ξ ˆ Q θ ( ¯ s 1: t , ξ ) = Q θ ( ¯ s 1: t ) (in particular , we can tak e ξ to be empty and then ˆ Q ≡ Q ). Then, E ¯ s ∼ P θ ,ξ " T X t =1 ˆ Q θ ( ¯ s 1: t , ξ ) ∇ θ log( π θ ( a t | s t )) # = ∇ E ¯ s ∼ P θ [ R ( ¯ s )] . Observe that the follo wing analogue of the value function for the non-Mark ovian setting, V θ ( ¯ s 1: t − 1 , s t ) = X a t π θ ( a t | s t ) Q θ ( ¯ s 1: t ) , (3) satisfies the conditions of Lemma 1. Therefore, we can also replace R ( ¯ s ) with an analogue of the so-called Adv antage function, A ( ¯ s 1: t ) = Q θ ( ¯ s 1: t ) − V θ ( ¯ s 1: t − 1 , s t ) . The advantage function, and generalization of it, are often used in actor-critique policy gradient implementations (see for example [ 26 ]). In the non-Markovian setting considered in this paper , the Advantage function is more complicated to estimate, and therefore, in our e xperiments, we use estimators that in volve the term R ( ¯ s ) − b t,i , where b t,i is estimated using online linear regression. 5 4 Safe Reinfor cement Learning In the previous section we hav e sho wn ho w to optimize the reinforcement learning objecti ve by polic y stochastic gradient ascent. Recall that we ha ve defined the objecti ve to be E ¯ s ∼ P θ R ( ¯ s ) , that is, the expected reward. Objecti ves that in volve expectation are common in machine learning. W e now argue that this objecti ve poses a functional safety problem. Consider a rew ard function for which R ( ¯ s ) = − r for trajectories that represent a rare “corner” e vent which we would like to av oid, such as an accident, and R ( ¯ s ) ∈ [ − 1 , 1] for the rest of the trajectories. For concreteness, suppose that our goal is to learn to perform an overtake maneuv er . Normally , in an accident free trajectory , R ( ¯ s ) would rew ard successful, smooth, takeo vers and penalize ag ainst staying in lane without completing the takeov er — hence the range [ − 1 , 1] . If a sequence, ¯ s , represents an accident, we would lik e the rew ard − r to provide a suf ficiently high penalty to discourage such occurrences. The question is what should be the value of r to ensure accident-free dri ving? Observe that the ef fect of an accident on E [ R ( ¯ s )] is the additiv e term − pr , where p is the probability mass of trajectories with an accident e vent. If this term is negligible, i.e., p  1 /r , then the learner might prefer a policy that performs an accident (or adopt in general a reckless dri ving policy) in order to fulfill the takeov er maneuver successfully more often than a policy that would be more defensi ve at the expense of ha ving some takeov er maneuvers not completed successfully . In other words, if we want to make sure that the probability of accidents is at most p then we must set r  1 /p . Since we would like p to be extremely small (say , p = 10 − 9 ), we obtain that r must be extremely large. Recall that in policy gradient we estimate the gradient of E [ R ( ¯ s )] . The follo wing lemma shows that the variance of the random v ariable R ( ¯ s ) grows with pr 2 , which is larger than r for r  1 /p . Hence, ev en estimating the objectiv e is difficult, let alone its gradient. Lemma 3 Let π θ be a policy and let p, r be scalars suc h that with pr obability p we have R ( ¯ s ) = − r and with pr obability 1 − p we have R ( ¯ s ) ∈ [ − 1 , 1] . Then, V ar[ R ( ¯ s )] ≥ pr 2 − ( pr + (1 − p )) 2 = ( p − p 2 ) r 2 − 2 p (1 − p ) r − (1 − p ) 2 ≈ pr 2 , wher e the last appr oximation holds for the case r ≥ 1 /p . The abov e discussion shows that an objecti ve of the form E [ R ( ¯ s )] cannot ensure functional safety without causing a serious variance problem. The baseline subtraction method for variance reduction would not of fer a suf ficient remedy to the problem because we would be shifting the problem from very high v ariance of R ( ¯ s ) to equally high v ariance of the baseline constants whose estimation would equally suffer numerical instabilities. Moreov er , if the probability of an accident is p then on av erage we should sample at least 1 /p sequences before obtaining an accident event. This immediately implies a lower bound of 1 /p samples of sequences for any learning algorithm that aims at minimizing E [ R ( ¯ s )] . W e therefore face a fundamental problem whose solution must be found in a new architectural design and formalism of the system rather than through numerical conditioning tricks. Our approach is based on the notion that hard constraints should be injected outside of the learning framew ork. In other words, we decompose the policy function into a learnable part and a non- learnable part. Formally , we structure the policy function as π θ = π ( T ) ◦ π ( D ) θ , where π ( D ) θ maps the (agnostic) state space into a set of Desires , while π ( T ) maps the Desires into a trajectory (which determines ho w the car should mov e in a short range). The function π ( D ) θ is responsible for the comfort of driving and for making strate gical decisions such as which other cars should be ov er-taken or giv en way and what is the desired position of the host car within its lane and so forth. The mapping from state to Desires is a policy π ( D ) θ that is being learned from experience by maximizing an expected rew ard. The desires produced by π ( D ) θ are translated into a cost function ov er driving trajectories. The function π ( T ) , which is not being learned, is implemented by finding a trajectory that minimizes the aforementioned cost subject to hard constraints on functional safety . This decomposition allows us to always ensure functional safety while at the same time enjoying comfort driving most of the time. T o illustrate the idea, let us consider a challenging driving scenario, which we call the double mer ge scenario (see Figure 1 for an illustration). In a double merge, v ehicles approach the merge area from both left and right sides and, from each side, a vehicle can decide whether to merge into the other 6 side or not. Successfully executing a double merge in busy traffic requires significant negotiation skills and experience and is notoriously dif ficult to ex ecute in a heuristic or brute force approach by enumerating all possible trajectories that could be taken by all agents in the scene. Figure 1: The double mer ge sce- nario. V ehicles arriv e from the left or right side to the merge area. Some vehicles should continue on their road while other vehicles should merge to the other side. In dense traf fic, vehicles must nego- tiate the right of way . W e begin by defining the set of Desires D appropriate for the double merge maneuver . Let D be the Cartesian product of the following sets: D = [0 , v max ] × L × { g , t, o } n where [0 , v max ] is the desired target speed of the host vehicle, L = { 1 , 1 . 5 , 2 , 2 . 5 , 3 , 3 . 5 , 4 } is the desired lateral position in lanes units where whole numbers designate lane center and fraction num- bers designate lane boundaries, and { g , t, o } are classification labels assigned to each of the n other vehicles. Each of the other vehicles is assigned ‘g’ if the host v ehicle is to “gi ve way” to it, or ‘t’ to “take way” and ‘o’ to maintain an of fset distance to it. Next we describe how to translate a set of Desires , ( v , l, c 1 , . . . , c n ) ∈ D , into a cost function over driving trajecto- ries. A driving trajectory is represented by ( x 1 , y 1 ) , . . . , ( x k , y k ) , where ( x i , y i ) is the (lateral,longitudinal) location of the car (in ego- centric units) at time τ · i . In our e xperiments, we set τ = 0 . 1 sec and k = 10 . The cost assigned to a trajectory will be a weighted sum of individual costs assigned to the desired speed, lateral position, and the label assigned to each of the other n vehicles. Each of the individual costs are descried belo w . Giv en a desired speed v ∈ [0 , v max ] , the cost of a trajectory associated with speed is P k i =2 ( v − k ( x i , y i ) − ( x i − 1 , y i − 1 ) k /τ ) 2 . Gi ven desired lateral position l ∈ L , the cost associated with desired lateral position is P k i =1 dist ( x i , y i , l ) , where dist ( x, y , l ) is the distance from the point ( x, y ) to the lane position l . As to the cost due to other v ehicles, for any other vehicle let ( x 0 1 , y 0 1 ) , . . . , ( x 0 k , y 0 k ) be its predicted trajectory in the host vehicle e gocentric units, and let i be the earliest point for which there exists j such that the distance between ( x i , y i ) and ( x 0 j , y 0 j ) is small (if there is no such point we let i = ∞ ). If the car is classified as “giv e-way” we would like that τ i > τ j + 0 . 5 , meaning that we will arri ve to the trajectory intersection point at least 0 . 5 seconds after the other v ehicle will arriv e to that point. A possible formula for translating the abov e constraint into a cost is [ τ ( j − i ) + 0 . 5] + . Like wise, if the car is classified as “take-w ay” we would like that τ j > τ i + 0 . 5 , which is translated to the cost [ τ ( i − j ) + 0 . 5] + . Finally , if the car is classified as “of fset” we would like that i will be ∞ (meaning, the trajectories will not intersect). This can be translated to a cost by penalizing on the distance between the trajectories. By assigning a weight to each of the aforementioned costs we obtain a single objectiv e function for the trajectory planner , π ( T ) . Naturally , we can also add to the objecti ve a cost that encourages smooth drivi ng. More importantly , we add hard constraints that ensure functional safety of the trajectory . For example, we do not allo w ( x i , y i ) to be of f the roadway and we do not allo w ( x i , y i ) to be close to ( x 0 j , y 0 j ) for any trajectory point ( x 0 j , y 0 j ) of any other v ehicle if | i − j | is small. T o summarize, we decompose the policy π θ into a mapping from the agnostic state to a set of Desires and a mapping from the Desires to an actual trajectory . The latter mapping is not being learned and is implemented by solving an optimization problem whose cost depends on the Desires and whose hard constraints guarantees functional safety of the policy . It is left to explain how we learn the mapping from the agnostic state to the Desires , which is the topic of the next section. 5 T emporal Abstraction In the pre vious section we injected prior knowledge in order to break do wn the problem in such a w ay to ensure functional safety . W e saw that through RL alone a system complying with functional safety will suf fer a very high and unwieldy variance on the re ward R ( ¯ s ) and this can be fix ed by splitting the problem formulation into a mapping from (agnostic) state space to Desires using policy gradient iterations follo wed by a mapping to the actual trajectory which does not in v olve learning. 7 It is necessary , howe ver , to inject e ven more prior kno wledge into the problem and decompose the decision making into semantically meaningful components — and this for two reasons. First, the size of D might be quite lar ge and ev en continuous (in the double-merge scenario described in the previous section we had D = [0 , v max ] × L × { g , t, o } n ). Second, the gradient estimator in volv es the term P T t =1 ∇ θ π θ ( a t | s t ) . As mentioned abo ve, the variance gro ws with the time horizon T [ 25 ] . In our case, the value of T is roughly 1 250 which is high enough to create significant variance. Root Merge Prepare Right Left Go Push Stay Same Accelerate Decelerate ID 1 g 1 t 1 o 1 ID 2 . . . ID n g n t n o n Figure 2: An options graph for the double merge scenario. Our approach follows the options fr amework due to [ 34 ]. An options graph represents a hierarchical set of decisions organized as a Di- rected Acyclic Graph (D A G). There is a special node called the “root” of the graph. The root node is the only node that has no incoming edges. The decision process traverses the graph, starting from the root node, until it reaches a “leaf ” node, namely , a node that the has no outgoing edges. Each internal node should implement a policy function that picks a child among its av ailable children. There is a predefined mapping from the set of trav ersals over the options graph to the set of desires, D . In other words, a trav ersal on the options graph is automatically translated into a desire in D . Giv en a node v in the graph, we denote by θ v the parameter vector that specifies the policy of picking a child of v . Let θ be the concatenation of all the θ v , then π ( D ) θ is defined by trav ersing from the root of the graph to a leaf, while at each node v using the policy defined by θ v to pick a child node. A possible option graph for the double merge scenario is depicted in Figure 2. The root node first decides if we are within the merging area or if we are approaching it and need to prepare for it. In both cases, we need to decide whether to change lane (to left or right side) or to stay in lane. If we hav e decided to change lane we need to decide if we can go on and perform the lane change maneuver (the “go” node). If it is not possible, we can try to “push” our w ay (by aiming at being on the lane mark) or to “stay” in the same lane. This determines the desired lateral position in a natural way — for example, if we change lane from lane 2 to lane 3 , “go” sets the desire lateral position to 3 , “stay” sets the desire lateral position to 2 , and “push” sets the desire lateral position to 2 . 5 . Ne xt, we should decide whether to keep the “same” speed, “accelerate” or “decelerate”. Finally , we enter a “chain lik e” structure that goes ov er all the vehicles and sets their semantic meaning to a v alue in { g , t, o } . This sets the desires for the semantic meaning of vehicles in an ob vious w ay . Note that we share the parameters of all the nodes in this chain (similarly to Recurrent Neural Networks). An immediate benefit of the options graph is the interpretability of the results. Another immediate benefit is that we rely on the decomposable structure of the set D and therefore the policy at each node should choose between a small number of possibilities. Finally , the structure allo ws us to reduce the variance of the polic y gradient estimator . W e next elaborate on the last point. As mentioned pre viously , the length of an episode in the double mer ge scenario is roughly T = 250 steps. This number comes from the fact that on one hand we would lik e to have enough time to see the consequences of our actions (e.g., if we decided to change lane as a preparation for the mer ge, we will see the benefit only after a successful completion of the merge), while on the other hand due to the dynamic of driving, we must make decisions at a fast enough frequency ( 10 Hz in our case). The options graph enables us to decrease the effecti ve value of T by two complementary ways. First, giv en higher lev el decisions we can define a re ward for lower le vel decisions while taking into account much shorter episodes. For e xample, when we hav e already picked the “lane change” and “go” nodes, we can learn the policy for assigning semantic meaning to vehicles by looking at episodes of 2-3 seconds (meaning that T becomes 20 − 30 instead of 250 ). Second, for high level decisions (such as whether to change lane or to stay in the same lane), we do not need to make decisions e very 1 Suppose we work at 10Hz, the merge area is 100 meters, we start the preparation for the merge 300 meters before it, and we dri ve at 16 meters per second ( ≈ 60 Km per hour). In this case, the v alue of T for an episode is roughly 250 . 8 0 . 1 seconds. Instead, we can either make decisions at a lo wer frequenc y (e.g., e very second), or implement an “option termination” function, and then the gradient is calculated only after ev ery termination of the option. In both cases, the ef fectiv e value of T is again an order of magnitude smaller than its original value. All in all, the estimator at ev ery node depends on a value of T which is an order of magnitude smaller than the original 250 steps, which immediately transfers to a smaller variance [21]. T o summarize, we introduced the options graph as a way to breakdown the problem into semantically meaningful components where the Desires are defined through a tra versal ov er a D A G. At each step along the way the learner maps the state space to a small subset of Desires thereby effecti vely decreasing the time horizon to much smaller sequences and at the same time reducing the output space for the learning problem. The aggregated effect is both in reducing v ariance and sample complexity of the learning problem. 6 Experimental Demonstration The purpose of this section is to give a sense of ho w a challenging negotiation scenario is handled by our frame work. The e xperiment in volv es propriety software modules (to produce the sensing state, the simulation, and the trajectory planner) and data (for the learning-by-imitation part). It therefore should be regarded as a demonstration rather than a reproducible experiment. W e leave to future work the task of conducting a reproducible experiment, with a comparison to other approaches. W e experimented with the double-merge scenario described in Section 4 (see again Figure 1). This is a challenging negotiation task as cars from both sides hav e a strong incenti ve to merge, and f ailure to merge in time leads to ending on the wrong side of the intersection. In addition, the reward R ( ¯ s ) associated with a trajectory needs to account not only for the success or failure of the mer ge operation but also for smoothness of the trajectory control and the comfort le vel of all other vehicles in the scene. In other words, the goal of the RL learner is not only to succeed with the merge maneuv er but also accomplish it in a smooth manner and without disrupting the driving patterns of other v ehicles. W e relied on the follo wing sensing information. The static part of the en vironment is represented as the geometry of lanes and the free space (all in e go-centric units). Each agent also observes the location, velocity , and heading of e very other car which is within 100 meters a way from it. Finally , 300 meters before the merging area the agent receives the side it should be after the merge (’left’ or ’ right’). For the trajectory planner , π ( T ) , we used an optimization algorithm based on dynamic programming. W e used the option graph described in Figure 2. Recall that we should define a polic y function for every node of our option graph. W e initialized the policy at all nodes using imitation learning. Each policy function, associated with every node of the option graph, is represented by a neural network with three fully connected hidden layers. Note that data collected from a human driv er only contains the final maneuv er, but we do not observe a trav ersal on the option graph. For some of the nodes, we can infer the labels from the data in a relatively straight forw ard manner . F or example, the classification of vehicles to “gi ve-w ay”, “take-way”, and “offset” can be inferred from the future position of the host vehicle relativ e to the other vehicles. For the remaining nodes we used an implicit supervision. Namely , our option graph induces a probability o ver future trajectories and we train it by maximizing the (log) probability of the trajectory that was chosen by the human driver . Fortunately , deep learning is quite good in dealing with hidden v ariables and the imitation process succeeded to learn a reasonable initialization point for the policy . See [ 1 ] for some videos. For the policy gradient updates we used a simulator (initialized using imitation learning) with self-play enhancement. Namely , we partitioned the set of agents to two sets, A and B . Set A was used as reference players while set B was used for the polic y gradient learning process. When the learning process con ver ged, we used set B as the reference players and used the set A for the learning process. The alternating process of switching the roles of the sets continued for 10 rounds. See [ 1 ] for resulting videos. Acknowledgements W e thank Moritz W erling, Daniel Althoff, and Andreas La witz for helpful discussions. 9 References [1] Supplementary video files. https://www.dropbox.com/s/136nbndtdyehtgi/ doubleMerge.m4v?dl=0 . [2] VM Aleksandrov , VI Sysoev , and VV Shemenev a. Stochastic optimization of systems. Izv . Akad. Nauk SSSR, T ekh. Kibernetika , pages 14–19, 1968. [3] Haitham Bou Ammar , Rasul T utunov , and Eric Eaton. Safe policy search for lifelong rein- forcement learning with sublinear regret. The Journal of Mac hine Learning Researc h (JMLR) , 2015. [4] Richard Bellman. Dynamic programming and lagrange multipliers. Pr oceedings of the National Academy of Sciences of the United States of America , 42(10):767, 1956. [5] Richard Bellman. Intr oduction to the mathematical theory of contr ol pr ocesses , volume 2. IMA, 1971. [6] Dimitri P Bertsekas. Dynamic pr ogr amming and optimal contr ol , volume 1. Athena Scientific Belmont, MA, 1995. [7] Mariusz Bojarski, Da vide Del T esta, Daniel Dworako wski, Bernhard Firner , Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. End to end learning for self-driving cars. arXiv pr eprint arXiv:1604.07316 , 2016. [8] Ronen I Brafman and Moshe T ennenholtz. R-max–a general polynomial time algorithm for near-optimal reinforcement learning. The Journal of Mac hine Learning Researc h , 3:213–231, 2003. [9] George W Bro wn. Iterativ e solution of games by fictitious play . Activity analysis of pr oduction and allocation , 13(1):374–376, 1951. [10] N. Cesa-Bianchi and G. Lugosi. Prediction, learning , and games . Cambridge Univ ersity Press, 2006. [11] Peter W Glynn. Likelilood ratio gradient estimation: an overvie w . In Pr oceedings of the 19th confer ence on W inter simulation , pages 366–375. A CM, 1987. [12] Evan Greensmith, Peter L Bartlett, and Jonathan Baxter . V ariance reduction techniques for gradient estimates in reinforcement learning. Journal of Mac hine Learning Resear ch , 5(Nov): 1471–1530, 2004. [13] S. Hart and A. Mas-Colell. A simple adaptiv e procedure leading to correlated equilibrium. Econometrica , 68(5), 2000. [14] Sepp Hochreiter and Jürgen Schmidhuber . Long short-term memory . Neural computation , 9(8): 1735–1780, 1997. [15] Junling Hu and Michael P W ellman. Nash q-learning for general-sum stochastic g ames. The Journal of Mac hine Learning Resear ch , 4:1039–1069, 2003. [16] Rie Johnson and T ong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Pr ocessing Systems , pages 315–323, 2013. [17] Leslie Pack Kaelbling, Michael L Littman, and Andre w W Moore. Reinforcement learning: A surve y . Journal of artificial intelligence r esear ch , pages 237–285, 1996. [18] Michael K earns and Satinder Singh. Near-optimal reinforcement learning in polynomial time. Machine Learning , 49(2-3):209–232, 2002. [19] Jens K ober , J Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A surv ey . The International Journal of Robotics Resear ch , page 0278364913495721, 2013. [20] Michael L Littman. Markov games as a frame work for multi-agent reinforcement learning. In Pr oceedings of the eleventh international confer ence on mac hine learning , volume 157, pages 157–163, 1994. [21] T imothy A. Mann, Doina Precup, and Shie Mannor . Approximate value iteration with temporally extended actions. Journal of Artificial Intelligence Resear c h , 2015. [22] Eric Moulines and Francis R Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Pr ocessing Systems , pages 451–459, 2011. 10 [23] Keith Naughton. Human dri vers are bumping into dri verless cars and exposing a ke y flaw . http://www .autonews.com/article/20151218/OEM11/151219874/human-dri vers-are- bumping-into-dri verless-cars-and-e xposing-a-key-fla w , 2015. [24] Deanna Needell, Rachel W ard, and Nati Srebro. Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm. In Advances in Neural Information Pr ocessing Systems , pages 1017–1025, 2014. [25] Jan Peters and Stefan Schaal. Reinforcement learning of motor skills with policy gradients. Neural networks , 21(4):682–697, 2008. [26] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High- dimensional continuous control using generalized advantage estimation. arXiv pr eprint arXiv:1506.02438 , 2015. [27] Shai Shalev-Shw artz. Online learning and online con vex optimization. F oundations and T rends in Machine Learning , 4(2):107–194, 2011. [28] Shai Shalev-Shwartz. Sdca without duality , regularization, and individual con ve xity . ICML , 2016. [29] Shai Shalev-Shwartz and T ong Zhang. Stochastic dual coordinate ascent methods for re gularized loss. The Journal of Mac hine Learning Resear ch , 14(1):567–599, 2013. [30] Y oa v Shoham, Rob Powers, and T rond Grenager . If multi-agent learning is the answer , what is the question? Artificial Intelligence , 171(7):365–377, 2007. [31] Richard S Sutton and Andrew G Barto. Reinforcement learning: An intr oduction , volume 1. MIT press Cambridge, 1998. [32] Richard S Sutton, David A McAllester, Satinder P Singh, Y ishay Mansour , et al. Policy gradient methods for reinforcement learning with function approximation. In NIPS , volume 99, pages 1057–1063, 1999. [33] Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framew ork for temporal abstraction in reinforcement learning. Artificial intelligence , 112(1): 181–211, 1999. [34] R.S. Sutton, D. Precup, and S. Singh. Between mdps and semi-mdps: A frame work for temporal abstraction in reinforcement learning. Artificial Intelligence , 112:181–211, 1999. [35] Csaba Szepesvári. Algorithms for reinforcement learning. Synthesis Lectures on Artificial Intelligence and Mac hine Learning , 4(1):1–103, 2010. URL http://www.ualberta.ca/ ~szepesva/RLBook.html . [36] Ben T askar , V assil Chatalbashe v , Daphne K oller , and Carlos Guestrin. Learning structured pre- diction models: A lar ge margin approach. In Proceedings of the 22nd international confer ence on Machine learning , pages 896–903. A CM, 2005. [37] Chen T essler , Shahar Giv ony , T om Zahavy , Daniel J Mankowitz, and Shie Mannor . A deep hierarchical approach to lifelong learning in minecraft. arXiv pr eprint arXiv:1604.07255 , 2016. [38] S. Thrun. Learning to play the game of chess. In G. T esauro, D. T ouretzky , and T . Leen, editors, Advances in Neural Information Pr ocessing Systems (NIPS) 7 , Cambridge, MA, 1995. MIT Press. [39] W illiam Uther and Manuela V eloso. Adversarial reinforcement learning. T echnical report, T echnical report, Carnegie Mellon Univ ersity , 1997. Unpublished, 1997. [40] Christopher JCH W atkins and Peter Dayan. Q-learning. Machine learning , 8(3-4):279–292, 1992. [41] Chelsea C White III. A surve y of solution techniques for the partially observ ed markov decision process. Annals of Operations Resear ch , 32(1):215–230, 1991. [42] Ronald J W illiams. Simple statistical gradient-follo wing algorithms for connectionist reinforce- ment learning. Machine learning , 8(3-4):229–256, 1992. 11 A Proofs A.1 Proof of Theorem 1 The policy π θ induces a probability distribution over sequences as follows: given a sequence ¯ s = ( s 1 , a 1 ) , . . . , ( s T , a T ) , we hav e P θ ( ¯ s ) = T Y t =1 P [ s t | ¯ s 1: t − 1 ] π θ ( a t | s t ) . Note that in deri ving the above e xpression we make no assumptions on P [ s t | ¯ s 1: t − 1 ] . This stands in contrast to Markov Decision Processes, in which it is assumed that s t is independent on the past gi ven ( s t − 1 , a t − 1 ) . The only assumption we make is that the (random) choice of a t is solely based on s t , which comes from our architectural design choice of the hypothesis space of polic y functions. The remainder of the proof employs the standard likelihood ratio trick (e.g., [ 2 , 11 ]) with the observation that since P [ s t | ¯ s 1: t − 1 ] does not depend on the parameters θ it gets eliminated in the policy gradient. This is detailed below for the sak e of completeness: ∇ θ E ¯ s ∼ P θ [ R ( ¯ s )] = ∇ θ X ¯ s P θ ( ¯ s ) R ( ¯ s ) (definition of expectation) = X ¯ s R ( ¯ s ) ∇ θ P θ ( ¯ s ) (linearity of deriv ation) = X ¯ s P θ ( ¯ s ) R ( ¯ s ) ∇ θ P θ ( ¯ s ) P θ ( ¯ s ) (multiply and divide by P θ ( ¯ s )) = X ¯ s P θ ( ¯ s ) R ( ¯ s ) ∇ θ log( P θ ( ¯ s )) (deriv ati ve of the log) = X ¯ s P θ ( ¯ s ) R ( ¯ s ) ∇ θ T X t =1 log( P [ s t | ¯ s 1: t − 1 ]) + T X t =1 log( π θ ( s t , a t )) ! (def of P θ ) = X ¯ s P θ ( ¯ s ) R ( ¯ s ) T X t =1 ∇ θ log( P [ s t | ¯ s 1: t − 1 ]) + T X t =1 ∇ θ log( π θ ( s t , a t )) ! (linearity of deriv ati ve) = X ¯ s P θ ( ¯ s ) R ( ¯ s ) 0 + T X t =1 ∇ θ log( π θ ( s t , a t )) ! = E ¯ s ∼ P θ " R ( ¯ s ) T X t =1 ∇ θ log( π θ ( s t , a t )) # . This concludes our proof. A.2 Proof of Lemma 1 E ¯ s [ b t,i ∇ θ i log( π θ ( s t , a t ))] = X ¯ s P θ ( ¯ s ) b t,i ∇ θ i log( π θ ( s t , a t )) = X ¯ s P θ ( ¯ s 1: t − 1 , s t ) π θ ( a t | s t ) P θ ( ¯ s t +1: T | ¯ s 1: t ) b t,i ∇ θ i log( π θ ( s t , a t )) = X ¯ s 1: t − 1 ,s t P θ ( ¯ s 1: t − 1 , s t ) b t,i X a t π θ ( a t | s t ) ∇ θ i log( π θ ( s t , a t )) X ¯ s t +1: T P θ ( ¯ s t +1: T | ¯ s 1: t ) = X ¯ s 1: t − 1 ,s t P θ ( ¯ s 1: t − 1 , s t ) b t,i " X a t π θ ( a t | s t ) ∇ θ i log( π θ ( s t , a t )) # . By Lemma 4, the term in the parentheses is 0 , which concludes our proof. 12 A.3 Proof of Lemma 2 W e hav e: E ¯ s ∼ P θ ,ξ " T X t =1 ˆ Q θ ( ¯ s 1: t , ξ ) ∇ θ log( π θ ( a t | s t )) # = T X t =1 X ¯ s 1: t P θ ( ¯ s 1: t ) E ξ ˆ Q θ ( ¯ s 1: t , ξ ) ∇ θ log( π θ ( a t | s t )) X ¯ s t +1: T P θ ( ¯ s t +1: T | ¯ s 1: t ) = T X t =1 X ¯ s 1: t P θ ( ¯ s 1: t ) Q θ ( ¯ s 1: t ) ∇ θ log( π θ ( a t | s t )) = T X t =1 X ¯ s 1: t P θ ( ¯ s 1: t ) ∇ θ log( π θ ( a t | s t )) X ¯ s t +1: T P θ ( ¯ s t +1: T | ¯ s 1: t ) R ( ¯ s ) = E ¯ s " R ( ¯ s ) T X t =1 ∇ θ log( π θ ( a t | s t )) # . The claim follows from Theorem 1. A.4 Proof of Lemma 3 W e hav e E [ R ( ¯ s )] ∈ [ − pr − (1 − p ) , − pr + (1 − p )] ⇒ E [ R ( ¯ s )] 2 ≤ ( pr + (1 − p )) 2 and E [ R ( ¯ s ) 2 ] ≥ pr 2 . The claim follows since V ar[ R ( ¯ s )] = E [ R ( ¯ s ) 2 ] − E [ R ( ¯ s )] 2 . A.5 T echnical Lemmas Lemma 4 Suppose that π θ is a function such that for every θ and s we have P a π θ ( a | s ) = 1 . Then, X a π θ ( a | s ) ∇ θ log π θ ( a | s ) = 0 Proof X a π θ ( a | s ) ∇ log π θ ( a | s ) = X a π θ ( a | s ) ∇ θ π θ ( a | s ) π θ ( a | s ) = X a ∇ θ π θ ( a | s ) = ∇ θ X a π θ ( a | s ) = ∇ θ 1 = 0 13

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment