Accelerating Cooperative Planning for Automated Vehicles with Learned Heuristics and Monte Carlo Tree Search

Accelerating Cooperativ e Planning f or A utomated V ehicles with Learned Heuristics and Monte Carlo T ree Sear ch Karl Kurzer 1 , Marcus Fechner 1 and J. Marius Z ¨ ollner 1 , 2 Abstract — Efﬁcient driving in urban trafﬁc scenarios r e- quires for esight. The observation of other trafﬁc participants and the inference of their possible next actions depending on the own action is considered cooperative prediction and planning. Humans are well equipped with the capability to predict the actions of multiple interacting trafﬁc participants and plan accordingly , without the need to dir ectly communicate with others. Prior work has shown that it is possible to achiev e effective cooperative planning without the need for explicit communication. Ho wever , the search space for cooperative plans is so large that most of the computational b udget is spent on exploring the search space in unpromising regions that are far away from the solution. T o accelerate the planning pr ocess, we combined learned heuristics with a cooperative planning method to guide the search towards regions with promising actions, yielding better solutions at lower computational costs. I . I N T RO D U C T I O N Cooperativ e planning methods consider the mutual de- pendence of actions in multi-agent environments, opposed to methods that reduce multi-agent to single-agent en vi- ronments, with other agents’ actions being independent of one another . This reduction accelerates the planning process heavily , as it addresses the curse of dimensionality . But, the lack of consideration prunes the solution space, so that solutions requiring cooperation of agents can not be discov- ered. Thus, more suitable methods need to be de veloped, in order to handle the complexity of this problem class without dropping the interdependence of actions. While a chess player could consider all mov able pieces before taking a decision, ev en novice players quickly exclude certain moves that they deem irrele vant given the current board state. It is the ability to combine search with learned heuristics that allows us to discov er solutions for sequential decision making problems quickly instead of being stuck thinking [1], [2]. Problems solved by iterati ve methods, con verging to a (local) optima, greatly beneﬁt from an initialization that is as close as possible to the solution. Humans use their problem speciﬁc experience as the initialization. Thus, it is not surprising that the super human performance that powered AlphaGo, a computer program that beat the strongest Go player in the world in 2016, results from this combination of search and learned heuristics [3]. The interplay of learned models and search has additional inherent adv antages. Even though the goal for a learned 1 Karlsruhe Institute of T echnology , Kaiserstr . 12, 76131 Karlsruhe, Germany kurzer@kit.edu 2 FZI Research Center for Informa- tion T echnology , Haid-und-Neu-Str . 10-14, 76131 Karlsruhe, Germany fechner@fzi.de, zoellner@fzi.de Σ φ µ f v f s Monte Carlo Tree Search Mixture Density Network p ( a | f ) Fig. 1: Integration of a Mixture Density Network into Monte Carlo T ree Search; During the expansion phase (i.e. the exploration of the actions space), the current state of the Monte Carlo T ree Search (MCTS) is being transformed into a feature vector f , consisting of scalar features f s as well as visual features f v . These features are fed into an mixture density network (MDN), generating Gaussian mixture models (GMMs) from which actions a are sampled for each agent, biasing the expansion towards auspicious future states. Figure adapted from [4]. model is to generalize well, i.e. it should perform equally well on known and on unknown data, it cannot be predicted, what happens when it is fed with data originating from a different distribution. Furthermore, the lack of introspection makes it hard to encode constraints in the model that guarantee safety . Howe ver , these inherent disadvantages can easily be addressed by search algorithms, as they are bias free and allow for the integration of constraints. In the domain of automated dri ving, cooperati ve multi- agent trajectory planning tasks are costly problems requiring further research. From a game-theoretic point of view , it is the goal to ﬁnd a Nash equilibrium, i.e. no single agent can perform a dif ferent action yielding a higher return. Even though cooperative multi-agent trajectory planning algorithms have shown promising results [4], [5], [6], [7], their conv ergence speed is slow compared to state of the art single-agent trajectory planning methods. The contrib utions of our w ork are twofold. First, we dev eloped a compact and accurate model, using a deep neural network (DNN), which is capable of predicting sampling c  2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating ne w collecti ve works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. 2020 IEEE Intelligent V ehicles Symposium (IV) distributions o ver actions for cooperati ve urban multi-agent scenarios. Second, we inte grated the trained model as a heuristic in the cooperativ e multi-agent trajectory planning algorithm presented in [5] and e valuated its performance on a variety of dif ferent scenarios, see Fig. 1. This paper is structured as follows: First, a brief overvie w of research on prediction and learned heuristics used in search algorithms is gi ven in section II. Section III deﬁnes the problem. The general approach to the problem is presented in section IV. At last, a detailed comparison between the baseline and the proposed versions of the cooperati ve multi- agent trajectory planning algorithm is conducted in section V. I I . R E L A T E D W O R K A. Pr ediction The purpose of prediction is to model future ev ents. For a trafﬁc situation with multiple traf ﬁc participants, this entails to estimate where each of the trafﬁc participants will be at a given future time step. Prediction is the fundamental basis for single-agent planning, as it provides the constraints for the consecutiv e planning step. Approaching prediction from a multi-agent perspectiv e, it becomes clear that prediction and planning are non-separable [6]. The most likely prediction must incorporate the plan for each agent and thus, the plan for the predicting trafﬁc participant. Howe ver , for reasons of clarity , the term prediction shall denote output of a system, that is not being fed directly into a controller , but is rather incorporated in a subsequent planning step. Previous work for interacti ve scene prediction uses DNNs to predict the action of an ego vehicle in merge scenarios for highway driving [8]. Based on features, such as spatial relation, longitudinal time gap of the seven closest vehicles, the road geometry , and data about the ego vehicle’ s state, a DNN is trained to parameterize a Gaussian mixture model estimating the longitudinal acceleration and the lateral ve- locity of the ego vehicle. Similarly , with an LSTM encoder-decoder architecture with con volutional social pooling, trajectories can be pre- dicted for six distinguished maneuver classes, based on the spatial conﬁguration of a highway trafﬁc scene, capturing the interdependence between vehicles [9]. Other approaches make use of multi-agent trajectory ho- motopy , enumerating possible maneuver classes using a formation tree [10]. These classes can then be used to estimate the likelihood for any of the maneuver classes being picked giv en a set of observed trajectories, using Bayesian statistics. This approach allows for local rather than global estimation of the trajectory within a giv en homotopy class. T ight maneuvers such as passing a point at the same time cannot be captured and hence, cannot be predicted. Using a predeﬁned set of maneuvers for each agent in a highway scenario, the collision probabilities for all permuta- tions of these sets can be computed. This allo ws a prediction for the most likely maneuver combination with a look ahead of one time step [7]. Due to the short look ahead in combina- tion with a ﬁxed number of maneuvers, cooperative plans in urban driving scenarios cannot be predicted correctly . While Bahram et al. [6] extend the approach to longer planning horizons, it does not incorporate continuous action spaces, limiting its applicability to highway driving. Additionally , stochastic approaches, based on dynamic Bayesian models, that can learn conditional dependen- cies for interacti ve trafﬁc models from observ ation using Expectation-Maximization have been proposed [11]. In com- bination with random forests, a policy modeled as a con- ditional density function is approximated, predicting future actions. B. Learned Heuristics Knowing about the capabilities as well as the limitations of learned models, the integration of these models into classical planning techniques has been an acti ve ﬁeld of research. Silver et al. demonstrated the capabilities of Monte Carlo T ree Search (MCTS) in combination with learned value and policy networks, outranking the state of the art Go programs by a large margin. They used a policy network to predict priors for each of the possible actions of a node. These prior probabilities guide the search towards promising areas of the search space decaying over time [3]. Similar methods improving the strength of MCTS were de veloped, interweaving policy learning and roll-outs of the MCTS. In this case, the MCTS recursiv ely improves itself using the knowledge gathered from pre vious simulations embedded in a sampling policy that then improv es roll-outs further [2], [12], [13]. In the autonomous driving domain, Hubschneider et al. combine an end-to-end trained DNN, proposing trajectories to a particle swarm optimizer (PSO) [14]. The DNN is trained via imitation learning, using visual input from a front- facing camera and steering angle labels generated from an expert driv er . Using the trained model as a heuristic provided for the initialization of the PSO planner , the optimization is sped up. This is especially helpful in scenarios with many static obstacles or tight passages, where the majority of particles are in a colliding state, since collision checking is a major bottleneck of any motion planning algorithm [15]. Related work by Paxton et al. integrate a learned high- lev el options policy in combination with a lo w-level control policy into the MCTS to improve the overall quality for a single agent planning problem. [16] Other approaches to accelerate sampling based motion planning task have been proposed, using conditional vari- ational autoencoder, generating subspaces for sampling dis- tributions over desired states [17]. Similarly , Banzhaf et al. learn a sampling distribution for poses of an RR T* path planning algorithm in semi-structured en vironments. The model is trained via supervised learning, using an occupancy grid representation, enriched with additional features, such as the start position, the goal position, and the past path. The output of the model is the predicted path and heading of the vehicle. Using a sampling algorithm, proposals are drawn from the prediction [18]. While previous work in the area of prediction and integra- tion of learned heuristics exists, our work demonstrates the feasibility to efﬁciently predict distributions of trajectories for multiple agents in a single forward pass. This prediction is then integrated as a heuristic to improve cooperativ e multi- agent trajectory planning in continuous spaces for urban driving scenarios. I I I . P RO B L E M S TA T E M E N T W ith the goal to guide sampling based, cooperative, multi- agent trajectory planning algorithms, such as [5] towards promising regions of the action space, a mapping from the feature space of the scene to the action space of each agent in the scene needs to be deﬁned, F → A i . Fig. 2 depicts the exploration of a single agent’ s action space (change in velocity ∆ v lon and change in lateral position ∆ y lat ) for the unbiased algorithm de veloped in our previous work [5]. Samples are distributed in a rather random fashion with some regions yielding higher visit counts. While e xploration is necessary in order to ﬁnd the global optimum, it hinders the exploitation and, therefore, the thorough e valuation of areas of the action space that have shown high returns on some trajectories. A heuristic that generates actions or allows to deriv e actions for these kind of problems is thus likely to yield better results at a lo wer computational cost. Designing a heuristic based on e xpert kno wledge for planning cooperati ve trajectories is not only time intensi ve, but prone to errors, due to the complexity which arises from the interdependence and rare edge cases. Hence, we decided to learn this heuristic from data. With the goal to use it in a sample based planner for cooperativ e driving, the generation of the sampling distri- bution must meet a compromise between speed and accuracy , so that it can be ex ecuted suf ﬁciently often, while ensuring no misguidance. I V . A P P R OAC H Since the heuristic is learned from experience, we em- ployed a DNN for function approximation, mapping the feature space F of a multi-agent traf ﬁc scene with up to eight trafﬁc participants to a distribution ov er the action space A i for each agent. W e deﬁned the outputs of the approximator as the parame- ters of the sampling distribution, namely a Gaussian mixture model (GMM). This is conv enient for three reasons: Firstly , Gaussian mixtures allo w for easy sampling, secondly , they require little parameterization, facilitating the learning as well as reducing the complexity of the output, and lastly , are able to approximate arbitrary probability density functions. A Gaussian mixture model is a linear superposition of Gaussians. The probability of a sample x belonging to that distribution is deﬁned by (1). p ( x ) = K X k =1 φ k N ( x | µ k , Σ k ) (1) 0 2 4 6 0 2 4 0 . 0 2 . 5 5 . 0 7 . 5 10 . 0 12 . 5 15 . 0 17 . 5 V isits ∆ y lat [ m ] ∆ v lon [ m/s ] -2 -4 -2 -6 -4 Fig. 2: Exploration of the unbiased algorithm; darker areas represent regions with high visit counts, meaning that the algorithm has more thoroughly explored these regions. The red triangle is the action that is ﬁnally selected. Figure adapted from [5] with K being the number of mixture components, µ k the mean, Σ k the cov ariance and φ k the mixing coefﬁcient of the respectiv e component. In order to accommodate for scenarios, where the distribution ov er the action space is multimodal, e.g. two or more homotopy classes exist (passing and object on the right or left, merging in front or behind), we chose to train a Gaussian mixture model with two and three components to approximate the sampling distribution. A. Hybrid Model Ar chitectur e W ith the goal to approximate a GMM, the outputs of our model estimate the required parameters for the GMM. These so called mixture density models (MDN) have ﬁrst been conceiv ed by Bishop [19], with the GMM being conditioned on the feature vector f , see 2. p ( x | f ) = K X k =1 φ k ( f ) N ( x | µ k ( f ) , Σ k ( f )) (2) Our hybrid model architecture relies on visual as well as scalar features. V isual features are extracted using a simple CNN architecture and concatenated with scalar fea- tures obtained by a sequence of fully connected layers in a subsequent step, similar to Bacchiani et. al [20]. A depiction of the hybrid model is illustrated in Fig. 3. 1) Input: The visual input consists of 128 x 256 pixels (width x height). Each pixel represents 1 m in longitudinal direction, and 0 . 1 m in lateral direction. The ego vehicle is located at the center of the map. The longitudinal scale is based on a reasonable distance that is required for plan- ning at urban velocities over multiple time steps (e.g. for 50 km h − 1 128 m ≈ 9 s ). The lateral scale is due to the precision required for scenarios with little lateral clearance. Further , each pixel encodes an integer denoting a semantic class. Classes are non-driv able areas, static obstacles, dy- namic obstacles (i.e. other agents), and lanes. Each lane is encoded as a separate class allowing the agent to distinguish between them. The visual input is one frame made up by two maps, a lane map encoding driv able and non-driv able µ k σ 2 k 16 1 32 32 1 1 fc6 fc8 fc7 φ k fc5 fc4 concat fc3 con v2 con v1 fc1 fc2 256 256 256 256 512 448 64 256 256 1 1 1 1 1 1 1 16 32 32 128 Fig. 3: Hybrid Model Architecture of the MDN; fully connected layers and con volution layers are denoted by fc and conv respectiv ely . The network is split into a visual as well as scalar pipeline, handling the respective features. Scalar features of up to eight agents over eight time steps (heading, position, velocity , acceleration, as well as desired velocity and desired lane) are fed into fc1(ReLU) and further processed by fc2(ReLU). V isual features are ﬁrst processed by con v1(ReLU, ﬁlters = 16, kernel = 7x7, stride = 4, padding = reﬂect) and further fed into con v2(ReLU, ﬁlters = 32, kernel = 3x3, stride = 1, padding = reﬂect). After this they are ﬂattened into fc3(ReLU). The resulting outputs from the scalar and visual pipeline are concatenated and processed by two additional layers fc4(ReLU) and fc5(ReLU). The parameters of the GMM are generated with the Softmax (fc6), the identity (fc7) and the non-negati ve ELU (fc8) activ ation functions respectively . area as well as lanes, and a map with static and dynamic objects. The scalar input vector includes state information for each agent, (e.g. position, velocity , heading, desired velocity as well as desired lane). Due to different scales, all inputs are normalized in the range of [ − 1 , 1] . 2) Output: In order to generate a valid GMM, the acti- vation functions of the model need to be carefully chosen. As the mean can be both positive or negati ve, the identity function for µ ∆ v and µ ∆ y is used. Since the sum of the mixing coefﬁcients need to sum to one, a softmax activ ation function for φ ∆ v and φ ∆ y is used. Lastly , the covariance matrix has to be positi ve semideﬁnite. An exponential activ ation function suggested by Bishop [19] can solve this problem and av oid variances close to zero. Ho we ver , exponential functions can lead to numerical instability during training. Hence, we use the non-negati ve ELU proposed by [21] for σ 2 ∆ v and σ 2 ∆ y , see (3). NNELU = ELU (1 , x ) + 1 (3) B. Cooper ative Planning Algorithm The cooperativ e planning algorithm used to generate the data with and integrate the heuristic in as well as benchmark the heuristic against is described in our pre vious work [5]. Based on MCTS the algorithm iterati vely improv es value estimates for actions. The basic approach is depicted in Fig. 4 and Fig. 5. In the following we brieﬂy describe the basic concept of MCTS, that is required to understand how a priori kno wledge can be integrated. Howe ver , the interested reader is referred to [22] or [5] for a more elaborate description of MCTS or problem speciﬁc adaptations respectively . Monte Carlo methods approximate a quantity using a sam- pling based approach. Applied to a Marko v decision process (MDP), modeling a sequential decision making problem, sampling can be used to generate dif ferent trajectories (i.e. a tuples of states s t and actions a t ) through the MDP . The return R of a trajectory τ is the accumulated re ward r t at time step t , when taking action a t in state s t , see (4). R ( τ ) = X ( s t ,a t ) ∈ τ r t ( s t , a t ) (4) The Monte Carlo estimate of the action v alue Q π ( s, a ) is the average of the returns over all trajectories τ ∈ T sampled from policy π , starting in state s and taking action a , see (5). ˆ Q π ( s, a ) = 1 |T | X τ ∈T ∼ π R ( τ ) (5) Based on the current state of the MDP , MCTS estimates the action value in four distinct phases for each iteration, until a terminal condition is met (e.g. till a computational or time budget is exceeded). As MCTS is an anytime algorithm [22], it will always return an estimate. 1) Selection: During the selection phase the UCT (Upper Conﬁdence Bound for Trees) v alue, see (6), for a state action tuple is calculated, and the successor state with the maximum UCT v alue is selected. This process repeats itself, until a state is encountered, that has not been fully explored (i.e. not all av ailable actions in the state hav e been tried), see Fig. 4. In equation (6), the ﬁrst term fosters exploitation of previously explored actions with high action values. The second term ensures that all actions from a gi ven state are being tried at least once, with N ( s ) being the number of times the state s has been visited and N ( s, a ) the number of times action a has been taken in that state. A constant factor Expansion Selection Fig. 4: Selection and Expansion in MCTS for the scenario depicted in Fig. 1 ; Selection descends the tree by maximizing UCT values until an under- explored state leaf node is found, that gets subsequently expanded with an untried action. Figure adapted from [5] c is used to balance both terms, where larger v alues for c foster exploration. U C T ( s, a ) = Q π ( s, a ) + c s N ( s ) N ( s, a ) (6) 2) Expansion: A state that has untried actions left, gets expanded by sampling an action at random, see (7), from the action space and executing this action reaching a successor state, see Fig. 4. a ∼ U [min( A ) , max( A )] (7) 3) Simulation: After the expansion a simulation of subse- quent random actions is conducted until a terminal condition is reached (i.e. the planning horizon or an illegal action was sampled resulting in an in valid state), ev aluating the quality of the previous expansion, see Fig. 5. 4) Bac kpr opagation: Finally , the return R of the trajec- tory generated by the simulation is backpropagated to all states along the trajectory , see Fig. 5, and the action v alues for all actions of the trajectory are updated, see (8). ˆ Q π ( s, a ) = ˆ Q π ( s, a ) + 1 n R ( s, a ) − ˆ Q π ( s, a ) (8) Backpropagation Simulation Fig. 5: Simulation and Backpropagation in MCTS for the scenario depicted in Fig. 1 ; Simulations are run until the end of the planning horizon, after which the result gets backpropagated along the taken trajectory . Figure adapted from [5] Cooperation is ensured by selecting actions based on a combined reward of all agents. Giv en enough samples, the algorithm con verges to the optimal solution, e.g. the combi- nation of trajectories with the highest cumulated rew ard. C. Data Generation and T raining Due to the absence of a real world cooperative driving data set for urban dri ving scenarios as well as the cost and time requirements for the creation of such a data set, we decided to create our own data set using the simulation from our previous work [5]. In order to generate a diverse set of training data, we used 15 different scenarios. of which eight scenarios were adapted from Ulbrich et al. [23]. A short description and videos of all scenarios can be found online 1 . Using the cooperative planning algorithm from [5], each scenario was solved 85 times, generating a data set with roughly 1,275,000 expert actions (change in velocity ∆ v lon and change in lateral position ∆ y lat ). At each run the scenarios were initialized at random. This means, that the position, heading, size, v elocity , and desired velocity for each agent and obstacle as well as the width of the road were altered. Further , we augmented the data by using the point of view of each agent in the scene at e very time step, creating n-times the data for n agents and shifted non-ego agents in the scalar input vector f s . While this increased the size of the data set, it mainly achie ved better generalization, as the DNN is designed for a ﬁxed number of agents, i.e. eight. Consequently , for the most part some inputs of the scalar input vector are empty . T o av oid performance degeneration from position one to eight in the vector , non-ego agents were shifted for each time step through all slots. Follo wing the data creation, we normalized the number of trajectories for each semantic action class [5] (i.e. driving straight without velocity change was over represented). Before being fed into the MDN the input data was further normalized. Based on the action samples a ∈ O of the action space for a given trajectory , two Gaussian mixture models with two and three components were ﬁtted. The parameters of the GMM constitute the labels for the training. Using the data set and the corresponding labels, we trained a DNN using the negati ve-log likelihood for a gi ven sample of the MCTS belonging to the predicted Gaussian mixture model. W e use the T ensorFlo w/Keras API [24], to build and train our MDN depicted in Fig. 3. The training was conducted with a learning rate l r = 1e − 3 , in combination with the Adam optimizer , a batch size of 32 and L2-regularization of the cov ariance. As for the loss function, we employ the negati ve log- likelihood of the observed weighted samples O for each of the G agents in combination with a parameterized L2 cost for the cov ariance, see (9). The loss is based on the number of agents G and sampled actions per agent and thus needs to be normalized. The L2 cost allows tweaking of the resulting 1 http://url.fzi.de/MCTS- MDN- IV probability density function. L = − 1 G G X g =1 1 | O g | X a ∈ O g log 2 X k =1 φ g k N ( a | µ g k , Σ g k ) + α k Σ g k k 2 (9) D. Inte gration W ith the goal to improve results at a lower computational cost the prior knowledge of the MDN needs to be av ailable in the planning algorithm. Depending on the exact task and goal, dif ferent integration strate gies or a combination of them are feasible. 1) Expansion P olicy: The integration of prior kno wledge in the expansion policy steers the algorithm tow ards areas of the action space that should yield high returns. The probability of expanding a speciﬁc action a is proportional to the value of the MDN for the giv en features f and action a , see (10). a ∝ K X k =1 φ k ( f ) N ( a | µ k ( f ) , Σ k ( f )) (10) 2) Selection P olicy: The inte gration in the selection policy weighs the exploration term of UCT proportional to the value of the MDN for the giv en features f and a a, see (10), so actions, lying within high density regions of the MDN are more likely to be selected. U C T ( s, a ) = Q π ( s, a ) + p ( a | f ) c s N ( s ) N ( s, a ) (11) 3) Simulation P olicy: An inte gration in the simulation policy would be identical to the integration in the expansion policy , ho wev er , has not been explored in this work, as the computational overhead would be multiple times larger than in the expansion policy . The inference of the MDN is conducted in C++, requiring less than 5 ms for one forward pass. V . E V A L U A T I O N W e ev aluated the performance of the baseline MCTS, used to generate the training data with, against the inte gration of the MDN in the MCTS. Both the two and three component versions of the MDN were inte grated in the MCTS in four different ways: a) r oot w/ selection: The MDN is used in the expansion policy only in the root node as well as in the selection policy on all nodes. b) r oot w/o selection: The MDN is used in the expan- sion policy only in the root node. c) all w/ selection: The MDN is used in the expansion policy on all nodes as well as in the selection polic y on all nodes. 100 200 500 1000 2000 4000 8000 Iterations SC15 SC13 SC14 SC11 SC12 SC07 SC10 SC02 SC08 SC04 SC05 SC01 SC06 SC03 SC09 Scenarios 0.00 0.00 0.00 0.00 0.00 0.04 0.11 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.01 0.00 0.06 0.17 0.25 0.47 0.00 0.01 0.05 0.22 0.49 0.62 0.70 0.02 0.04 0.16 0.26 0.62 0.95 0.97 0.18 0.31 0.37 0.60 0.70 0.88 0.97 0.34 0.44 0.68 0.70 0.58 0.61 0.67 0.49 0.41 0.64 0.76 0.87 0.94 0.95 0.97 0.99 0.99 0.92 0.92 0.85 0.84 0.89 0.90 0.99 0.97 1.00 0.99 1.00 0.82 0.86 1.00 0.99 0.99 1.00 1.00 0.95 0.99 1.00 1.00 1.00 1.00 1.00 0.97 0.98 0.99 1.00 0.99 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.91 0.99 1.00 1.00 1.00 1.00 1.00 (a) MCTS baseline 100 200 500 1000 2000 4000 8000 Iterations SC15 SC13 SC14 SC11 SC12 SC07 SC10 SC02 SC08 SC04 SC05 SC01 SC06 SC03 SC09 Scenarios 0.20 0.16 0.18 0.10 0.36 0.44 0.36 0.26 0.36 0.24 0.44 0.54 0.40 0.34 0.06 0.00 0.06 0.08 0.14 0.12 0.12 0.80 0.96 1.00 1.00 1.00 0.98 1.00 0.28 0.46 0.62 0.66 0.72 0.84 0.78 0.32 0.34 0.52 0.40 0.32 0.28 0.38 0.70 0.70 0.80 0.98 0.90 0.98 0.94 0.76 0.92 0.98 1.00 0.98 1.00 1.00 0.84 0.96 0.98 0.98 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.92 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.98 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 (b) MCTS+MDN 2 root w/o selection 100 200 500 1000 2000 4000 8000 Iterations SC15 SC13 SC14 SC11 SC12 SC07 SC10 SC02 SC08 SC04 SC05 SC01 SC06 SC03 SC09 Scenarios 0.22 0.20 0.10 0.06 0.20 0.54 0.64 0.52 0.46 0.42 0.34 0.48 0.34 0.32 0.04 0.00 0.02 0.02 0.04 0.10 0.06 0.80 0.98 1.00 1.00 1.00 1.00 1.00 0.12 0.28 0.44 0.48 0.58 0.58 0.64 0.38 0.56 0.76 0.84 0.76 0.70 0.42 0.80 0.76 0.52 0.70 0.82 0.82 0.86 0.86 0.94 0.98 0.98 0.98 1.00 1.00 0.82 0.92 1.00 1.00 1.00 1.00 1.00 0.98 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.98 1.00 1.00 0.98 0.98 1.00 1.00 1.00 1.00 1.00 0.98 1.00 1.00 1.00 1.00 1.00 1.00 0.96 1.00 1.00 1.00 1.00 1.00 1.00 0.98 1.00 1.00 1.00 1.00 1.00 1.00 (c) MCTS+MDN 3 root w/o selection Fig. 6: Evaluation of the success rate (i.e. 1 − collision rate) for a) the MCTS baseline version and the MCTS+MDN versions with two b) and three c) components used only for the root node expansion without selection for each of the scenarios. 8000 100 200 500 1000 2000 4000 Iterations 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 Success Rate MDN 2 root w/ selection MDN 2 root w/o selection MDN 3 root w/ selection MDN 3 root w/o selection Baseline (a) Mixture density networks with two and three components used for the root node expansion with and without selection 8000 100 200 500 1000 2000 4000 Iterations 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 Success Rate MDN 2 all w/ selection MDN 2 all w/o selection MDN 3 all w/ selection MDN 3 all w/o selection Baseline (b) Mixture density networks with two and three components used for ev ery expansion with and without selection Fig. 7: Evaluation of the success rate (i.e. 1 − collision rate) for the baseline MCTS version (yellow) and the MCTS+MDN versions and integration strategies ov er all scenarios for different numbers of MCTS iterations d) all w/o selection: The MDN is used in the expansion policy on all nodes. The ev aluation of the MDN by itself was conducted by sampling 1,000 different actions from the resulting Gaussian mixture model and choosing the action with the highest probability density at each step in the scenario. Using each of the proposed solutions the scenarios were each run 50 times, the baseline version 100 times. As the performance of the MCTS and, thus, also the combination of the MCTS with the MDNs depend on the number of iterations being run, we ran all with an increasing number of iterations. Fig. 6 depicts the results from the baseline MCTS vs the integration of the prior knowledge provided by the MDNs with two and three components in the MCTS. Scenarios SC15, SC13, and SC14 were neither solved by the MCTS nor by the combination of MDN and MCTS. The poor performance on SC14 and SC15 is mainly due to the fact, that these scenarios are especially hard giv en their obstacle conﬁguration combined with four and eight vehicles respectively . SC13 on the other hand is usually solved with higher iteration counts and enhancements to the MCTS mentioned in [5] that are not enabled in the baseline version. Major impro vements were achie ved in SC11 and SC12. While the success rate of SC07 increases slightly for a low number of iterations, the performance drops sharply once more than 1,000 iterations were used. Other scenarios beha ve as expected, generally yielding better results with fewer computational resources. A sum- mary ov er the performance averaged over all scenarios is depicted in Fig. 7. The absolute av erage improvement is approximately 18 % up to 500 iterations (root w/o selection) and vanishes almost completely after 4,000 iterations (w/ selection) and after 8,000 (w/o selection). While there is no difference in overall performance be- tween the MDN versions with two and three mixture com- ponents, three components perform considerably better on SC07. Intuitiv ely this could be due to the three homotopy classes that e xist in this scenario, namely , merge before the ﬁrst, behind the second or between both vehicles. V I . C O N C L U S I O N S This paper proposes a method to accelerate multi-agent trajectory planning algorithms for automated vehicles in situations that require cooperation. Due to the high lev el representation of the environment, the proposed approach is applicable to a variety of cases without speciﬁc retraining. While the proposed approach yields a considerable improv e- ment for lo wer numbers of iterations, the performance gain decays quickly once 2,000 iterations are exceeded. The naiv e integration of the mixture density network requires further tuning and approaches that make a frequent inference computationally feasible. Additionally , tests with real world data, shall be conducted to rev eal whether modiﬁcations are needed for deployment on an actual v ehicle. V I I . A C K N O W L E D G M E N T S W e wish to thank the German Research Foundation (DFG) for funding the project Cooperativ ely Interacting Automo- biles (CoInCar) within which the research leading to this con- tribution was conducted. The information as well as views presented in this publication are solely the ones expressed by the authors. R E F E R E N C E S [1] D. Kahneman, “Maps of bounded rationality: Psychology for behav- ioral economics, ” The American Economic Review , 2003. [2] T . Anthony , Z. T ian, and D. Barber , “Thinking Fast and Slow with Deep Learning and T ree Search, ” 2017. [Online]. A vailable: http://arxiv .org/abs/1705.08439 [3] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser , I. Antonoglou, V . Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever , T . Lillicrap, M. Leach, K. Kavukcuoglu, T . Graepel, and D. Hassabis, “Mastering the game of Go with deep neural networks and tree search, ” Natur e , 2016. [4] K. Kurzer et al. , “Decentralized Cooperati ve Planning for Automated V ehicles with Hierarchical Monte Carlo T ree Search, ” in IEEE Intel- ligent V ehicles Symposium (IV) . IEEE, 2018. [5] K. Kurzer , F . Engelhorn, and J. M. Z ¨ ollner , “Decentralized Coopera- tiv e Planning for Automated V ehicles with Continuous Monte Carlo T ree Search, ” in 2018 21st International Confer ence on Intelligent T ransportation Systems (ITSC) . IEEE, 2018. [6] M. Bahram, A. Lawitzky , J. Friedrichs, M. Aeberhard, and D. W oll- herr , “A Game-Theoretic Approach to Replanning-A ware Interactive Scene Prediction and Planning, ” IEEE Tr ansactions on V ehicular T echnology , 2016. [7] A. Lawitzky , D. Althoff, C. F . Passenberg, G. T anzmeister, D. W oll- herr , and M. Buss, “Interactive scene prediction for automotive appli- cations, ” in IEEE Intelligent V ehicles Symposium, Proceedings . IEEE, 2013. [8] D. Lenz, F . Diehl, M. T . Le, and A. Knoll, “Deep neural networks for Markovian interactive scene prediction in highway scenarios, ” in 2017 IEEE Intelligent V ehicles Symposium (IV) . IEEE, 2017. [9] N. Deo and M. M. T riv edi, “Con volutional Social Pooling for V ehicle Trajectory Prediction, ” 2018. [Online]. A vailable: http://arxiv .org/abs/1805.06771 [10] J. Schulz, K. Hirsenkorn, J. Lochner , M. W erling, and D. Burschka, “Estimation of collectiv e maneuvers through cooperative multi-agent planning, ” in 2017 IEEE Intelligent V ehicles Symposium (IV) . IEEE, 2017. [11] T . Gindele, S. Brechtel, and R. Dillmann, “Learning driver behavior models from trafﬁc observations for decision making and planning, ” IEEE Intelligent T ransportation Systems Magazine , 2015. [12] D. Silver , J. Schrittwieser , K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T . Hubert, L. Baker , M. Lai, A. Bolton, Y . Chen, T . Lillicrap, F . Hui, L. Sifre, G. v . d. Driessche, T . Graepel, and D. Hassabis, “Mastering the game of Go without human knowledge, ” Nature , 2017. [13] J. Schrittwieser, I. Antonoglou, T . Hubert, K. Simonyan, L. Sifre, S. Schmitt, A. Guez, E. Lockhart, D. Hassabis, T . Graepel, T . Lillicrap, and D. Silver, “Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model, ” 2019. [Online]. A vailable: http://arxiv .org/abs/1911.08265 [14] C. Hubschneider, A. Bauer, J. Doll, M. W eber , S. Klemm, F . Kuhnt, and J. M. Zollner , “Integrating end-to-end learned steering into probabilistic autonomous driving, ” in 2017 IEEE 20th International Confer ence on Intelligent T ransportation Systems (ITSC) . IEEE, 2017. [15] J. Ziegler and C. Stiller , “Fast collision checking for intelligent vehicle motion planning, ” in 2010 IEEE Intelligent V ehicles Symposium . IEEE, 2010. [16] C. Paxton, V . Raman, G. D. Hager, and M. Kobilaro v , “Combining Neural Networks and Tree Search for T ask and Motion Planning in Challenging En vironments, ” 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IR OS) , 2017. [17] B. Ichter, J. Harrison, and M. Pa vone, “Learning Sampling Distributions for Robot Motion Planning, ” 2017. [Online]. A vailable: http://arxiv .org/abs/1709.05448 [18] H. Banzhaf, P . Sanzenbacher, U. Baumann, and J. M. Z ¨ ollner , “Learn- ing to Predict Ego-V ehicle Poses for Sampling-Based Nonholonomic Motion Planning, ” 2018. [19] C. M. Bishop, “Mixture density networks, ” 1994. [20] G. Bacchiani, D. Molinari, and M. P atander , “Microscopic T rafﬁc Simulation by Cooperative Multi-agent Deep Reinforcement Learning, ” 2019. [Online]. A vailable: http://arxiv .org/abs/1903.01365 [21] A. Brando Guillaumes, “Mixture density networks for distribution and uncertainty estimation, ” Master’ s thesis, Universitat Polit ` ecnica de Catalunya, 2017. [22] C. B. Browne, E. Powley , D. Whitehouse, S. M. Lucas, P . I. Co wling, P . Rohlfshagen, S. T avener , D. Perez, S. Samothrakis, and S. Colton, “A survey of Monte Carlo tree search methods, ” IEEE T ransactions on Computational Intelligence and AI in Games , 2012. [23] S. Ulbrich, S. Grossjohann, C. Appelt, K. Homeier, J. Rieken, and M. Maurer, “Structuring Cooperativ e Behavior Planning Implemen- tations for Automated Dri ving, ” in IEEE Confer ence on Intelligent T ransportation Systems, Proceedings, ITSC . IEEE, 2015. [24] M. Abadi, A. Agarwal, P . Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow , A. Harp, G. Irving, M. Isard, Y . Jia, R. Jozefowicz, L. Kaiser , M. Kudlur, J. Lev enberg, D. Man ´ e, R. Monga, S. Moore, D. Murray , C. Olah, M. Schuster , J. Shlens, B. Steiner , I. Sutskever , K. T alwar , P . Tucker , V . V anhoucke, V . V asudevan, F . V i ´ egas, O. V inyals, P . W arden, M. W attenberg, M. W icke, Y . Y u, and X. Zheng, “T ensorFlow: Large-scale machine learning on heterogeneous systems, ” 2015, software available from tensorﬂow .org. [Online]. A vailable: https://www .tensorﬂow .org/

Accelerating Cooperative Planning for Automated Vehicles with Learned Heuristics and Monte Carlo Tree Search

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment