Dynamic Graph Configuration with Reinforcement Learning for Connected Autonomous Vehicle Trajectories

Dynamic Graph Conﬁguration with Reinforcement Learning for Connected Autonomous V ehicle T rajectories Udesh Gunarathna, Hairuo Xie, Egemen T anin, Shanika Karunasekara, Renata Borovica-Gajic pgunarathna@student.unimelb .edu.au, { xieh, etanin, karus, renata.borovica } @unimelb .edu.au School of Computing and Information Systems University of Melbourne Abstract —T raditional trafﬁc optimization solutions assume that the graph structure of r oad networks is static, missing opportunities for further trafﬁc ﬂo w optimization. W e ar e in- terested in optimizing trafﬁc ﬂows as a new type of graph-based problem, where the graph structure of a road network can adapt to trafﬁc conditions in real time. In particular , we focus on the dynamic conﬁguration of trafﬁc-lane directions, which can help balance the usage of trafﬁc lanes in opposite directions. The rise of connected autonomous vehicles offers an opportunity to apply this type of dynamic trafﬁc optimization at a large scale. The existing techniques for optimizing lane-directions are however not suitable f or dynamic trafﬁc envir onments due to their high computational complexity and the static nature. In this paper , we propose an efﬁcient trafﬁc optimization solution, called Coordinated Learning-based Lane Allocation (CLLA), which is suitable for dynamic conﬁguration of lane- directions. CLLA consists of a tw o-layer multi-agent architectur e, where the bottom-layer agents use a machine learning technique to ﬁnd a suitable conﬁguration of lane-directions around indi- vidual road intersections. The lane-direction changes proposed by the learning agents are then coordinated at a higher level to reduce the negative impact of the changes on other parts of the road network. Our experimental results show that CLLA can reduce the av erage travel time signiﬁcantly in congested road networks. W e believe our method is general enough to be applied to other types of networks as well. Index T erms —Graphs, Spatial Database, Reinfor cement Learning I . I N T R O D U C T I O N The goal of trafﬁc optimization is to improv e trafﬁc ﬂows in road networks. Traditional solutions normally assume that the structure of road networks is static reg ardless of ho w traf ﬁc can change at real time [1], [2]. A less-common way to optimize trafﬁc is by performing limited changes to road networks which when is-use are deployed in very small scale. W e focus on dynamic lane-direction changes, which can help balance the usage of trafﬁc lanes in many circumstances, e.g. as soon as when the trafﬁc lanes in one direction become congested while the trafﬁc lanes in the opposite direction are underused [3], [4]. Unfortunately , the existing techniques for optimizing lane- directions are not suitable for dynamic traf ﬁc en vironment at large scale due to their high computational complexity [5]–[7]. W e develop an ef ﬁcient solution for optimizing lane-directions in highly dynamic trafﬁc en vironments. Our solution is based on an algorithm that modiﬁes the property of a road network graph for impro ving traf ﬁc ﬂow in the corresponding road network, introducing a new graph problem. (a) T rafﬁc before lane- direction change (b) T rafﬁc after lane- direction change Figure 1: The impact of lane-direction change on trafﬁc ﬂow . There are 20 vehicles moving in the north-bound direction and 2 vehicles moving in the south-bound direction. The impact of dynamic lane-direction conﬁgurations can be shown in the follo wing e xample, where 20 v ehicles are mo ving north-bound and 2 vehicles are moving south-bound (Figure 1) at a certain time. In Figure 1a, there are 4 north-bound lanes and 4 south-bound lanes. Due to the large number of vehicles and the limited number of lanes, the north-bound trafﬁc is highly congested. At the same time, the south-bound vehicles are moving at a high speed as the south-bound lanes are almost empty . Figure 1b shows the dramatic change of trafﬁc ﬂow after lane-direction changes are applied when congestion is observed, where the direction of E, F and G is re versed. The north-bound vehicles are distributed into the additional lanes, resulting in a higher average speed of the vehicles. At the same time, the number of south-bound lanes is reduced to 1. Due to the low number of south-bound vehicles, the av erage speed of south-bound trafﬁc is not affected. The lane-direction change helps improve the overall trafﬁc efﬁciency in this case. This observation was used by traf ﬁc engineers of certain road segments for many years and applied in a more static way . W e aim to scale this to extreme lev els in time and space. The beneﬁt of dynamic lane-direction changes can also be observed in preliminary tests, where we compare the av erage trav el time of vehicles in two scenarios, one uses dynamic lane-direction conﬁgurations, another uses static lane-direction conﬁgurations. The dynamic lane-direction conﬁgurations are computed with a straightforward solution (Section IV). The result shows that lane-direction changes reduce travel times 100 200 300 5 10 The number of vehicles generated A verage trav el time (min) Dynamic conﬁgurations Static conﬁgurations Figure 2: The average tra vel time of vehicles when using static and dynamic lane-direction conﬁgurations. by 14% on average when the trafﬁc increases (see Figure 2). In trafﬁc engineering terms this is a dramatic reduction. Despite their potential beneﬁt, dynamic lane-direction changes cannot be easily applied to existing trafﬁc systems as they require additional signage and safety features [8]. The emergence of connected autonomous vehicles (CA Vs) [9] howe ver can make dynamic lane-direction changes a common practice in the future. Our previous work sho ws that CA Vs hav e the potential to enable innov ati ve traf ﬁc management solutions [10]. Compared to human-dri ven vehicles, CA Vs are more capable of responding to a giv en command in a timely manner [6]. CA Vs can also provide detailed trafﬁc telemetry data to a central trafﬁc management system in real time. This helps the system to adapt to dynamic trafﬁc conditions. W e formulate lane allocation based on real-time trafﬁc as a new graph problem with the aim to ﬁnd a new graph G 0 t from a road network ( G ) (i.e., dynamically optimize the graph) such that total tra vel time of all vehicles in the road network is minimized. In order to optimize the ﬂow of the whole network, all the trafﬁc lanes in the network must be considered. In many circumstances, one cannot simply allocate more trafﬁc lanes at a road segment for a speciﬁc direction when there is more trafﬁc demand in the direction. This is because a lane-direction change at a road segment can affect not only the ﬂow in both directions at the road segment but also the ﬂow at other road segments. Due to the complexity of the problem, the computation time can be very high with the existing approaches as they aim to ﬁnd the optimal conﬁgurations based on linear programming [5]–[7], and hence are not suitable for frequent recomputation over large networks. T o address the abov e mentioned issues, we propose a light-weight and effecti ve framew ork, called a Coordinated Learning-based Lane Allocation (CLLA) framework, for op- timizing lane-directions in dynamic trafﬁc en vironments. The CLLA approach ﬁnds the conﬁgurations that effecti vely im- prov e the trafﬁc ef ﬁciency of the whole network, while keeping the computation cost of the solution lo w . The key idea is that trafﬁc optimization can be decoupled into two processes: i) a local process that proposes lane-direction changes based on local trafﬁc conditions around road intersections, and b) a global process that ev aluates the proposed lane-direction Figure 3: The hierarchical architecture of our trafﬁc management solution based on lane-direction changes. changes based on their large scale impact. The architecture of our hierarchical solution is illustrated in Figure 3. The bottom layer consists of a set of autonomous agents that operate at the intersection lev el. An agent ﬁnds suitable lane-direction changes for the road segments that connect to a speciﬁc intersection. The agent uses reinforcement learning [11], which helps determine the best changes based on multiple dynamic factors. The agents send the proposed lane-direction changes to the upper layer , which consists of a coordinating agent. The coordinating agent maintains a data structure, named P ath Dependency Graph (PDG) , which is built based on the trip information of connected autonomous vehicles. W ith the help of the data structure, the coordinat- ing agent ev aluates the global impact of the proposed lane- direction changes and decides what changes should be made to the trafﬁc lanes. The decision is sent back to the bottom layer agents, which will make the changes accordingly . The main contributions of our work are as follows: • W e formulate dynamic lane allocation as a new graph problem (Dynamic Resource Allocation problem). • W e propose a hierarchical multi-agent solution (called CLLA) for efﬁcient dynamic optimization of lane- directions that uses reinforcement learning to capture dynamic changes in the trafﬁc. • W e introduce an algorithm and innovati ve data structure (called path dependency gr aph ) for coordinating lane- direction changes at the global lev el. • Extensiv e experimental ev aluation shows that CLLA signiﬁcantly outperforms other trafﬁc management so- lutions, making it a viable tool for mitigating traf ﬁc congestion for future trafﬁc networks. I I . R E L AT E D W O R K A. T raf ﬁc Optimization Algorithms Existing traf ﬁc optimization algorithms are commonly based on trafﬁc ﬂow optimization with linear programming [2], [12], [13]. The algorithms are suitable for the situations where trafﬁc demand and congestion levels are relativ ely static. When there is a signiﬁcant change in the network, the optimal solutions normally need to be re-computed from scratch. Due to the high computational complexity of ﬁnding an optimal solution, the algorithms may not be suitable for highly dynamic traf ﬁc en vironments. W ith the rise of reinforcement learning [14], a new gen- eration of trafﬁc optimization algorithms hav e emerged [15]– [17]. In reinforcement learning, a learning agent can ﬁnd the rules to achieve an objectiv e by repeatedly interacting with an en vironment. The interactive process can be modelled as a ﬁnite Marko v Decision Process, which requires a set of states S and a set of actions A per state. Given a state s of the en vironment, the agent takes an action a . As the result of the action, the en vironment state may change to s 0 with a re ward r . The agent then decides on the next action in order to maximize the re ward in the next round. Reinforcement learning-based approaches can suggest the best actions for trafﬁc optimization giv en a combination of netw ork states, such as the queue size at intersections [18], [19]. They ha ve an advantage ov er linear programming-based approaches, since if trained well, they can optimize trafﬁc in a highly dynamic network. In other words, there is no need to re-train the agent when there is a change in the network. For example, Arel et al. show that a multi-agent system can optimize the timing of adaptive trafﬁc lights based on reinforcement learning [19]. Different to the existing approaches, our solution uses reinforcement learning for optimizing lane-directions. A common problem with reinforcement learning is that the state space can grow exponentially when the dimensionality of the state space grows linearly . For example, let us assume that the initial state space only has one dimension, the queue size at intersections. If we add two dimensions to the state space, trafﬁc signal phase and traf ﬁc lane conﬁguration, there will be three dimensions and the state space is four times as large as the original state space. The fast growth of the state space can make reinforcement learning unsuitable for real deployments. This problem is known as the curse of dimensionality [20]. A common way to mitigate the problem is by using a function approximator such as a neural network. Such techniques have been mainly used for dynamic trafﬁc signal control [21], [22], while we extend the use of the technique to dynamic lane- direction conﬁgurations. Many existing traf ﬁc optimization solutions use model- based reinforcement learning, where one needs to know the exact probability that a speciﬁc state transits to another speciﬁc state as a result of a speciﬁc action [23], [24]. Nonetheless, such an assumption is unrealistic since the full knowledge of state transition probabilities can hardly be known for highly complex trafﬁc systems. Dif ferent to model-based approaches, our optimization solution employs a model-free algorithm, Q- learning [25], which does not require such knowledge and hence is much more applicable to real trafﬁc systems. Coordination of multi-agent reinforcement learning can be achiev ed through a joint state space or through a coordination graph [26]. Such techniques howe ver require agents to be trained on the targeted network. Since our approach uses an implicit mechanism to coordinate, once an agent is trained, it can be used in any road network. B. Lane-direction Conﬁgurations Research shows that dynamic lane-direction changes can be an effecti ve way to improve trafﬁc efﬁcienc y [3]. Howe ver , existing approaches for optimizing lane-directions are based on linear programming [5]–[7], [27], which are unsuitable for dynamic trafﬁc environments dues to their high computational complexity . For example, Chu et al. use linear programming to make lane-allocation plans by considering the schedule of connected autonomous vehicles [6]. Their experiments show that the total trav el time can be reduced. Ho wev er , the compu- tational time grows e xponentially when the number of vehicles grows linearly , which can make the approach unsuitable for highly dynamic trafﬁc en vironments. Other approaches per- form optimization based on two processes that interact with each other [5], [7], [27]. One process is for minimizing the total system cost by rev ersing lane directions while the other process is for making route decisions for individual vehicles such that all the vehicles can minimize their trav el times. T o ﬁnd a good optimization solution, the two processes need to interact with each other iteratively . The high computational cost of the approaches can make them unsuitable for dynamic trafﬁc optimizations. Furthermore, all these approaches assume exact kno wledge of trafﬁc demand ov er the time horizon is known beforehand; this assumption does not hold when trafﬁc demand is stochastic [28]. On the contrary , CLLA is lightweight and can adapt to highly dynamic situations based on reinforcement learning. The learning agents can ﬁnd the ef fectiv e lane-direction changes for individual road intersections ev en when trafﬁc demand changes dramatically . C. T raf ﬁc Management with Connected Autonomous V ehicles Some recent dev elopment of trafﬁc management solutions is tailored for the era of connected autonomous vehicles. T elebpour and Mahmassani de velop a trafﬁc management model that combines connected autonomous vehicles and intelligent road infrastructures for improving trafﬁc ﬂo w [29]. Guler et al. de velop an approach to improv e trafﬁc efﬁcienc y at intersection using connected autonomous vehicles [30]. W e use the CA Vs as an opportunity for lane optimization. T o the best of our kno wledge, we are the ﬁrst to study dynamic lane-direction changes at large scale networks in the era of connected autonomous vehicles. I I I . P R O B L E M D E FI N I T I O N In this section, we formalize the problem of trafﬁc optimiza- tion based on dynamic conﬁguration of lane directions. Our problem is similar to Network Design Problem [31], howe ver NDP is based on the assumption of knowledge of trafﬁc demand for whole time horizon at time zero and the output network is designed for a common state. W e try to conﬁgure a graph (road network) at re gular time intervals based on real-time trafﬁc, thus we name this problem, Dynamic Graph Resour ce Allocation problem. Let G ( V , E ) be a road network graph, where V is a set of vertices and E is a set of edges. Let us assume that edge e ∈ E connects between vertex x ∈ V and verte x y ∈ V . The edge has three properties. The ﬁrst property is the the total number of lanes, n e , which is a constant number . The second property is the number of lanes that start from x and end in y , n e 1 . The third property is the number of lanes in the opposite direction (from y to x ), n e 2 . n e 1 and n e 2 can change but they are always subject to the following constraint. n e 1 + n e 2 = n e (1) W e assume that a CA V follows a pre-determined path based on an origin-destination (O-D) pair . Let the number of unique O-D pairs of the existing vehicles be k at a giv en time t . For the i th ( i < = k ) O-D pair , let d i,t be the trafﬁc demand at time t , i.e., the number of vehicles with the same O-D pair at that time. The trafﬁc demand can be stochastic. Let the travel time of vehicle j with the i th O-D pair be T T i,j , which is the duration for the vehicle to mov e from the origin to destination. For a giv en time t , the a verage tra vel time of all the v ehicles, which will reach their destination during a time period T after t , can be deﬁned as AT T = k X i =1 m i X j =1 T T i,j / k X i =1 m i (2) where m i is the number of vehicles with the i th O-D pair that will complete their trips between t and t + T . W e deﬁne and solve a version of the problem where at frequent regular intervals we optimize travel time, while changing the lane arrangement in all edges. W e ﬁnd a new graph G 0 t ( V , E 0 ) at a given time t from pre vious G at previous time step. Let e 0 1 , e 0 2 ∈ E 0 and e 0 1 connects from vertex x to verte x y and e 0 2 connects vertex y to vertex x . W e ﬁnd for all edges the v alues of n e 0 1 and n e 0 2 , such that the average trav el time AT T is minimized. W e call this Dynamic Resource Allocation problem. I V . D E M A N D - B A S E D L A N E A L L O C A T I O N ( D L A ) When considering dynamic lane-direction changes, a straightforward solution can use a centralized approach to optimize lane-directions based on the full knowledge of traf ﬁc demand, i.e., the number of vehicle paths that pass through the road links. W e call this solution Demand-based Lane Allocation (DLA) . Algorithm 1 shows the implementation (in pseudo code) of such an idea to compute the conﬁguration of lane-directions. DLA allocates more lanes for a speciﬁc direction when the a verage traf ﬁc demand per lane in the direction is higher than the av erage traf ﬁc demand per lane in the opposite direction. T o specify the directions, we deﬁne two terms, upstream and downstream . The terms are deﬁned as follows. Let us assume that all the vertices of the road network graph are ordered by the identiﬁcation number of the vertices. Giv en two vertices, v 1 and v 2 , and a direction that points from v 1 to v 2 , we say that the direction is upstream if v 1 is lower than v 2 or downstr eam if v 1 is higher than v 2 . DLA ﬁrst computes the trafﬁc demand at the edges that are on the path of the vehicles (Line 1-6). The trafﬁc demand is computed for the upstream direction ( up e ) and the do wnstream direction ( down e ) separately . Then it quantiﬁes the dif ference of the av erage traf ﬁc demand per lane between the two directions (Line 9-11). Based on the dif ference between the two directions, DLA decides whether the number of lanes in a speciﬁc direction needs to be increased (line number 11- 14). W e should note that increasing the number of lanes in one direction implies that the number of lanes in the opposite direction is reduced. DLA only reduces the number of lanes in a direction if the trafﬁc demand in that direction is lower than a threshold (Line 12). The complexity of the algorithm is O ( k | E | ) , where | E | is the number of edges in G and k is the number of O-D pairs. While straightforward to implement and ef fectiv e, there are two notable drawbacks of DLA. First, the algorithm does not consider real-time trafﬁc conditions, such as the queue length at a given time, during optimization; the only information used for optimization is (assumed apriori known) trafﬁc demand and exact kno wledge of trafﬁc demand is difﬁcult to obtain in dynamic road networks [28]. This can make the lane-direction conﬁguration less adapti ve (and less applicable) to real-time trafﬁc conditions. Second, the lane-direction optimization for individual road se gments is performed indi vidually , not consid- ering the potential impact of a lane-direction change at a road segment on other road segments in the same road network. Therefore, a lane-direction change that helps improve trafﬁc efﬁcienc y at a road link may lead to the decrease of trafﬁc efﬁcienc y in other parts of the road network. V . C O O R D I N A T E D L E A R N I N G - BA S E D L A N E A L L O C A T I O N ( C L L A ) T o tackle the problems of the straightforward solution, we propose a fundamentally different solution, a Coordinated Learning-based Lane Allocation (CLLA) framework. CLLA uses a machine learning technique to help optimize lane- direction conﬁgurations, which allows the frame work to adapt to a high variety of real-time trafﬁc conditions. In addition, CLLA coordinates the lane-direction changes by considering the impact of a potential lane-direction change on different parts of the road network. DLA, on the other hand, does not consider the global impact of lane-direction changes. Another difference between the two is that DLA requires the full path of vehicles to be kno wn for computing traf ﬁc demand. As detailed later , CLLA only needs to know partial information about vehicle paths in addition to certain information about real-time traf ﬁc conditions, such as intersection queue lengths and lane conﬁguration road segments which can be obtained from inductiv e-loop trafﬁc detectors. CLLA uses a two-layer multi-agent architecture. The bot- tom layer consists of learning agents that are responsible for optimizing the direction of lanes connected to speciﬁc intersections. Using the multi-agent approach can signiﬁcantly boost the speed of learning. The lane-direction changes that are decided by the learning agents are aggregated and ev aluated by a coordinating agent at the upper layer , which will send the globally optimized lane-direction conﬁguration to the bottom layer for making the changes. A more detailed o vervie w of CLLA is shown in Fig- ure 4. As the ﬁgure shows, an agent in the bottom layer only observes the local trafﬁc condition around a speciﬁc intersection. Agents make decisions on lane-direction changes Algorithm 1: Demand-based Lane Allocation (DLA) Input: A road network graph G ( V , E ) . Input: The set of paths. A path is a sequence of edges on the shortest path between a speciﬁc Origin-Destination (O-D) pair . The set of paths includes the paths of all unique O-D pairs. Input: The demands associated with the paths, where a demand is the number of vehicles that follow a speciﬁc path. Input: th : demand threshold. Input: g : minimal gap in trafﬁc demand that can trigger a lane-direction change. 1 foreach p ∈ paths do 2 for each e ∈ p do 3 if p passes e in the upstr eam dir ection then 4 up e += demand of p 5 if p passes e in the downstr eam dir ection then 6 dow n e += demand of p 7 foreach e ∈ E do 8 minLoad ← min( up e , down e ) 9 dow n 0 e ← down e / number of downstream lanes 10 up 0 e ← up e / number of upstream lanes 11 g ap ← dow n 0 e − up 0 e up 0 e + dow n 0 e 12 if minLoad < th then 13 if g ap > g then 14 mov e one upstream lane to the set of downstream lanes 15 if g ap < − g then 16 mov e one downstream lane to the set of upstream lanes Area obs erv ed by an RL agent O/D pa irs o f users Bottom lay er ents Upper layer Coor dinati ng agents ne direct i change ion ges P ath dependency gr aph d us Figure 4: An overvie w of the CLLA ’ s architecture independently . Whenever an agent needs to make a lane- direction change, it sends the proposed change to the co- ordinating agent in the upper layer . The agents also send certain trafﬁc information to the upper layer periodically . The information can help indicate whether there is an imbalance between upstream trafﬁc and downstream traf ﬁc at speciﬁc road segments. The coordinating agent ev aluates whether a change would be beneﬁcial at the global lev el. The ev aluation process inv olves a nov el data structure, P ath Dependency Graph (PDG) , to inform decisions sent from the bottom layer . The coordinator may allow or deny a lane-direction change request from the bottom-layer . It may also decide to make further changes to lane-directions in addition to the requested changes. After ev aluation, the coordinating agent informs the bottom-layer agents of the changes to be made. Agents Coor dinato r R oad Networ k T T T Time requests fr om agents lane directi on changes buffer re quest s Figure 5: The CLLA ’ s communication timeline between agents and the coordinator W e should note that the coordinator does not need to ev aluate a lane-direction change request as soon as it arrives. As sho wn in Figure 5, the coordinator ev aluates the lane- direction changes periodically . The time interval between the ev aluations is T . All the requests from the bottom-layer agents are buf fered during the interval. The exact value of the interval needs to be adjusted case by case. A short interv al may increase the computational cost of the solution. A long interval may decrease the effecti veness of the optimization. A. CLLA Algorithm Algorithm 2 shows the entire optimization process of CLLA. During one iteration of the algorithm, each learning agent ﬁnds the lane-direction changes around a speciﬁc road intersection using the process detailed in Section V -B. The proposed change is stored as an edge-change pair , which is buf fered in the system (Line 5). When it is time to ev aluate the proposed changes, the system uses the Dir ection-Change Evaluation algorithm (Section V -C) to quantify the conﬂicts between the proposed changes (Line 8). For example, when a learning agent proposes to increase the number of upstream lanes on road segment s 1 while another agent proposes a lane-direction change on a different road segment s 2 , which can lead to the increase of the downstream traf ﬁc ﬂow on s 1 , there is a conﬂict between the proposed changes for s 1 . The Change Ev aluation algorithm also expands the set of the proposed changes that may be beneﬁcial. Upon returning from the Change Ev aluation algorithm, CLLA checks the expanded set of edge-change pairs. For each edge-change pair , if the number of conﬂicts for the edge are below a giv en limit, the change is applied to the edge (Line 12). B. Learning-based Lane-direction Conﬁguration In the CLLA framew ork, the bottom-layer agents use the Q-learning technique to ﬁnd suitable lane-direction changes based on real-time traf ﬁc conditions. Q-learning aims to ﬁnd a policy that maps a state to an action. The algorithm relies on an action value function , Q ( s, a ) , which computes the quality of a state-action combination. In Q-learning, an agent tries to ﬁnd the optimal policy that leads to the maximum action Algorithm 2: Coordinated Lane Allocation (CLLA) Input: T , action ev aluation interv al Input: E C initial , set of edge-change pairs proposed by the learning agents Input: E C expanded , set of edge-change pairs giv en by the coordinator Input: mc , the maximum number of conﬂicts between lane-direction changes before a proposed lane-direction change can be applied to an edge. Input: G ( V , E ) , a road network graph. Input: l , the lookup distance for building Path Dependency Graph. Input: dp , the maximum search depth in Path Dependency Graph for ev aluating lane-direction changes. 1 while T rue do 2 for each ag ent ∈ Ag ents do 3 determine the best lane-direction change for all the edges (road segments) that connect to the verte x (road intersection) controlled by the ag ent 4 for each edge e that needs a lane-direction change do 5 E C initial .inser t ( { e, chang e } ) 6 if T = t then 7 t ← 0 8 E C expanded ← Direction-Change Evaluation ( E C initial , G , l , dp ) 9 E C initial ← ∅ 10 for each { e, change } in E C expanded do 11 if number of conﬂicts on e is less than mc then 12 apply the lane-direction change to e 13 t ← t + 1 value. Q-learning updates the action value function using an iterativ e process as shown in Equation 3. Q new t ( s, a ) = (1 − α ) Q t ( s, a ) + α ( r t +1 + γ m a axQ ( s t +1 , a )) (3) where s is the current state, a is a speciﬁc action, s t +1 is the next state as a result of the action, m a axQ ( s t +1 , a ) is the estimated optimal action value in the next state, v alue r t +1 is an observed rew ard at the next state, α is a learning rate and γ is a discount factor . In CLLA, the states, actions and re wards used by the learning agents are deﬁned as follows. 1) States: A learning agent can work with four types of states. The ﬁrst state represents the current trafﬁc signal phase at an intersection. The second state represents the queue length of incoming vehicles that are going to pass the intersection without turning. The third state represents the queue length of incoming vehicles which are going to turn at the intersection. The fourth state represents the queue length of outgoing vehicles, i.e., the vehicles that have passed the intersection. Although it is possible to add other types of states, we ﬁnd that the combination of the four states can work well for trafﬁc optimization. 2) Actions: There are three possible actions: increasing the number of upstream lanes by 1, increasing the number of downstream lanes by 1 or keeping the current conﬁguration. When the number of lanes in one direction is increased, the number of lanes in the opposite direction is decreased at the same time. Since a learning agent controls a speciﬁc road intersection, the agent determines the action for each individual road segment that connects with the intersection. An agent is allo wed to take a lane-changing action only when there is a traf ﬁc imbalance on the road se gment (see Equation 4 for the deﬁnition of trafﬁc imbalance). 3) Rewar ds: W e deﬁne the rew ards based on two factors. The ﬁrst factor is the waiting time of vehicles at an intersec- tion. When the waiting time decreases, there is generally an improv ement of trafﬁc efﬁcienc y . Hence the rew ards should consider the difference between the current waiting time and the updated waiting time of all the vehicles that are ap- proaching the intersection. The second factor is the dif ference between the length of vehicle queues at different approaches to an intersection. When the queue length of one approaching road is signiﬁcantly longer than the queue length of another approaching road, there is a higher chance that the traf ﬁc becomes congested in the former case. Therefore we need to penalize the actions that increase the difference between the longest queue length and the shortest queue length. The fol- lowing re ward function combines the two factors. A parameter β is used to give weights for the two factors. W e normalized the two factors to stabilize the learning process by limiting rew ard function between 1 to -1. T o giv e equal priority to both factors, we set β to 0.5 in the experiments. R = (1 − β ) × Current wait time − Next wait time max(Next wait time, Current wait time) − β × Queue length difference Aggregated road capacity C. Coordination of Lane-direction Changes W e dev elop the coordinating process based on the observa- tion that a locally optimized lane-direction change may conﬂict with the lane-direction changes that happen in the surrounding areas. A conﬂict can happen due to the fact that the effect of a lane-direction change can spread from one road segment to other road segments. For example, let us assume that a constant portion of the upstream trafﬁc that passes through road segment x will also pass through road segment y in the upstream direction later on. An increase of the upstream lanes on x can lead to a signiﬁcant increase of upstream trafﬁc on x due to the increased trafﬁc capacity in the direction. Over time, the trafﬁc volume change on x can lead to the increase of the upstream traf ﬁc on y , which implies that the number of upstream lanes at y may need to be increased to suit the change of traf ﬁc v olume. In this case, the lane-direction change at y can be seen as a consequential change caused by the change at x . Howe ver , the learning agent that controls the lane-directions at y may suggest an increase of do wnstream lanes based on the current local trafﬁc condition at y . If this is the case, the locally optimized change will conﬂict with the consequential change. The key task of the coordinating process is quantifying such conﬂicts in road networks. If there are a large number of conﬂicts at a road segment, the locally optimized change should not be applied because it may have a negati ve impact on the trafﬁc ﬂows at the global le vel later on. This is a key idea behind the coordination process of our solution. As sho wn in Section V -A, our solution applies a proposed lane-direction change to a road segment only when the number of the conﬂicts is below a giv en threshold. T o help identify the conﬂicts between lane-direction changes, we develop a novel data structure, named P ath De- pendency Graph (PDG) . The data structure maintains several types of trafﬁc information, including the path of trafﬁc ﬂow , the proposed lane-direction changes and the current trafﬁc conditions. The coordinating agent uses PDG to search for the locations of consequential lane-direction changes. The conﬂicts between lane-direction changes are then identiﬁed by comparing the consequential lane-direction changes and the proposed lane-direction changes at the same locations. The coordinating agent also proposes additional lane-direction changes using PDG. A PDG ( P D G ( V P DG , E P DG ) ) consists of a number of vertices and a number of directional edges. A vertex v ∈ V P DG represents a road segment. The corresponding road segments of the two vertices must appear in the path of a vehicle. A verte x can connect to a number of out-degree edges and a number of in-degree edges. The direction of an edge depends on the order of traf ﬁc ﬂow that goes through the two road segments. An edge that starts from vertex v 1 and ends in verte x v 2 shows that the traf ﬁc ﬂow will pass through v 1 ’ s corresponding road segment ﬁrst then pass through v 2 ’ s corresponding road segment later . W e should also note that the two road segments, which are linked by an edge, do not hav e to share a common road intersection, i.e., they can be disjoint. Giv en the path of all the vehicles, a PDG can be constructed such that all the unique road segments on the vehicle paths hav e corresponding vertices in the graph. For each pair of the road segments on a vehicle path, there is a corresponding edge in the graph. If the path of two or more vehicles goes through the same pair of road segments, there is only one corresponding edge in the graph. A verte x of PDG has the following properties. • Proposed Change: The proposed lane-direction change at the corresponding road segment. This may be submit- ted from a learning agent or created by the system during the coordinating process. The property value can be 1 , 0 and − 1 . A value of 1 means the upstream direction gets one more lane. A value of 0 means there is no need for a change. A value of − 1 means the downstream direction gets one more lane. • Consequential Changes: A list of potential lane- direction changes caused by lane-direction changes at other road segments. Similar to the Proposed Change property , the value of a consequential change can be 1 , 0 and − 1 . • Imbalance: The lane direction which has a considerably higher trafﬁc load than the other direction. The property value can be upstr eam , dow nstream and none . In our implementation, the imbalance of trafﬁc load is measured based on the queue length in the opposite directions. Let q up be the upstream queue length and q down be the downstream queue length. Let the total queue length in both directions be q total . Let P be a threshold percentage. The property value is computed as follows. I mbalance =      upstr eam, if q up /q total > P dow nstream, if q down /q total > P 0 , otherwise (4) Due to the dynamic nature of trafﬁc, imbalance v alue may change frequently , leading to frequent changes of lane- directions. This may not be ideal in practice. One can get a steady imbalance v alue by adding certain restrictions in the computation. For example, one may require that the ratio between upstream queue length and the total queue length must be above the threshold for a certain period of time before setting the imbalance value to upstream. • Current Lane Conﬁguration: The number of upstream lanes and the number of downstream lanes in the corre- sponding road segment. An edge of PDG has a property called impact , which shows whether a lane-direction change at the starting vertex can lead to the same change at the ending vertex. The value of this property can be 1 or − 1 . A value of 1 means the change at both vertices will be the same. For example, if the change at the starting verte x is increasing the number of upstream lanes, the change at the ending vertex will also be increasing the number of upstream lanes. A value of − 1 means the changes at the vertices will be opposite to each other . The relationship between the changes and the property value is shown in Equation 5, where the starting vertex is v 1 and the ending vertex is v 2 . The property value is determined based on the path of the majority of the vehicles that mov e between the two corresponding road segments. If the path passes through both road segments in the same direction, the property value is 1 . Otherwise, the property value is − 1 . The impact property is key for ﬁnding the consequential change at the ending verte x gi ven the change at the starting vertex. As shown in Equation 6, the consequential change at the ending verte x can be computed based on the property value and the initial change at the starting verte x. impact ( v 1 ,v 2 ) = chang e v 1 × chang e v 2 (5) chang e v 2 = impact ( v 1 ,v 2 ) × chang e v 1 (6) When constructing a PDG, it may not be necessary to consider the full path of vehicles due to two reasons. First, the full path of vehicles can consist of a large number of road segments. The size of the graph can gro w exponentially when the length of path increases. Second, due to the highly dynamic nature of trafﬁc, the coordination of lane-direction changes should only consider the trafﬁc conditions in the near future. Therefore, in our implementation, we set an upper limit to the number of road segments in vehicle paths when building a PDG. The limit is called lookup distance in our experiments. W e show an example road network (Figure 6a) and its corresponding PDG (Figure 6b). The road netw ork has 12 roads segments (A to L). There are two paths going through the network, path α and path β . Path α passes through 4 edges (A, F , I, J). Path β passes through 3 edges (C, F , H). The 7 edges correspond to 7 vertices in the PDG. The PDG contains 3 edges starting from A (A-F , A-I, A-J) because path α passes through A, F , I and J in the road network. Similarly , the PDG contains 2 edges that starting from F . For each edge in the PDG, the value of its impact property is attached to the edge. As path α goes through the edges (A, F , I and J) in the upstream direction (Figure 6a), the impact value at all the edges between the corresponding vertices is 1 in the PDG (Figure 6b). Differently , the impact value of the edge C-H is -1 in the PDG. This is because path β goes through C in the upstream direction but it goes through H in the downstream direction. A D C L E F G B I J H K 3 4 1 2 5 6 9 8 7 10 12 11 P ath α P ath β (a) A simple road network with two paths (red and green) C H F A I J -1 1 1 1 1 11 (b) P ath dependency graph (PDG) based on the road net- work in Figure 6a . The coordinator uses the Direction-Change Evaluation al- gorithm (Algorithm 3) to quantify the conﬂicts between lane- direction changes. The algorithm trav erses through a PDG in a breadth-ﬁrst manner in iterations. The number of iterations is controlled by a depth parameter (shown as dp in Algorithm 3). In the ﬁrst round of iteration, the algorithm starts with the lane-direction changes that are proposed by the bottom-layer learning agents. For each vertex with a proposed change, its ﬁrst-depth neighbours (out-degree nodes) are visited (Step 6). For each of the neighbours, the consequential change caused by the proposed change is computed. This can be done with the process shown in Equation 6. Then the algorithm updates the count of conﬂicts at the neighbour’ s corresponding road network edge. In the next iteration, the algorithm starts with all the neighbour vertices that are visited in the pre vious round. After each iteration dp is decremented. The algorithm stops when dp reaches zero. The Direction-Change Evaluation algorithm not only quan- tiﬁes the conﬂicts between lane-direction changes but also expands the set of lane-direction changes for the road se gments that are visited during the trav ersal of the PDG. The rationale is that the bottom-layer learning agents may not propose lane- direction changes for road segments when they do not predict any beneﬁt of the change based on local traf ﬁc conditions. Howe ver , the lane-direction changes in other parts of the road network may ev entually affect the trafﬁc conditions at these road segments, leading to trafﬁc congestions. The algorithm pre-emptiv ely attempts lane-direction changes for these road segments when it predicts that there can be consequential changes caused by the changes in other parts of the road network. This can help mitigate incoming trafﬁc congestions. As shown in Step 6 of Algorithm 3, a Direction-Change Cre- ation algorithm is used for proposing additional lane-direction changes. Details of the Direction-Change Creation algorithm are shown in Algorithm 4. Every time coordinator executes Direction-Change Ev aluation , a new lane conﬁguration is computed and new G 0 t ( V , E 0 ) is generated. Complexity of Coordinating Process. Let us assume there are m number of requests from bottom layer agents. The degree of a node in PDG is deg ( v ) , where v ∈ V P DG . The algorithm traverses m BFTs throughout the PDG for certain depth dp . Then complexity of m BFTs is O ( m ( deg ( v ) dp )) . Howe ver , according to lemma A.1, deg ( v ) is independent of the road network size or number of paths for a giv en l . The depth dp is constant irrespecti ve of the road netw ork size. Then the algorithm complexity can be reduced to O ( m ) ; the algorithm complexity is linear in the number of requests from agents within the buf fering period. In the worst case, there can be requests for each road segment of the road network, G ( V , E ) , leading to the complexity of O ( | E | ) . Distributed version. Algorithm 3 can work with a set of distributed agents (coordinating agents) in the upper layer . In Algorithm 3, ex ecution is independent of the order of requests coming from the bottom layer agents. Therefore, requests can be processed in a distributed manner . Every coordinating agent traverses ﬁrst depth and inform changes to other agents. Once ev ery agent ﬁnishes their ﬁrst depth, all coordinating agents start their second depth, and so on. In such a setting, the complexity of the algorithm is O (1) . In this work, we implemented the centralized version, howe ver , when applied to larger road networks, the distributed version can be implemented. V I . E X P E R I M E N TA L M E T H O D O L O G Y W e compare the proposed algorithm, CLLA, against DLA and two other baseline algorithms using traf ﬁc simulations. W e ev aluate the performance of the algorithms for two road networks based on real trafﬁc data. The effects of individual parameters of CLLA and DLA are also ev aluated. The rest of the section details the settings of the experiments. A. Experimental setup Simulation Model. W e simulate a trafﬁc system similar to the ones used for reinforcement learning-based trafﬁc optimization in [23], [32]. In our implementation, vehicles Algorithm 3: Dir ection-Change Evaluation Input: E C initial , a set of edge-change pairs proposed by the learning agents Input: G ( V , E ) , A road network graph. Each edge in the graph has a property , conﬂict count , which has an integer value that is set to 0 initially . Input: l , the lookup distance of PDG Input: dp , the depth of search Output: E C expanded , a set of edge-change pairs giv en by the coordinator 1 Build a PDG based on the next l road segments on the path of vehicles. For each PDG verte x, its properties, proposed change and consequential changes , are set to empty values initially . 2 Create an empty set N . For each edge-change pair in E C initial , ﬁnd the corresponding verte x v in PDG and update its proposed change property . Add v to N . 3 Set the current depth of search to dp . 4 If the current depth is above 0, do the following steps. Otherwise, jump to Step 8. 5 Create an empty set N 0 . 6 For each v in N , ﬁrst check whether v has a proposed change. If not, get a proposed change for v using the Direction-Change Creation algorithm. Then for each of v ’ s neighbours at the end of its out-degree arcs, v o , identify the consequential change at the verte x that is caused by the proposed change at v . Add the consequential change to the consequential changes of v o if the change does not exist on the list. If v o already has a proposed change but the proposed change is different to the consequential change at v o , increase the conﬂict count of the corresponding road network edge by 1. Add v o to N 0 . 7 Decrease the current depth of search by 1. Replace the vertices in N with the vertices in N 0 . Go back to Step 4. 8 For each PDG verte x v with a proposed change, create a corresponding edge-change pair and add the pair to E A expanded . Exit the algorithm. on a road link are modelled based on travel time, which is the sum of two v alues, pure transmit time and waiting time . Pur e transmit time is the time taken by a vehicle to travel through the road link at the free-ﬂo w speed. W aiting time is the duration that a vehicle w aits in a trafﬁc signal queue. When the direction of a lane needs to be changed, all existing vehicles in the lane need to leave the lane and mov e into the adjacent lane in the same direction. V ehicles trav elling in the opposite direction can use the lane only after it is cleared of trafﬁc. Road Networks. W e run experiments based on the real taxi trip data from New Y ork City [33]. The data includes the source, the destination and the start time of the taxi trips in the city . W e pick two areas for simulation (Figure 8a and Figure 8b) because the areas contain a larger number of sources and destinations than other areas. The road network of the simulation areas is loaded from OpenStreetMap [34]. For a Algorithm 4: Dir ection-Change Creation Input: v , a PDG vertex that corresponds to an edge in a road network graph. The value of the imbalance property is set to none initially . Output: chang e , the proposed lane-direction change for e , which can be 1( upstr eam ) , 0( none ) and − 1( dow nstream ) . The default v alue is 0 . 1 conseq uential up : whether the consequential changes at v include one that increases the number of upstream lanes. 2 conseq uential down : whether the consequential changes at v include one that increases the number of downstream lanes. 3 if imbalance = upstream then 4 if conseq uential up = T r ue and conseq uential down = F al se then 5 chang e ← 1 (change one lane from downstream to upstream) 6 if imbalance = downstream then 7 if conseq uential up = F al se and conseq uential down = T r ue then 8 chang e ← − 1 (change one lane from upstream to downstream) 9 if imbalance = none then 10 if ( conseq uential up = T r ue and downstream has mor e lanes than upstr eam then 11 chang e ← 1 (change one lane from downstream to upstream) 12 if ( conseq uential down = T r ue and upstream has mor e lanes than downstr eam then 13 chang e ← − 1 (change one lane from upstream to downstream) (a) Long Island (LI) (b) Midto wn Manhattan (MM) Figure 7: The road network of two simulation areas in New Y ork speciﬁc taxi trip, the source and the destination are mapped to the nearest OpenStreetMap nodes. The shortest path between the source and destination is calculated. The simulated v ehicles follow the shortest paths generated from the taxi trip data. Comparison baselines. Different to the proposed solution, CLLA, the existing approaches for optimizing lane-directions are based on linear programming, which makes them unsuit- able for large-scale dynamic optimization due to the high computation cost. Due to the lack of comparable solutions, we deﬁne three baseline solutions, which are used to compare against CLLA. In our experiments, the trafﬁc signals use static timing and phasing, regardless of which solution is used. W e conduct comparativ e tests against the follo wing solutions: • No Lane-direction Allocations ( no-LA ): This solution does not do any lane-direction change. The trafﬁc is controlled by static trafﬁc signals only . • Demand-based Lane Allocations ( DLA ): In this so- lution, the lane-direction changes are computed with Algorithm 1. • Local Lane-direction Allocations ( LLA ): This solution uses multiple learning agents to decide lane-direction changes. The optimization is performed using the ap- proach described in Section V -B. LLA is similar to CLLA but there is no coordination between the agents. • Coordinated Learning-based Lane Allocations ( CLLA ): This is the two layer optimization frame work described in Section V -A. B. Evaluation Metrics W e measure the performance of the solutions based on the following metrics. A verage travel time : The trav el time of a vehicle is the duration that the vehicle spends on travelling from its source to its destination. W e compute the av erage trav el time based on all the vehicles that complete their trips during a simulation. A higher average travel time indicates that the traf ﬁc is more congested during the simulation. Our proposed solutions aim to reduce the average tra vel time. More information about this metric is shown in Section III. Deviation from free-ﬂo w travel time : The free-ﬂow travel time of a vehicle is the shortest possible trav el time, achiev ed when the vehicle tra vels at the speed limit of the roads without slo wing down at trafﬁc lights during its entire trip. Deviation from Free-Flo w travel T ime ( D F F T ) is deﬁned as in Equation 7, where t a is the actual time and t f is the free- ﬂow travel time. The lo west value of DFFT is 1, which is also the best value that a vehicle can achieve. D F F T = t a /t f (7) C. P arameter Sensitivity T esting W e ev aluate the effects of the hyper-parameters of CCLA and DLA , which are directly related to lane-direction changes in the simulation model. T o ev aluate the effects of a speciﬁc parameter , we run a set of tests varying the v alue of the parameter (while keeping the value of other parameters at their default reported in T able I). The a verage trav el time is reported for each of the tests. The detailed settings of the parameters are shown in T able I. W e describe the parameters as follows. Cost of lane-direction change in CLLA : The cost of a lane-direction change is the time spent on clearing the lane that needs to be changed. When the direction of a lane needs to be changed, all the existing vehicles in the lane need to leav e the lane before the lane can be used by the vehicles from the opposite direction. The time spent on clearing the lane can vary due to v arious random factors in the real world. For example, the vehicles in the lane may not be able to mov e Parameter Range Default value Cost of lane-direction change in CLLA and DLA (seconds) 40 - 480 120 Aggressiv eness of lane-direction change in CLLA (seconds) 100-1000 300 Depth in CLLA 1-5 2 Lookup distance in CLLA 3-7 5 Update period in CLLA (minutes) 0.3 - 20 0.3 Update period in DLA (minutes) 2.5-20 10 T able I: Settings used in the parameter sensitivity experiments to an adjacent lane immediately if the adjacent lane is highly congested. W e vary the v alue of this parameter in a large range, from 40 seconds to 480 seconds. Aggressi veness of lane-direction change in CLLA : This parameter affects the minimum interv al between lane-direction changes. A lane-direction change can only happen when there is a trafﬁc imbalance between the two directions at a road segment. The imbalance is computed based on the model as shown in Equation 4 (Section V -C). Based on an existing study [3], we set the threshold percentage P of the model to 65% and require that the trafﬁc imbalance must last for a minimum time period before a lane-direction change can be performed. W e deﬁne the aggressi veness of lane-direction changes in CLLA as the length of the period. When the period is short, the system can perform lane-direction changes at smaller intervals, and vice-versa. Depth in CLLA : This is the parameter dp used in Algo- rithm 3. When the depth is larger , CLLA can explore more vertices in the PDG, which allo ws it to detect the impact of a lane-direction change on the road segments that are further away from the location of the change. Lookup distance in CLLA : This is the parameter l used in Algorithm 3. It can affect the number of vertices and the number of edges in a PDG. W ith a higher lookup distance, the PDG needs to consider more road segments in the path of vehicles, which can help identify the impact of lane-direction changes at a longer distance but can increase the size of the graph at the same time. Update period in CLLA : This parameter controls the fre- quency at which coordinating agents decide on lane-direction changes. CLLA is suitable for highly dynamic trafﬁc environ- ments. Hence the update period ∆ t can be set to a low value. W e vary the value of this parameter between 0.3 minute to 20 minutes with the default value set to 0.3 minute. Update period in DLA : This parameter affects the fre- quency at which DLA optimizes lane-direction changes. DLA decides on lane-direction changes based on the traf ﬁc demand that is collected within the update period ∆ t prior to the optimization process. W e v ary the v alue of this parameter between 2.5 min to 20 min with the default value set to 10. V I I . E X P E R I M E N TA L R E S U LT S W e no w present experimental results when comparing CLLA against the baseline algorithms in the ﬁrst part, and present the sensitivity analysis to the parameter values of the algorithms in the second part. Solution Long Island Midto wn Manhattan DLA 10.04% 49.52% LLA 7.78% 44.19% CLLA 7.77% 46.12% T able II: The percentage of vehicles with a DFFT of higher than 10 A. Comparison against the baselines This experiment compares the performance of the four solutions, which are described in Section VI-A. W e run a number of simulations in this experiment. For each simulation, we extract taxi trip information for one hour using the real taxi trip data from Ne w Y ork. Based on the real data, we generate trafﬁc in the simulation. The experiment is done for two areas as shown in Figure 7a and Figure 7b. T o simulate a larger variety of trafﬁc scenarios, we also up-sample the trip data to generate more vehicles. W e deﬁne an Up Sampled F actor , which is the number of vehicles that are generated based on each taxi trip in the taxi data. For LLA and CLLA, the learning rate α is 0.001 and the discount f actor used by Q-learning is 0.75. The parameter minLoad of DLA is set to 100. For other parameters of the solutions, we use the default values as sho wn in T able I. A verage trav el time: Figure 8a and Figure 8b show the av erage trav el time achieved with the four solutions. CLLA outperforms the other solutions in both simulation areas. W e can observe that the av erage trav el time of LLA and CLLA is signiﬁcantly lower compared to the average travel time of no-LA , which sho ws the beneﬁt of dynamic lane-direction changes. Although DLA achiev es lower travel times than no- LA , it does not perform well compared to CLLA for both areas. CLLA performs consistently better than LLA , because LLA only makes lane-direction changes based on local traf ﬁc information without coordination. W e also test the performance of the solutions for a dif ferent scenario, where the trafﬁc demand is static. V ehicles are generated at a constant rate during a 30-minute period. Under this setting, the trafﬁc is less dynamic than the previous scenario, where the trafﬁc demand is based on real data. Figure 9a and Figure 9b show the average trav el time achiev ed with the four solutions. Interestingly , DLA performs as good as CLLA . This is due to the fact that DLA optimizes traf ﬁc based on the estimated traf ﬁc demand. As the trafﬁc demand is kept constant, the estimated demand can match the actual demand, resulting in the good performance of DLA . On the other hand, CLLA is de veloped for highly dynamic trafﬁc en vironments. When the trafﬁc is static, such as in this scenario, the advantage of the solution is limited. The results show that DLA can work well with static traf ﬁc but does not work well with highly dynamic trafﬁc. CLLA on the other hand works well in both en vironments: substantially outperforming the baselines in dynamic en vironments, and matching the performance of DLA in static en vironments. Deviation from fr ee-ﬂow travel time (DFFT) : T able II shows the percentage of vehicles whose travel time is 10 times or more than their free-ﬂo w trav el time. The results sho w that LLA and CLLA are able to achiev e a lower de viation from the free-ﬂow trav el time compared to DLA . 3 4 5 Up sampled factor 6 8 10 12 14 Average travel time (min) no-LA DLA LLA CLLA (a) Long Island 3 4 5 Up sampled factor 5.0 7.5 10.0 12.5 15.0 17.5 20.0 Average travel time (min) no-LA DLA LLA CLLA (b) Midto wn Manhattan Figure 8: Performance of four solutions with dynamic traf ﬁc 3 4 5 Up sampled factor 2 4 6 8 10 12 14 Average travel time (min) no-LA DLA LLA CLLA (a) Long Island 3 4 5 Up sampled factor 2 4 6 8 10 Average travel time (min) no-LA DLA LLA CLLA (b) Midto wn Manhattan Figure 9: Performance of four solutions with static traf ﬁc B. P arameter sensitivity testing For ev aluating the effects of individual parameters, we run simulations in the area sho wn in Figure 7a. Each simulation lasts for one hour , during which the trafﬁc is generated based on the real taxi trip data from the area. Figure 10 sho ws the effects of four parameters of CLLA . Figure 11 compares the effects of the update period between DLA and CLLA . Our result shows that the trav el time increases when the cost of a lane-direction change increases (Figure 10a). The result indicates that lane-direction changes may not be beneﬁcial in all circumstances. When the cost of lane-direction changes is high, performing the changes can cause signiﬁcant interruption to the trafﬁc and negate the beneﬁt of the changes. Figure 10b shows how the aggressiv eness of lane-direction changes can affect the trav el time of vehicles. The result shows that a low le vel of aggressiv eness and a high lev el of aggressiv eness have a negati ve impact on travel times. When the le vel of aggressiv eness is lo w , the lane-direction changes can only happen in large intervals. Hence the changes may not adapt to the dynamic change of traf ﬁc. When the level of aggressiveness is high, the system changes the direction of lanes at a high frequency , which can cause signiﬁcant interruption to the trafﬁc attributed to taking the time to clear the lanes during the changes. Our result shows that the best depth for traversing the PDG is 2 (Figure 10c). When the depth changes from 1 to 2, we observe a decrease in travel time. Howe ver , when the depth is higher than 2, we do not observe a decrease of trav el time. When the depth is higher , the system can identify the impact of a lane-direction change that are further away . Howe ver , the impact can become negligible if the lane-direction change is far away . This is the reason there is no improvement of travel time when the depth is higher than 2. Figure 10d shows that a larger lookup distance can result 40 120 240 360 480 Cost of lane change (sec.) 11.00 11.25 11.50 11.75 Average travel time (min) (a) Cost of change 2 10 20 30 40 50 60 70 Imbalance factor (10 sec.) 10 11 12 13 14 Average travel time (min) (b) Aggressi veness of change 1 2 3 4 5 Depth(dp) 11.00 11.25 11.50 11.75 Average travel time (min) (c) Depth ( dp ) 1 2 3 4 5 6 Lookup distance 10 11 12 13 14 Average travel time (min) (d) Lookup distance ( l ) Figure 10: Effects of four parameters of CLLA 2.5 5.0 10.0 15.0 20.0 Update period (min) 11.00 11.25 11.50 11.75 12.00 Average travel time (min) (a) DLA update period 0.3 2.5 5.0 10.0 15.0 20.0 Update period (min) 10.5 11.0 11.5 12.0 12.5 Average travel time (min) (b) CLLA update period Figure 11: Effects of the update period of CLLA and DLA in a lower av erage travel time. When the lookup distance in- creases, CLLA considers more road se gments in a vehicle path when building the PDG. This helps identify the consequential lane-direction changes on the same path. The reduction in the av erage trav el time becomes less signiﬁcant when the lookup distance is higher than 2. This is because the impact of a lane- direction change reduces when the change is further away . When the update period ∆ t of DLA is below 5 minutes or beyond 15 minutes, it is less likely to get a good estimation of traf ﬁc demand, which can lead to a relatively high trav el time (Figure 11a). The av erage trav el time is at its minimum when ∆ t is set to 10 minutes. Different to DLA , the trav el time achiev ed with CLLA grows slowly with the increase of ∆ t until ∆ t reaches beyond 15 minutes. The relativ ely steady performance of CLLA shows that the coordination between lane-direction changes can help mitigate trafﬁc congestion for a certain period of time in the future. If minimizing the av erage trav el time is of priority , one can set ∆ t to a very low value, e.g., 5 minutes. If one needs to reduce the computation cost of the optimization while achieving a reasonably good travel time, the ∆ t can be set to a larger value, e.g., 15 minutes. V I I I . C O N C L U S I O N W e hav e sho wn that ef fectiv e trafﬁc optimization can be achieved with dynamic lane-direction conﬁgurations. Our proposed hierarchical multi-agent solution, CLLA, can help reduce trav el time by combining machine learning and the global coordination of lane-direction changes. The proposed solution adapts to signiﬁcant changes of trafﬁc demand in a timely manner, making it a viable choice for realizing the potential of connected autonomous vehicles in traf ﬁc optimization. Compared to state-of-the-art solutions based on lane-direction conﬁguration, CLLA runs more efﬁciently , and is scalable to large networks. There are many directions one can in vestigate further . An interesting extension would be to incorporate dynamic trafﬁc signals into the optimization process. Currently we assume that the connected autonomous vehicles follow the pre-determined path during their trip. An exciting direction for further research is to dynamically change vehicle routes in addition to the lane- direction changes. The dynamic change of speed limit of roads can also be included in an extension to CLLA. R E F E R E N C E S [1] L. R. Ford and D. R. Fulkerson, “Maximal ﬂow through a network, ” in Classic papers in combinatorics . Springer , 2009, pp. 243–248. [2] L. Fleischer and M. Skutella, “Quickest ﬂows ov er time, ” SIAM J. Comput. , vol. 36, no. 6, pp. 1600–1630, 2007. [3] B. W olshon and L. Lambert, “Planning and operational practices for rev ersible roadways, ” Institute of T ransportation Engineer s. ITE J ournal , vol. 76, no. 8, pp. 38–43, August 2006. [4] L. Lambert and B. W olshon, “Characterization and comparison of trafﬁc ﬂow on reversible roadways, ” Journal of Advanced Tr ansportation , vol. 44, no. 2, pp. 113–122, 2010. [5] J. J. W u, H. J. Sun, Z. Y . Gao, and H. Z. Zhang, “Reversible lane-based trafﬁc network optimization with an advanced traveller information system, ” Engineering Optimization , vol. 41, no. 1, pp. 87–97, 2009. [6] K. F . Chu, A. Y . S. Lam, and V . O. K. Li, “Dynamic lane rev ersal routing and scheduling for connected autonomous vehicles, ” in 2017 International Smart Cities Conference (ISC2) , Sep. 2017, pp. 1–6. [7] M. Hausknecht, T . Au, P . Stone, D. Fajardo, and T . W aller, “Dynamic lane rev ersal in traf ﬁc management, ” in 2011 14th International IEEE Confer ence on Intelligent Tr ansportation Systems (ITSC) , Oct 2011, pp. 1929–1934. [8] Georgia Department of Transportation, “ Advantages and disadvantages of rev ersible managed lanes, ” http://www .dot.ga.gov/BuildSmart/ Studies/ManagedLanesDocuments/Emerging%20Issues- Re versible% 20Managed%20Lanes.pdf, 2010. [9] R. K. Narla, T . T echnology , and U. States, “The Evolution of Connected V ehicle T echnology : From Smart Driv ers to Smart Cars to . . . Self- Driving Cars, ” no. July , pp. 22–26, 2013. [10] K. Ramamohanarao, J. Qi, E. T anin, and S. Motallebi, “From how to where: Traf ﬁc optimization in the era of automated vehicles, ” in Pr oceedings of the 25th ACM SIGSP ATIAL International Conference on Advances in Geogr aphic Information Systems , ser. SIGSP A TIAL ’17. New Y ork, NY , USA: A CM, 2017, pp. 10:1–10:4. [11] R. S. Sutton and A. G. Barto, Introduction to Reinforcement Learning , 1st ed. Cambridge, MA, USA: MIT Press, 1998. [12] L. R. Ford and D. R. Fulkerson, “Constructing maximal dynamic ﬂows from static ﬂows, ” Oper . Res. , vol. 6, no. 3, pp. 419–433, jun 1958. [13] E. K ¨ ohler , R. H. M ¨ ohring, and M. Skutella, Tr afﬁc Networks and Flows over T ime . Berlin, Heidelberg: Springer Berlin Heidelberg, 2009, pp. 166–196. [14] N. R. Ravishankar and M. V . V ijayakumar , “Reinforcement learning algorithms: Survey and classiﬁcation, ” Indian Journal of Science and T echnology , vol. 10, no. 1, 2017. [15] E. W alra ven, M. T . Spaan, and B. Bakker , “Traf ﬁc ﬂow optimization: A reinforcement learning approach, ” Engineering Applications of Artiﬁcial Intelligence , vol. 52, pp. 203–212, 2016. [16] K.-L. A. Y au, J. Qadir, H. L. Khoo, M. H. Ling, and P . Komisarczuk, “A Survey on Reinforcement Learning Models and Algorithms for Traf ﬁc Signal Control, ” A CM Computing Surveys , vol. 50, no. 3, pp. 1–38, 2017. [17] P . Mannion, J. Duggan, and E. Howley , “An Experimental Revie w of Reinforcement Learning Algorithms for Adapti ve Trafﬁc Signal Control, ” pp. 47–66. [18] M. Aslani, S. Seipel, M. S. Mesgari, and M. Wiering, “T rafﬁc signal optimization through discrete and continuous reinforcement learning with robustness analysis in downto wn T ehran, ” Advanced Engineering Informatics , vol. 38, no. June 2017, pp. 639–655, 2018. [19] I. Arel, C. Liu, T . Urbanik, and A. K ohls, “Reinforcement learning-based multi-agent system for network trafﬁc signal control, ” IET Intelligent T ransport Systems , vol. 4, no. 2, p. 128, 2010. [20] C. Boutilier, “Planning, learning and coordination in multiagent decision processes, ” in Pr oceedings of the Sixth Confer ence on Theor etical Aspects of Rationality and Knowledge, De Zeeuwse Str omen, The Netherlands, Mar ch 17-20 1996 , 1996, pp. 195–210. [21] W . Genders and S. N. Razavi, “Using a deep reinforcement learning agent for trafﬁc signal control, ” CoRR , vol. abs/1611.01142, 2016. [22] E. V an der Pol and F . A. Oliehoek, “Coordinated deep reinforcement learners for trafﬁc light control, ” in NIPS’16 W orkshop on Learning, Infer ence and Contr ol of Multi-Agent Systems , Dec. 2016. [23] M. Wiering, “Multi-agent reinforcement learning for trafﬁc light con- trol, ” 17th International Conference on Machine Learning (ICML 2000) , pp. 1151–1158, 2000. [24] M. Steingr ¨ over , R. Schouten, S. Peelen, E. Nijhuis, and B. Bakker , “Reinforcement learning of trafﬁc light controllers adapting to trafﬁc congestion, ” in In Proceedings of the Belgium-Netherlands Artiﬁcial Intelligence Conference, BNAIC’05 , 2005. [25] C. J. C. H. W atkins and P . Dayan, “Q-learning, ” Machine Learning , vol. 8, no. 3, pp. 279–292, May 1992. [26] C. Guestrin, M. G. Lagoudakis, and R. Parr , “Coordinated reinforcement learning, ” in Pr oceedings of the Nineteenth International Conference on Machine Learning , ser. ICML ’02. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2002, pp. 227–234. [27] H. Zhang and Z. Gao, “T w o-way road network design problem with variable lanes, ” Journal of Systems Science and Systems Engineering , vol. 16, no. 1, pp. 50–61, Mar 2007. [28] M. W . Levin and S. D. Boyles, “ A cell transmission model for dynamic lane rev ersal with autonomous vehicles, ” T ransportation Resear ch P art C: Emer ging T echnologies , vol. 68, pp. 126 – 143, 2016. [29] A. T alebpour and H. S. Mahmassani, “Inﬂuence of connected and autonomous vehicles on trafﬁc ﬂow stability and throughput, ” T rans- portation Researc h P art C: Emer ging T ec hnologies , vol. 71, pp. 143 – 163, 2016. [30] S. I. Guler , M. Menendez, and L. Meier , “Using connected vehicle technology to improve the efﬁcienc y of intersections, ” T ransportation Resear ch P art C: Emerging T echnolo gies , vol. 46, pp. 121 – 131, 2014. [31] R. T . W . T .L. Magnanti, “Network design and transportation planning : Models and algorithms, ” T ransportation Science , vol. 18, no. 1, pp. 1 – 55, 1984. [32] E. K ¨ ohler and M. Strehler, “T rafﬁc signal optimization: combining static and dynamic models, ” T ransportation Science , vol. 53, no. 1, pp. 21–41, 2018. [33] [Online]. A vailable: https://www1.n yc.gov/site/tlc/about/ tlc- trip- record- data.page [34] [Online]. A v ailable: https://www .openstreetmap.org A P P E N D I X D E G R E E O F A V E RT E X I N P D G Lemma A.1. The maximum node density ∆( P D G ) is inde- pendent of the underline r oad network G ( V , E ) and number of vehicle paths. Pr oof. Let v ∈ V P DG , maximum vertex de gree ∆( P D G ) can be found as follows. A degree of a vertex v G ∈ G ( V , E ) (road network) depends on the number of roads connected to an intersection. A special property of a road graph is that the degree of a node does not increase with the network size. Let there be n number of roads on av erage, connected to one intersection in G . Then deg ( v G ) = n . Now let us take v ∈ V P DG is also v ∈ E , where v is a road in the road network. Starting from v , within l lookup distance, there can be maximum of n l roads. Since n does not increase with the network size n l also does not increase with the network size. Assuming the worse case, there can be paths from v to each of these n l roads. Let R be the set of roads in n l . According to the deﬁnition of P D G , if there is a path between v and r ∈ R , ∃ e v ,r ∈ V P DG ∀ r ∈ R . This means there are n l number of edges from v . Therefore deg ( v ) = n l . Then the maximum node density ∆( P D G ) = n l Note that ∆( P D G ) is independent of the size of G ( V , E ) and number of paths.

Dynamic Graph Configuration with Reinforcement Learning for Connected Autonomous Vehicle Trajectories

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment