Learning to Switch Among Agents in a Team via 2-Layer Markov Decision Processes
Reinforcement learning agents have been mostly developed and evaluated under the assumption that they will operate in a fully autonomous manner -- they will take all actions. In this work, our goal is to develop algorithms that, by learning to switch…
Authors: Vahid Balazadeh, Abir De, Adish Singla
Published in T ransations on Mac hine Learning Researc h (07/2022) Lea rning to Switch Among Agents in a T eam via 2 -La y er Ma rk ov Decision Pro cesses V ahid Balazadeh vahid@cs.tor onto.e du University of T or onto Abir De abir@cse.iitb.ac.in Indian Institute of T e chnolo gy Bomb ay A dish Singla adishs@mpi-sws.dot.or g Max Planck Institute for Softwar e Systems Man uel Gomez Ro driguez manuelgr@mpi-sws.or g Max Planck Institute for Softwar e Systems Review ed on OpenReview: https://op enr eview.net/forum?id=NT9zge dd3I Abstract Reinforcemen t learning agen ts hav e b een mostly dev elop ed and ev aluated under the assump- tion that they will op erate in a fully autonomous manner—they will take al l actions. In this w ork, our goal is to develop algorithms that, by learning to switch con trol b et ween agents, allo w existing reinforcement learning agents to op erate under different automation levels. T o this end, w e first formally define the problem of learning to switc h con trol among agen ts in a team via a 2-lay er Marko v decision pro cess. Then, we dev elop an online learning algorithm that uses upp er confidence b ounds on the agents’ p olicies and the environmen t’s transition probabilities to find a sequence of switching p olicies. The total regret of our algorithm with resp ect to the optimal switching p olicy is sublinear in the n umber of learning steps and, when- ev er m ultiple teams of agents op erate in a similar environmen t, our algorithm greatly b enefits from main taining shared confidence bounds for the environmen ts’ transition probabilities and it enjoys a b etter regret b ound than problem-agnostic algorithms. Sim ulation exp erimen ts illustrate our theoretical findings and demonstrate that, by exploiting the sp ecific structure of the problem, our prop osed algorithm is sup erior to problem-agnostic algorithms. 1 Intro duction In recent years, reinforcement learning (RL) agents ha v e ac hieved, or ev en surpassed, h uman p erformance in a v ariety of computer games by taking decisions autonomously , without human in terven tion ( Mnih et al. , 2015 ; Silver et al. , 2016 ; 2017 ; Viny als et al. , 2019 ). Motiv ated by these successful stories, there has been a tremendous excitement on the p ossibilit y of using RL agents to op erate fully autonomous cyb erph ysical systems, esp ecially in the con text of autonomous driving. Unfortunately , a num b er of tec hnical, so cietal, and legal c hallenges ha v e precluded this p ossibilit y to b ecome so far a reality . In this work, we argue that existing RL agen ts may still enhance the op eration of cyberphysical systems if deplo y ed under lo w er automation level s. F or example, if we let RL agents take some of the actions and lea v e the remaining ones to human agents, the resulting p erformance may b e b etter than the p erformance either of them would achiev e on their own ( Raghu et al. , 2019a ; De et al. , 2020 ; Wilder et al. , 2020 ). Once w e depart from full automation, we need to address the following question: when should we switch control b et ween mac hine and human agents? In this work, we lo ok into this problem from a theoretical p ersp ectiv e and develop an online algorithm that learns to optimally switc h control b et ween multiple agents in a team automatically . Ho wev er, to fulfill this goal, we need to address several challenges: 1 Published in T ransations on Mac hine Learning Researc h (07/2022) — L evel of automation. In eac h application, what is considered an appropriate and tolerable load for each agent ma y differ ( Europ ean Parliamen t , 2006 ). Therefore, we would like that our algorithms provide mechanisms to adjust the amoun t of con trol for eac h agent ( i.e. , lev el of automation) during a giv en time p eriod. — Numb er of switches. Consider tw o different switching patterns resulting in the same amount of agen t con trol and equiv alent p erformance. Then, we would like our algorithms to favor the pattern with the least n um b er of switches. F or example, in a team consisting of human and machine agents, every time a mac hine defers (takes) control to (from) a h uman, there is an additional cognitive load for the human ( Bro okh uis et al. , 2001 ). — Unknown agent p olicies. The sp ectrum of human abilities spans a broad range ( Macadam , 2003 ). As a result, there is a wide v ariet y of p oten tial h uman p olicies. Here, w e would like that our algorithms learn p ersonalized switching p olicies that, ov er time, adapt to the particular humans (and mac hines) they are dealing with. — Disentangling agents’ p olicies and envir onment dynamics. W e would lik e that our algorithms learn to disen tangle the influence of the agents’ p olicies and the environmen t dynamics on the switching p olicies. By doing so, they could b e used to efficiently find multiple p ersonalized switc hing p olicies for different teams of agents op erating in similar en vironmen ts ( e.g. , multiple semi-autonomous vehicles with different human driv ers). T o tackle the ab o v e challenges, w e first formally define the problem of learning to switch con trol among agen ts in a team using a 2-la yer Marko v decision process (Figure 1 ). Here, the team can b e comp osed of an y num b er of mac hines or human agents, and the agents’ p olicies, as well as the transition probabilities of the environmen t, ma y b e unknown. In our form ulation, w e assume that all agen ts follow Marko vian p olicies 1 , similarly as other theoretical mo dels of h uman decision making ( T ownsend et al. , 2000 ; Da w & Da y an , 2014 ; McGhan et al. , 2015 ). Under this definition, the problem reduces to finding the switc hing p olicy that provides an optimal trade off b et ween the environmen tal cost, the amoun t of agent control, and the n um b er of switches. Then, we develop an online learning algorithm, which we refer to as UCRL2-MC 2 , that uses upp er confidence b ounds on the agen ts’ policies and the transition probabilities of the environmen t to find a sequence of switching p olicies whose total regret with respect to the optimal switc hing p olicy is sublinear in the num b er of learning steps. In addition, w e also demonstrate that the same algorithm can b e used to find multiple sequences of switchi ng p olicies across several independent teams of agen ts op erating in similar environmen ts, where it greatly b enefits from maintaining shared confidence b ounds for the transition probabilities of the en vironmen ts and enjo ys a b etter regret b ound than UCRL2, a very well kno wn reinforcement learning algorithm that we view as the most natural competitor. Finally , we p erform a v ariet y of simulation exp erimen ts in the standard RiverSwim environmen t as well as an obstacle av oidance task, where we consider multiple teams of agents (drivers) comp osed by one human and one machine agent. Our results illustrate our theoretical findings and demonstrate that, by exploiting the sp ecific structure of the problem, our prop osed algorithm is sup erior to problem-agnostic alternatives. Before w e proceed further, we would like to point out that, at a broader level, our methodology and theoretical results are applicable to the problem of switching con trol b et ween agen ts following Marko vian p olicies. As long as the agent p olicies are Marko vian, our results do not distinguish b et w een mac hine and human agents. In this context, we view teams of h uman and mac hine agen ts as one p oten tial application of our w ork, which w e use as a motiv ating example throughout the pap er. Ho wev er, we w ould also like to ackno wledge that a practical deplo ymen t of our metho dology in a real application with human and machine agents would require considering a wide range of additional practical aspects ( e.g. , transparency , explainability , and visualization). Moreo v er, one may also need to explicitly mo del the difference in reaction times b et ween human and machine agen ts. Finally , there may b e scenarios in which it might b e b eneficial to allo w a h uman op erator to switch con trol. Suc h considerations are out of the scop e of our w ork. 1 In certain cases, it is p ossible to conv ert a non-Markovian human p olicy in to a Markovian one b y changing the state representa- tion ( Daw & Da yan , 2014 ). Addressing the problem of learning to switc h control among agen ts in a team in a semi-Marko vian setting is left as a v ery in teresting venue for future work. 2 UCRL2 with Multiple Confidence sets. 2 Published in T ransations on Mac hine Learning Researc h (07/2022) 2 Related w o rk One can think of applying existing RL algorithms ( Jaksch et al. , 2010 ; Osband et al. , 2013 ; Osband & V an Roy , 2014 ; Gopalan & Mannor , 2015 ), such as UCRL2 or Rmax, to find switc hing policies. Ho wev er, these problem-agnostic algorithms are unable to exploit the sp ecific structure of our problem. More sp ecifically , our algorithm computes the confidence interv als separately o ver the agents’ p olicies and the transition probabilities of the environmen t, instead of computing a single confidence in terv al, as problem-agnostic algorithms do. As a consequence, our algorithm learns to switch more efficiently across multiple teams of agents, as shown in Section 6 . There is a rapidly increasing line of w ork on learning to defer decisions in the machine learning litera- ture ( Bartlett & W egkamp , 2008 ; Cortes et al. , 2016 ; Geifman et al. , 2018 ; Ramasw am y et al. , 2018 ; Geifman & El-Y aniv , 2019 ; Liu et al. , 2019 ; Ragh u et al. , 2019a ; b ; Thulasidasan et al. , 2019 ; De et al. , 2020 ; 2021 ; Mozannar & Sontag , 2020 ; Wilder et al. , 2020 ; Shekhar et al. , 2021 ). How ever, previous work has typically fo cused on sup ervised learning. More sp ecifically , it has developed classifiers that learn to defer by considering the defer action as an additional lab el v alue, by training an indep enden t classifier to decide ab out deferred decisions, or b y reducing the problem to a combinatorial optimization problem. Moreo ver, except for a few recen t notable exceptions ( Raghu et al. , 2019a ; De et al. , 2020 ; 2021 ; Mozannar & Sontag , 2020 ; Wilder et al. , 2020 ), they do not consider there is a human decision maker who takes a decision whenever the classifiers defer it. In contrast, we fo cus on reinforcement learning, and dev elop algorithms that learn to switch control b et ween m ultiple agen ts, including h uman agen ts. Recen tly , Jacq et al. ( 2022 ) introduced a new framework called lazy-MDPs to decide when to act optimally for reinforcemen t learning agen ts. They prop ose to augmen t existing MDPs with a new default action and encourage agen ts to defer decision-making to default p olicy in non-critical states. Though their lazy-MDP is similar to our augmen ted 2-la yer MDP framework, our approac h is designed to switc h optimally b et ween p ossibly multiple agen ts, each ha ving its o wn p olicy . Our w ork is also connected to research on understanding switching b eha vior and switching costs in the con text of human-computer interaction ( Czerwinski et al. , 2000 ; Horvitz & Apacible , 2003 ; Iqbal & Bailey , 2007 ; Koto wick & Shah , 2018 ; Janssen et al. , 2019 ), whic h has b een sometimes referred to as “adjustable autonom y” ( Mostafa et al. , 2019 ). At a tec hnical level, our work adv ances state of the art in adjustable autonom y by introducing an algorithm with prov able guaran tees to efficiently find the optimal switching p olicy in a setting in which the dynamics of the environmen t and the agents’ policies are unknown ( i.e. , there is uncertain t y ab out them). Moreov er, our work also relates to a recent line of research that combines deep reinforcemen t learning with opp onen t mo deling to robustly switch betw een multiple mac hine p olicies ( Ev erett & Rob erts , 2018 ; Zheng et al. , 2018 ). Ho w ev er, this line of research do es not consider the presence of human agen ts, and there are no theoretical guaran tees on the p erformance of the prop osed algorithms. F urthermore, our work contributes to an extensive b ody of work on h uman-machine collab oration ( Stone et al. , 2010 ; T aylor et al. , 2011 ; W alsh et al. , 2011 ; Barrett & Stone , 2012 ; Macindo e et al. , 2012 ; T orrey & T a ylor , 2013 ; Nikolaidis et al. , 2015 ; Hadfield-Menell et al. , 2016 ; Nikolaidis et al. , 2017 ; Grov er et al. , 2018 ; Haug et al. , 2018 ; Reddy et al. , 2018 ; Wilson & Daughert y , 2018 ; Bro wn & Niekum , 2019 ; Kamalaruban et al. , 2019 ; Radanovic et al. , 2019 ; T schiatsc hek et al. , 2019 ; Ghosh et al. , 2020 ; Strouse et al. , 2021 ). How ever, rather than developing algorithms that learn to switch con trol b et ween humans and mac hines, previous w ork has predominan tly considered settings in whic h the mac hine and the h uman interact with eac h other. Finally , one can think of using option framew ork and the notion of macro-actions and micro-actions to form ulate the problem of learning to switc h ( Sutton et al. , 1999 ). How ev er, the option framework is designed to address different levels of temp oral abstraction in RL b y defining macro-actions that correspond to sub-tasks (skills). In our problem, each agent is not necessarily optimized to act for a sp ecific task or sub-goal but for the whole en vironmen t/goal. Also, in our problem, we do not necessarily ha v e con trol o v er all agen ts to learn the optimal policy for each agent, while in the option framew ork, a primary direction is to learn optimal options for each sub-task. In other words, even though we can mathematically refer to each agent p olicy as an option, they are not conceptually the same. 3 Published in T ransations on Mac hine Learning Researc h (07/2022) 3 Switching Control Among Agents as a 2-La y er MDP Giv en a team of agents D , at each time step t ∈ { 1 , . . . , L } , our (cyb erph ysical) system is characterized by a state s t ∈ S , where S is a finite state space, and a con trol switch d t ∈ D , which determines who takes an action a t ∈ A , where A is a finite action space. In the ab o v e, the switch v alue is given b y a (deterministic and time-v arying) switching p olicy d t = π t ( s t , d t − 1 ) 3 . More sp ecifically , if d t = d , the action a t is sampled from the agent d ’s p olicy p d ( a t | s t ) . Moreo v er, given a state s t and an action a t , the state s t +1 is sampled from a transition probabilit y p ( s t +1 | s t , a t ) . Here, w e assume that the agents’ policies and the transition probabilities ma y b e unkno wn. Finally , giv en an initial state and switc h v alue ( s 1 , d 0 ) and a tra jectory τ = { ( s t , d t , a t ) } L t =1 of states, switc h v alues and actions, we define the total cost c ( τ | s 1 , d 0 ) as: c ( τ | s 1 , d 0 ) = L X t =1 [ c e ( s t , a t ) + c c ( d t ) + c x ( d t , d t − 1 )] , (1) where c e ( s t , a t ) is the environmen t cost of taking action a t at state s t , c c ( d t ) is the cost of giving control to agen t d t , c x ( d t , d t − 1 ) is the cost of switc hing from d t − 1 to d t , and L is the time horizon 4 . Then, our goal is to find the optimal switc hing p olicy π ∗ = ( π ∗ 1 , . . . , π ∗ L ) that minimizes the exp ected cost, i.e. , π ∗ = argmin π E [ c ( τ | s 1 , d 0 )] , (2) where the exp ectation is taken ov er all the tra jectories induced b y the switching p olicy given the agents’ p olicies. T o solve the abov e problem, one could just resort to problem-agnostic RL algorithms, suc h as UCRL2 or Rmax, o v er a standard Mark o v decision pro cess (MDP), defined as M = ( S × D , D , ¯ P , ¯ C , L ) , where S × D is an augmen ted state space, the set of actions D is just the switc h v alues, the transition dynamics ¯ P at time t are given by p ( s t +1 , d t | s t , d t − 1 ) = I [ π t ( s t , d t − 1 ) = d t ] × X a ∈A p ( s t +1 | s t , a ) p d t ( a | s t ) , (3) the immediate cost ¯ C at time t is giv en b y ¯ c ( s t , d t − 1 ) = E a t ∼ p π t ( s t ,d t − 1 ) ( · | s t ) [ c e ( s t , a t )] + c c ( π t ( s t , d t − 1 )) + c x ( π t ( s t , d t − 1 ) , d t − 1 ) . (4) Here, note that, by using conditional exp ectations, we can compute the a verage cost of a tra jectory , given by Eq. 1 , from the ab o v e immediate costs. Ho wev er, these algorithms w ould not exploit the structure of the problem. More sp ecifically , they would not use the observed agents’ actions to improv e the estimation of the transition dynamics o v er time. T o a v oid the ab o ve shortcoming, we will resort instead to a 2-lay er MDP where taking an action d t in state ( s t , d t − 1 ) leads first to an intermediate state ( s t , a t ) ∈ S × A with probability p d t ( a t | s t ) and immediate cost c d t ( s t , d t − 1 ) = c c ( d t ) + c x ( d t , d t − 1 ) and then to a final state ( s t +1 , d t ) ∈ S × D with probability I [ π t ( s t , d t − 1 ) = d t ] · p ( s t +1 | s t , a t ) and immediate cost c e ( s t , a t ) . More formally , the 2-lay er MDP is defined b y the follo wing 8-tuple: M = ( S × D , S × A , D , P D , P , C D , C e , L ) (5) where S ×D is the final state space, S ×A is the intermediate state space, the set of actions D is the switch v alues, the transition dynamics P D and P at time t are given by p d t ( a t | s t ) and I [ π t ( s t , d t − 1 ) = d t ] · p ( s t +1 | s t , a t ) , and the immediate costs C D and C e at time t are giv en b y c d t ( s t , d t − 1 ) and c e ( s t , a t ) , resp ectively . The ab o v e 2-lay er MDP will allow us to estimate separately the agents’ p olicies p d ( · | s ) and the transition probabilit y p ( · | s, a ) of the environmen t using b oth the intermediate and final states and design an algorithm that impro v es the regret that problem-agnostic RL algorithms ac hiev e in our problem. 3 Note that, by making the switc hing policy dep enden t on the previous switch v alue d t − 1 , we can account for the switc hing cost. 4 The sp ecific c hoice of environment cost c e ( · , · ) , control cost c c ( · ) and switching cost c x ( · , · ) is application dependent. 4 Published in T ransations on Mac hine Learning Researc h (07/2022) s, d s, a 1 s, a 2 s, a m s ! ,d ! p d ! ( a 1 | s ) p d ! ( a 2 | s ) p d ! ( a m | s ) p ( s ! | s, a m ) p ( s ! | s, a 2 ) p ( s ! | s, a 1 ) Switc hing Lay er Action Lay er Figure 1: T ransitions of a 2-la yer Marko v Decision Process (MDP) from state ( s, d ) to state ( s ′ , d ′ ) after seleting agen t d ′ . d ′ and d denote the curren t and previous agents in control. In the first lay er (switching la y er), the switc hing policy c ho oses agent d ′ , which takes action w.r.t. its action policy p d ′ . Then, in the action lay er, the environmen t transitions to the next state s ′ based on the taken action w.r.t. the transition probabilit y p . 4 Lea rning to Switch in a T eam of Agents Since we may not know the agents’ p olicies nor the transition probabilities, we need to trade off exploitation, i.e. , minimizing the exp ected cost, and exploration, i.e. , learning ab out the agents’ p olicies and the transition probabilities. T o this end, we lo ok at the problem from the p erspective of episo dic learning and pro ceed as follo ws. W e consider K indep enden t subsequent episo des of length L and denote the aggregate length of all episo des as T = K L . Eac h of these episo des corresp onds to a realization of the same finite horizon 2-lay er Marko v decision pro cess, introduced in Section 3 , with state spaces S × A and S × D , set of actions D , true agent p olicies P ∗ D , true en vironmen t transition probability P ∗ , and immediate costs C D and C e . How ever, since w e do not know the true agent p olicies and environmen t transition probabilities, just b efore eac h episo de k starts, our goal is to find a switching p olicy π k with desirable prop erties in terms of total regret R ( T ) , which is giv en b y: R ( T ) = K X k =1 h E τ ∼ π k ,P ∗ D ,P ∗ [ c ( τ | s 1 , d 0 )] − E τ ∼ π ∗ ,P ∗ D ,P ∗ [ c ( τ | s 1 , d 0 )] i , (6) where π ∗ is the optimal switc hing policy under the true agent p olicies and en vironment transition probabilities. T o ac hiev e our goal, w e apply the principle of optimism in the fac e of unc ertainty , i.e. , π k = argmin π min P D ∈P k D min P ∈P k E τ ∼ π,P D ,P [ c ( τ | s 1 , d 0 )] (7) where P k D is a ( |S |×|D |× L ) -rectangular confidence set, i.e. , P k D = × s,d,t P k · | d,s,t , and P k is a ( |S |×|A|× L ) - rectangular confidence set, i.e. , P k = × s,a,t P k · | s,a,t . Here, note that the confidence sets are constructed using data gathered during the first k − 1 episo des and allows for time-v arying agent p olicies p d ( · | s, t ) and transition probabilities p ( · | s, a, t ) . 5 Published in T ransations on Mac hine Learning Researc h (07/2022) Ho w ev er, to solv e Eq. 7 , we first need to explicitly define the confidence sets. T o this end, we first define the empirical distributions ˆ p k d ( · | s ) and ˆ p k ( · | s, a ) just b efore episo de k starts as: ˆ p k d ( a | s ) = ( N k ( s,d,a ) N k ( s,d ) if N k ( s, d ) = 0 1 |A| otherwise , (8) ˆ p k ( s ′ | s, a ) = ( N ′ k ( s,a,s ′ ) N ′ k ( s,a ) if N ′ k ( s, a ) = 0 1 |S | otherwise , (9) where N k ( s, d ) = k − 1 X l =1 X t ∈ [ L ] I ( s t = s, d t = d in episo de l ) , N k ( s, d, a ) = k − 1 X l =1 X t ∈ [ L ] I ( s t = s, a t = a, d t = d in episo de l ) , N ′ k ( s, a ) = k − 1 X l =1 X t ∈ [ L ] I ( s t = s, a t = a in episo de l ) , N ′ k ( s, a, s ′ ) = k − 1 X l =1 X t ∈ [ L ] I ( s t = s, a t = a, s t +1 = s ′ in episo de l ) . Then, similarly as in Jaksc h et al. ( 2010 ), w e opt for L 1 confidence sets 5 , i.e. , P k · | d,s,t ( δ ) = p d : || p d ( · | s, t ) − ˆ p k d ( · | s ) || 1 ≤ β k D ( s, d, δ ) , P k · | s,a,t ( δ ) = p : || p ( · | s, a, t ) − ˆ p k ( · | s, a ) || 1 ≤ β k ( s, a, δ ) , for all d ∈ D , s ∈ S , a ∈ A and t ∈ [ L ] , where δ is a given parameter, β k D ( s, d, δ ) = v u u t 2 log ( k − 1) 7 L 7 |S ||D| 2 |A| +1 δ max { 1 , N k ( s, d ) } and β k ( s, a, δ ) = v u u t 2 log ( k − 1) 7 L 7 |S ||A| 2 |S | +1 δ max { 1 , N k ( s, a ) } . Next, giv en the switc hing p olicy π and the transition dynamics P D and P , w e define the v alue function as V π t | P D ,P ( s, d ) = E L X τ = t c e ( s τ , a τ ) + c c ( d τ ) + c x ( d τ , d τ − 1 ) | s t = s, d t − 1 = d , (10) where the exp ectation is taken ov er all the tra jectories induced b y the switching p olicy given the agents’ p olicies. Then, for each episo de k , we define the optimal v alue function v k t ( s, d ) as v k t ( s, d ) = min π min P D ∈P k D ( δ ) min P ∈P k ( δ ) V π t | P D ,P ( s, d ) . (11) Then, we are ready to use the following key theorem, which gives a solution to Eq. 7 (prov en in App endix A ): Theorem 1. F or any episo de k , the optimal value function v k t ( s, d ) satisfies the fol lowing r e cursive e quation: v k t ( s, d ) = min d t ∈D h c d t ( s, d ) + min p d t ∈P k · | d t ,s,t X a ∈A p d t ( a | s, t ) × c e ( s, a ) + min p ∈P k · | s,a,t E s ′ ∼ p ( · | s,a,t ) [ v k t +1 ( s ′ , d t )] i , (12) with v k L +1 ( s, d ) = 0 for al l s ∈ S and d ∈ D . Mor e over, if d ∗ t is the solution to the minimization pr oblem of the RHS of the ab ove r e cursive e quation, then π k t ( s, d ) = d ∗ t . The ab o ve result readily implies that, just b efore each episo de k starts, we can find the optimal switching p olicy π k = ( π k 1 , . . . , π k L ) using dynamic programming, starting with v L +1 ( s, d ) = 0 for all s ∈ S and d ∈ D . Moreo v er, similarly as in Strehl & Littman ( 2008 ), we can solve the inner minimization problems in Eq. 12 analytically using Lemma 7 in App endix B . T o this end, we first find the optimal p ( · | s, a, t ) for all and a ∈ A 5 This choice will result into a sequence of switc hing policies with desirable properties in terms of total regret. 6 Published in T ransations on Mac hine Learning Researc h (07/2022) ALGORITHM 1: UCRL2-MC 1: Cost functions C D and C e , δ 2: { N k , N ′ k } ← InitializeCounts () 3: for k = 1 , . . . , K do 4: { ˆ p k d } , ˆ p k ← Upda teDistribution ( { N k , N ′ k } ) 5: P k D , P k ← Upda teConfidenceSets ( { ˆ p k d } , ˆ p k , δ ) 6: π k ← GetOptimal ( P k D , P k , C D , C e ) , 7: ( s 1 , d 0 ) ← InitializeConditions () 8: for t = 1 , . . . , L do 9: d t ← π k t ( s t , d t − 1 ) 10: a t ∼ p d t ( ·| s t ) 11: s t +1 ∼ P ( ·| s t , a t ) 12: N ← Upda teCounts (( s t , d t , a t , s t +1 ) , { N k , N ′ k } ) 13: end for 14: end for 15: Return π K and then we find the optimal p d t ( · | s, t ) for all d t ∈ D . Algorithm 1 summarizes the whole pro cedure, whic h w e refer to as UCRL2-MC. Within the algorithm, the function GetOptimal ( · ) finds the optimal p olicy π k using dynamic programming, as describ ed abov e, and Upda teDistribution ( · ) computes Eqs. 8 and 9 . Moreo ver, it is important to notice that, in lines 8 – 10 , the switching p olicy π k is actually deploy ed, the true agents take actions on the true en vironmen t and, as a result, action and state transition data from the true agents and the true environmen t is gathered. Next, the follo wing theorem shows that the sequence of policies { π k } K k =1 found by Algorithm 1 ac hieve a total regret that is sublinear with resp ect to the num b er of steps, as defined in Eq. 6 (prov en in App endix A ): Theorem 2. A ssume we use A lgorithm 1 to find the switching p olicies π k . Then, with pr ob ability at le ast 1 − δ , it holds that R ( T ) ≤ ρ 1 L s |A||S ||D | T log |S ||D | T δ + ρ 2 L |S | s |A| T log |S ||A| T δ (13) wher e ρ 1 , ρ 2 > 0 ar e c onstants. The ab o ve regret bound suggests that our algorithm may ac hieve higher regret than standard UCRL2 ( Jaksc h et al. , 2010 ), one of the most p opular problem-agnostic RL algorithms. More sp ecifically , one can readily sho w that, if w e use UCRL2 to find the switc hing p olicies π k (refer to App endix C ), then, with probability at least 1 − δ , it holds that R ( T ) ≤ ρL |S | s |D | T log |S ||D | T δ (14) where ρ is a constant. Then, if we omit constant and logarithmic factors and assume the size of the team of agen ts is smaller than the size of state space, i.e. , |D | < |S | , we hav e that, for UCRL2, the regret b ound is ˜ O ( L |S | p |D | T ) while, for UCRL2-MC, it is ˜ O ( L |S | p |A| T ) . That b eing said, in practice, we ha ve found that our algorithm achiev es comparable regret with resp ect to UCRL2, as sho wn in Figure 4 . In addition, after applying our algorithm on a sp ecific team of agents and en vironmen t, w e can reuse the confidence interv als ov er the transition probability p ( · | s, a ) we hav e learned to find the optimal switching p olicy for a differen t team of agents op erating in a similar environmen t. In con trast, after applying UCRL2, we w ould only hav e a confidence in terv al ov er the conditional probability defined by Eq. 3 , whic h would b e of little use to find the optimal switching p olicy for a different team of agen ts. In the following section, we will build up on this insight by considering several indep enden t teams of agents op erating in similar environmen ts. W e will demonstrate that, whenev er we aim to find multiple sequences of 7 Published in T ransations on Mac hine Learning Researc h (07/2022) (a) γ 0 = no-car (b) γ 0 = light (c) γ 0 = heavy Figure 2: Three examples of environmen t realizations with different initial traffic level γ 0 . switc hing p olicies for these indep enden t teams, a straightforw ard v ariation of UCRL2-MC greatly b enefits from maintaining shared confidence b ounds for the transition probabilities of the environmen ts and enjoys a b etter regret b ound than UCRL2. Remarks. F or ease of exposition, we hav e assumed that b oth the mac hine and h uman agents follo w arbitrary Mark o v policies that do not change due to switching. Ho wev er, our theoretical results still hold if we lift this assumption—w e just need to define the agen ts’ p olicies as p d ( a t | s t , d t , d t − 1 ) and construct separate confidence sets based on the switc h v alues. 5 Lea rning to Switch A cross Multiple T eams of Agents In this section, rather than finding a sequence of switching p olicies for a single team of agents, we aim to find m ultiple sequences of switching p olicies across several indep enden t teams operating in similar en vironments. W e will analyze our algorithm in scenarios where it can maintain shared confidence b ounds for the transition probabilities of the environmen ts across these indep enden t teams. F or instance, when the learning algorithm is deploy ed in centralized settings, it is p ossible to collect data across indep enden t teams to maintain shared confidence in terv als on the common parameters (i.e., the environmen t’s transition probabilities in our problem setting). This setting fits a v ariet y of real applications, more prominently , think of a car manufacturer con tin uously collecting driving data from million of human driv ers wishing to learn different switc hing p olicies for each driver to implement a p ersonalized semi-autonomous driving system. Similarly as in the previous section, w e lo ok at the problem from the p erspective of episo dic learning and pro ceed as follows. Giv en N indep enden t teams of agents {D i } N i =1 , we consider K indep enden t subsequen t episo des of length L p er team and denote the aggregate length of all of these episo des as T = K L . F or each team of agen ts D i , ev ery episo de corresp onds to a realization of a finite horizon 2-lay er Marko v decision pro cess with state spaces S × A and S × D i , set of actions D i , true agent p olicies P ∗ D i , true environmen t transition probability P ∗ , and immediate costs C D i and C e . Here, note that all the teams op erate in a similar environmen t, i.e. , P ∗ is shared across teams, and, without loss of generality , they share the same costs. Then, our goal is to find the switc hing p olicies π k i with desirable prop erties in terms of total regret R ( T , N ) , which is given by: R ( T , N ) = N X i =1 K X k =1 h E τ ∼ π k i ,P ∗ D i ,P ∗ [ c ( τ | s 1 , d 0 )] − E τ ∼ π ∗ i ,P ∗ D i ,P ∗ [ c ( τ | s 1 , d 0 )] i (15) where π ∗ i is the optimal switching p olicy for team i , under the true agent p olicies and environmen t transition probabilit y . T o ac hiev e our goal, we just run N instances of UCRL2-MC (Algorithm 1 ), each with a different confidence set P k D i ( δ ) for the agen ts’ policies, similarly as in the case of a single team of agents, but with a shared confidence set P k ( δ ) for the environmen t transition probability . Then, we hav e the following key corollary , whic h readily follo ws from Theorem 2 : 8 Published in T ransations on Mac hine Learning Researc h (07/2022) k ≈ 5 k ≈ 500 k ≈ 3000 (a) c x = 0 . 1 , c c ( H ) = 0 . 2 k ≈ 5 k ≈ 500 k ≈ 3000 (b) c x = 0 , c c ( H ) = 0 Figure 3: T ra jectories induced b y the switching p olicies found by Algorithm 1 . The blue and orange segmen ts indicate machine and human control, resp ectiv ely . In b oth panels, we train Algorithm 1 within the same sequence of episo des, where the initial traffic lev el of each episo de is sampled uniformly from { no-car , light , heavy } , and show three episo des with different initial traffic levels. The results indicate that, in the latter episo des, the algorithm has learned to switch to the human driv er in hea vier traffic lev els. Corollary 3. A ssume we use N instanc es of A lgorithm 1 to find the switching p olicies π k i using a shar e d c onfidenc e set for the envir onment tr ansition pr ob ability. Then, with pr ob ability at le ast 1 − δ , it holds that R ( T , N ) ≤ ρ 1 N L s |A||S ||D | T log |S ||D | T δ + ρ 2 L |S | s |A| N T log |S ||A| T δ (16) wher e ρ 1 , ρ 2 > 0 ar e c onstants. The ab o ve results suggests that our algorithm may ac hieve lo wer regret than UCRL2 in a scenario with m ultiple teams of agents op erating in similar environmen ts. This is b ecause, under UCRL2, the confidence sets for the conditional probabilit y defined b y Eq. 3 cannot b e shared across teams. More sp ecifically , if we use N instances of UCLR2 to find the switching policies π k i , then, with probability at least 1 − δ , it holds that R ( T , N ) ≤ ρN L |S | s |D | T log |S ||D | T δ where ρ is a constant. Then, if w e omit constan t and logarithmic factors and assume |D i | < |S | for all i ∈ [ N ] , w e hav e that, for UCRL2, the regret bound is ˜ O ( N L |S | p |D | T ) while, for UCRL2-MC, it is ˜ O ( L |S | p |A| T N + N L p |A||S ||D | T ) . Imp ortan tly , in practice, w e ha ve found that UCRL2-MC does ac hieve a significan t lo w er regret than UCRL2, as sho wn in the Figure 5 . 6 Exp eriments 6.1 Obstacle avoidance W e p erform a v ariet y of simulations in obstacle a voidance, where teams of agen ts (drivers) consist of one h uman agen t ( H ) and one mac hine agent ( M ), i.e., D = { H , M } . W e consider a lane driving en vironment with three lanes and infinite rows, where the type of each individual cell ( i.e. , road , car , stone or grass ) in row r is sampled indep enden tly at random with a probabilit y that depends on the traffic level γ r , which can take three discrete v alues, γ r ∈ { no-car , light , heavy } . The traffic level of each row γ r +1 is sampled at random with a probability that dep ends on the traffic level of the previous row γ r . The probability of each cell type based on traffic lev el, as w ell as the conditional distribution of traffic levels can b e found in App endix D . A t any given time t , we assume that who ev er is in control—be it the machine or the h uman—can take three differen t actions A = { left, straight, right } . A ction left steers the car to the left of the current lane, 9 Published in T ransations on Mac hine Learning Researc h (07/2022) 0 5 10 15 20 Episode, k × 10 3 0 25 50 75 Re gret, R ( T ) × 10 3 UCRL2-MC UCRL2 Machine Human (a) c c ( H ) = 0 . 2 , c x = 0 . 1 0 5 10 15 20 Episode, k × 10 3 0 50 Re gret, R ( T ) × 10 3 UCRL2-MC UCRL2 Machine Human (b) c c ( H ) = 0 , c x = 0 Figure 4: T otal regret of the tra jectories induced by the switching p olicies found by Algorithm 1 and those induced by a v ariant of UCRL2 in comparison with the tra jectories induced b y a machine driv er and a human driv er in a setting with a single team of agents. In all panels, we run K = 20 , 000 . F or Algorithm 1 and the v arian t of UCRL2, the regret is sublinear with resp ect to the num b er of time steps whereas, for the machine and the h uman driv ers, the regret is linear. action right steers it to the right and action straight lea v es the car in the current lane. If the car is already on the leftmost (righ tmost) lane when taking action left (righ t), then the lane remains unchanged. Irresp ectiv e of the action tak en, the car alw a ys mo v es forward. The goal of the cyb erph ysical system is to drive the car from an initial state in time t = 1 until the end of the episo de t = L with the minimum total amount of cost. In our exp erimen ts, we set L = 10 . Figure 2 shows three examples of environmen t realizations. State space. T o ev aluate the switching p olicies found by Algorithm 1 , we exp erimen t with a sensor-b ase d state space, where the state v alues are the type of the current cell and the three cells the car can mov e into in the next time step, as well as the current traffic level—w e assume the agen ts (b e it a human or a machine) can measure the traffic level. F or example, assume at time t the traffic is light, the car is on a road cell and, if it mov es forward left, it hits a stone, if it mov es forward straight, it hits a car, and, if it mov es forward righ t, it drives ov er grass, then its state v alue is s t = ( light , road , stone , car , grass ) . Moreo ver, if the car is on the leftmost (rightmost) lane, then we set the v alue of the third (fifth) dimension in s t to ∅ . Therefore, under this state represen tation, the resulting MDP has ∼ 3 × 5 4 states. Cost and human/mac hine policies. W e consider a state-dep enden t environmen t cost c e ( s t , a t ) = c e ( s t ) that dep ends on the type of the cell the car is on at state s t , i.e. , c e ( s t ) = 0 if the type of the current cell is road, c e ( s t ) = 2 if it is grass, c e ( s t ) = 4 if it is stone and c e ( s t ) = 10 if it is car. Moreo ver, in all simulations, w e use a machine p olicy that has b een trained using a standard RL algorithm on en vironmen t realizations with γ 0 = no-car . In other w ords, the machine p olicy is trained to p erform well under a lo w traffic level. Moreo v er, we consider all the humans pick whic h action to take ( left , straight or right ) according to a noisy estimate of the environmen t cost of the three cells that the car can mov e in to in the next time step. More sp ecifically , each human mo del H computes a noisy estimate of the cost ˆ c e ( s ) = c e ( s ) + ϵ s of each of the three cells the car can mov e in to, where ϵ s ∼ N (0 , σ H ) , and picks the action that mov es the car to the cell with the low est noisy estimate 6 . As a result, h uman driv ers are generally more reliable than the machine under high traffic levels, how ever, the mac hine is more reliable than humans under low traffic level, where its p olicy is near-optimal (See App endix E for a comparison of the human and machine p erformance). Finally , w e consider that only the car driv en b y our system mov es in the en vironmen t. 6.1.1 Results First, w e fo cus on a single team of one mac hine M and one h uman model H , with σ H = 2 , and use Algorithm 1 to find a sequence of switc hing p olicies with sublinear regret. A t the b eginning of each episo de, the initial traffic lev el γ 0 is sampled uniformly at random. 6 Note that, in our theoretical results, we hav e no assumption other than the Mark ov prop ert y regarding the human policy . 10 Published in T ransations on Mac hine Learning Researc h (07/2022) 0 2 4 Episode, k × 10 3 0 20 40 60 Re gret, R ( T ) × 10 3 UCRL2-MC UCRL2 (a) c c ( H ) = 0 . 2 , c x = 0 . 1 0 2 4 Episode, k × 10 3 0 20 40 Re gret, R ( T ) × 10 3 UCRL2-MC UCRL2 (b) c c ( H ) = 0 , c x = 0 Figure 5: T otal regret of the trajectories induced b y the switching p olicies found by N instances of Algorithm 1 and those induced by N instances of a v ariant of UCRL2 in a setting with N team of agents. In b oth panels, eac h instance of Algorithm 1 shares the same confidence set for the en vironment transition probabilities and w e run K = 5000 episo des. The sequence of p olicies found by Algorithm 1 outp erform those found by the v arian t of UCRL2 in terms of total regret, in agreement with Corollary 3 . W e lo ok at the trajectories induced b y the switching p olicies found by our algorithm across different episo des for differen t v alues of the switc hing cost c x and cost of human control c c ( H ) 7 . Figure 3 summarizes the results, whic h show that, in the latter episo des, the algorithm has learned to rely on the machine (blue segments) whenev er the traffic level is low and switches to the h uman driver when the traffic level increases. Moreov er, whenev er the amount of human con trol and num b er of switches is not p enalized ( i.e. , c x = c c ( H ) = 0 ), the algorithm switches to the human more frequently whenev er the traffic level is high to reduce the environmen t cost. See App endix F for a comparison of human control rate in environmen ts with different initial traffic lev els. In addition, w e compare the p erformance ac hieved b y Algorithm 1 with three baselines: (i) a v ariant of UCRL2 ( Jaksch et al. , 2010 ) adapted to our finite horizon setting (see Appendix C ), (ii) a human agent, and (iii) a machine agent. As a measure of p erformance, we use the total regret, as defined in Eq. 6 . Figure 4 summarizes the results for tw o different v alues of switching cost c x and cost of human con trol c c ( H ) . The results show that both our algorithm and UCRL2 ac hiev e sublinear regret with resp ect to the num b er of time steps and their p erformance is comparable in agreement with Theorem 2 . In contrast, whenever the h uman or the mac hine driv e on their o wn, they suffer linear regret, due to a lack of exploration. Next, w e consider N = 10 indep enden t teams of agents, {D i } N i =1 , op erating in a similar lane driving en vironmen t. Eac h team D i is comp osed of a different human mo del H i , with σ H i sampled uniformly from (0 , 4) , and the same machine driver M . Then, to find a sequence of switching p olicies for each of the teams, w e run N instances of Algorithm 1 with shared confidence set for the environmen t transition probabilities. W e compare the p erformance of our algorithm against the same v ariant of UCRL2 used in the exp erimen ts with a single team of agents in terms of the total regret defined in Eq. 15 . Here, note that the v ariant of UCRL2 do es not maintain a shared confidence set for the en vironmen t transition probabilities across teams but instead creates a confidence set for the conditional probability defined by Eq. 3 for each team. Figure 5 summarizes the results for a sequence for different v alues of the switching cost c x and cost of human control c c ( H ) , whic h sho ws that, in agreemen t with Corollary 3 , our metho d outp erforms UCRL2 significantly . 6.2 RiverSwim In addition to the obstacle av oidance task, we consider the standard task of RiverSwim ( Strehl & Littman , 2008 ). The MDP states and transition probabilities are shown in Figure 6 . The cost of taking action in states s 2 to s 5 equals 1 , while 0 . 995 and 0 for states s 1 and s 6 , resp ectiv ely . Each episo de ends after L = 20 7 Here, we assume the cost of machine control c c ( M ) = 0 . 11 Published in T ransations on Mac hine Learning Researc h (07/2022) Figure 6: RiverSwim. Contin uous (dashed) arrows sho w the transitions after taking actions right ( left ). The optimal p olicy is to alwa ys tak e action right . 3 4 5 6 7 8 9 10 Number of T eams 0 . 0 0 . 2 0 . 4 Re gret Ratio (a) 0 . 0 2 . 5 5 . 0 7 . 5 10 . 0 Episode, k × 10 3 0 100 200 Re gret, R ( T ) × 10 3 UCRL2-MC UCRL2 (b) Figure 7: (a) Ratio of UCRL2-MC regret to UCRL2 for different num b er of teams. (b) T otal regret of the tra jectories induced b y the switc hing p olicies found by UCRL2-MC and those induced by UCRL2 in a setting with N = 100 team of agents. steps. W e set the switching cost and cost of agent control to zero for all the simulations in this section, i.e., c x ( · , · ) = c c ( · ) = 0 . The set D consists of agents that c ho ose action right with some probability v alue p , whic h may differ for differen t agents. In the following part, w e inv estigate the effect of increasing the num b er of teams on the regret b ound in the m ultiple teams of agents setting. See Appendix G for more simulations to study the impact of action size and n um b er of agents in each team on the total regret. 6.2.1 Results W e consider N indep enden t teams of agents, each consisting of tw o agents with the probability p and 1 − p of c ho osing action right , where p is c hosen uniformly at random for each team. W e run the simulations for N = { 3 , 4 , · · · , 10 } teams of agents. F or each N , we run b oth UCRL2-MC and UCRL2 for 20 , 000 episo des and repeat eac h exp erimen t 5 times. Figure 7 (a) summarizes our results, sho wing the adv antage of the shared confidence b ounds on the environmen t transition probabilities in our algorithm against its problem-agnostic v ersion. T o b etter illustrate t h e p erformance of UCRL2-MC, we also run an exp erimen t with N = 100 teams of agents for 10 , 000 episo des and compare the total regret of our algorithm to UCRL2. Figure 7 (b) shows that our algorithm significan tly outp erforms UCRL2. 7 Conclusions and F uture W o rk W e hav e formally defined the problem of learning to switch control among agents in a team via a 2-la yer Mark o v decision process and then dev elop ed UCRL2-MC, an online learning algorithm with desirable pro v able guaran tees. Moreo ver, we ha ve p erformed a v ariety of simulation exp erimen ts on the standard RiverSwim task and obstacle av oidance to illustrate our theoretical results and demonstrate that, by exploiting the sp ecific structure of the problem, our prop osed algorithm is sup erior to problem-agnostic algorithms. Our work op ens up many interesting av enues for future work. F or example, we hav e assumed that the agents’ p olicies are fixed. How ever, there are reasons to b eliev e that sim ultaneously optimizing the agents’ p olicies and the switching p olicy may lead to sup erior p erformance ( De et al. , 2020 ; 2021 ; Wilder et al. , 2020 ; W u et al. , 2020 ). In our w ork, w e hav e assumed that the state space is discrete and the horizon in finite. It 12 Published in T ransations on Mac hine Learning Researc h (07/2022) w ould b e very interesting to lift these assumptions and develop approximate v alue iteration metho ds to solve the learning to switch problem. Finally , it would b e in teresting to ev aluate our algorithm using real human agen ts in a v ariet y of tasks. A ckno wledgments. Gomez-Ro driguez ackno wledges supp ort from the Europ ean Research Council (ER C) under the Europ ean Union’s Horizon 2020 research and innov ation programme (grant agreemen t No. 945719). References Sam uel Barrett and Peter Stone. An analysis framework for ad ho c teamw ork tasks. In Pr o c e e dings of the 11th International Confer enc e on A utonomous A gents and Multiagent Systems-V olume 1 , pp. 357–364, 2012. P . Bartlett and M. W egkamp. Classification with a reject option using a hinge loss. JMLR , 2008. K. Bro okh uis, D. De W aard, and W. Janssen. Beha vioural impacts of adv anced driver assistance systems–an o v erview. Eur op e an Journal of T r ansp ort and Infr astructur e R ese ar ch , 1(3), 2001. Daniel S Bro wn and Scott Niekum. Machine teaching for inv erse reinforcement learning: Algorithms and applications. In Pr o c e e dings of the AAAI Confer enc e on A rtificial Intel ligenc e , volume 33, pp. 7749–7758, 2019. C. Cortes, G. DeSalv o, and M. Mohri. Learning with rejection. In AL T , 2016. Mary Czerwinski, Edward Cutrell, and Eric Horvitz. Instan t messaging and interruption: Influence of task t yp e on p erformance. In OZCHI 2000 c onfer enc e pr o c e e dings , volume 356, pp. 361–367, 2000. Nathaniel D. Daw and Peter Day an. The algorithmic anatom y of mo del-based ev aluation. Philosophic al T r ansactions of the R oyal So ciety B: Biolo gic al Scienc es , 369(1655):20130478, 2014. A. De, P . Koley , N. Ganguly , and M. Gomez-Ro driguez. Regression under human assistance. In AAAI , 2020. Abir De, Nastaran Okati, Ali Zarezade, and Man uel Gomez-Ro driguez. Classification under h uman assistance. In AAAI , 2021. Europ ean P arliament. Regulation (EC) No 561/2006. http://data.eur op a.eu/eli/r e g/2006/561/2015-03-02 , 2006. R. Everett and S. Rob erts. Learning against non-stationary agents with opp onen t mo delling and deep reinforcemen t learning. In 2018 AAAI Spring Symp osium Series , 2018. Y. Geifman and R. El-Y aniv. Selectiv enet: A deep neural netw ork with an integrated reject option. arXiv pr eprint arXiv:1901.09192 , 2019. Y. Geifman, G. Uziel, and R. El-Y aniv. Bias-reduced uncertaint y estimation for deep neural classifiers. In ICLR , 2018. A. Ghosh, S. T schiatsc hek, H. Mahda vi, and A. Singla. T ow ards deplo yment of robust coop erativ e ai agents: An algorithmic framew ork for learning adaptiv e p olicies. In AAMAS , 2020. A dit ya Gopalan and Shie Mannor. Thompson sampling for learning parameterized marko v decision pro cesses. In Confer enc e on L e arning The ory , pp. 861–898, 2015. A. Gro v er, M. Al-Shediv at, J. Gupta, Y. Burda, and H. Edw ards. Learning p olicy represen tations in m ultiagen t systems. In ICML , 2018. D. Hadfield-Menell, S. Russell, P . Abb eel, and A. Dragan. Coop erativ e inv erse reinforcement learning. In NIPS , 2016. L. Haug, S. T schiatsc hek, and A. Singla. T eaching inv erse reinforcemen t learners via features and demonstra- tions. In NeurIPS , 2018. 13 Published in T ransations on Mac hine Learning Researc h (07/2022) Eric Horvitz and Johnson Apacible. Learning and reasoning ab out interruption. In Pr o c e e dings of the 5th international c onfer enc e on Multimo dal interfac es , pp. 20–27, 2003. Shamsi T Iqbal and Brian P Bailey . Understanding and developing mo dels for detecting and differentiating breakp oin ts during in teractive tasks. In Pr o c e e dings of the SIGCHI c onfer enc e on Human factors in c omputing systems , pp. 697–706, 2007. Alexis Jacq, Johan F erret, Olivier Pietquin, and Matthieu Geist. Lazy-mdps: T o wards interpretable reinforcemen t learning b y learning when to act. In AAMAS , 2022. T. Jaksc h, R. Ortner, and P . Auer. Near-optimal regret b ounds for reinforcement learning. Journal of Machine L e arning R ese ar ch , 2010. Christian P Janssen , Shamsi T Iqbal, Andrew L Kun, and Stella F Donker. Interrupted b y m y car? implications of in terruption and interlea ving research for automated vehicles. International Journal of Human-Computer Studies , 130:221–233, 2019. P aramesw aran Kamalaruban, Rati Devidze, V olkan Cevher, and A dish Singla. Interactiv e teac hing algorithms for in v erse reinforcemen t learning. In IJCAI , 2019. Kyle Koto wick and Julie Shah. Mo dalit y switching for mitigation of sensory adaptation and habituation in p ersonal navigation systems. In 23r d International Confer enc e on Intel ligent U ser Interfac es , pp. 115–127, 2018. Z. Liu, Z. W ang, P . Liang, R. Salakh utdinov, L. Morency , and M. Ueda. Deep gam blers: Learning to abstain with p ortfolio theory . In NeurIPS , 2019. C. Macadam. Understanding and mo deling the human driver. V ehicle system dynamics , 40(1-3):101–134, 2003. O. Macindo e, L. Kaelbling, and T. Lozano-Pérez. Pomcop: Belief space planning for sidekicks in co op erativ e games. In AIIDE , 2012. Catharine L. R. McGhan, Ali Nasir, and Ella M. Atkins. Human inten t prediction using marko v decision pro cesses. Journal of A er osp ac e Information Systems , 12(5):393–397, 2015. V. Mnih et al. Human-lev el control through deep reinforcement learning. Natur e , 518(7540):529, 2015. Salama A Mostafa, Mohd Sharifuddin Ahmad, and Aida Mustapha. Adjustable autonomy: a systematic literature review. A rtificial Intel ligenc e R eview , 51(2):149–186, 2019. Hussein Mozannar and David Sontag. Consistent estimators for learning to defer to an expert. In ICML , 2020. S. Nikolaidis, R. Ramakrishnan, K. Gu, and J. Shah. Efficient mo del learning from joint-action demonstrations for h uman-rob ot collab orativ e tasks. In HRI , 2015. S. Nik olaidis, J. F orlizzi, D. Hsu, J. Shah, and S. Sriniv asa. Mathematical mo dels of adaptation in h uman-rob ot collab oration. arXiv pr eprint arXiv:1707.02586 , 2017. Ian Osband and Benjamin V an Roy . Near-optimal reinforcement learning in factored mdps. In A dvanc es in Neur al Information Pr o c essing Systems , pp. 604–612, 2014. Ian Osband, Daniel Russo, and Benjamin V an Roy . (more) efficient reinforcement learning via p osterior sampling. In A dvanc es in Neur al Information Pr o c essing Systems , pp. 3003–3011, 2013. Goran Radano vic, Rati Devidze, Da vid C. Park es, and Adish Singla. Learning to collab orate in marko v decision pro cesses. In ICML , 2019. M. Raghu, K. Blumer, G. Corrado, J. Klein b erg, Z. Ob ermey er, and S. Mullainathan. The algorithmic automation problem: Prediction, triage, and human effort. arXiv pr eprint arXiv:1903.12220 , 2019a. 14 Published in T ransations on Mac hine Learning Researc h (07/2022) M. Raghu, K. Blumer, R. Sayres, Z. Ob ermey er, B. Kleinberg, S. Mullainathan, and J. Kleinberg. Direct uncertain t y prediction for medical second opinions. In ICML , 2019b. H. Ramasw amy , A. T ewari, and S. Agarw al. Consistent algorithms for m ulticlass classification with an abstain option. Ele ctr onic J. of Statistics , 2018. Siddharth Reddy , Anca D Dragan, and Sergey Levine. Shared autonomy via deep reinforcement learning. arXiv pr eprint arXiv:1802.01744 , 2018. Sh ubhansh u Shekhar, Mohammad Gha v amzadeh, and T ara Ja vidi. A ctive learning for classification with absten tion. IEEE Journal on Sele cte d A r e as in Information The ory , 2(2):705–719, 2021. D. Silver et al. Mastering the game of go with deep neural netw orks and tree search. Natur e , 529(7587):484, 2016. D. Silv er et al. Mastering the game of go without human knowledge. Natur e , 550(7676):354, 2017. P eter Stone, Gal A Kaminka, Sarit Kraus, and Jeffrey S Rosensc hein. A d ho c autonomous agent teams: Collab oration without pre-coordination. In Twenty-F ourth AAAI Confer enc e on A rtificial Intel ligenc e , 2010. A. Strehl and M. Littman. An analysis of mo del-based interv al estimation for marko v decision pro cesses. Journal of Computer and System Scienc es , 74(8):1309–1331, 2008. DJ Strouse, Kevin McKee, Matt Botvinic k, Edward Hughes, and Ric hard Ev erett. Collab orating with humans without h uman data. In A dvanc es in Neur al Information Pr o c essing Systems , volume 34, 2021. Ric hard S Sutton, Doina Precup, and Satinder Singh. Betw een mdps and semi-mdps: A framework for temp oral abstraction in reinforcement learning. A rtificial intel ligenc e , 112(1-2):181–211, 1999. Matthew E T aylor, Halit Bener Suay , and Sonia Chernov a. In tegrating reinforcement learning with human demonstrations of v arying abilit y . In The 10th International Confer enc e on A utonomous A gents and Multiagent Systems-V olume 2 , pp. 617–624. International F oundation for Autonomous Agen ts and Multiagent Systems, 2011. S. Th ulasidasan, T. Bhattac hary a, J. Bilmes, G. Chenn upati, and J. Mohd-Y usof. Com bating lab el noise in deep learning using absten tion. arXiv pr eprint arXiv:1905.10964 , 2019. Lisa T orrey and Matthew T aylor. T eaching on a budget: Agents advising agen ts in reinforcement learning. In Pr o c e e dings of the 2013 international c onfer enc e on A utonomous agents and multi-agent systems , pp. 1053–1060, 2013. James T. T ownsend, Kam M. Silv a, Jesse Sp encer-Smith, and Michael J. W enger. Exploring the relations b et ween categorization and decision making with regard to realistic face stimuli. Pr agmatics & Co gnition , 8(1):83–105, 2000. S. T sc hiatschek, A. Ghosh, L. Haug, R. Devidze, and A. Singla. Learner-a ware teac hing: In verse reinforcement learning with preferences and constrain ts. In NeurIPS , 2019. O. Viny als et al. Grandmaster level in starcraft ii using multi-agen t reinforcement learning. Natur e , pp. 1–5, 2019. Thomas J W alsh, Daniel K Hewlett, and Clayton T Morrison. Blending autonomous exploration and appren ticeship learning. In A dvanc es in Neur al Information Pr o c essing Systems , pp. 2258–2266, 2011. Bry an Wilder, Eric Horvitz, and Ece Kamar. Learning to complement humans. In IJCAI , 2020. H. Wilson and P . Daugherty . Collaborative in telligence: humans and ai are joining forces. Harvar d Business R eview , 2018. 15 Published in T ransations on Mac hine Learning Researc h (07/2022) Bohan W u, Ja yesh K Gupta, and Myk el K o c henderfer. Model primitives for hierarc hical lifelong reinforcement learning. A utonomous A gents and Multi-A gent Systems , 34(1):1–38, 2020. Y. Zheng, Z. Meng, J. Hao, Z. Zhang, T. Y ang, and C. F an. A deep bay esian p olicy reuse approac h against non-stationary agen ts. In NeurIPS , 2018. 16 Published in T ransations on Mac hine Learning Researc h (07/2022) A Pro ofs A.1 Pro of of Theorem 1 W e first define P k D| .,t + := × s ∈S ,d ∈D ,t ′ ∈{ t,...,L } P k . | d,s,t ′ , P k | .,t + = × s ∈S ,a ∈A ,t ′ ∈{ t,...,L } P k | s,a,t ′ and π t + = { π t , . . . , π L } . Next, we get a low er b ound the optimistic v alue function v k t ( s, d ) as follows: v k t ( s, d ) = min π min P D ∈P k D min P ∈P k V π t | P D ,P ( s, d ) = min π t + min P D ∈P k D min P ∈P k V π t | P D ,P ( s, d ) ( i ) = min π t ( s,d ) min p π t ( s,d ) ( . | s,t ) ∈P k · | π t ( s,d ) ,s,t min p ( . | s,.,t ) ∈P k · | s, · ,t min π ( t +1)+ P D ∈P k D | · , ( t +1) + P ∈P k · | · , ( t +1) + h c π t ( s,d ) ( s, d ) + E a ∼ p π t ( s,d ) ( · | s,t ) c e ( s, a ) + E s ′ ∼ p ( · | s,a,t ) V π t +1 | P D ,P ( s ′ , π t ( s, d )) i ( ii ) ≥ min π t ( s,d ) min p π t ( s,d ) ( . | s,t ) ∈P k · | π t ( s,d ) ,s,t min p ( . | s,.,t ) ∈P k · | s, · ,t c π t ( s,d ) ( s, d ) + E a ∼ p π t ( s,d ) ( · | s,t ) c e ( s, a ) + E s ′ ∼ p ( · | s,a,t ) " min π ( t +1) + min P D ∈P k D | · , ( t +1) + min P ∈P k · | · , ( t +1) + V π t +1 | P D ,P ( s ′ , π t ( s, d )) #!# = min d t " c d t ( s, d ) + min p d t ( . | s,t ) ∈P k ·| d t ,s,t X a ∈A p d t ( a | s, t ) · c e ( s, a ) + min p ( . | s,a,t ) ∈P k · | s,a,t E s ′ ∼ p ( · | s,a,t ) v k t +1 ( s ′ , d t ) !# , where (i) follows from Lemma 8 and (ii) follows from the fact that min a E [ X ( a )] ≥ E [min a X ( a )] . Next, we pro vide an upp er b ound of the optimistic v alue function v k t ( s, d ) as follows: v k t ( s, d ) = min π min P D ∈P k D min P ∈P k V π t | P D ,P ( s, d ) ( i ) = min π t min p π t ( s,d ) ( . | s,t ) ∈P k · | π t ( s,d ) ,s,t min p ( . | s,.,t ) ∈P k · | s, · ,t min π ( t +1)+ P D ∈P k D | · , ( t +1) + P ∈P k · | · , ( t +1) + h c π t ( s,d ) ( s, d ) + E a ∼ p π t ( s,d ) ( · | s,t ) c e ( s, a ) + E s ′ ∼ p ( · | s,a,t ) V π t +1 | P D ,P ( s ′ , π t ( s, d )) i ( ii ) ≤ min π t ( s,d ) min p π t ( s,d ) ( . | s,t ) ∈P k · | π t ( s,d ) ,s,t min p ( . | s,.,t ) ∈P k · | s, · ,t c π t ( s,d ) ( s, d ) + E a ∼ p π t ( s,d ) ( · | s,t ) c e ( s, a ) + E s ′ ∼ p ( · | s,a,t ) V π ∗ t +1 | P ∗ D ,P ∗ ( s ′ , π t ( s, d )) i ( iii ) = min π t ( s,d ) min p π t ( s,d ) ( . | s,t ) ∈P k · | π t ( s,d ) ,s,t min p ( . | s,.,t ) ∈P k · | s, · ,t c π t ( s,d ) ( s, d ) + E a ∼ p π t ( s,d ) ( · | s,t ) c e ( s, a ) + E s ′ ∼ p ( · | s,a,t ) v k t +1 ( s ′ , π t ( s, d )) i = min d t " c d t ( s, d ) + min p d t ( . | s,t ) ∈P k ·| d t ,s,t X a ∈A p d t ( a | s, t ) · c e ( s, a ) + min p ( . | s,a,t ) ∈P k · | s,a,t E s ′ ∼ p ( · | s,a,t ) v k t +1 ( s ′ , d t ) !# . 17 Published in T ransations on Mac hine Learning Researc h (07/2022) Here, (i) follo ws from Lemma 8 , (ii) follo ws from the fact that: min π ( t +1)+ P D ∈P k D | · , ( t +1) + P ∈P k · | · , ( t +1) + h c π t ( s,d ) ( s, d ) + E a ∼ p π t ( s,d ) ( · | s,t ) c e ( s, a ) + E s ′ ∼ p ( · | s,a,t ) V π t +1 | P D ,P ( s ′ , π t ( s, d )) i ≤ h c π t ( s,d ) ( s, d ) + E a ∼ p π t ( s,d ) ( · | s,t ) c e ( s, a ) + E s ′ ∼ p ( · | s,a,t ) V π t +1 | P D ,P ( s ′ , π t ( s, d )) i ∀ π , P D ∈ P k D | · , ( t +1) + , P ∈ P k · | · , ( t +1) + (17) and if w e set π ( t +1)+ = { π ∗ t +1 , ..., π ∗ L } , P D = P ∗ D ∈ P k D | · , ( t +1) + and P = P ∗ ∈ P k | · , ( t +1) + , where { π ∗ t +1 , ..., π ∗ L } , P ∗ D , P ∗ = argmin π ( t +1)+ P D ∈P k D | · , ( t +1) + P ∈P k · | · , ( t +1) + V π t +1 | P D ,P ( s ′ , π t ( s, d )) , (18) then equality (iii) holds. Since the upp er and low er b ounds are the same, we can conclude that the optimistic v alue function satisfies Eq. 12 , which completes the pro of. A.2 Pro of of Theorem 2 In this pro of, we assume that c e ( s, a ) + c c ( d ) + c x ( d, d ′ ) < 1 for all s ∈ S , a ∈ A and d, d ′ ∈ D . Throughout the pro of, we will omit the subscripts P ∗ D , P ∗ in V t | P ∗ D ,P ∗ and write V t instead in case of true agent p olicies P ∗ D and true transition probabilities P ∗ . Then, we define the following quantities: P k D = argmin P D ∈P k D ( δ ) min P ∈P k ( δ ) V π k 1 | P D ,P ( s 1 , d 0 ) , (19) P k = argmin P ∈P k ( δ ) V π k 1 | P k D ,P ( s 1 , d 0 ) , (20) ∆ k = V π k 1 ( s 1 , d 0 ) − V π ∗ 1 ( s 1 , d 0 ) , (21) where, recall from Eq. 7 that, π k = argmin π min P D ∈P k D , min P ∈P k V π 1 | P D ,P ( s 1 , d 0 ) ; and, ∆ k indicates the regret for episo de k . Hence, we hav e R ( T ) = K X k =1 ∆ k = K X k =1 ∆ k I ( P ∗ D ∈ P k D ∧ P ∗ ∈ P k ) + K X k =1 ∆ k I ( P ∗ D ∈ P k D ∨ P ∗ ∈ P k ) (22) Next, we split the analysis into tw o parts. W e first b ound P K k =1 ∆ k I ( P ∗ D ∈ P k D ∧ P ∗ ∈ P k ) and then b ound P K k =1 ∆ k I ( P ∗ D ∈ P k D ∨ P ∗ ∈ P k ) . — Computing the b ound on P K k =1 ∆ k I ( P ∗ D ∈ P k D ∧ P ∗ ∈ P k ) First, w e note that ∆ k = V π k 1 ( s 1 , d 0 ) − V π ∗ 1 ( s 1 , d 0 ) ≤ V π k 1 ( s 1 , d 0 ) − V π k 1 | P k D ,P k ( s 1 , d 0 ) (23) This is b ecause V π k 1 | P k D ,P k ( s 1 , d 0 ) ( i ) = min π min P D ∈P k D min P ∈P k V π 1 | P D ,P ( s 1 , d 0 ) ( ii ) ≤ min π V π 1 | P ∗ D ,P ∗ ( s 1 , d 0 ) = V π ∗ 1 ( s 1 , d 0 ) , (24) where (i) follo ws from Eqs. 19 , 20 , and (ii) holds b ecause of the fact that the true transition probabilities P ∗ D ∈ P k D and P ∗ ∈ P k . Next, w e use Lemma 4 (Appendix B ) to bound P K k =1 ( V π k 1 ( s 1 , d 0 ) − V π k 1 | P k D ,P k ( s 1 , d 0 )) . K X k =1 ( V π k 1 ( s 1 , d 0 ) − V π k 1 | P k D ,P k ( s 1 , d 0 )) ≤ K X k =1 L E " L X t =1 min { 1 , β k D ( s t , d t , δ ) } + L X t =1 min { 1 , β k ( s t , a t δ ) } s 1 , d 0 # (25) 18 Published in T ransations on Mac hine Learning Researc h (07/2022) Since b y assumption, c e ( s, a ) + c c ( d ) + c x ( d, d ′ ) < 1 for all s ∈ S , a ∈ A and d, d ′ ∈ D , the w orst-case regret is b ounded by T . Therefore, we hav e that: K X k =1 ∆ k I ( P ∗ D ∈ P k D ∧ P ∗ ∈ P k ) ≤ min ( T , K X k =1 L E " L X t =1 min { 1 , β k D ( s t , d t , δ ) }| s 1 , d 0 # + K X k =1 L E " L X t =1 min { 1 , β k ( s t , a t , δ ) }| s 1 , d 0 #) ≤ min ( T , K X k =1 L E " L X t =1 min { 1 , β k D ( s t , d t , δ ) }| s 1 , d 0 #) + min ( T , K X k =1 L E " L X t =1 min { 1 , β k ( s t , a t , δ ) }| s 1 , d 0 #) , (26) where, the last inequality follows from Lemma 9 . Now, we aim to b ound the first term in the RHS of the ab o ve inequalit y . K X k =1 L E " L X t =1 min { 1 , β k D ( s t , d t , δ ) }| s 1 , d 0 # ( i ) = L K X k =1 E L X t =1 min 1 , v u u t 2 log (( k − 1) L ) 7 |S ||D| 2 |A| +1 δ max { 1 , N k ( s t , d t ) } s 1 , d 0 ( ii ) ≤ L K X k =1 E L X t =1 min 1 , v u u t 2 log ( K L ) 7 |S ||D| 2 |A| +1 δ max { 1 , N k ( s t , d t ) } ( iii ) ≤ 2 √ 2 L s 2 log ( K L ) 7 |S ||D | 2 |A| +1 δ |S ||D | K L + 2 L 2 |S ||D | (27) ≤ 2 √ 2 s 14 |A| log ( K L ) |S ||D | δ |S ||D | K L + 2 L 2 |S ||D | = √ 112 s |A| log ( K L ) |S ||D | δ |S ||D | K L + 2 L 2 |S ||D | (28) where (i) follo ws by replacing β k D ( s t , d t , δ ) with its definition, (ii) follows by the fact that ( k − 1) L ≤ K L , (iii) follo ws from Lemma 5 , in whic h, we put W := S × D , c := r 2 log ( K L ) 7 |S ||D| 2 |A| +1 δ , T k = ( w k, 1 , . . . , w k,L ) := (( s 1 , d 1 ) , . . . , ( s L , d L )) . No w, due to Eq. 28 , we hav e the following. min ( T , K X k =1 L E " L X t =1 min { 1 , β k D ( s t , d t , δ ) }| s 1 , d 0 #) ≤ min ( T , √ 112 L s |A||S ||D | T log T |S ||D | δ + 2 L 2 |S ||D | ) (29) No w, if T ≤ 2 L 2 |S ||A||D | log T |S ||D | δ , T 2 ≤ 2 L 2 |S ||A||D | T log T |S ||D | δ = ⇒ T ≤ √ 2 L s |S ||A||D | T log T |S ||D | δ 19 Published in T ransations on Mac hine Learning Researc h (07/2022) and if T > 2 L 2 |S ||A||D | log T |S ||D | δ , 2 L 2 |S | < r 2 L 2 |S ||A||D | T log T |S ||D | δ |A||D | log T |S ||D | δ ≤ √ 2 L s |S ||A||D | T log T |S ||D | δ . (30) Th us, the minim um in Eq. 29 is less than ( √ 2 + √ 112) L s |S ||A||D | T log |S ||D | T δ < 12 L s |S ||A||D | T log |S ||D | T δ (31) A similar analysis can b e done for the second term of the RHS of Eq. 26 , which would sho w that, min ( T , K X k =1 L E " L X t =1 min { 1 , β k ( s t , a t , δ ) }| s 1 , d 0 #) ≤ 12 L |S | s |A| T log T |S ||A| δ . (32) Com bining Eqs. 26 , 31 and 32 , w e can b ound the first term of the total regret as follows: K X k =1 ∆ k I ( P ∗ D ∈ P k D ∧ P ∗ ∈ P k ) ≤ 12 L s |A||S ||D | T log T |S ||D | δ + 12 L |S | s |A| T log T |S ||A| δ (33) — Computing the b ound on P K k =1 ∆ k I ( P ∗ D ∈ P k D ∨ P ∗ ∈ P k ) Here, w e use a similar approac h to Jaksc h et al. ( 2010 ). Note that K X k =1 ∆ k I ( P ∗ D ∈ P k D ∨ P ∗ ∈ P k ) = √ K L X k =1 ∆ k I ( P ∗ D ∈ P k D ∨ P ∗ ∈ P k ) + K X k = √ K L +1 ∆ k I ( P ∗ D ∈ P k D ∨ P ∗ ∈ P k ) . (34) No w, our goal is to sho w the second term of the RHS of abov e equation v anishes with high probability . If w e succeed, then it holds that, with high probability , P K k =1 ∆ k I ( P ∗ D ∈ P k D ∨ P ∗ ∈ P k ) equals the first term of the RHS and then w e will b e done b ecause √ K L X k =1 ∆ k I ( P ∗ D ∈ P k D ∨ P ∗ ∈ P k ) ≤ √ K L X k =1 ∆ k ( i ) ≤ $ r K L % L = √ K L, (35) where (i) follows from the fact that ∆ k ≤ L since w e assumed the cost of each step c e ( s, a ) + c c ( d ) + c x ( d, d ′ ) ≤ 1 for all s ∈ S , a ∈ A , and d, d ′ ∈ D . T o prov e that P K k = √ K L +1 ∆ k I ( P ∗ D ∈ P k D ∨ P ∗ ∈ P k ) = 0 with high probability , we pro ceed as follows. By applying Lemma 6 to P ∗ D and P ∗ , w e ha v e Pr ( P ∗ D ∈ P k D ) ≤ δ 2 t k 6 , Pr ( P ∗ ∈ P k ) ≤ δ 2 t k 6 (36) Th us, Pr ( P ∗ D ∈ P k D ∨ P ∗ ∈ P k ) ≤ Pr ( P ∗ D ∈ P k D ) + Pr ( P ∗ ∈ P k ) ≤ δ t k 6 (37) 20 Published in T ransations on Mac hine Learning Researc h (07/2022) where t k = ( k − 1) L is the end time of episo de k − 1 . Therefore, it follows that Pr K X k = √ K L +1 ∆ k I ( P ∗ D ∈ P k D ∨ P ∗ ∈ P k ) = 0 ! = Pr ∀ k : $ r K L % + 1 ≤ k ≤ K ; P ∗ D ∈ P k D ∧ P ∗ ∈ P k ! = 1 − Pr ∃ k : $ r K L % + 1 ≤ k ≤ K ; P ∗ D ∈ P k D ∨ P ∗ ∈ P k ! ( i ) ≥ 1 − K X k = √ K L +1 Pr ( P ∗ D ∈ P k D ∨ P ∗ ∈ P k ) ( ii ) ≥ 1 − K X k = √ K L +1 δ t 6 k ( iii ) ≥ 1 − K L X t = √ K L δ t 6 ≥ 1 − Z K L √ K L δ t 6 ≥ 1 − δ 5( K L ) 5 4 . (38) where (i) follows from a union b ound, (ii) follows from Eq. 37 and (iii) holds using that t k = ( k − 1) L . Hence, with probabilit y at least 1 − δ 5( K L ) 5 4 w e ha v e that K X k = √ K L +1 ∆ k I ( P ∗ D ∈ P k D ∨ P ∗ ∈ P k ) = 0 . (39) If w e com bine the ab o ve equation and Eq. 35 , we can conclude that, with probability at least 1 − δ 5 T 5 / 4 , w e ha v e that √ K L X k =1 ∆ k I ( P ∗ D ∈ P k D ∨ P ∗ ∈ P k ) ≤ √ T (40) where T = K L . Next, if we combine Eqs. 33 and 40 , we hav e R ( T ) = K X k =1 ∆ k I ( P ∗ D ∈ P k D ∧ P ∗ ∈ P k ) + K X k =1 ∆ k I ( P ∗ D ∈ P k D ∨ P ∗ ∈ P k ) ≤ 12 L s |A||S ||D | T log T |S ||D | δ + 12 L |S | s |A| T log T |S ||A| δ + √ T ≤ 13 L s |A||S ||D | T log T |S ||D | δ + 12 L |S | s |A| T log T |S ||A| δ (41) Finally , since P ∞ T =1 δ 5 T 5 / 4 ≤ δ , with probability at least 1 − δ , the ab o v e inequalit y holds. This concludes the pro of. B Useful lemmas Lemma 4. Supp ose P D and P ar e true tr ansitions and P D ∈ P k D , P ∈ P k for episo de k . Then, for arbitr ary p olicy π k , and arbitr ary P k D ∈ P k D , P k ∈ P k , it holds that V π k 1 | P D ,P ( s, d ) − V π k 1 | P k D ,P k ( s, d ) ≤ L E " L X t =1 min { 1 , β k D ( s t , d t , δ ) } + H X t =1 min { 1 , β k ( s t , a t , δ ) } | s 1 = s, d 0 = d # , (42) 21 Published in T ransations on Mac hine Learning Researc h (07/2022) wher e the exp e ctation is taken over the MDP with p olicy π k under true tr ansitions P D and P . Pr o of. F or ease of notation, let v k t := V π k t | P D ,P , v k t | k := V π k t | P k D ,P k and c π t ( s, d ) = c π k t ( s,d ) ( s, d ) . W e also define d ′ = π k 1 ( s, d ) . F rom Eq 68 , we hav e v k 1 ( s, d ) = c π 1 ( s, d ) + X a ∈A p π k 1 ( s,d ) ( a | s ) · c e ( s, a ) + X s ′ ∈S p ( s ′ | s, a ) · v k 2 ( s ′ , d ′ ) ! (43) v k 1 | k ( s, d ) = c π 1 ( s, d ) + X a ∈A p k π k 1 ( s,d ) ( a | s ) · c e ( s, a ) + X s ′ ∈S p k ( s ′ | s, a ) · v k 2 | k ( s ′ , d ′ ) ! (44) No w, using ab o ve equations, w e rewrite v k 1 ( s, d ) − v k 1 | k ( s, d ) as v k 1 ( s, d ) − v k 1 | k ( s, d ) = X a ∈A p π k 1 ( s,d ) ( a | s ) c e ( s, a ) + X s ′ ∈S p ( s ′ | s, a ) · v k 2 ( s ′ , d ′ ) ! − X a ∈A p k π k 1 ( s,d ) ( a | s ) c e ( s, a ) + X s ′ ∈S p k ( s ′ | s, a ) · v k 2 | k ( s ′ , d ′ ) ! ( i ) = X a ∈A p π k 1 ( s,d ) ( a | s ) c e ( s, a ) + X s ′ ∈S p ( s ′ | s, a ) · v k 2 ( s ′ , d ′ ) − c e ( s, a ) − X s ′ ∈S p k ( s ′ | s ) · v k 2 | k ( s ′ , d ′ ) ! + X a ∈A p π k 1 ( s,d ) ( a | s ) − p k π k 1 ( s,d ) ( a | s ) c e ( s, a ) + X s ′ ∈S p k ( s ′ | s, a ) · v k 2 | k ( s ′ , d ′ ) | {z } ≤ L ( ii ) ≤ X a ∈A " p π k 1 ( s,d ) ( a | s ) · X s ′ ∈S h p ( s ′ | s, a ) v k 2 ( s ′ , d ′ ) − p k ( s ′ | s, a ) v k 2 | k ( s ′ , d ′ ) i # + L X a ∈A h p π k 1 ( s,d ) ( a | s ) − p k π k 1 ( s,d ) ( a | s ) i ( iii ) = X a ∈A " p π k 1 ( s,d ) ( a | s ) · X s ′ ∈S p ( s ′ | s, a ) · v k 2 ( s ′ , d ′ ) − v k 2 | k ( s ′ , d ′ ) # + X a ∈A p π k 1 ( s,d ) ( a | s ) X s ′ ∈S p ( s ′ | s, a ) − p k ( s ′ | s, a ) v k 2 | k ( s ′ , d ′ ) | {z } ≤ L + L X a ∈A h p π k 1 ( s,d ) ( a | s ) − p k π k 1 ( s,d ) ( a | s, d ) i ( iv ) ≤ E a ∼ p π k 1 ( s,d ) ( . | s ) ,s ′ ∼ p ( · | s,a ) h v k 2 ( s ′ , d ′ ) − v k 2 | k ( s ′ , d ′ ) i + L E a ∼ p π k 1 ( s,d ) ( · | s ) " X s ′ ∈S p ( s ′ | s, a ) − p k ( s ′ | s, a ) # + L X a ∈A h p π k 1 ( s,d ) ( a | s ) − p k π k 1 ( s,d ) ( a | s ) i , (45) where (i) follows b y adding and subtracting term p π k 1 ( s,d ) ( a | s ) c e ( s, a ) + P s ′ ∈S p k ( s ′ | s, a ) · v k 2 | k ( s ′ , d ′ ) , (ii) follows from the fact that c e ( s, a ) + P s ′ ∈S p k ( s ′ | s, a ) · v k 2 | k ( s ′ , d ′ ) ≤ L , since, by assumption, c e ( s, a ) + c c ( d ) + c x ( d, d ′ ) < 1 for all s ∈ S , a ∈ A and d, d ′ ∈ D .. Similarly , (iii) follo ws by adding and subtracting p ( s ′ | s, a ) v k 2 | k ( s ′ , d ′ ) , and (iv) follows from the fact that v k 2 | k ≤ L . By assumption, b oth P D and P k D lie in 22 Published in T ransations on Mac hine Learning Researc h (07/2022) the confidence set P k D ( δ ) , so X a ∈A h p π k 1 ( s,d ) ( a | s ) − p k π k 1 ( s,d ) ( a | s ) i ≤ min { 1 , β k D ( s, d ′ = π k 1 ( s, d ) , δ ) } (46) Similarly , X s ′ ∈S p ( s ′ | s, a ) − p k ( s ′ | s, a ) ≤ min { 1 , β k ( s, a, δ ) } (47) If w e com bine Eq. 46 and Eq. 47 in Eq. 45 , for all s ∈ S , it holds that v k 1 ( s, d ) − v k 1 | k ( s, d ) ≤ E a ∼ p π k 1 ( s,d ) ( . | s ) ,s ′ ∼ p ( . | s,a ) h v k 2 ( s ′ , d ′ ) − v k 2 | k ( s ′ , d ′ ) i + L E a ∼ p π k 1 ( s,d ) ( . | s ) min { 1 , β k ( s, a, δ ) } + L min { 1 , β k D ( s, d ′ = π k 1 ( s, d ) , δ ) } (48) Similarly , for all s ∈ S , d ∈ D w e can show v k 2 ( s, d ) − v k 2 | k ( s, d ) ≤ E a ∼ p π k 1 ( s,d ) ( . | s ) ,s ′ ∼ p ( . | s,a ) h v k 3 ( s ′ , π 2 ( s, d )) − v k 3 | k ( s ′ , π 2 ( s, d )) i + L E a ∼ p π k 1 ( s,d ) ( . | s ) min { 1 , β k ( s, a, δ ) } + L min { 1 , β k D ( s, π k 2 ( s, d ) , δ ) } (49) Hence, b y induction w e ha v e v k 1 ( s, d ) − v k 1 | k ( s, d ) ≤ L E " L X t =1 min { 1 , β k D ( s t , d t , δ ) } + L X t =1 min { 1 , β k ( s t , a t , δ ) }| s 1 = s, d 0 = d ) # (50) where the exp ectation is taken o v er the MDP with p olicy π k under true transitions P D and P . Lemma 5. L et W b e a finite set and c b e a c onstant. F or k ∈ [ K ] , supp ose T k = ( w k, 1 , w k, 2 , . . . , w k,H ) is a r andom variable with distribution P ( . | w k, 1 ) , wher e w k,i ∈ W . Then, K X k =1 E T k ∼ P ( . | w k, 1 ) " H X t =1 min { 1 , c p max { 1 , N k ( w k,t ) } } # ≤ 2 H |W | +2 √ 2 c p |W | K H (51) with N k ( w ) := P k − 1 j =1 P H t =1 I ( w j,t = w ) . Pr o of. The pro of is adapted from Osband et al. ( 2013 ). W e first note that E " K X k =1 H X t =1 min { 1 , c p max { 1 , N k ( w k,t ) } } # = E " K X k =1 H X t =1 I ( N k ( w k,t ) ≤ H ) min { 1 , c p max { 1 , N k ( w k,t ) } } # + E " K X k =1 H X t =1 I ( N k ( w k,t ) > H ) min { 1 , c p max { 1 , N k ( w k,t ) } } # ≤ E " K X k =1 H X t =1 I ( N k ( w k,t ) ≤ H ) · 1 # + E " K X k =1 H X t =1 I ( N k ( w k,t ) > H ) · c p N k ( w k,t ) # (52) 23 Published in T ransations on Mac hine Learning Researc h (07/2022) Then, w e b ound the first term of the ab o ve equation E " K X k =1 H X t =1 I ( N k ( w k,t ) ≤ H ) # = E " X w ∈W { # of times w is observed and N k ( w ) ≤ H } # ≤ E [ |W |· 2 H ] = 2 H |W | (53) T o b ound the second term, we first define n τ ( w ) as the num b er of times w has b een observed in the first τ steps, i.e. , if w e are at the t th index of tra jectory T k , then τ = t k + t , where t k = ( k − 1) H , and note that n t k + t ( w ) ≤ N k ( w ) + t (54) b ecause w e will observe w at most t ∈ { 1 , . . . , H } times within tra jectory T k . No w, if N k ( w ) > H , we hav e that n t k + t ( w ) + 1 ≤ N k ( w ) + t + 1 ≤ N k ( w ) + H + 1 ≤ 2 N k ( w ) . (55) Hence w e ha v e, I ( N k ( w k,t ) > H )( n t k + t ( w k,t ) + 1) ≤ 2 N k ( w k,t ) = ⇒ I ( N k ( w k,t ) > H ) N k ( w k,t ) ≤ 2 n t k + t ( w k,t ) + 1 (56) Then, using the ab o ve equation, w e can b ound the second term in Eq. 52 : E " K X k =1 H X t =1 I ( N k ( w k,t ) > H ) c p N k ( w k,t ) # = E " K X k =1 H X t =1 c s I ( N k ( w k,t ) > H ) N k ( w k,t ) # ( i ) ≤ √ 2 c E " K X k =1 H X t =1 s 1 n t k + t ( w k,t ) + 1 # , (57) where (i) follo ws from Eq. 56 . Next, w e can further b ound E h P K k =1 P H t =1 q 1 n t k + t ( w k,t )+1 i as follo ws: E " K X k =1 H X t =1 s 1 n t k + t ( w k,t ) + 1 # = E " K H X τ =1 s 1 n τ ( w τ ) + 1 # ( i ) = E X w ∈W N K +1 ( w ) X ν =0 r 1 ν + 1 = X w ∈W E N K +1 ( w ) X ν =0 r 1 ν + 1 ≤ X w ∈W E " Z N K +1 ( w )+1 1 r 1 x dx # ≤ X w ∈W E h 2 p N K +1 ( w ) i ( ii ) ≤ E 2 s |W | X w ∈W N K +1 ( w ) ( iii ) = E h 2 p |W | K H i = 2 p |W | K H , (58) where (i) follows from summing o ver different w ∈ W instead of time and from the fact that w e observe eac h w exactly N K +1 ( w ) times after K tra jectories, (ii) follows from Jensen’s inequality and (iii) follows from the fact that P w ∈W N K +1 ( w ) = K H . Next, we combine Eqs 57 and 58 to obtain E " K X k =1 H X t =1 I ( N k ( w k,t ) > H ) c p N k ( w k,t ) # ≤ √ 2 c × 2 p |W | K H = 2 √ 2 c p |W | K H (59) 24 Published in T ransations on Mac hine Learning Researc h (07/2022) F urther, w e plug in Eqs. 53 and 59 in Eq. 52 E " K X k =1 H X τ =1 min { 1 , c p max { 1 , N k ( w k,t ) } }} | # ≤ 2 H |W | +2 √ 2 c p |W | K H (60) This concludes the pro of. Lemma 6. L et W b e a finite set and P t ( δ ) := { p : ∀ w ∈ W , || p ( . | w ) − ˆ p t ( . | w ) || 1 ≤ β t ( w , δ ) } b e a |W | - r e ctangular c onfidenc e set over pr ob ability distributions p ∗ ( . | w ) with m outc omes, wher e ˆ p t ( . | w ) is the empiric al estimation of p ∗ ( . | w ) . Supp ose at e ach time τ , we observe an state w τ = w and sample fr om p ∗ ( . | w ) . If β t ( w , δ ) = v u u t 2 log t 7 |W | 2 m +1 δ max { 1 , N t ( w ) } with N t ( w ) = P t τ =1 I ( w τ = w ) , then the true distributions p ∗ lie in the c onfidenc e set P t ( δ ) with pr ob ability at le ast 1 − δ 2 t 6 . Pr o of. W e adapt the pro of from Lemma 17 in Jaksch et al. ( 2010 ). W e note that, Pr ( p ∗ ∈ P t ) ( i ) = Pr [ w ∈W ∥ p ∗ ( · | w ) − ˆ p t ( · | w ) ∥ 1 ≥ β t ( w , δ ) ! ( ii ) ≤ X w ∈W Pr ∥ p ∗ ( · | w ) − ˆ p t ( · | w ) ∥ 1 ≥ v u u t 2 log t 7 |W | 2 m +1 δ max { 1 , N t ( w ) } ( iii ) ≤ X w ∈W t X n =0 Pr ∥ p ∗ ( · | w ) − ˆ p t ( · | w ) ∥ 1 ≥ v u u t 2 log t 7 |W | 2 m +1 δ max { 1 , n } , where (i) follows from the definition of the confidence set, i.e. , the probability distributions do not lie in the confidence set if there is at least one state w in which ∥ p ∗ ( · | w ) − ˆ p ( · | w ) ∥ 1 ≥ β t ( w , δ ) , (ii) follows from the definition of β t ( w , δ ) and a union b ound ov er all w ∈ W and (iii) follows from a union b ound ov er all p ossible v alues of N t ( w ) . T o contin ue, we split the sum into n = 0 and n > 0 : X w ∈W t X n =0 Pr ∥ p ∗ ( · | w ) − ˆ p t ( · | w ) ∥ 1 ≥ v u u t 2 log t 7 |W | 2 m +1 δ max { 1 , n } = X w ∈W t X n =1 Pr ∥ p ∗ ( · | w ) − ˆ p t ( · | w ) ∥ 1 ≥ v u u t 2 log t 7 |W | 2 m +1 δ n + if n =0 z }| { X w ∈W Pr ∥ p ∗ ( · | w ) − ˆ p t ( · | w ) ∥ 1 ≥ s 2 log t 7 |W | 2 m +1 δ ! ( i ) = X w ∈W t X n =1 Pr ∥ p ∗ ( · | w ) − ˆ p t ( · | w ) ∥ 1 ≥ v u u t 2 log t 7 |W | 2 m +1 δ n + 0 ( ii ) ≤ t |W | 2 m exp log − t 7 |W | 2 m +1 δ ≤ δ 2 t 6 , 25 Published in T ransations on Mac hine Learning Researc h (07/2022) where (i) follows from the fact that ∥ p ∗ ( · | w ) − ˆ p t ( · | w ) ∥ 1 < r 2 log t 7 |W | 2 m +1 δ for non-trivial cases. More sp ecifically , δ < 1 , t ≥ 2 = ⇒ s 2 log t 7 |W | 2 m +1 δ > p 2 log(512) > 2 , ∥ p ∗ ( · | w ) − ˆ p t ( · | w ) ∥ 1 ≤ X i ∈ [ m ] ( p ∗ ( i | s ) + ˆ p t ( i | w )) ≤ 2 , (61) and (ii) follows from the fact that, after observing n samples, the L 1 -deviation of the true distribution p ∗ from the empirical one ˆ p o v er m ev en ts is b ounded by: Pr ( ∥ p ∗ ( · ) − ˆ p ( · ) ∥ 1 ≥ ϵ ) ≤ 2 m exp − n ϵ 2 2 (62) Lemma 7. Consider the fol lowing minimization pr obl em: minimize x m X i =1 x i w i subje ct to m X i =1 | x i − b i |≤ d, X i x i = 1 , x i ≥ 0 ∀ i ∈ { 1 , . . . , m } , (63) wher e d ≥ 0 , b i ≥ 0 ∀ i ∈ { 1 , . . . , m } , P i b i = 1 and 0 ≤ w 1 ≤ w 2 . . . ≤ w m . Then, the solution to the ab ove minimization pr oblem is given by: x ∗ i = min { 1 , b 1 + d 2 } if i = 1 b i if i > 1 and P i l =1 x l ≤ 1 0 otherwise. (64) Pr o of. Supp ose there is { x ′ i ; P i x ′ i = 1 , x ′ i ≥ 0 } suc h that P i x ′ i w i < P i x ∗ i w i . Let j ∈ { 1 , . . . , m } b e the first index where x ′ j = x ∗ j , then it’s clear that x ′ j > x ∗ j . If j = 1 : m X i =1 | x ′ i − b i | = | x ′ 1 − b 1 | + m X i =2 | x ′ i − b i | > d 2 + m X i =2 b i − x ′ i = d 2 + x ′ 1 − b 1 > d (65) If j > 1 : m X i =1 | x ′ i − b i | = | x ′ 1 − b 1 | + m X i = j | x ′ i − b i | > d 2 + m X i = j +1 b i − x ′ i > d 2 + x ′ 1 − b 1 = d (66) Both cases con tradict the condition P m i =1 | x ′ i − b i |≤ d . Lemma 8. F or the value function V π t | P D ,P define d in Eq. 10 , we have that: V π t | P D ,P ( s, d ) = c π t ( s,d ) ( s, d ) + X a ∈A p π t ( s,d ) ( a | s ) · c e ( s, a ) + X s ′ ∈S p ( s ′ | s, a ) · V π t +1 | P D ,P ( s ′ , π t ( s, d )) ! (67) Pr o of. V π t | P D ,P ( s, d ) ( i ) = ¯ c ( s, d ) + X s ′ ∈S p ( s ′ , π t ( s, d ) | ( s, d )) V π t +1 | P D ,P ( s ′ , π t ( s, d )) 26 Published in T ransations on Mac hine Learning Researc h (07/2022) ( ii ) = X a ∈A p π t ( s,d ) ( a | s ) c e ( s, a ) + c c ( π t ( s, d )) + c x ( π t ( s, d ) , d ) + X s ′ ∈S X a ∈A p ( s ′ | s, a ) p π t ( s,d ) ( a | s ) V π t +1 | P D ,P ( s ′ , π t ( s, d )) ( iii ) = c π t ( s,d ) ( s, d ) + X a ∈A p π t ( s,d ) ( a | s ) · c e ( s, a ) + X s ′ ∈S p ( s ′ | s, a ) · V π t +1 | P D ,P ( s ′ , π t ( s, d )) ! , (68) where (i) is the standard Bellman equation in the standard MDP defined with dynamics 3 and costs 4 , (ii) follo ws b y replacing ¯ c and p with equations 3 and 4 , and (iii) follows by c d ′ ( s, d ) = c c ( d ′ ) + c x ( d ′ , d ) . Lemma 9. min { T , a + b } ≤ min { T , a } + min { T , b } for T , a, b ≥ 0 . Pr o of. Assume that a ≤ b ≤ a + b . Then, min { T , a + b } = T ≤ a + b = min { T , a } + min { T , b } if a ≤ b ≤ T ≤ a + b T ≤ a + T = min { T , a } + min { T , b } if a ≤ T ≤ b ≤ a + b T ≤ 2 T = min { T , a } + min { T , b } if T ≤ a ≤ b ≤ a + b a + b = min { T , a } + min { T , b } if a ≤ b ≤ a + b ≤ T (69) 27 Published in T ransations on Mac hine Learning Researc h (07/2022) C Implementation of UCRL2 in finite ho rizon setting ALGORITHM 2: Mo dified UCRL2 algorithm for a finite horizon MDP M = ( S , A , P , C, L ) . Require: Cost C = [ c ( s, a )] , confidence parameter δ ∈ (0 , 1) . 1: ( { N k ( s, a ) } , { N k ( s, a, s ′ ) } ) ← InitializeCounts () 2: for k = 1 , . . . , K do 3: for s, s ′ ∈ S , a ∈ A do 4: if N k ( s, a ) = 0 then ˆ p k ( s ′ | s, a ) ← N k ( s, a, s ′ ) N k ( s, a ) else ˆ p k ( s ′ | s, a ) ← 1 |S | 5: β k ( s, a, δ ) ← s 14 |S | log 2( k − 1) L |A||S | δ max { 1 , N k ( s, a ) } 6: end for 7: π k ← ExtendedV alueItera tion ( ˆ p k , β k , C ) 8: s 0 ← InitialConditions () 9: for t = 0 , . . . , L − 1 do 10: T ake action a t = π k t ( s t ) , and observe next state s t +1 . 11: N k ( s t , a t ) ← N k ( s t , a t ) + 1 12: N k ( s t , a t , s t +1 ) ← N k ( s t , a t , s t +1 ) + 1 13: end for 14: end for 15: Return π K ALGORITHM 3: It implements ExtendedV alueItera tion , whic h is used in Algorithm 2 . Require: Empirical transition distribution ˆ p ( . | s, a ) , cost c ( s, a ) , and confidence interv al β ( s, a, δ ) . 1: π ← InitializePolicy () , v ← InitializeV alueFunction () 2: n ← |S | 3: for t = T − 1 , . . . , 0 do 4: for s ∈ S do 5: for a ∈ A do 6: s ′ 1 , . . . , s ′ n ← Sor t ( v t +1 ) # v t +1 ( s ′ 1 ) ≤ . . . ≤ v t +1 ( s ′ n ) 7: p ( s ′ 1 ) ← min { 1 , ˆ p ( s ′ 1 | s, a ) + β ( s,a,δ ) 2 } 8: p ( s ′ i ) ← ˆ p ( s ′ i | s, a ) ∀ 1 < i ≤ n 9: l ← n 10: while P s ′ i ∈S p ( s ′ i ) > 1 do 11: p ( s ′ l ) = max { 0 , 1 − P s ′ i = s ′ l p ( s ′ i ) } 12: l ← l − 1 13: end while 14: q ( s, a ) = c ( s, a ) + E s ′ ∼ p [ v t +1 ( s ′ )] 15: end for 16: v t ( s ) ← min a ∈A { q ( s, a ) } 17: π t ( s ) ← arg min a ∈A { q ( s, a ) } 18: end for 19: end for 20: Return π 28 Published in T ransations on Mac hine Learning Researc h (07/2022) D Distribution of cell types and traffic levels in the lane driving environment T able 1: Probabilit y of cell types based on traffic level. road grass stone car no-car 0.7 0.2 0.1 0 light 0.6 0.2 0.1 0.1 heavy 0.5 0.2 0.1 0.2 T able 2: Probabilit y of traffic levels based on the previous row. no-car light heavy no-car 0.99 0.01 0 light 0.01 0.98 0.01 heavy 0 0.01 0.99 E P erformance of the human and machine agents in obstacle avoidance task Machine Human Optimal Polic y 0 2 T otal en vironment cost × 10 3 (a) γ 0 = no-car Machine Human Optimal Polic y 0 10 × 10 3 (b) γ 0 ∈ { light , heavy } Figure 8: P erformance of the machine p olicy , a human p olicy with σ H = 2 , and the optimal p olicy in terms of total cost. In panel (a), the episo des start with an initial traffic level γ 0 = no-car and, in panel (b), the episo des start with an initial traffic level γ 0 ∈ { light , heavy } . F The amount of human control fo r different initial traffic levels no-car light heavy Initial T raf fic Le v el 0 . 2 0 . 4 Human Control Rate c x = 0 , c c ( H ) = 0 c x = 0 . 1 , c c ( H ) = 0 . 2 Figure 9: The amount of human control rate using UCRL2-MC switching algorithm for different initial traffic lev els. F or each traffic lev el, w e sample 500 environmen t and av erage the human control rate o ver them. Higher traffic lev el results in more h uman con trol, as the human agen t is more reliable in hea vier traffic. 29 Published in T ransations on Mac hine Learning Researc h (07/2022) 2 4 8 16 Number of Actions 1 . 0 1 . 5 2 . 0 Re gret Ratio (a) 2 3 4 5 6 7 8 9 10 Number of Agents 0 . 25 0 . 50 0 . 75 1 . 00 Re gret Ratio (b) Figure 10: Ratio of UCRL2-MC regret to UCRL2 for (a) a set of action sizes and (b) different num b ers of agen ts. By increasing the action space size, the p erformance of UCRL2-MC gets worse but remains within the same scale. In addition, UCRL2-MC outp erforms UCRL2 in en vironments with a larger num b er of agents. G A dditional Exp eriments In this section, we run additional exp erimen ts in the RiverSwim environmen t to inv estigate the effect of action space size and the n um b er of agents in a team on the total regret. G.1 A ction space size T o study the effect of action space size on the total regret, w e artificially increase the num b er of actions b y planning m steps ahead. More concretely , we consider a new MDP , where each time step consists of m steps of the original RiverSwim MDP , and the switching p olicy decides for all the m steps at once. The n um b er of actions in the new MDP increases to 2 m , while the state space remains unchanged. W e consider a setting with a single team of tw o agents with p = 0 and p = 1 , i.e., one agent alwa ys takes action right and the other takes left . W e run the simulations for 20 , 000 episodes with m = { 1 , 2 , · · · , 4 } , i.e., with the action size of 2 , 4 , 8 , 16 . Eac h exp erimen t is rep eated for 5 times. W e compare the p erformance of our algorithm against UCRL2 in terms of total regret. Figure 10 (a) summarizes our results; The p erformance of UCRL2-MC gets worse by increasing the num b er of actions as the regret b ound directly dep ends on the action size (Theorem 2 ). How ever, the regret ratio still remains within the same scale even after doubling the n um b er of actions. One reason is that our algorithm only needs to learn the actions taken by the agents to find the optimal switching p olicy . If the agents’ p olicies include a small subset of actions, our algorithms will main tain a small regret b ound even in environmen ts with h uge action space. Therefore, we b eliev e a more careful analysis can improv e our regret b ound by making it a function of agents’ action space instead of the whole action size. G.2 Numb er of agents Here, our goal is to examine the impact of the num b er of agen ts on the total regret achiev ed by our algorithm. T o this end, we consider the original Riv erSwim MDP (i.e., tw o actions) with a single team of n agen ts, where we run our simulations for n = { 3 , 4 , · · · , 10 } and 20 , 000 episo des for eac h n . W e choose p , i.e., the probability of taking action right for n agen ts as { 0 , 1 n − 1 , · · · , n − 2 n − 1 , 1 } . As shown in Figure 10 (b), UCRL2-MC outp erforms UCRL2 as the num b er of agen ts increases. This agrees with Theorem 2 , as our deriv ed regret b ound mainly dep ends on the action space size |A| , while the UCRL2 regret b ound dep ends on the n um b er of agents |D | . 30
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment