Multi-class Generalized Binary Search for Active Inverse Reinforcement Learning

Multi-class Generalized Binary Searc h for A ctiv e In v erse Reinforcemen t Learning F rancisco S. Melo INESC-ID/Instituto Sup erior Técnico P ortugal fmelo@inesc-id.pt Man uel Lop es INRIA Bordeaux Sud-Ouest F rance manuel.lop es@inria.fr Abstract This pap er addresses the problem of learning a task from demonstration. W e adopt the framework of in- verse r einfor c ement le arning , where tasks are repre- sen ted in the form of a reward function. Our con tribu- tion is a no vel active learning algorithm that enables the learning agen t to query the expert for more infor- mativ e demonstrations, th us leading to more sample- eﬃcien t learning. F or this no vel algorithm (General- ized Binary Search for In v erse Reinforcement Learn- ing, or GBS-IRL), we pro vide a theoretical bound on sample complexit y and illustrate its applicabilit y on sev eral diﬀerent tasks. T o our kno wledge, GBS-IRL is the ﬁrst activ e IRL algorithm with pro v able sam- ple complexity b ounds. W e also discuss our metho d in ligh t of other existing metho ds in the literature and its general applicability in multi-class classiﬁcation prob- lems. Finally , motiv ated by recen t work on learning from demonstration in rob ots, w e also discuss how dif- feren t forms of human feedback can b e in tegrated in a transparen t manner in our learning framework. 1 In tro duction So cial le arning , where an agen t uses information pro- vided by other individuals to p olish or acquire anew skills, is lik ely to b ecome one primary form of program- ming suc h complex intelligen t systems (Sc haal, 1999). P aralleling the social learning abilit y of h uman infan ts, an artiﬁcial system can retriev e a large amount of task related information b y observing and/or interacting with other agents engaged in relev an t activities. F or example, the b ehavior of an exp ert can bias an agen t’s exploration of the en vironment, improv e its knowledge of the w orld, or ev en lead it to reproduce parts of the observ ed behavior (Melo et al, 2007). In this pap er w e are particularly interested in le arn- ing fr om demonstr ation . This particular form of social learning is commonly asso ciated with imitation and emulation behaviors in nature (Lop es et al, 2009a). It is also p ossible to ﬁnd n umerous successful examples of rob ot systems that learn from demonstration (see the surv ey works of Argall et al, 2009; Lopes et al, 2010). In the simplest form of in teraction, the demonstration ma y consist of examples of the right action to tak e in diﬀeren t situations. In our approac h to learning from demonstration w e adopt the formalism of inverse r einfor c ement le arn- ing (IRL), where the task is represen ted as a r ewar d function (Ng and Russel, 2000). F rom this represen ta- tion, the agent can then construct its own p olicy and solv e the target task. How ever, and unlik e many sys- tems that learn from demonstration, in this pap er we prop ose to com bine ideas from active le arning (Set- tles, 2009) with IRL, in order to reduce the data re- quiremen ts during learning. In fact, many agents able to learn from demonstration are designed to pro cess batc hes of data, typically acquired before an y actual learning tak es place. Suc h data acquisition process fails to tak e adv antage of an y information the learner may acquire in early stages of learning to guide the acquisi- tion of new data. Several recen t w orks hav e prop osed that a more inter active learning may actually lead to impro ved learning p erformance. W e adopt a Ba y esian approac h to IRL, follo wing Ra- mac handran and Amir (2007), and allow the learning agen t to activ ely select and query the exp ert for the desired b eha vior at the most informative situations. W e con tribute a theoretical analysis of our algorithm that pro vides a b ound on the sample complexity of our learning approac h and illustrate our metho d in several problems from the IRL literature. Finally , ev en if learning from demonstration is the main focus of our paper and an imp ortan t skill for in- telligen t agents in teracting with human users, the abil- 1 it y to accommo date diﬀeren t forms of feedbac k is also useful. In fact, there are situations where the user may b e unable to prop erly demonstrate the in tended be- ha vior and, instead, prefers to describ e a task in terms of a r ewar d function , as is customary in reinforcemen t learning (Sutton and Barto, 1998). As an example, supp ose that the user wan ts the agent to learn ho w to na vigate a complex maze. The user ma y exp erience diﬃculties in na vigating the maze herself and ma y , in- stead, allow the agen t to explore the maze and rew ard it for exiting the maze. A dditionally , recent studies on the b ehavior of naïve users when instructing agents (namely , rob ots) show ed that the feedbac k pro vided b y h umans is often am bigu- ous and do es not map in any ob vious manner to either a rew ard function or a p olicy (Thomaz and Breazeal, 2008; Cakmak and Thomaz, 2010). F or instance, it w as observ ed that h uman users tend to pro vide learn- ing agen ts with an ticipatory or guidance rew ards, a situation seldom considered in reinforcemen t learning (Thomaz and Breazeal, 2008). This study concludes that robust agents able to successfully learn from hu- man users should be ﬂexible to accommo date diﬀerent forms of feedbac k from the user. In order to address the issues ab ov e, we discuss ho w other forms of exp ert feedback (b eyond p olicy informa- tion) ma y b e in tegrated in a seamless manner in our IRL framework, so that the learner is able to reco ver eﬃcien tly the target task. In particular, w e sho w ho w to combine b oth p olicy and r ewar d information in our learning algorithm. Our approach thus provides a use- ful bridge b etw een reinforcemen t learning (or learning b y trial and error) and imitation learning (or learning from demonstration), a line of work seldom explored in the literature (see, ho wev er, the works of Kno x and Stone, 2010, 2011, and discussion in Section 1.1). The paper is organized as follows. In the remain- der of this section, w e provide an ov erview of related w ork on so cial learning, particularly on learning from demonstration. W e also discuss relev ant researc h in IRL and active learning, and discuss our con tribu- tions in light of existing work. Section 2 revisits core bac kground concepts, introducing the notation used throughout the pap er. Section 3 in tro duces our active IRL algorithm, GBS-IRL, and provides a theoretical analysis of its sample complexit y . Section 4 illustrates the application of GBS-IRL in several problems of dif- feren t complexity , providing an empirical comparison with other metho ds in the literature. Finally , Section 5 concludes the paper, discussing directions for future re- searc h. 1.1 Related W ork There is extensive literature rep orting research on in- telligen t agen ts that learn from exp ert advice. Man y examples feature rob otic agents that learn simple tasks from diﬀerent forms of h uman feedbac k. Examples in- clude the robot Leona rdo that is able to learn new tasks b y observing c hanges induced in the w orld (as p er- ceiv ed by the rob ot) by a human demonstrating the target task Breazeal et al (2004). During learning, Leona rdo pro vides additional feedback on its current understanding of the task that the h uman user can then use to prov ide additional information. W e refer the survey w orks of Argall et al (2009); Lopes et al (2010) for a comprehensive discussion on learning from demonstration. In this pap er, as already men tioned, we adopt the in verse reinforcement learning (IRL) formalism in tro- duced in the seminal pap er b y Ng and Russel (2000). One app ealing asp ect of the IRL approach to learn- ing from demonstration is that the learner is not just “mimic king” the observed actions. Instead, the learner infers the purp ose behind the observed behavior and sets suc h purp ose as its goal. IRL also enables the learner to accommodate for diﬀerences b et ween itself and the demonstrator (Lop es et al, 2009a). The appealing features discussed abov e ha v e led sev- eral researchers to address learning from demonstra- tion from an IRL p ersp ective. Abb eel and Ng (2004) explored in verse reinforcement learning in a context of appr entic eship le arning , where the purp ose of the learn- ing agen t is to replicate the behavior of the demonstra- tor, but is only able to observe a sequence of states ex- p erienced during task execution. The IRL formalism allo ws the learner to reason ab out which tasks could lead the demonstrator to visit the observ ed states and infer how to replicate the inferred b ehavior. Syed et al (Sy ed et al, 2008; Syed and Sc hapire, 2008) hav e further explored this line of reasoning from a game-theoretic p ersp ectiv e, and prop osed algorithms to learn from demonstration with pro v able guaran tees on the per- formance of the learner. Ramac handran and Amir (2007) introduced Bayesian inverse r einfor c ement le arning (BIRL), where the IRL problem is cast as a Ba yesian inference problem. Giv en a prior distribution ov er p ossible target tasks, the algorithm uses the demonstration b y the exp ert as evidence to compute the p oste- rior distribution ov er tasks and identify the target task. Unfortunately , the Monte-Carlo Marko v c hain (MCMC) algorithm used to approximate the p oste- 2 rior distribution is computationally exp ensiv e, as it requires extensive sampling of the space of p ossible rew ards. T o av oid such complexit y , sev eral posterior w orks hav e departed from the BIRL formulation and instead determine the task that maximizes the lik eliho o d of the observ ed demonstration (Lop es et al, 2009b; Bab es et al, 2011). The aforementioned maximum likelihoo d approac hes of Lopes et al (2009b) and Babes et al (2011) tak e adv antage of the underlying IRL problem structure and deriv e simple gradient-based algorithms to de- termine the maximum lik eliho o d task representation. T wo closely related works are the maximum entr opy appr o ach of Ziebart et al (2008) and the gr adient IRL appr o ach of Neu and Szep esv ari (2007). While the for- mer selects the task representation that maximizes the lik eliho o d of the observed exp ert behavior, under the maxim um en tropy distribution, the latter explores a gradien t-based approac h to IRL, but the where the task represen tation is selected so as to induce a b eha vior as similar as possible to the expert behavior. Finally , Ross and Bagnell (2010) propose a learning algorithm that reduces imitation learning to a classiﬁ- cation problem. The classiﬁer prescrib es the b est ac- tion to tak e in eac h possible situation that the learner can encoun ter, and is successively impro ved by enrich- ing the data-set used to train the classiﬁer. All ab o ve w orks are designed to learn from whatev er data is a v ailable to them at learning time, data that is t ypically acquired before any actual learning tak es place. Suc h data acquisition pro cess fails to tak e ad- v antage of the information that the learner acquires in early stages of learning to guide the acquisition of new, more informativ e data. A ctive le arning aims to reduce the data requirements of learning algorithms by activ ely selecting p otentially informative samples, in con trast with random sampling from a predeﬁned dis- tribution (Settles, 2009). In the case of learning from demonstration, active learning can b e used to reduce the n umber of situations that the exp ert/human user is required to demonstrate. Instead, the learner should proactiv ely ask the exp ert to demonstrate the desired b eha vior at the most informativ e situations. Conﬁdenc e-b ase d autonomy (CBA), prop osed by Cherno v a and V eloso (2009), also enables a robot to learn a task from a h uman user by building a mapping b et ween situations that the robot has encoun tered and the adequate actions. This work already incorp orates a mec hanism that enables the learner to ask the ex- p ert for the right action when it encoun ters a situation in which it is less conﬁdent ab out the correct b ehav- ior. The system also allows the human user to provide correctiv e feedbac k as the robot executes the learned task. 1 The querying strategy in CBA can b e classiﬁed b oth as str e am-b ase d and as mel low (see discussions in the survey w orks of Settles, 2009; Dasgupta, 2011). Stream-based, since the learner is presented with a stream of samples (in the case of CBA, samples cor- resp ond to possible situations) and only asks for the lab els ( i.e., correct actions) of those samples it feels uncertain ab out. Mello w, since it do es not seek highly informativ e samples, but queries any sample that is at all informative. In the IRL literature, active learning was ﬁrst ex- plored in a preliminary v ersion of this paper (Lopes et al, 2009b). In this early v ersion, the learner actively queries the exp ert for the correct action in those states where it is most uncertain ab out the correct behavior. Unlik e CBA, this active sampling approac h is aggr es- sive and uses memb ership query synthesis . Aggressiv e, since it activ ely selects highly informativ e samples. And, unlik e CBA, it can select (“synthesize”) queries from the whole input space. Judah et al (2011) pro- p ose a very similar approach, the imitation query-by- c ommitte e (IQBC) algorithm, which diﬀers only from the previous activ e sampling approach in the fact that the learner is able to accommodate the notion of “bad states”, i.e., states to b e av oided during task execution. Cohn et al (2011) propose another closely related ap- proac h that, ho wev er, uses a diﬀerent criterion to se- lect which situations to query . EMG-A QS (Expected My opic Gain A ction Querying Strategy) queries the ex- p ert for the correct action in those states where the ex- p ected gain of information is p otentially larger. Unfor- tunately , as discussed b y Cohn et al (2011), the deter- mination of the exp ected gain of information requires extensiv e computation, rendering EMG-AQS compu- tationally costly . On a diﬀeren t line of w ork, Ross et al (2011); Judah et al (2012) address imitation learning using a no-regret framework, and prop ose algorithms for direct imitation learning with prov able b ounds on the regret. Finally , Melo and Lop es (2010) use active learning in a metric approac h to learning from demon- stration. Our approac h in this pap er is a mo diﬁed version of our original activ e sampling algorithm (Lopes et al, 2009b). W e depart from the generalized binary search (GBS) algorithm of No wak (2011) and adapt it to the IRL setting. T o this purp ose, w e cast IRL as a (multi- class) classiﬁcation problem and extend the GBS al- 1 Related ideas are further explored in the do gge d le arning arc hitecture of Grollman and Jenkins (2007). 3 gorithm of No wak (2011) to this m ulti-class setting. W e analyze the sample complexit y of our GBS-IRL ap- proac h, th us pro viding the ﬁrst active IRL algorithm with prov able b ounds on sample complexity . Also, to the extent of our kno wledge, GBS-IRL is the ﬁrst ag- gressiv e active learning algorithm for non-separable, m ulti-class data (Dasgupta, 2011). W e conclude this discussion of related work b y p oint- ing out that all ab o ve w orks describ e systems that learn from human feedbac k. Ho wev er, other forms of exp ert advice ha ve also been explored in the agent learning lit- erature. Price and Boutilier (1999, 2003) ha ve explored ho w a learning agent can improv e its p erformance by observing other similar agents, in what could b e seen as “implicit” imitation learning. In these w orks, the demonstrator is, for all purposes, oblivious to the fact that its actions are b eing observed and learned from. Instead, the learned observes the b ehavior of the other agen ts and extracts information that may be useful for its own learning (for example, it ma y extract useful information ab out the world dynamics). In a more general setting, Barto and Rosenstein (2004) discuss ho w diﬀeren t forms of sup ervisory in- formation can b e in tegrated in a reinforcemen t learn- ing arc hitecture to impro v e learning. Finally , Kno x and Stone (2009, 2010) introduce the t amer paradigm, that enables a reinforcement learning agent to use hu- man feedbac k (in addition to its reinforcement signal) to guide its learning process. 1.2 Con tributions Our contributions can be summarized as follows: • A novel active IRL algorithm , GBS-IRL, that ex- tends generalized binary searc h to a m ulti-class setting in the context of IRL. • The sample-c omplexity analysis of GBS-IRL . W e establish, under suitable conditions, the exponen- tial conv ergence of our active learning metho d, as a function of the num b er of samples. As p ointed out earlier, to our kno wledge ours is the ﬁrst w ork pro viding sample complexity bounds on ac- tiv e IRL. Several exp erimen tal results conﬁrm the go o d sample performance of our approach. • A general discussion on ho w diﬀerent forms of ex- p ert information (namely action and rew ard in- formation) can b e in tegrated in our IRL setting. W e illustrate the applicability of our ideas in sev- eral simple scenarios and discuss the applicability of these diﬀeren t sources of information in face of our empirical results. F rom a broader p ersp ective, our analysis is a non- trivial extension of the results of Now ak (2011) to a m ulticlass setting, ha ving applications not only on IRL but on an y multiclass classiﬁcation problem. 2 Bac kground and Notation This section in tro duces background material on Mark ov decision pro cesses and the Ba yesian in verse re- inforcemen t learning formalism, up on whic h our con- tributions are dev elop ed. 2.1 Mark o v Decision Pro cesses A Markov de cision pr oblem (MDP) describ es a sequen- tial decision problem in whic h an agen t m ust c ho ose the sequence of actions that maximizes some reward- based optimization criterion. F ormally , an MDP M is a tuple M = ( X , A , P , r, γ ) , where X represents the state-space, A the ﬁnite action space, P represen ts the transition probabilities, r is the rew ard function and γ is a positive discoun t factor. P ( y | x, a ) denotes the probabilit y of transitioning from state x to state y when action a is taken, i.e., P ( y | x, a ) = P [ X t +1 = y | X t = x, A t = a ] , where each X t , t = 1 , . . . , is a random v ariable (r.v.) demoting the state of the process at time-step t and A t is a r.v. denoting the action of the agen t at time-step t . A p olicy is a mapping π : X × A → [0 , 1] , where π ( x, a ) is the probabilit y of choosing action a ∈ A in state x ∈ X . F ormally , π ( x, a ) = P [ A t = a | X t = x ] . It is possible to asso ciate with an y suc h p olicy π a value-function , V π ( x ) = E π " ∞ X t =0 γ t r ( X t , A t ) | X 0 = x # , where the exp ectation is now taken o ver p ossible tra- jectories of { X t } induced by p olicy π . The purp ose of the agent is then to select a p olicy π ∗ suc h that V π ∗ ( x ) ≥ V π ( x ) , for all x ∈ X . Any such p olicy is an optimal p olicy for that MDP and the corresponding v alue function is denoted by V ∗ . 4 Giv en an y p olicy π , the follo wing recursion holds V π ( x ) = r π ( x ) + γ X y ∈X P π ( x, y ) V π ( y ) where P π ( x, y ) = P a ∈A π ( x, a ) P a ( x, y ) and r π ( x ) = P a ∈A π ( x, a ) r ( x, a ) . F or the particular case of the op- timal p olicy π ∗ , the abov e recursion becomes V ∗ ( x ) = max a ∈A   r ( x, a ) + γ X y ∈X P a ( x, y ) V ∗ ( y )   . W e also deﬁne the Q -function asso ciated with a p ol- icy π as Q π ( x, a ) = r ( x, a ) + γ X y ∈X P a ( x, y ) V π ( y ) whic h, in the case of the optimal p olicy , becomes Q ∗ ( x, a ) = r ( x, a ) + γ X y ∈X P a ( x, y ) V ∗ ( y ) = r ( x, a ) + γ X y ∈X P a ( x, y ) max b ∈A Q ∗ ( y , b ) . (1) 2.2 Ba y esian In verse Reinforcement Learn- ing As seen ab o ve, an MDP describes a sequential decision making problem in whic h an agent must choose its ac- tions so as to maximize the total discoun ted reward. In this sense, the reward function in an MDP encodes the task of the agen t. Inverse r einfor c ement le arning (IRL) deals with the problem of reco vering the task represen tation ( i.e., the rew ard function) giv en a demonstration of the b ehavior to b e learned ( i.e., the desired p olicy). In this pap er w e adopt the formulation in Ramachandran and Amir (2007), where IRL is cast as a Bayesian infer enc e pr ob- lem , in whic h the agen t is pro vided with samples of the desired p olicy , π ∗ , and it must iden tify the target re- w ard function, r ∗ , from a general set of p ossible func- tions R . Prior to the observ ation of any p olicy sample and given any measurable set R ⊂ R , the initial b elief that r ∗ ∈ R is enco ded in the form of a probabilit y densit y function ρ deﬁned on R , i.e., P [ r ∗ ∈ R ] = Z R ρ ( r ) dr . As discussed by Ramac handran and Amir (2007); Lop es et al (2009b), it is generally impractical to ex- plicitly main tain and up date ρ . Instead, as in the afore- men tioned w orks, w e work with a ﬁnite (but potentially v ery large) sample of R obtained according to ρ . W e denote this sample by R ρ , and associate with eac h el- emen t r k ∈ R ρ a prior pr ob ability p 0 ( r k ) given by p 0 ( r k ) = ρ ( r k ) P i ρ ( r i ) . Asso ciated with each reward r k ∈ R ρ and each x ∈ X , w e deﬁne the set of gr e e dy actions at x with resp ect to r k as A k ( x ) = { a ∈ A | a ∈ argmax Q k ( x, a ) } where Q k is the Q -function asso ciated with the opti- mal policy for r k , as deﬁned in (1). F rom the sets A k ( x ) , x ∈ X , we deﬁne the gr e e dy p olicy with resp ect to r k as the mapping π k : X × A → [0 , 1] given by π k ( x, a ) = I A k ( x ) ( a ) |A k ( x ) | , where we write I U to denote the indicator function for a set U . In other words, for each x ∈ X , the greedy p olicy with respect to r k is deﬁned as a probabilit y distribution that is uniform in A k ( x ) and zero in its complemen t. W e assume, without loss of generality , that for an y r i , r j ∈ R ρ , A i ( x ) 6 = A j ( x ) for at least one x ∈ X . 2 F or any r k ∈ R ρ , consider a p erturbed v ersion of π k where, for eac h x ∈ X , action a ∈ A is selected with a probabilit y ˆ π k ( x, a ) = ( β k ( x ) if a / ∈ A k ( x ) γ k ( x ) if a ∈ A k ( x ) , (2) where, typically , β k ( x ) < γ k ( x ) . 3 W e note that b oth π k and the uniform p olicy can b e obtained as limits of ˆ π k , b y setting β k ( x ) = 0 or β k ( x ) = γ k ( x ) , respectively . F ollowing the Ba yesian IRL paradigm, the lik eliho o d of observing an action a b y the demonstrator at state x , giv en that the target task is r k , is no w given by ` k ( x, a ) = P [ A t = a | X t = x, r ∗ = r k ] = ˆ π k ( x, a ) . (3) 2 This assumption merely ensures that there are no redundant rew ards on R ρ . If t wo suc h rew ards r i , r j existed in R ρ , we could safely discard one of the t wo, say r j , setting p 0 ( r i ) ← p 0 ( r i ) + p 0 ( r j ) . 3 P olicy ˆ π k assigns the same probability , γ k ( x ) to all actions that, for the particular reward r k , are optimal in state x . Simi- larly , it assigns the same probabilit y , β k ( x ) , to all corresp onding sub-optimal actions. This perturb ed version of π k is conv enient b oth for its simplicit y and because it facilitates our analysis. Ho wev er, other v ersions of p erturbed policies ha ve b een consid- ered in the IRL literature—see, for example, the w orks of Ra- mac handran and Amir (2007); Neu and Szep esv ari (2007); Lop es et al (2009b). 5 Giv en a history of t (indep enden t) observ ations, F t = { ( x τ , a τ ) , τ = 0 , . . . , t } , the likelihoo d in (3) can no w be used in a standard Ba yesian update to com- pute, for ev ery r k ∈ R ρ , the posterior probabilit y p t ( r k ) = P [ r ∗ = r k ] P [ F t | r ∗ = r k ] Z = p 0 ( r k ) Q t τ =0 ` k ( x τ , a τ ) Z , where Z is a normalization constant. F or the particular case of r ∗ w e write the corresp ond- ing p erturb ed policy as ˆ π ∗ ( x, a ) = ( β ∗ ( x ) if a / ∈ A ∗ ( x ) γ ∗ ( x ) if a ∈ A ∗ ( x ) , and denote the maximum noise level as the p ositive constan t α deﬁned as α = sup x ∈X β ∗ ( x ) . 3 Multiclass A ctiv e Learning for In- v erse Reinforcemen t Learning In this section w e introduce our active learning ap- proac h to IRL. 3.1 Preliminaries T o dev elop an activ e learning algorithm for this setting, w e con vert the problem of determining r ∗ in to an equiv- alen t classiﬁcation problem. This mostly amoun ts to rewriting of the Ba y esian IRL problem from Section 2 using a diﬀeren t notation. W e deﬁne the h yp othesis space H as follows. F or ev ery r k ∈ R ρ , the k th hypothesis h k : X → {− 1 , 1 } |A| is deﬁned as the function h k ( x, a ) = 2 I A k ( x ) ( a ) − 1 , where we write h k ( x, a ) to denote the a th comp onent of h k ( x ) . Intuitiv ely , h k ( x ) identiﬁes (with a v alue of 1 ) the greedy actions in x with resp ect to r k , assigning a v alue of − 1 to all other actions. W e tak e H as the set of all suc h functions h k . Note that, since every reward prescrib es at least one optimal action p er state, it holds that for every h ∈ H and every x ∈ X there is at least one a ∈ A such that h ( x, a ) = 1 . W e write h ∗ to denote the target h yp othesis, corresp onding to r ∗ . As b efore, given a h yp othesis h ∈ H , we deﬁne the set of gr e e dy actions at x according to h as A h ( x ) = { a ∈ A | h ( x, a ) = 1 } . F or an indexed set of samples, { ( x λ , a λ ) , λ ∈ Λ } , we write h λ to denote h ( x λ , a λ ) , when the index set is clear from the context. The prior distribution p 0 o ver R ρ induces an equiv- alen t distribution o v er H , whic h w e abusiv ely also de- note as p 0 , and is such that p 0 ( h k ) = p 0 ( r k ) . W e let the history of observ ations up to time-step t b e F t = { ( x τ , a τ ) , τ = 0 , . . . , t } , and β h and γ h b e the estimates of β ∗ and γ ∗ asso ciated with the hypothesis h . Then, the distribution ov er H after observing F t can b e updated using Ba yes rule as p t ( h ) , P [ h ∗ = h | F t ] ∝ P [ a t | x t , h ∗ = h , F t − 1 ] P [ h ∗ = h | F t − 1 ] = P [ a t | x t , h ∗ = h ] P [ h = h ∗ | F t − 1 ] ≈ γ h ( x t ) (1+ h t ) / 2 β h ( x t ) (1 − h t ) / 2 p t − 1 ( h ) , (4) where we assume, for all x ∈ X , |A h ( x ) | γ h ( x ) ≤ |A ∗ ( x ) | γ ∗ ( x ) , (5) and p t ( h ) is normalized so that P h ∈H p t ( h ) = 1 . Note that, in (4), w e accommo date for the possibility of hav- ing access (for eac h hypothesis) to inaccurate estimates β h and γ h of β ∗ and γ ∗ , resp ectively . W e consider a partition of the state-space X in to a disjoint family of N sets, Ξ = {X 1 , . . . , X N } such that all h yp otheses h ∈ H are constant in eac h set X i , i = 1 . . . , N . In other words, any t wo states x, y lying in the same X i are indistinguishable, since h ( x, a ) = h ( y , a ) for all a ∈ A and all h ∈ H . This means that our hypothesis space H induces an equiv a- lence relation in X in which tw o elements x, y ∈ X are equiv alent if { x, y } ⊂ X i . W e write [ x ] i to denote the (an y) represen tativ e of the set X i . 4 The following deﬁnitions extend those of Now ak (2011). Deﬁnition 1 ( k -neighborho o d) . Two sets X i , X j ∈ Ξ ar e said to b e k -neigh b ors if the set  h ∈ H | A h ([ x ] i ) 6 = A h ([ x ] j )  has, at most, k elements, i.e., if ther e ar e k or fewer hyp otheses in H that output diﬀer ent optimal actions in X i and X j . 4 While this partition is, p erhaps, of little relev ance in prob- lems with a small state-space X , it is central in problems with large (or inﬁnite) state-space, since the state to be queried has to b e selected from a set of N alternativ es, instead of the (muc h larger) set of |X | alternativ es. 6 Deﬁnition 2. The p air ( X , H ) is k -neigh b orly if, for any two sets X i , X j ∈ Ξ , ther e is a se quenc e {X ` 0 , . . . , X ` n } ⊂ Ξ such that • X ` 0 = X i and X ` n = X j ; • F or any m , X ` m and X ` m +1 ar e k -neighb ors. The notion of k -neigh b orho o d structures the state- space X in terms of the hypotheses space H , and this structure can b e exploited for activ e learning purposes. 3.2 A ctiv e IRL using GBS In deﬁning our active IRL algorithm, w e ﬁrst consider a simpliﬁed setting in which the following assumption holds. W e postp one to Section 3.3 the discussion of the more general case. Assumption 1. F or every h ∈ H and every x ∈ X , |A h ( x ) | = 1 . In other words, we fo cus on the case where all h y- p othesis considered prescrib e a unique optimal action p er state. A single optimal action p er state implies that the noise mo del can b e simpliﬁed. In particular, the noise mo del can now b e constant across h yp othesis, since all h ∈ H prescrib es the same n umber of optimal actions in each state (namely , one). W e denote by ˆ γ ( x ) and ˆ β ( x ) the estimates of γ ∗ and β ∗ , resp ectively , and consider a Ba y esian update of the form: p t ( h ) ∝ 1 Z ˆ γ ( x t ) (1+ h t ) / 2 ˆ β ( x t ) (1 − h t ) / 2 p t − 1 ( h ) , (6) with 1 − ˆ γ ( x ) = ( |A| − 1) ˆ β ( x ) and Z an adequate nor- malization constan t. F or this simpler case, (5) b ecomes ˆ β ( x ) ≥ β ∗ ( x ) and ˆ γ ( x ) ≤ γ ∗ ( x ) , (7) where, as b efore, w e ov erestimate the noise rate β ∗ ( x ) . F or a given probability distribution p , deﬁne the weighte d pr e diction in x as W ( p, x ) = max a ∈A X h ∈H p ( h ) h ( x, a ) , and the pr e dicte d action at x as A ∗ ( p, x ) = argmax a ∈A X h ∈H p ( h ) h ( x, a ) . W e are now in p osition to introduce a ﬁrst version of our activ e learning algorithm for in verse reinforce- men t learning, that we dub Gener alize d Binary Se ar ch for IRL (GBS-IRL). GBS-IRL is summarized in Al- gorithm 1. This ﬁrst v ersion of the algorithm relies Algorithm 1 GBS-IRL (version 1) Require: MDP parameters M\ r Require: Reward space R ρ Require: Prior distribution p 0 o ver R 1: Compute H from R ρ 2: Determine partition Ξ = X 1 , . . . X N of X 3: Set F 0 = ∅ 4: for all t = 0 , . . . do 5: Set c t = min i =1 ,...,N W ( p t , [ x ] i ) 6: if there are 1 -neighbor sets X i , X j suc h that W ( p t , [ x ] i ) > c t , W ( p t , [ x ] j ) > c t A ∗ ( p t , [ x ] i ) 6 = A ∗ ( p t , [ x ] j ) , then 7: Sample x t +1 from X i or X j with probabilit y 1 / 2 8: else 9: Sample x t +1 from the set X i that minimizes W ( p t , [ x ] i ) . 10: end if 11: Obtain noisy response a t +1 12: Set F t +1 ← F t ∪ { ( x t +1 , a t +1 ) } 13: Up date p t +1 from p t using (6) 14: end for 15: return ˆ h t = argmax h ∈H p t ( h ) . critically on Assumption 1. In Section 3.3, w e discuss ho w Algorithm 1 can b e mo diﬁed to accommo date sit- uations in whic h Assumption 1 do es not hold. Our analysis of GBS-IRL relies on the following fun- damen tal lemma that generalizes Lemma 3 of No wak (2011) to m ulti-class settings. Lemma 1. L et H denote a hyp othesis sp ac e deﬁne d over a set X , wher e ( X , H ) is assume d k -neighb orly. Deﬁne the coherence parameter for ( X , H ) as c ∗ ( X , H ) , max a ∈A min µ max h ∈H N X i =1 h ([ x ] i , a ) µ ( X i ) , wher e µ is a pr ob ability me asur e over X . Then, for any pr ob ability distribution p over H , one of the two statements b elow holds: 1. Ther e is a set X i ∈ Ξ such that W ( p, [ x ] i ) ≤ c ∗ . 2. Ther e ar e two k -neighb or sets X i and X j such that W ( p, [ x ] i ) > c ∗ W ( p, [ x ] j ) > c ∗ A ∗ ( p, [ x ] i ) 6 = A ∗ ( p, [ x ] j ) . Pr o of. See App endix A.1. This lemma states that, given any distribution ov er the set of h yp othesis, either there is a state [ x ] i for 7 whic h there is great uncertain ty concerning the optimal action or, alternatively , there are tw o k -neighboring states [ x ] i and [ x ] j in whic h all except a few hypothe- sis predict the same action, yet the predicted optimal action is strikingly diﬀerent in b oth states. In either case, it is p ossible to select a query that is highly in- formativ e. The coherence parameter c ∗ is the multi-class equiv- alen t of the coherence parameter introduced by No wak (2011), and quan tiﬁes the informativ eness of queries. That c ∗ alw ays exists can b e established b y noting that the partition of X is ﬁnite (since H is ﬁnite) and, there- fore, the minimization can b e conducted exactly . On the other hand, if H do es not include trivial hypotheses that are constan t all o ver X , it holds that c ∗ < 1 . W e are no w in position to establish the con vergence prop erties of Algorithm 1. Let P [ · ] and E [ · ] denote the probability measure and corresp onding exp ecta- tion go verning the underlying probabilit y ov er noise and p ossible algorithm randomizations in query selec- tion. Theorem 1 (Consistency of GBS-IRL) . L et F t = { ( x τ , a τ ) , τ = 1 , . . . , t } denote a p ossible history of ob- servations obtaine d with GBS-IRL. If, in the up date (6) , ˆ β ( x ) and ˆ γ ( x ) verify (7) , then lim t →∞ P  ˆ h t 6 = h ∗  = 0 . Pr o of. See App endix A.2. Theorem 1 establishes the consistency of active learning for m ulti-class classiﬁcation. The proof re- lies on a fundamen tal lemma that, roughly sp eaking, ensures that the sequence p t ( h ∗ ) is increasing in ex- p ectation. This fundamen tal lemma (Lemma 2 in App endix A.2) generalizes a related result of No wak (2011) that, due to the consideration of m ultiple classes in GBS-IRL, do es not apply . Our generalization re- quires, in particular, stronger assumptions on the noise, ˆ β ( x ) , and implies a diﬀeren t rate of con vergence, as will so on become apparen t. It is also worth mentioning that the statement in Theorem 1 could alternativ ely b e pro ved using an adaptive sub-mo dularity argument (again relying on Lemma 2 in App endix A.2), using the results of Golovin and Krause (2011). Theorem 1 ensures that, as the n umber of sam- ples increases, the probability mass concentrates on the correct hypothesis h ∗ . Ho wev er, it do es not pro- vide an y information concerning the rate at which P  ˆ h t 6 = h ∗  → 0 . The con vergence rate for our ac- tiv e sampling approach is established in the follo wing result. Theorem 2 (Conv ergence Rate of GBS-IRL) . L et H denote our hyp othesis sp ac e, deﬁne d over X , and as- sume that ( X , H ) is 1-neighb orly. If, in the up date (6) , ˆ β ( x ) > α for al l x ∈ X , then P  ˆ h t 6 = h ∗  ≤ |H| (1 − λ ) t , t = 0 , . . . (8) wher e λ = ε · min  1 − c ∗ 2 , 1 4  < 1 and ε = min x γ ∗ ( x ) ˆ γ ( x ) − ˆ β ( x ) ˆ γ ( x ) + β ∗ ( x ) ˆ β ( x ) − ˆ γ ( x ) ˆ β ( x ) . (9) Pr o of. See App endix A.4. Theorem 2 extends Theorem 4 of No wak (2011) to the m ulti-class case. How ev er, due to the existence of m ultiple actions (classes), the constan ts obtained in the ab o ve b ounds diﬀer from those obtained in the afore- men tioned w ork (Now ak, 2011). Interestingly , for c ∗ close to zero, the con vergence rate obtained is near- optimal, exhibiting a logarithmic dep endence on the dimension of the hypothesis space. In fact, w e hav e the following straightforw ard corollary of Theorem 2. Corollary 1 (Sample Complexity of GBS-IRL) . Un- der the c onditions of The or em 2, for any given δ > 0 , P  ˆ h t = h ∗  > 1 − δ as long as t ≥ 1 λ log |H| δ . T o conclude this section, w e note that our reduction of IRL to a standard (multi-class) classiﬁcation prob- lem implies that Algorithm 1 is not specialized in an y particular wa y to IRL problems—in particular, it can b e used in general classiﬁcation problems. Addition- ally , the guaran tees in Theorems 1 and 2 are also gen- erally applicable in any m ulti-class classiﬁcation prob- lems verifying the corresponding assumptions. 3.3 Discussion and Extensions W e now discuss the general applicabilit y of our re- sults from Section 3.2. In particular, w e discuss tw o assumptions considered in Theorem 2, namely the 1 - neigh b orly condition on ( X , H ) and Assumption 1. W e also discuss how additional forms of exp ert feedbac k ma y b e integrated in a seamless manner in our GBS- IRL approach, so that the learner is able to recov er eﬃcien tly the target task. 8 1 -Neigh b orly Assumption: This assumption is form ulated in Theorem 2. The 1 -neigh b orly assumption states that ( X , H ) is 1 - neigh b orly , meaning that it is p ossible to “structure” the state-space X in a manner that is coherent with the hypothesis space H . T o assess the v alidit y of this assumption in general, we start by recalling that t w o sets X i , X j ∈ Ξ are 1-neighbors if there is a single hy- p othesis h 0 ∈ H that prescrib es diﬀerent optimal ac- tions in X i and X j . Then, ( X , H ) is 1-neighborly if ev ery tw o sets X i , X j can b e “connected” by a sequence of 1-neighbor sets. In general, giv en a multi-class classiﬁcation problem with h yp othesis space H , the 1 -neighborly assumption can be inv estigated by v erifying the connectivit y of the 1 -neigh b orho o d graph induced b y H on X . W e refer to the work of Now ak (2011) for a detailed discussion of this case, as similar argumen ts carry to our m ulti-class extension. In the particular case of in verse reinforcemen t learn- ing, it is important to assess whether the 1 -neighborly assumption is reasonable. Giv en a ﬁnite state-space, X , and a ﬁnite action-space, A , it is p ossible to build a total of |A| |X | diﬀeren t h yp othesis. 5 As sho wn in the work of Melo et al (2010), for any such hypothesis it is alw ays p ossible to build a non-degenerate reward function that yields such hypothesis as the optimal pol- icy . Therefore, a suﬃcien tly ric h rew ard space ensures that the corresp onding h yp othesis space H includes all |A| |X | p ossible policies already alluded to. This triv- ially implies that ( X , H ) is not 1 -neigh b orly . Unfortunately , as also sho wn in the aforemen tioned w ork (Melo et al, 2010), the consideration of H as the set of all p ossible p olicies also implies that all states must b e suﬃciently sampled, since no gener- alization across states is p ossible. This observ ation supp orts the option in most IRL research to fo cus on problems in which rew ards/p olicies are selected from some restricted set (for example, Abb eel and Ng, 2004; Ramac handran and Amir, 2007; Neu and Szep esv ari, 2007; Sy ed and Sc hapire, 2008). F or the particular case of active learning approac hes, the consideration of a full set of rew ards/p olicies also implies that there is little hop e that any active sampling will provide an y but a negligible improv ement in sample complexit y . A related observ ation can b e found in the w ork of Das- gupta (2005) in the con text of activ e learning for binary classiﬁcation. 5 This n umber is even larger if m ultiple optimal actions are allo wed. Algorithm 2 GBS-IRL (version 2) Require: MDP parameters M\ r Require: Reward space R ρ Require: Prior distribution p 0 o ver R 1: Compute H from R ρ 2: Determine partition Ξ = X 1 , . . . X N of X 3: Set F 0 = ∅ 4: for all t = 0 , . . . do 5: Sample x t +1 from the set X i that minimizes W ( p t , [ x ] i ) . 6: Obtain noisy response a t +1 7: Set F t +1 ← F t ∪ { ( x t +1 , a t +1 ) } 8: Up date p t +1 from p t using (6) 9: end for 10: return ˆ h t = argmax h ∈H p t ( h ) . In situations where the 1 -neighborly assumption ma y not b e v eriﬁed, Lemma 1 cannot b e used to en- sure the selection of highly informative queries once W ( p, [ x ] i ) > c ∗ for all X i . Ho wev er, it should still b e p ossible to use the main approac h in GBS-IRL, as de- tailed in Algorithm 2. F or this situation, we can sp e- cialize our sample complexity results in the following immediate corollary . Corollary 2 (Con vergence Rate of GBS-IRL, v er- sion 2) . L et H denote our hyp othesis sp ac e, deﬁne d over X , and let ˆ β ( x ) > α in the up date (6) . Then, for al l t such that W ( p t , [ x ] i ) ≤ c ∗ for some X i , P  ˆ h t 6 = h ∗  ≤ |H| (1 − λ ) t , t = 0 , . . . wher e λ = ε 1 − c ∗ 2 and ε is deﬁne d in (9) . Multiple Optimal Actions: In our presentation so far, we assumed that R ρ is such that, for any r ∈ R ρ and an y x ∈ X , |A r ( x ) | = 1 (As- sumption 1). Informally , this corresponds to assuming that, for ev ery rew ard function considered, there is a single optimal action, π ∗ ( x ) , at each x ∈ X . This assumption has been considered, either explicitly or implicitly , in sev eral previous works on learning by demonstration (see, for example, the w ork of Chernov a and V eloso, 2009). Closer to our o wn work on active IRL, sev eral w orks recast IRL as a classiﬁcation prob- lem, focusing on deterministic p olicies π k : X → A (Ng and Russel, 2000; Cohn et al, 2011; Judah et al, 2011; Ross and Bagnell, 2010; Ross et al, 2011) and therefore, although not explicitly , also consider a single optimal action in eac h state. Ho wev er, MDPs with m ultiple optimal actions p er state are not uncommon (the scenarios considered in Section 4, for example, ha ve multiple optimal actions 9 p er state). In this situation, the prop erties of the re- sulting algorithm do not follo w from our previous anal- ysis, since the existence of multiple optimal actions necessarily requires a more general noise mo del. The immediate extension of our noise mo del to a scenario where multiple optimal actions are allow ed p oses sev- eral diﬃculties, as optimal actions across policies ma y b e sampled with diﬀerent probabilities. In order to o vercome such diﬃculty , we consider a more conserv ativ e Ba yesian update, that enables a seamless generalization of our results to scenarios that admit m ultiple optimal actions in eac h state. Our up- date now arises from considering that the likelihoo d of observing an action from a set A h ( x ) at state x is giv en b y γ h ( x ) . Equiv alen tly , the lik eliho o d of observing an action from A − A h ( x ) is giv en by β h ( x ) = 1 − γ h ( x ) . As b efore, γ ∗ and β ∗ corresp ond to the v alues of γ h and β h for the target hypothesis, and w e again let α = sup x ∈X β ∗ ( x ) . Suc h aggr e gate d noise model again enables the consid- eration of an appro ximate noise mo del that is constant across hypothesis, and is deﬁned in terms of estimates ˆ γ ( x ) and ˆ β ( x ) of γ ∗ ( x ) and β ∗ ( x ) . Giv en the noise mo del just described, w e get the Bay esian update p t ( h ) , P [ h ∗ = h | F t ] ∝ P [ a t ∈ A h | x t , F t − 1 ] P [ h ∗ = h | F t − 1 ] = P [ a t ∈ A h | x t ] P [ h = h ∗ | F t − 1 ] ≈ ˆ γ ( x ) (1+ h t ) / 2 ˆ β ( x ) (1 − h t ) / 2 p t − 1 ( h ) , (10) with ˆ γ ( x ) and ˆ β ( x ) v erifying (7). This revised form ula- tion implies that the updates to p t are more conserv a- tiv e, in the sense that they are slow er to “eliminate” h y- p othesis from H . Ho wev er, all results for Algorithm 1 remain v alid with the new v alues for ˆ γ and ˆ β . Unfortunately , b y allo wing m ultiple optimal actions p er state, it is also muc h easier to ﬁnd (non-degenerate) situations where c ∗ = 1 , in whic h case our b ounds are v oid. How ev er, if we fo cus on iden tifying, in each state, at least one optimal action , we are able to retriev e some guaran tees on the sample complexit y of our active learning approach. W e thus consider yet another v er- sion of GBS-IRL, describ ed in Algorithm 3, that uses uses a threshold ˆ c < 1 such that, if W ( p t , [ x ] i ) > ˆ c , w e consider that (at least) one optimal action at [ x ] i has b een iden tiﬁed. Once this is done, it outputs the most lik ely hypothesis. Once at least one optimal action has b een identiﬁed in all states, the algorithm stops. Algorithm 3 GBS-IRL (version 3) Require: MDP parameters M\ r Require: Reward space R ρ Require: Prior distribution p 0 o ver R 1: Compute H from R ρ 2: Determine partition Ξ = X 1 , . . . X N of X 3: Set F 0 = ∅ 4: for all t = 0 , . . . do 5: Set c t = min i =1 ,...,N W ( p t , [ x ] i ) 6: if c t < ˆ c then 7: Sample x t +1 from the set X i that minimizes W ( p t , [ x ] i ) . 8: else 9: Return ˆ h t = argmax h ∈H p t ( h ) . 10: end if 11: Obtain noisy response a t +1 12: Set F t +1 ← F t ∪ { ( x t +1 , a t +1 ) } 13: Up date p t +1 from p t using (10) 14: end for T o analyze the performance of this version of GBS- IRL, let the set of pr e dicte d optimal actions at x b e deﬁned as A ˆ c ( p, x ) = ( a ∈ A | X h p ( h ) h ( x, a ) > ˆ c ) . W e ha ve the follo wing results. Theorem 3 (Consistency of GBS-IRL, ver- sion 3) . Consider any history of observations F t = { ( x τ , a τ ) , τ = 1 , . . . , t } fr om GBS-IRL. If, in the up date (6) , ˆ β and ˆ γ verify (7) for al l h ∈ H , then for any a ∈ A ˆ c ( p, [ x ] i ) , lim t →∞ P  h ∗ ([ x ] i , a ) 6 = 1  = 0 . Pr o of. See App endix A.5. Note that the ab ov e result is no longer formulated in terms of the identiﬁcation of the correct hypothesis, but in terms of the iden tiﬁcation of the set of opti- mal actions. W e also ha ve the following result on the sample complexity of v ersion 3 of GBS-IRL. Corollary 3 (Con vergence Rate of GBS-IRL, v er- sion 3) . L et H denote our hyp othesis sp ac e, deﬁne d over X , and let ˆ β ( x ) > α in the up date (10) . Then, for al l t such that W ( p t , [ x ] i ) ≤ c ∗ for some X i , and al l a ∈ A ˆ c ( p, [ x ] i ) , P  h ∗ ([ x ] i , a ) 6 = 1  ≤ |H| (1 − λ ) t , t = 0 , . . . wher e λ = ε 1 − c ∗ 2 and ε is deﬁne d in (9) with the new values for ˆ γ and ˆ β . 10 Diﬀeren t Query T yp es: Finally , it is w orth noting that, in the presen tation so far admits for queries suc h as “What is the optimal action in state x ?” How ev er, it is p ossible to devise diﬀeren t t yp es of queries (suc h as “Is action a optimal in state x ?” ) that enable us to recov er the stronger results in Theorem 2. In fact, a query suc h as the one exempliﬁed reduces the IRL problem to a binary clas- siﬁc ation pr oblem o ver X × A , for whic h existing activ e learning metho ds such as the one of Now ak (2011) can readily b e applied. In tegrating Rew ard F eedbac k: So far, we discussed one p ossible approach to IRL, where the agent is provided with a demonstration F t = { ( x τ , a τ ) , τ = 1 , . . . , t } consisting of pairs ( x τ , a τ ) of states and corresponding actions. F rom this demon- stration the agen t must identify the underlying target task, represen ted as a rew ard function, r ∗ . W e now depart from the Bay esian formalism in tro duced abov e and describ e how reward information can also b e inte- grated. With the addition of reward information, our demon- strations ma y no w include state-rew ard pairs ( x τ , u τ ) , indicating that the reward in state x τ tak es the v alue u τ . This can b e seen as a similar approac h as those of Thomaz and Breazeal (2008); Knox and Stone (2010) for reinforcement learning. The main diﬀerence is that, in the aforemen tioned w orks, actions are experienced b y the learner who then receiv es rew ards both from the en vironment and the teacher. Another related ap- proac h is introduced by Regan and Boutilier (2011), in the context of rew ard design for MDPs. As with action information, the demonstrator w ould ideally pro vide exact v alues for r ∗ . How ever, we gen- erally allow the demonstration to include some lev el of noise, where P [ u τ = u | x τ , r ∗ ] ∝ e ( u − r target ( x )) 2 /σ , (11) where σ is a non-negativ e constan t. As with p olicy information, rew ard information can be used to update p t ( r k ) as p t ( r k ) , P [ r ∗ = r k | F t ] ∝ P [ u t | x t , r ∗ = r k , F t − 1 ] P [ r ∗ = r k | F t − 1 ] = P [ u t | x t , r ∗ = r k ] P [ r ∗ = r k | F t − 1 ] ≈ e ( u t − r k ( x )) 2 / ˆ σ p t − 1 ( r k ) where, as b efore, w e allow for an inaccurate estimate ˆ σ of σ suc h that ˆ σ ≥ σ . Given the corresp ondence be- t ween the rew ards in R ρ and the h yp othesis in H , the ab o ve Bay esian up date can b e used to seamlessly in te- grate reward information in our Bay esian IRL setting. T o adapt our activ e learning approach to accommo- date for rew ard feedback, let x t +1 = argmin X i ,i =1 ,...,N W ( p t , [ x ] i ) . i.e., x t +1 is the state that would b e queried by Algo- rithm 1 at time-step t + 1 . If the user instead wishes to pro vide reward information, we would lik e to replace the query x t +1 b y some alternativ e query x 0 t +1 that disam biguates as m uch as p ossible the actions in state x t +1 —m uch lik e a direct query to x t +1 w ould. T o this purp ose, we partition the space of rewards, R ρ , in to |A| or less disjoin t sets R 1 , . . . , R |A| , where eac h set R a con tains precisely those rew ards r ∈ R ρ for whic h π r ( x t +1 ) = a . W e then select the state x 0 t +1 ∈ X , the rew ard at whic h b est discriminates betw een the sets R 1 , . . . , R |A| . The algorithm will then query the demonstrator for the reward at this new state. In man y situations, the rew ards in R ρ allo w only p o or discrimination b et ween the sets R 1 , . . . , R |A| . This is particularly eviden t if the rew ard is sparse, since after a couple informativ e rew ard samples, all other states con tain similar rew ard information. In Section 4 w e illustrate this inconv enience, comparing the p erfor- mance of our activ e metho d in the presence of both sparse and dense reward functions. 4 Exp erimen tal Results This section illustrates the application of GBS-IRL in sev eral problems of diﬀerent complexity . It also fea- tures a comparison with other existing metho ds from the active IRL literature. 4.1 GBS-IRL In order to illustrate the applicability of our proposed approac h, w e conducted a series of exp erimen ts where GBS-IRL is used to determine the (unknown) reward function for some underlying MDP , giv en a p erturb ed demonstration of the corresp onding policy . In each exp erimen t, w e illustrate and discuss the p erformance of GBS-IRL. The results presented corre- sp ond to a verages ov er 200 indep endent Monte-Carlo trials, where each trial consists of a run of 100 learn- ing steps, in eac h of which the algorithm is required to select one state to query and is provided the cor- resp onding action. GBS-IRL is initialized with a set R ρ of 500 indep enden tly generated random rewards. 11 This set alw ays includes the correct reward, r ∗ and the remaining rew ards are built to ha ve similar range and sparsit y as that of r ∗ . The prior probabilities, p 0 ( r ) , are prop ortional to the lev el of sparsity of each reward r . This implies that some of the random rew ards in R ρ ma y ha ve larger prior probability than r ∗ . F or simplicit y , w e considered an exact noise mo del, i.e., ˆ β = β ∗ and ˆ γ = γ ∗ , where β ∗ ( x ) ≡ 0 . 1 and γ ∗ ( x ) ≡ 0 . 9 , for all x ∈ X . F or comparison purp oses, we also ev aluated the p er- formance of other activ e IRL approac hes from the lit- erature, to kno w: • The imitation query-by-c ommitte e algorithm (IQBC) of Judah et al (2011), that uses an en tropy-based criterion to select the states to query . • The exp e cte d myopic gain algorithm (EMG) of Cohn et al (2011), that uses a criterion based on the exp ected gain of information to select the states to query . As p ointed out in Section 1.1, IQBC is, in its core, v ery similar to GBS-IRL, the main diﬀerences being in terms of the selection criterion and of the fact that the IQBC is able to accommo date the notion of “bad states”. Since this notion is not used in our examples, w e exp ect the p erformance of b oth metho ds to b e es- sen tially similar. As for EMG, this algorithm queries the exp ert for the correct action in those states where the exp ected gain of information is p oten tially larger (Cohn et al, 2011). This requires ev aluating, for each state x ∈ X and eac h p ossible outcome, the asso ciated gain of informa- tion. Such method is, therefore, fundamentally diﬀer- en t from GBS-IRL and we exp ect this metho d to yield crisp er diﬀerences from our o wn approach. A ddition- ally , the ab o ve estimation is computationally heavy , as (in the w orst case) requires the ev aluation of an MDP p olicy for eac h state-action pair. Small-sized random MDPs In the ﬁrst set of exp eriments, we ev aluate the p erfor- mance of GBS-IRL in several small-sized MDPs with no particular structure (both in terms of transitions and in terms of rewards). Sp eciﬁcally , w e considered MDPs where |X | = 10 and either |A| = 5 or |A| = 10 . F or each MDP size, we consider 10 random and inde- p enden tly generated MDPs, in each of whic h we con- ducted 200 indep endent learning trials. This ﬁrst set of exp eriments serv es t wo purp oses. On one hand, it 0 5 10 15 20 25 30 35 40 45 50 5 6 7 8 9 10 Steps Accuracy ( × n. states) Random − 10S − 5A GBS IQBC EMG (a) 0 5 10 15 20 25 30 35 40 45 50 0 3 6 9 12 Steps Value loss Random − 10S − 5A GBS IQBC EMG (b) Figure 1: P erformance of all methods in random MDPs with |X | = 10 and |A| = 5 . 0 10 20 30 40 50 60 70 80 90 100 2 4 6 8 Steps Accuracy ( × n. states) Random − 10S − 10A GBS IQBC EMG (a) 0 10 20 30 40 50 60 70 80 90 100 0 4 8 12 16 Steps Value loss Random − 10S − 10A GBS IQBC EMG (b) Figure 2: P erformance of all methods in random MDPs with |X | = 10 and |A| = 10 . illustrates the applicabilit y of GBS-IRL in arbitrary settings, b y ev aluating the p erformance of our metho d in random MDPs with no particular structure. On the other hand, these initial exp eriments also enable a quic k comparative analysis of GBS-IRL against other relev ant metho ds from the active IRL literature. Figures 1(a) and 2(a) depict the learning curve for all three metho ds in terms of p olicy accuracy . The p erformance of all three methods is essen tially similar in the early stages of the learning pro cess. How ever, GBS-IRL slightly outp erforms the other t wo metho ds, although the diﬀerences from IQBC are, as exp ected, smaller than those from EMG. While p olicy accuracy gives a clear view of the learn- ing performance of the algorithms, it con veys a less clear idea on the ability of the learned p olicies to com- plete the task in tended b y the demonstrator. T o ev al- uate the performance of the three learning algorithms in terms of the target task, we also measured the loss of the learned p olicies with resp ect to the optimal p ol- icy . Results are depicted in Figs. 1(b) and 2(b). These results also conﬁrm that the performance of GBS-IRL is essen tially similar. In particular, the diﬀerences ob- serv ed in terms of p olicy accuracy hav e little impact in terms of the abilit y to p erform the target task comp e- 12 10 1 10 2 10 3 10 4 10 5 0 240 480 720 960 1200 Dimension (n. states × n. actions) Total computation time (sec.) Comparative computational performance GBS IQBC EMG Figure 3: A v erage (total) computational time for prob- lems of diﬀeren t dimensions. ten tly . T o conclude this section, we also compare the com- putation time for all metho ds in these smaller prob- lems. The results are depicted in Fig. 3. W e em- phasize that the results portray ed herein are only in- dicativ e, as all algorithms w ere implemen ted in a rela- tiv ely straightforw ard manner, with no particular con- cerns for optimization. Still, the comparison do es conﬁrm that the computational complexit y associated with EMG is many times sup erior to that inv olved in the remaining metho ds. This, discussed earlier, is due to the hea vy computations inv olve d in the estimation of the exp ected m yopic gain, which gro ws directly with the size of |X | × |A| . This observ ation is also in line with the discussion already found in the original work of Cohn et al (2011). Medium-sized random MDPs In the second set of exp eriments, we inv estigate ho w the p erformance of GBS-IRL is aﬀected by the di- mension of the domain considered. T o this purpose, w e ev aluate the p erformance of GBS-IRL in arbitrary medium-sized MDPs with no particular structure (b oth in terms of transitions and in terms of rewards). Sp ecif- ically , we now consider MDPs where either |X | = 50 or |X | = 100 , and again take either |A| = 5 or |A| = 10 . F or each MDP size, we consider 10 random and inde- p enden tly generated MDPs, in each of whic h we con- ducted 200 independent learning trials. Giv en the results in the ﬁrst set of exp eriments and the computation time already asso ciated with EMG, in the remaining exp erimen ts we opted by compar- ing GBS-IRL with IQBC only . The learning curv es in terms both of p olicy accuracy and task execution are depicted in Fig. 4. In this set of exp erimen ts we can observe that the p erformance of IQBC app ears to deteriorate more 0 10 20 30 40 50 60 70 80 90 100 15 22 29 36 43 50 Steps Accuracy ( × n. states) Random − 50S − 5A GBS IQBC (a) P olicy accuracy ( 50 × 5 ) 0 10 20 30 40 50 60 70 80 90 100 50 60 70 80 90 100 Steps Accuracy ( × n. states) Random − 100S − 5A GBS IQBC (b) P olicy accuracy ( 100 × 5 ) 0 10 20 30 40 50 60 70 80 90 100 20 26 32 38 44 50 Steps Accuracy ( × n. states) Random − 50S − 10A GBS IQBC (c) Policy accuracy ( 50 × 10 ) 0 10 20 30 40 50 60 70 80 90 100 − 2 2 6 10 14 Steps Value loss Random − 50S − 5A GBS IQBC (d) V alue p erf. ( 50 × 5 ) 0 10 20 30 40 50 60 70 80 90 100 − 2 1 4 7 10 Steps Value loss Random − 100S − 5A GBS IQBC (e) V alue p erf. ( 100 × 5 ) 0 10 20 30 40 50 60 70 80 90 100 − 2 1 4 7 10 Steps Value loss Random − 50S − 10A GBS IQBC (f ) V alue p erf. ( 50 × 10 ) 0 10 20 30 40 50 60 70 80 90 100 30 44 58 72 86 100 Steps Accuracy ( × n. states) Random − 100S − 10A GBS IQBC (g) P olicy accuracy ( 100 × 10 ) 0 10 20 30 40 50 60 70 80 90 100 − 2 1 4 7 10 Steps Value loss Random − 100S − 10A GBS IQBC (h) V alue p erf. ( 100 × 10 ) Figure 4: Classiﬁcation and v alue p erformance of GBS- IRL and IQBC in medium-sized random MDPs. Solid lines corresp ond to GBS-IRL, and dotted lines cor- resp ond to IQBC. (a)-(g) Classiﬁcation p erformance. (d)-(h) V alue p erformance. The indicated v alues cor- resp ond to the dimensions |X | × |A| of the MDPs. sev erely with the num b er of actions than that of GBS- IRL. Although not signiﬁcantly , this tendency could already b e observed in the smaller environmen ts (see, for example, Fig. 2(b)). This dep endence on the n um- b er of actions is not completely unexp ected. In fact, 13 Figure 5: The puddle- w orld domain (Boy an and Mo ore, 1995). T rap ro oms Start state Goal state Figure 6: The trap- w orld domain (Judah et al, 2011). IQBC queries states x that maximize V E ( x ) = − X a ∈A n H ( x, a ) |H| log n H ( x, a ) |H| , where n H ( x, a ) is the n umber of hypothesis h ∈ H suc h that a ∈ A h ( x ) . Since the disagreement is tak en ov er the set of all p ossible actions, there is some dependence of the p erformance of IQBC on the num b er of actions. GBS-IRL, on the other hand, is more fo cused to w ard iden tifying one optimal action p er state. This renders our approac h less sensitiv e to the n umber of actions, as can b e seen in Corollaries 1 through 3 and illustrated in Fig. 4. Large-sized structured domains So far, we hav e analyzed the p erformance of GBS-IRL in random MDPs with no particular structure, b oth in terms of transition probabilities and reward function. In the third set of exp eriments, w e look further into the scalabilit y of GBS-IRL b y considering large-sized domains. W e consider more structured problems se- lected from the IRL literature. In particular, w e ev al- uate the p erformance of GBS-IRL in the tr ap-world , pudd le-world and driver domains. The pudd le-world domain w as introduced in the w ork of Boy an and Mo ore (1995), and is depicted in Fig. 5. It consists of a 20 × 20 grid-w orld in which t wo “puddles” exist (corresp onding to the dark er cells). When in the puddle, the agen t receives a penalty that is prop ortional to the squared distance to the nearest edge of the puddle, and ranges b et ween 0 and − 1 . The agen t must reac h the goal state in the top-right corner of the environmen t, up on which it receives a rew ard of +1 . W e refer to the original description of Bo yan and Mo ore (1995) for further details. Figure 7: The driv er-world domain (Abb eel and Ng, 2004). This domain can b e describ ed b y an MDP with |X | = 400 and |A| = 4 , where the four actions corre- sp ond to motion commands in the four p ossible direc- tions. T ransitions are sto chastic, and can b e describ ed as follo ws. After selecting the action corresponding to mo ving in direction d , the agent will roll back one cell ( i.e., mo ve in the direction − d ) with a probability 0 . 06 . With a probability 0 . 24 the action will fail and the agent will remain in the same position. The agen t will mov e to the adjacent p osition in direction d with probabilit y 0 . 4 . With a probabilit y 0 . 24 it will mov e t wo cells in direction d , and with probability 0 . 06 it will mo ve three cells in direction d . W e used a dis- coun t γ = 0 . 95 for the MDP (not to be confused with the noise parameters, ˆ γ ( x ) ). The tr ap-world domain w as in tro duced in the work of Judah et al (2011), and is depicted in Fig. 6. It consists of a 30 × 30 grid-world separated into 9 ro oms. Dark er ro oms correspond to tr ap r o oms , from which the agent can only lea v e by reac hing the corresp onding b ottom- left cell (marked with a “ × ”). Dark lines corresp ond to walls that the agen t cannot tra verse. Dotted lines are used to delimit the trap-rooms from the safe ro oms but are otherwise meaningless. The agent m ust reach the goal state in the b ottom-righ t corner of the envi- ronmen t. W e refer to the w ork of Judah et al (2011) for a more detailed description. This domain can b e describ ed b y an MDP with |X | = 900 and |A| = 4 , where the four actions corre- sp ond to motion commands in the four p ossible direc- tions. T ransitions are deterministic. The target rew ard function r ∗ is ev erywhere 0 except on the goal, where r ∗ ( x goal ) = 1 . W e again used a discount γ = 0 . 95 for the MDP . Finally , the driver domain w as introduced in the w ork of Abbeel and Ng (2004), an instance of which is depicted in Fig. 7. In this environmen t, the agent corresp onds to the driver of the blue car at the b ot- tom, moving at a sp eed greater than all other cars. All 14 0 10 20 30 40 50 60 70 80 90 100 330 344 358 372 386 400 Steps Accuracy ( × n. states) Puddleworld GBS IQBC (a) Policy accur. (puddle- w.). 0 10 20 30 40 50 60 70 80 90 100 650 700 750 800 850 900 Steps Accuracy ( × n. states) Trapworld GBS IQBC (b) Policy accur. (trap-w.). 0 10 20 30 40 50 60 70 80 90 100 0.2 0.52 0.84 1.16 1.48 1.8 x 10 4 Steps Accuracy ( × n. states) Driving Domain GBS IQBC (c) Policy accur. (driv er). 0 10 20 30 40 50 60 70 80 90 100 0 1 2 Steps Value loss Puddleworld GBS IQBC (d) V alue p erf. (puddle-w.). 0 10 20 30 40 50 60 70 80 90 100 − 0.2 0.8 Steps Value loss Trapworld GBS IQBC (e) V alue p erf. (trap-w.). 0 10 20 30 40 50 60 70 80 90 100 0 6 12 18 24 30 Steps Value loss Driving Domain GBS IQBC (f ) V alue p erf. (driv er). Figure 8: Classiﬁcation and v alue p erformance of GBS- IRL and IQBC in the three large domains. Solid lines corresp ond to GBS-IRL, and dotted lines corresp ond to IQBC. (b)-(c) Classiﬁcation p erformance. (e)-(f ) V alue performance. other cars mov e at constan t speed and are scattered across the three central lanes. The goal of the agen t is to drive as safely as p ossible— i.e., av oid crashing in to other cars, turning to o suddenly and, if p ossible, driving in the shoulder lanes. F or the purposes of our tests, w e represented the driv er domain as an MDP with |X | = 16 , 875 and |A| = 5 , where the ﬁve actions corresp ond to driving the car in to eac h of the 5 lanes. T ransitions are deterministic. The target rew ard function r ∗ p enalizes the agen t with a v alue of − 10 for every crash, and with a v alue of − 1 for driving in the shoulder lanes. Additionally , each lane change costs the agen t a p enalt y of − 0 . 1 . As in the previous scenarios, we used a discount γ = 0 . 95 for the MDP . As with the previous exp erimen ts, we conducted 200 indep enden t learning trials for each of the three envi- ronmen ts, and ev aluated the performance of b oth GBS- Figure 9: The grid-world used to illustrate the com- bined use of action and rew ard feedbac k. IRL and IQBC. The results are depicted in Fig. 8. W e can observ e that, as in previous scenarios, the p erformance of b oth metho ds is very similar. All sce- narios feature a relatively small n umber of actions, whic h attenuates the negativ e dep endence of IQBC on the n umber of actions observed in the previous exp er- imen ts. It is also interesting to observe that the trap-world domain seems to b e harder to learn than the other tw o domains, in spite of the diﬀerences in dimension. F or example, while the driv er domain required only around 10 samples for GBS-IRL to single out the correct hy- p othesis, the trap-w orld required around 20 to attain a similar p erformance. This may b e due to the fact that the trap-w orld domain features the sparsest rew ard. Since the other rew ards in the hypothesis space were selected to b e similarly sparse, it is p ossible that man y w ould lead to similar p olicies in large parts of the state- space, thus hardening the identiﬁcation of the correct h yp othesis. T o conclude, is is still interesting to observ e that, in spite of the dimension of the problems considered, b oth metho ds w ere eﬀectiv ely able to single out the cor- rect hypothesis after only a few samples. In fact, the o verall p erformance is sup erior to that observed in the medium-sized domains, whic h indicates that the do- main structure presen t in these scenarios greatly con- tributes to disam biguate betw een h yp othesis, giv en the exp ert demonstration. 4.2 Using A ction and Reward F eedbac k T o conclude the empirical v alidation of our approach, w e conduct a ﬁnal set of exp eriments that aims at illus- trating the applicabilit y of our approach in the presence of b oth action and rew ard feedbac k. One ﬁrst exp erimen t illustrates the integration of b oth reward and policy information in the Ba y esian IRL setting describ ed in Section 3.3. W e consider the simple 19 × 10 grid-w orld depicted in Fig. 9, where the agen t m ust na vigate to the top-righ t corner of the en vironment. In this ﬁrst exp eriment, w e use random sampling, in whic h, at eac h time step t , the exp ert adds 15 0 10 20 30 40 50 60 70 80 90 100 0 36 72 108 144 180 Steps Value loss Performance of reward and policy feedback Random actions Random rewards Random actions and rewards Figure 10: Ba yesian IRL using reward and action feed- bac k. 0 10 20 30 40 50 60 70 80 90 100 2 4 6 8 10 Steps Value loss Active reward sampling GBS − IRL Random sampling (a) 0 10 20 30 40 50 60 70 80 90 100 0.0 1.5 3.0 4.5 6.0 Steps Value loss Active reward sampling Random sampling GBS − IRL (b) Figure 11: A ctive IRL using reward feedback: sparse vs dense rew ards. one (randomly selected) sample to the demonstration F t , which can be of either form ( x t , a t ) or ( x t , u t ) . Figure 10 compares the p erformance of Bay esian IRL for demonstrations consisting of state-action pairs only , state-rew ard pairs only , and also demonstrations that include b oth state-action and state-rew ard pairs. W e ﬁrst observ e that all demonstration types enable the learner to slowly impro v e its p erformance in the target task. This indicates that all three sources of information (action, reward, and action+reward) giv e useful information to accurately iden tify the target task (or, equiv alen tly , iden tify the target reward function). Another imp ortant observ ation is that a direct com- parison b et ween the learning p erformance obtained with the diﬀeren t demonstration t yp es may be mis- leading, since the ability of the agen t to extract useful information from the reward samples greatly dep ends on the sparsit y of the rew ard function. Except in those situations in which the rew ard is extremely informa- tiv e, an action-based demonstration will generally b e more informative. In a second exp eriment, we analyze the p erformance of our activ e learning metho d when querying only re- w ard information in the same grid-world en vironment. In particular, w e analyze the dep endence of the perfor- mance on the sparsity of the reward function, testing GBS-IRL in t wo distinct conditions. The ﬁrst condi- tion, depicted in Fig. 11(a), corresp onds to a rew ard function r ∗ that is sparse, i.e., such that r ∗ ( x ) = 0 for all states x except the goal states, where r ∗ ( x goal ) = 1 . As discussed in Section 3.3, sparsit y of rew ards greatly impacts the learning p erformance of our Ba yesian IRL approac h. This phenomenon, ho w ever, is not exclusive to the activ e learning approac h—in fact, as seen from Fig. 11(a), random sampling also exhibits a p o or performance. It is still p ossible, nonetheless, to detect some adv antage in using an active sampling approac h. In con trast, it is p ossible to design v ery informativ e rew ards, by resorting to a tec hnique prop osed in the reinforcemen t learning literature under the designation of r ewar d shaping (Ng et al, 1999). By considering a shaped version of that same rew ard, we obtain the learning p erformance depicted in Fig. 11(b). Note how, in the latter case, con vergence is extremely fast even in the presence of random sampling. W e conclude by noting that, in the case of rew ard information, our setting is essentially equiv alen t to a standard reinforcement learning setting, for which ef- ﬁcien t exploration techniques hav e b een prop osed and ma y pro vide fruitful a v enues for future researc h. 5 Discussion In this pap er we introduce GBS-IRL, a no vel activ e IRL algorithm that allows an agen t to learn a task from a demonstration b y an “exp ert”. Using a generalization of binary searc h, our algorithm greedily queries the ex- p ert for demonstrations in highly informativ e states. As seen in Section 1.1, and following the designation of Dasgupta (2011), GBS-IRL is an aggr essive activ e learning algorithm. Additionally , giv en our considera- tion of noisy samples, GBS-IRL is naturally designed to consider non-sep ar able data . As pointed out b y Das- gupta (2011), few aggressive active learning algorithms exist with prov able complexity b ounds for the non- separable case. GBS-IRL comes with such guaran tees, summarized in Corollary 1: under suitable conditions and for any giv en δ > 0 , P [ h t 6 = h ∗ ] > 1 − δ , as long as t ≥ 1 λ log |H| δ , where λ is a constan t that do es not dep end on the di- mension of the h yp othesis space but only on the sample noise. 16 A dditionally , as brieﬂy remark ed in Section 3.2, it is p ossible to use an adaptiv e sub-mo dularit y argumen t to establish the near-optimality of GBS-IRL. In fact, giv en the target hypothesis, h ∗ , consider the ob jectiv e function f ( F t ) = P [ h t 6 = h ∗ | F t ] = 1 − p t ( h ∗ ) . F rom Theorem 1 and its pro of, it can b e sho wn that f is str ongly adaptive monotone and adaptive sub mo dular and use results of Golovin and Krause (2011) to pro vide a similar b ound on sample complexit y of GBS-IRL. T o our knowledge, GBS-IRL is the ﬁrst active IRL algo- rithm with pro v able sample complexity b ounds. A ddi- tionally , as discussed in Section 3.2, our reduction of IRL to a standard (m ulti-class) classiﬁcation problem implies that Algorithm 1 is not sp ecialized in an y par- ticular wa y to IRL problems. In particular, our results are generally applicable in an y m ulti-class classiﬁcation problems verifying the corresponding assumptions. Finally , our main contributions are fo cused in the simplest form of interaction, when the demonstration consist of examples of the righ t action to take in dif- feren t situations. Ho wev er, w e also discuss ho w other forms of exp ert feedbac k (b eyond p olicy information) ma y b e integrated in a seamless manner in our GBS- IRL framew ork. In particular, we discussed ho w to com bine b oth p olicy and reward information in our learning algorithm. Our approach th us provides an in teresting bridge betw een reinforcement learning (or learning b y trial and error) and imitation learning (or learning from demonstration). In particular, it brings to the forefron t existing results on eﬃcient exploration in reinforcement learning (Jaksc h et al, 2010). A dditionally , the general Bay esian IRL framework used in this pap er is also amenable to the integration of additional information sources. F or example, the hu- man agent ma y pro vide tra jectory information, or in- dicate states that are frequen tly visited when follo wing the optimal path. F rom the MDP parameters it is gen- erally p ossible to asso ciate a likelihoo d with such feed- bac k, which can in turn b e integrated in the Ba y esian task estimation setting. Ho wev er, extending the activ e learning approac h to suc h sources of information is less straigh tforward and is left as an imp ortant a ven ue for future research. A Pro ofs In this app endix we collect the proofs of all statements throughout the paper. A.1 Pro of of Lemma 1 The method of pro of is related to that of Now ak (2011). W e w ant to sho w that either • W ( p, [ x ] i ) ≤ c ∗ for some X i ∈ Ξ or, alternativ ely , • There are t wo k -neighbor sets X i , X j ∈ Ξ suc h that W ( p, [ x ] i ) > c ∗ and W ( p, [ x ] j ) > c ∗ , while A ∗ ( p, [ x ] i ) 6 = A ∗ ( p, [ x ] j ) . W e ha ve that, for an y a ∈ A , X h ∈H p ( h ) N X i =1 h ([ x ] i , a ) µ ( X i ) ≤ X h ∈H p ( h ) c ∗ = c ∗ The ab ov e expression can b e written equiv alently as E µ " X h ∈H p ( h ) h ([ x ] i , a ) # ≤ c ∗ . (12) Supp ose that there is no x ∈ X such that W ( p, x ) ≤ c ∗ . In other words, supp ose that, for every x ∈ X , W ( p, x ) > c ∗ . Then, for (12) to hold, there m ust b e X i , X j ∈ Ξ and a ∈ A suc h that X h ∈H p ( h ) h ([ x ] i , a ) > c ∗ X h ∈H p ( h ) h ([ x ] j , a ) < − c ∗ . Since ( X , H ) is k -neighborly by assumption, there is a sequence {X k 1 , . . . , X k ` } suc h that X k 1 = X i , X k ` = X j , and ev ery tw o sets X k m , X k m +1 are k -neighborly . A dditionally , at some p oin t in this sequence, the signal of P h ∈H p ( h ) h ([ x ] i , a ) must c hange. This implies that there are t wo k -neighboring sets X k i and X k j suc h that X h ∈H p ( h ) h ([ x ] k i , a ) > c ∗ X h ∈H p ( h ) h ([ x ] k j , a ) < − c ∗ , whic h implies that A ∗ ( p t , [ x ] k i ) 6 = A ∗ ( p t , [ x ] k j ) , and the proof is complete. A.2 Pro of of Theorem 1 Let C t denote the amount of probability mass placed on incorrect h yp othesis by p t , i.e., C t = 1 − p t ( h ∗ ) p t ( h ∗ ) . The pro of of Theorem 1 relies on the follo wing fun- damen tal lemma, whose pro of can b e found in Ap- p endix A.3. 17 Lemma 2. Under the c onditions of The or em 1, the pr o c ess { C t , t = 1 , . . . } is a non-ne gative sup ermartin- gale with r esp e ct to the ﬁltr ation {F t , t = 1 , . . . } . In other wor ds, E [ C t +1 | F t ] ≤ C t , for al l t ≥ 0 . The pro of no w replicates the steps in the proof of Theorem 3 of No wak (2011). In order to keep the pa- p er as self-contained as p ossible, w e rep eat those steps here. W e hav e that P  ˆ h t 6 = h ∗  ≤ P [ p t ( h ∗ ) < 1 / 2] = P [ C t > 1] ≤ E [ C t ] , where the last inequality follo ws from the Marko v in- equalit y . Explicit computations yield P  ˆ h t 6 = h ∗  ≤ E [ C t ] = E  C t C t − 1 C t − 1  = E  E  C t C t − 1 C t − 1 | F t − 1  = E  E  C t C t − 1 | F t − 1  C t − 1  ≤ max F t − 1 E  C t C t − 1 | F t − 1  E [ C t − 1 ] . Finally , expanding the recursion, P  ˆ h t 6 = h ∗  ≤ C 0  max τ =1 ,...,t − 1 E  C τ C τ − 1 | F τ − 1  t . (13) Since, from Lemma 2, E h C t C t − 1 | F t − 1 i < 1 for all t , the conclusion follows. A.3 Pro of of Lemma 2 The structure of the pro of is similar to that of the proof of Lemma 2 of Now ak (2011). W e start by explicitly writing the expression for the Ba yesian update in (6). F or all a ∈ A , let δ ( a ) , P p t [ h ( x t +1 , a ) = 1] = (14) = 1 2  1 + P h ∈H p t ( h ) h ( x t +1 , a )  , and w e abusively write δ t +1 to denote δ ( a t +1 ) . The quan tity δ ( a ) corresp onds to the fraction of probability mass concen trated on h yp otheses prescribing action a as optimal in state x t +1 . The normalizing factor in the up date (6) is given b y X h ∈H p t ( h ) ˆ γ ( x t +1 ) (1+ h t +1 ) / 2 ˆ β ( x t +1 ) (1 − h t +1 ) / 2 = X h : h t +1 =1 p t ( h ) ˆ γ ( x t +1 ) + X h : h t +1 = − 1 p t ( h ) ˆ β ( x t +1 ) = δ t +1 ˆ γ ( x t +1 ) + (1 − δ t +1 ) ˆ β ( x t +1 ) . W e can no w write the Bay esian up date of p t ( h ) as p t +1 ( h ) = p t ( h ) ˆ γ ( x t +1 ) (1+ h t +1 ) / 2 ˆ β ( x t +1 ) (1 − h t +1 ) / 2 δ t +1 ˆ γ ( x t +1 ) + (1 − δ t +1 ) ˆ β ( x t +1 ) . (15) Let η ( a ) = δ ( a ) ˆ γ ( x t +1 ) + (1 − δ ( a )) ˆ β ( x t +1 ) ˆ γ ( x t +1 ) (1+ h ∗ ( x t +1 ,a )) / 2 ˆ β ( x t +1 ) (1 − h ∗ ( x t +1 ,a )) / 2 , (16) where, as with δ , w e abusively write η t +1 to denote η ( a t +1 ) . Then, for h ∗ , w e can no w write the up- date (15) simply as p t +1 ( h ∗ ) = p t ( h ∗ ) /η t +1 , and C t +1 C t = (1 − p t ( h ∗ ) /η t +1 ) η t +1 1 − p t ( h ∗ ) = η t +1 − p t ( h ∗ ) 1 − p t ( h ∗ ) The conclusion of the Lemma holds as long as E [ η t +1 | F t ] ≤ 1 . Conditioning the exp ectation E [ η t +1 | F t ] on x t +1 , we hav e that E [ η t +1 | F t , x t +1 ] = X a ∈A η ( a ) P [ a t +1 = a | F t , x t +1 ] = X a ∈A η ( a ) γ ∗ ( x t +1 ) (1+ h ∗ ( x t +1 ,a )) / 2 β ∗ ( x t +1 ) (1 − h ∗ ( x t +1 ,a )) / 2 . Let a ∗ denote the action in A such that h ∗ ( x t +1 , a ∗ ) = 1 . This leads to E [ η t +1 | F t , x t +1 ] = η ( a ∗ ) γ ∗ ( x t +1 ) + X a 6 = a ∗ η ( a ) β ∗ ( x t +1 ) . (17) F or simplicity of notation, w e temp orarily drop the ex- plicit dependence of β ∗ , ˆ β , γ ∗ and ˆ γ on x t +1 . Explicit computations now yield η ( a ∗ ) γ ∗ ( x t +1 ) + X a 6 = a ∗ η ( a ) β ∗ ( x t +1 ) = [ δ ( a ∗ ) ˆ γ + (1 − δ ( a ∗ )) ˆ β ] γ ∗ ˆ γ + + X a 6 = a ∗ [ δ ( a ) ˆ γ + (1 − δ ( a )) ˆ β ] β ∗ ˆ β = δ ( a ∗ ) γ ∗ + (1 − δ ( a ∗ )) ˆ β γ ∗ ˆ γ + + X a 6 = a ∗  δ ( a ) ˆ γ β ∗ ˆ β + (1 − δ ( a )) β ∗  . 18 Since P a 6 = a ∗ δ ( a ) = 1 − δ ( a ∗ ) , η ( a ∗ ) γ ∗ ( x t +1 ) + X a 6 = a ∗ η ( a ) β ∗ ( x t +1 ) = (1 − δ ( a ∗ )) " ˆ β γ ∗ ˆ γ + ˆ γ β ∗ ˆ β # + + δ ( a ∗ ) γ ∗ + X a 6 = a ∗ (1 − δ ( a )) β ∗ = (1 − δ ( a ∗ )) " ˆ β γ ∗ ˆ γ + ˆ γ β ∗ ˆ β # + + δ ( a ∗ ) γ ∗ + ( |A| − 1) β ∗ − (1 − δ ( a ∗ )) β ∗ = (1 − δ ( a ∗ )) " ˆ β γ ∗ ˆ γ + ˆ γ β ∗ ˆ β # + + δ ( a ∗ )( γ ∗ + β ∗ ) + 1 − γ ∗ − β ∗ , where we hav e used the fact that ( | A | − 1) β ∗ + γ ∗ = 1 . Finally , w e ha v e η ( a ∗ ) γ ∗ ( x t +1 ) + X a 6 = a ∗ η ( a ) β ∗ ( x t +1 ) = (1 − δ ( a ∗ )) " ˆ β γ ∗ ˆ γ + ˆ γ β ∗ ˆ β − γ ∗ − β ∗ # + 1 = 1 − (1 − δ ( a ∗ )) " γ ∗ + β ∗ − ˆ β γ ∗ ˆ γ − ˆ γ β ∗ ˆ β # = 1 − (1 − δ ( a ∗ )) " γ ∗ ˆ γ − ˆ β ˆ γ + β ∗ ˆ β − ˆ γ ˆ β # . Letting ρ = 1 − ( |A| − 1) α , w e ha ve that E [ η t +1 | F t , x t +1 ] < 1 as long as γ ∗ ( x ) ˆ γ ( x ) − ˆ β ( x ) ˆ γ ( x ) + β ∗ ( x ) ˆ β ( x ) − ˆ γ ( x ) ˆ β ( x ) > 0 for all x ∈ X . Since, for all x ∈ X , β ∗ ( x ) < α and γ ∗ ( x ) > ρ , we ha ve γ ∗ ( x ) ˆ γ ( x ) − ˆ β ( x ) ˆ γ ( x ) + β ∗ ( x ) ˆ β ( x ) − ˆ γ ( x ) ˆ β ( x ) = = ( γ ∗ ( x ) − β ∗ ( x )) " γ ∗ ( x ) ˆ γ ( x ) − β ∗ ( x ) ˆ β ( x ) # > ( ρ − α ) " ρ ˆ γ ( x ) − α ˆ β ( x ) # ≥ 0 . where the inequalit y is strict if ˆ β ( x ) > α for all x ∈ X . A.4 Pro of of Theorem 2 T o pro ve Theorem 2, w e depart from (13): P  ˆ h t 6 = h ∗  ≤ C 0  max τ =1 ,...,t − 1 E  C τ C τ − 1 | F τ − 1  t . Letting λ t = max τ =1 ,...,t − 1 E  C τ C τ − 1 | F τ − 1  , the desired result can b e obtained by b ounding the se- quence { λ t , t = 0 , . . . } by some v alue λ < 1 . T o sho w that such λ exists, w e consider separately the t wo p os- sible queries in Algorithm 1. Let then c t = min i =1 ,...,N W ( p t , [ x ] i ) , and supp ose that there are no 1-neigh b or sets X i and X j suc h that W ( p t , [ x ] i ) > c t , W ( p t , [ x ] j ) > c t , (18) A ∗ ( p t , [ x ] i ) 6 = A ∗ ( p t , [ x ] j ) . (19) Then, from Algorithm 1, the queried state x t +1 will b e suc h that x t +1 ∈ argmin i W ( p t , [ x ] i ) . Since, from the deﬁnition of c ∗ , c t < c ∗ , it follo ws that δ ( a ) ≤ 1+ c ∗ 2 for all a ∈ A , where δ ( a ) is deﬁned in (14). Then, from the pro of of Lemma 2, E [ η t +1 | F t , x t +1 ] ≤ 1 − ε (1 − δ ( a ∗ )) ≤ 1 − ε 1 − c ∗ 2 , where a ∗ denotes the action in A suc h that h ∗ ( x, a ∗ ) = 1 . Consider no w the case where there are 1 -neighboring sets X i and X j suc h that (18) holds. In this case, ac- cording to Algorithm 1, x t +1 is selected randomly as either [ x ] i or [ x ] j with probabilit y 1 / 2 . Moreo ver, since X i and X j are 1-neighbors, there is a single h yp othe- sis, say h 0 , that prescrib es diﬀeren t optimal actions in X i and X j . Let a ∗ i denote the optimal action at [ x ] i , and a ∗ j the optimal action at [ x ] j , as prescrib ed by h ∗ . Three situations are p ossible: Situation 1. A ∗ ( p t , [ x ] i ) 6 = a ∗ i and A ∗ ( p t , [ x ] j ) = a ∗ j , or A ∗ ( p t , [ x ] i ) = a ∗ i and A ∗ ( p t , [ x ] j ) 6 = a ∗ j . Situation 2. A ∗ ( p t , [ x ] i ) 6 = a ∗ i and A ∗ ( p t , [ x ] j ) 6 = a ∗ j ; Situation 3. A ∗ ( p t , [ x ] i ) = a ∗ i and A ∗ ( p t , [ x ] j ) = a ∗ j ; 19 W e consider Situation 1 ﬁrst. F rom the pro of of Lemma 2, E [ η t +1 | F t , x t +1 ∈ { [ x ] i , [ x ] j } ] ≤ 1 − ε 2 " 1 − 1 2 X h ∈H p t ( h )  h ([ x ] i , a ∗ i ) + h ([ x ] j , a ∗ j )  # , where w e explicitly replaced the deﬁnition of δ ( a ) . If A ∗ ( p t , [ x ] i ) = a ∗ i and A ∗ ( p t , [ x ] j ) 6 = a ∗ j (the alternativ e is treated similarly), we ha ve that X h ∈H p t ( h ) h ([ x ] i , a ∗ i ) ≤ 1 and X h ∈H p t ( h ) h ([ x ] i , a ∗ i ) ≤ 0 , yielding E [ η t +1 | F t , x t +1 ∈ { [ x ] i , [ x ] j } ] ≤ 1 − ε 4 . Considering Situation 2, we again hav e E [ η t +1 | F t , x t +1 ∈ { [ x ] i , [ x ] j } ] ≤ 1 − ε 2 " 1 − 1 2 X h ∈H p t ( h )  h ([ x ] i , a ∗ i ) + h ([ x ] j , a ∗ j )  # , where, now, X h ∈H p t ( h ) h ([ x ] i , a ∗ i ) ≤ 0 and X h ∈H p t ( h ) h ([ x ] i , a ∗ i ) ≤ 0 . This immediately implies E [ η t +1 | F t , x t +1 ∈ { [ x ] i , [ x ] j } ] ≤ 1 − ε 2 . Finally , concerning Situation 3, h 0 = h ∗ . Since X i and X j are 1-neigh b ors, h ([ x ] i , a ∗ i ) = h ([ x ] j , a ∗ i ) for all h yp othesis other than h ∗ . Equiv alently , h ([ x ] i , a ∗ i ) = − h ([ x ] j , a ∗ j ) for all h yp othesis other than h ∗ . This im- plies that E [ η t +1 | F t , x t +1 ∈ { [ x ] i , [ x ] j } ] ≤ 1 − ε 2 (1 − p t ( h ∗ )) . Putting everything together, E [ η t +1 | F t ] ≤ max n 1 − ε 4 , 1 − ε 2 (1 − p t ( h ∗ )) , 1 − ε 2 (1 − c ∗ ) o and E  C τ C τ − 1 | F τ − 1  ≤ E [ η t +1 | F t ] − p t ( h ∗ ) 1 − p t ( h ∗ ) ≤ 1 − min n ε 4 , ε 2 (1 − c ∗ ) o . The pro of is complete. A.5 Pro of of Theorem 3 Let ε = 1 − ˆ c and C t = ε − p t ( h ∗ ) p t ( h ∗ ) . Let a denote an arbitrary action in A ˆ c ( p t , [ x ] i ) , for some [ x ] i , i = 1 , . . . , N . Then P [ h ∗ ([ x ] i , a ) = − 1] = P   X h 6 = h ∗ p t ( h ) h ([ x ] i , a ) > ˆ c + p t ( h )   ≤ P   X h 6 = h ∗ p t ( h ) > ˆ c + p t ( h )   = P [1 − p t ( h ∗ ) > ˆ c + p t ( h )] = P [ C t > 1] ≤ E [ C t ] , where, again, the last inequality follows from the Mark ov inequalit y . W e can now replicate the steps in the pro of of Theorem 1 in App endix A.2 to establish the desired result, for whic h w e need only to pro v e that E [ C t +1 | F t ] ≤ C t . F rom Lemma 2, the result follows. A c kno wledgemen ts This w ork was partially supp orted b y the P or- tuguese F undação para a Ciência e a T ecnologia (INESC-ID m ultiannual funding) under pro ject PEst- OE/EEI/LA0021/2011. Manuel Lop es is with the Flo wers T eam, a join t INRIA ENST A-Paristec h lab. References Abb eel P , Ng A (2004) Apprenticeship learning via in verse reinforcement learning. In: Pro c. 21st Int. Conf. Machine Learning, pp 1–8 Argall B, Cherno v a S, V eloso M (2009) A survey of rob ot learning from demonstration. Rob otics and Autonomous Systems 57(5):469–483 Bab es M, Mariv ate V, Littman M, Subramanian K (2011) Appren ticeship learning ab out multiple inten- tions. In: Proc. 28th In t. Conf. Machine Learning, pp 897–904 20 Barto A, Rosenstein M (2004) Sup ervised actor-critic reinforcemen t learning. In: Si J, Barto A, P ow ell W, W unsch D (eds) Handb o ok of Learning and Appro x- imate Dynamic Programming, Wiley-IEEE Press, c hap 14, pp 359–380 Bo yan J, Mo ore A (1995) Generalization in reinforce- men t learning: Safely appro ximating the v alue func- tion. In: Adv. Neural Information Pro c. Systems, v ol 7, pp 369–376 Breazeal C, Brooks A, Gray J, Hoﬀman G, Lieb erman J, Lee H, Thomaz A, Mulanda D (2004) T utelage and collab oration for h umanoid rob ots. In t J Humanoid Rob otics 1(2) Cakmak M, Thomaz A (2010) Optimalit y of h uman teac hers for rob ot learners. In: Proc. 2010 IEEE In t. Conf. Developmen t and Learning, pp 64–69 Cherno v a S, V eloso M (2009) In teractive policy learn- ing through conﬁdence-based autonom y . J Artiﬁcial In telligence Res 34:1–25 Cohn R, Durfee E, Singh S (2011) Comparing action- query strategies in semi-autonomous agen ts. In: Pro c. 25th AAAI Conf. Artiﬁcial Intelligence, pp 1102–1107 Dasgupta S (2005) Analysis of a greedy active learn- ing strategy . In: Adv. Neural Information Pro c. Sys- tems, vol 17, pp 337–344 Dasgupta S (2011) T w o faces of active learning. J The- oretical Computer Science 412(19):1767–1781 Golo vin D, Krause A (2011) Adaptiv e submo dular- it y: Theory and applications in active learning and sto c hastic optimization. J Artiﬁcial In telligence Res 42:427–486 Grollman D, Jenkins O (2007) Dogged learning for rob ots. In: Proc. 2007 IEEE In t. Conf. Rob otics and Automation, pp 2483–2488 Jaksc h T, Ortner R, Auer P (2010) Near-optimal re- gret b ounds for reinforcemen t learning. J Machine Learning Res 11:1563–1600 Judah K, F ern A, Dietteric h T (2011) A ctive imitation learning via state queries. In: Pro c. ICML W orkshop on Combining Learning Strategies to Reduce Lab el Cost Judah K, F ern A, Dietteric h T (2012) A ctive imita- tion learning via reduction to I.I.D. activ e learning. In: Pro c. 28th Conf. Uncertaint y in Artiﬁcial Intel- ligence, pp 428–437 Kno x W, Stone P (2009) In teractively shaping agents via h uman reinforcemen t. In: Pro c. 5th Int. Conf. Kno wledge Capture, pp 9–16 Kno x W, Stone P (2010) Combining manual feedbac k with subsequent MDP rew ard signals for reinforce- men t learning. In: Proc. 9th Int. Conf. Autonomous Agen ts and Multiagen t Systems, pp 5–12 Kno x W, Stone P (2011) Augmen ting reinforcemen t learning with h uman feedback. In: IJCAI W ork- shop on Agents Learning Interactiv ely from Human T eachers Lop es M, Melo F, Ken ward B, San tos-Victor J (2009a) A computational mo del of so cial-learning mec ha- nisms. Adaptiv e Behavior 467(17) Lop es M, Melo F, Montesano L (2009b) A ctive learn- ing for rew ard estimation in in verse reinforcement learning. In: Pro c. Eur. Conf. Mac hine Learning and Princ. Practice of Kno wledge Disc. Databases, pp 31–46 Lop es M, Melo F, Mon tesano L, San tos-Victor J (2010) Abstraction levels for rob otic imitation: Overview and computational approac hes. In: Sigaud O, Pe- ters J (eds) F rom Motor to Interaction Learning in Rob ots, Springer, pp 313–355 Melo F, Lop es M (2010) Learning from demonstration using MDP induced metrics. In: Pro c. Europ ean Conf. Mac hine Learning and Practice of Knowledge Disco very in Databases, pp 385–401 Melo F, Lop es M, Santos-Victor J, Rib eiro M (2007) A uniﬁed framework for imitation-lik e behaviours. In: Pro c. 4th Int. Symp. Imitation in Animals and Artifacts, pp 241–250 Melo F, Lop es M, F erreira R (2010) Analysis of in- v erse reinforcement learning with p erturb ed demon- strations. In: Pro c. 19th Europ ean Conf. Artiﬁcial In telligence, pp 349–354 Neu G, Szepesv ari C (2007) Apprenticeship learning using inv erse reinforcement learning and gradient metho ds. In: Pro c. 23rd Conf. Uncertaint y in Ar- tiﬁcial Intelligence, pp 295–302 21 Ng A, Russel S (2000) Algorithms for in verse reinforce- men t learning. In: Pro c. 17th In t. Conf. Mac hine Learning, pp 663–670 Ng A, Harada D, Russell S (1999) P olicy inv ariance un- der rewa rd transformations: Theory and application to reward shaping. In: Pro c. 16th Int. Conf. Machine Learning, pp 278–294 No wak R (2011) The geometry of generalized bi- nary searc h. IEEE T rans Information Theory 57(12):7893–7906 Price B, Boutilier C (1999) Implicit imitation in mul- tiagen t reinforcement learning. In: Pro c. 16th In t. Conf. Machine Learning, pp 325–334 Price B, Boutilier C (2003) Accelerating reinforcemen t learning through implicit imitation. J Artiﬁcial In- telligence Res 19:569–629 Ramac handran D, Amir E (2007) Bay esian in verse re- inforcemen t learning. In: Pro c. 20th Int. Join t Conf. Artiﬁcial Intelligence, pp 2586–2591 Regan K, Boutilier C (2011) Robust online optimiza- tion of reward-uncertain MDPs. In: Pro c. 22nd Int. Join t Conf. Artiﬁcial Intelligence, pp 2165–2171 Ross S, Bagnell J (2010) Eﬃcient reductions for imita- tion learning. In: 661-668 (ed) Pro c. 13th In t. Conf. Artiﬁcial Intelligence and Statistics Ross S, Gordon G, Bagnell J (2011) Reduction of imita- tion learning and structured prediction to no-regret online learning. In: Pro c. 14th Int. Conf. Artiﬁcial In telligence and Statistics, pp 627–635 Sc haal S (1999) Is imitation learning the route to h umanoid rob ots? T rends in Cognitiv e Sciences 3(6):233–242 Settles B (2009) Activ e learning literature survey . Comp. Sciences T echn. Rep. 1648, Univ. Wisconsin- Madison Sutton R, Barto A (1998) Reinforcement Learning: An In tro duction. MIT Press Sy ed U, Schapire R (2008) A game-theoretic approach to apprenticeship learning. In: Adv. Neural Informa- tion Pro c. Systems, vol 20, pp 1449–1456 Sy ed U, Schapire R, Bo wling M (2008) Appren ticeship learning using linear programming. In: Pro c. 25th In t. Conf. Mac hine Learning, pp 1032–1039 Thomaz A, Breazeal C (2008) T eac hable rob ots: Understanding human teaching b ehavior to build more eﬀective rob ot learners. Artiﬁcial In telligence 172:716–737 Ziebart B, Maas A, Bagnell J, Dey A (2008) Maximum en tropy in verse reinforcemen t learning. In: Proc. 23rd AAAI Conf. Artiﬁcial In telligence., pp 1433– 1438 22

Multi-class Generalized Binary Search for Active Inverse Reinforcement Learning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment