Feature Dynamic Bayesian Networks
Feature Markov Decision Processes (PhiMDPs) are well-suited for learning agents in general environments. Nevertheless, unstructured (Phi)MDPs are limited to relatively simple environments. Structured MDPs like Dynamic Bayesian Networks (DBNs) are use…
Authors: Marcus Hutter
F eature Dynamic Ba y esian Net w orks Marcus Hutter RSISE @ ANU and SML @ NICT A Canberra , ACT, 0200, Austra lia marcus @hutt er1.net www.hu tter1 .net 24 Decem b er 200 8 Abstract F eature Mark ov Decisio n Pr o cess e s ( Φ MDPs) [Hut09] are w el l-suited for learning agen ts i n ge n- eral en vironments. Nev ertheless, unstructured ( Φ )MDPs are limited to relatively simple environ- men ts. Structured MDPs lik e Dynamic Ba yesian Net w orks (DBNs) are used for large-scale real- w orld problems. In this article I extend Φ MDP to Φ DBN. T he primary con tribution is to deri v e a cost criterio n that allows to automatically extract the mo st rel e v an t features from the environmen t, leading to the “b est” DBN representa tion. I dis- cuss all building blo c ks required for a complete general learning algorithm. Keywor ds: Reinforcement learning; dynamic Ba yesian net w ork; structure learning; feature learning; global vs. l o cal rew ard; explore-explo it. 1 In tro duction Agen ts. The agent-en vironment setup in which an Ag ent int eracts with an Envir onment is a very general and prev a- lent framework for studying intelligen t learning systems [RN03]. In cycles t = 1 , 2 , 3 ,... , the environmen t provides a (regula r) observation o t ∈ O (e.g. a c amera image) to the agen t; then the a gent ch o oses an action a t ∈ A (e.g. a lim b mov ement ); finally the en vironment provides a r eal- v alued r ewar d r t ∈ I R to the a gent. The reward ma y b e very scarce, e.g. just +1 (-1 ) for winning (losing) a chess game, and 0 at a ll other times [Hut05, Sec.6.3]. Then the next cycle t + 1 sta r ts. The agent’s o b jective is to maximize his reward. En vironments. F or ex ample, se quenc e pr e diction is con- cerned with en vironments that do not react to the a gents actions (e.g. a weather-forecasting “action”) [Hut03], plan- ning deals with the ca se where the environmental function is known [RPPCd08], classific ation and r e gr ession is for conditionally independent obs erv ations [Bis06], Markov De cision Pr o c esses (MDPs) assume that o t and r t only depe nd o n a t − 1 and o t − 1 [SB98], POMDPs deal with Par- tial ly Observable MDPs [KLC98], and Dynamic Bayesian Networks (DBNs) with structured MDPs [BDH99]. F eature MDPs [Hut0 9]. Concrete real-world prob- lems c an often b e m o dele d as MDPs. F o r this purp ose, a designer extracts relev a nt features from the history (e.g. p os ition a nd velo cit y of all ob jects), i.e. the history h t = a 1 o 1 r 1 ...a t − 1 o t − 1 r t − 1 o t is summarized by a f e atu r e vector s t := Φ( h t ). The feature vectors are regar ded as states of an MDP and are assumed to b e (approximately) Marko v. Artificial General Intelligence (AGI) [GP07] is con- cerned with designing agents t hat p erform wel l in a very lar ge ra nge of envir onments [LH07], including all of the men tioned o nes ab ov e a nd more. In this general situa- tion, it is no t a prio ri clear wha t the useful fea tures are. Indeed, an y obser v ation in the (far) pa s t may b e relev ant in the future. A solution suggested in [H ut09] is to lear n Φ itself. If Φ keeps to o muc h of the history (e.g. Φ( h t ) = h t ), the r esulting MDP is to o larg e (infinite) and cannot b e learned. If Φ keeps to o little, the re s ulting s tate sequence is not Markov. The Cost criterion I develop formalizes this tradeoff and is minimized for the “b est” Φ. At an y time n , the best Φ is the o ne that minim izes the Marko v co de length o f s 1 ...s n and r 1 ...r n . This reminds but is actually quite different fr om MDL, which minimizes mo del+data co de leng th [Gr ¨ u07]. Dynamic Bay esian net works. The use of “ unstruc- tured” MDPs [Hut09 ], even our Φ- optimal ones , is clearly limited to r elatively simple tasks. Real-world pro blems are structured a nd can often b e r epresented by dynamic Bay esian netw or ks (DBNs) with a reasona ble num b e r of no des [DK89]. Bay esian net works in genera l and DBNs in particular are powerful to ols for mo deling and solv- ing complex real-world pro blems. Adv ances in theory a nd increase in c o mputation power co ns tant ly br oaden their range of applica bilit y [BDH99 , SDL07]. Main con tribution. The pr ima ry contribution o f this work is to extend the Φ sele ction principle de velop ed in [Hut09] for MDP s to the conceptually muc h mo re demand- ing DBN case. The ma jor extra co mplications a re approx- imating, lea rning and co ding the rewards, the dep endence of the Cost criterion on the DBN str ucture, lea rning the DBN structure, and how to sto r e and find the o ptimal v alue function and p olicy . 1 Although this article is self-contained, it is re com- mended to read [Hut09] first. 2 F eature Dynamic Bay esian Netw orks (ΦDBN) In this section I reca pitula te the definition of ΦMDP from [Hut09], and ada pt it to DBNs. While formally a DBN is just a spec ial case of an MDP , exploiting the a dditional structure efficiently is a c hallenge. F or ge neric MDPs , typ- ical alg orithms should b e p olyno mial and can a t b est b e linear in the num b er of states |S | . F or DBNs we wan t algorithms that are p olyno mial in the num b er of features m . Such DBNs have exponentially ma n y states (2 O ( m ) ), hence the standard MDP algorithms ar e exp onential, not po lynomial, in m . Deriving p oly-time (and p oly-spac e !) algorithms for DBNs by explo iting the additional DBN structure is the challenge. The gain is tha t we ca n handle exp onentially la rge structured MD Ps efficien tly . Notation. Througho ut this article, log denotes the bi- nary loga rithm, and δ x,y = δ xy = 1 if x = y a nd 0 else is the Kronecker sym bo l. I genera lly o mit separating com- mas if no c o nfusion arises, in particula r in indice s . F or any z of suita ble type (string ,vector,set), I define str ing z = z 1: l = z 1 ...z l , sum z + = P j z j , union z ∗ = S j z j , and vector z • = ( z 1 ,...,z l ), where j rang es ov er the full range { 1 ,...,l } and l = | z | is the leng th or dimension or size of z . ˆ z denotes an es timate of z . The ch ara cteristic func- tion 1 1 B = 1 if B = true and 0 else. P( · ) denotes a prob- ability over states and rewards or par ts thereo f. I do not disting uish b etw een r andom v a riables Z and realiza- tions z , and a bbr eviation P ( z ) := P[ Z = z ] never leads to confusion. More sp ecifically , m ∈ I N denotes the num- ber of fea tures, i ∈ { 1 ,...,m } any fea ture, n ∈ I N the cur - rent time, and t ∈ { 1 ,...,n } any time. F urther, in order not to get distracted at several places I glo ss ov er ini- tial conditions or sp ecial cases where inessential. Also 0 ∗ undefined=0 ∗ infinit y:=0. ΦMDP definition. A ΦMDP co nsists of a 7 tu- pel ( O , A , R , Agent , Env , Φ , S ) = (observ ation spa ce, a c tion space, reward space, agent, en vironment, feature map, state s pace). Without muc h loss o f g enerality , I as sume that A a nd O a re finite and R ⊆ I R . Implicitly I ass ume A to be s mall, while O may b e h uge . Agent and Env a re a pair o r triple of interlo cking func- tions of the history H := ( O × A × R ) ∗ × O : Env : H × A × R ❀ O , o n = Env ( h n − 1 a n − 1 r n − 1 ) , Agent : H ❀ A , a n = Agen t( h n ) , Env : H × A ❀ R , r n = Env ( h n a n ) . where ❀ indicates that ma ppings → might b e s to chas- tic. The informal goal of AI is to desig n an Agen t() that achiev es high (expe cted) reward o ver the agent’s lifetime in a large range of En v()ironment s. The feature map Φ maps histories to sta tes Φ : H → S , s t = Φ( h t ) , h t = oar 1: t − 1 o t ∈ H The idea is that Φ shall extract the “relev ant” a s pec ts of the histor y in the sense that “compre ssed” history sar 1: n ≡ s 1 a 1 r 1 ...s n a n r n can well be describ ed a s a sample from some MDP ( S , A ,T ,R ) = (state space, action spa ce, transition pr obability , reward function). (Φ) Dynamic Ba y esian Net w orks are structured (Φ)MDPs. The state space is S = { 0 , 1 } m , and each state s ≡ x ≡ ( x 1 ,...,x m ) ∈ S is int erpre ted as a feature vector x = Φ ( h ), where x i = Φ i ( h ) is the v alue of the i th binary feature. In the following I will also refer to x i as feature i , although strictly sp eaking it is its v alue. Since non- binary features can b e rea liz ed a s a list o f binary features, I r estrict myself to the latter. Given x t − 1 = x , I assume that the features ( x 1 t ,...,x m t ) = x ′ at time t are indep enden t, and that ea ch x ′ i depe nds only on a subset o f “parent” fea tures u i ⊆ { x 1 ,...,x m } , i.e. the tra nsition matrix has the structur e T a xx ′ = P( x t = x ′ | x t − 1 = x , a t − 1 = a ) = m Y i =1 P a ( x ′ i | u i ) (1 ) This defines o ur ΦDBN mo del . It is just a ΦMDP with sp ecial S and T . Explaining ΦDBN o n an ex ample is easier th an s taying g eneral. 3 ΦDBN Example Consider an instan tiation of the simple v acuum world [RN03, Sec.3.6]. There are tw o ro oms, A a nd B , and a v acuum R ob ot that ca n observe whether the ro o m he is in is C lean or D irty; M ove to the other ro om, S uck, i.e. clean the r o om he is in; or do N othing. After 3 days a ro om gets dirty ag ain. Every clea n ro o m gives a reward 1, but a moving o r sucking rob ot costs a nd hence reduces the reward b y 1. Hence O = { A,B }× { C ,D } , A = { N ,S,M } , R = {− 1 , 0 , 1 , 2 } , and the dynamics E nv() (p o ssible histo- ries) is clear from the ab ov e descr iption. Dynamics as a DBN. W e can mo del the dynamics by a DBN as follows: The state is mode le d by 3 features. F eature R ∈ { A,B } stores in which ro om the rob ot is, a nd feature A/B ∈ { 0 , 1 , 2 , 3 } remembers (ca pped a t 3) how lo ng ago the rob ot has c leaned ro o m A/B last time, hence S = { 0 , 1 , 2 , 3 } × { A,B } × { 0 , 1 , 2 , 3 } . The state/feature trans ition is a s follows: if ( x R = A and a = S ) then x ′ A = 0 else x ′ A = min { x A + 1 , 3 } ; if ( x R = B and a = S ) then x ′ B = 0 else x ′ B = min { x B + 1 , 3 } ; if a = M (if x R = B then x ′ R = A else x ′ R = B ) else x ′ R = x R ; A DBN can be viewed as a tw o-layer Bay esian net work [BDH99]. The dep endency s tr ucture o f our example is depicted in the right diagram. t − 1 t ✒✑ ✓✏ A ✒✑ ✓✏ A ′ ✒✑ ✓✏ R ✒✑ ✓✏ R ′ ✒✑ ✓✏ B ✒✑ ✓✏ B ′ x x ′ ✲ ✲ ✲ ✚ ✚ ✚ ❃ ❩ ❩ ❩ ⑦ Each feature co nsists of a (left,right)- pair of no des, a nd a no de i ∈ { 1 , 2 , 3 = m } b = { A,R,B } on the righ t is connected to all and o nly the par en t features u i on the left. The r eward is r = 1 1 x A < 3 + 1 1 x B < 3 − 1 1 a 6 = N 2 The fea tures map Φ = (Φ A , Φ R , Φ B ) can a lso b e wr itten down e x plicitly . It dep ends on the a c tions a nd observ a- tions of the last 3 time steps. Discussion. No te that all no des x ′ i can implicitly also de- pend on the chosen action a . The o ptimal p olicies are r e p- etitions of action sequence S,N ,M or S,M ,N . One migh t think that binar y features x A/B ∈ { C,D } ar e sufficient, but this would r esult in a POMDP (P artially Observ a ble MDP), since the cleannes s of ro om A is not obs erved while the rob ot is in ro om B . That is, x ′ would no t b e a (proba- bilistic) function of x and a alone. The qua ternary featur e x A ∈ { 0 , 1 , 2 , 3 } can easily b e conv e r ted int o t wo binary fea- tures, and similarly x B . The purely deterministic exam- ple can easily b e ma de sto chastic. F or instance, S ucking and M oving may fail with a cer tain probability . Possible, but more complicated is to mo del a probabilistic tr ansi- tion from C lean to D ir t y . In the randomized versions the agent needs to use its observ atio ns . 4 ΦDBN Co ding and Ev aluation I now constr uct a co de for s 1: n given a 1: n , a nd for r 1: n given s 1: n and a 1: n , which is o ptimal (minimal) if s 1: n r 1: n given a 1: n is sampled from some MDP . It constitutes our cost function for Φ a nd is used to define the Φ selection principle for DBNs. Compa red to the MDP case , reward co ding is mo re complex, and there is an extra dependence on the graphical structure o f the DBN. Recall [Hut09] that a sequence z 1: n with counts n = ( n 1 ,...,n m ) can within an a dditiv e co nstant b e coded in CL( n ) := n H ( n /n ) + m ′ − 1 2 log n if n> 0 and 0 else (2) bits, wher e n = n + = n 1 + ... + n m and m ′ = |{ i : n i > 0 }| ≤ m is the n umber of no n- empt y ca tegories, and H ( p ) := − P m i =1 p i log p i is the en tropy of pro bability distr ibution p . The co de is o ptimal (within + O (1)) for all i.i.d. sources. State/F eature Co ding. Similarly to the ΦMDP cas e, we need to co de the temp oral “obser ved” state=feature sequence x 1: n . I do this by a frequency estimate of the state/feature trans itio n pro bability . (Within a n additive constant, MDL, MML, combinatorial, incremental, and Bay esian co ding all lead to the same res ult). In the fol- lowing I will drop the prime in ( u i ,a,x ′ i ) tuples and re- lated situations if/since it do es not lead to confusion. Let T ia u i x i = { t ≤ n : u t − 1 = u i ,a t − 1 = a,x i t = x i } be the set of times t − 1 a t which features that influence x i hav e v al- ues u i , and ac tion is a , and whic h leads to feature i ha v- ing v a lue x i . Let n ia u i x i = |T ia u i x i | their num b er ( n i + ++ = n ∀ i ). I es tima te ea ch feature pro bability sepa rately by ˆ P a ( x i | u i ) = n ia u i x i /n ia u i + . Using (1), this yields ˆ P( x 1: n | a 1: n ) = n Y t =1 ˆ T a t − 1 x t − 1 x t = n Y t =1 m Y i =1 ˆ P a t − 1 ( x i t | u i t − 1 ) = ... = exp X i, u i ,a n ia u i + H n ia u i • n ia u i + The length of the Shannon-F ano co de of x 1: n is just the logarithm of this expressio n. W e also nee d to co de each non-zero count n ia u i x i to a ccuracy O (1 / p n ia u i + ), which each needs 1 2 log( n ia u i + ) bits. T ogether this gives a co mplete co de of leng th CL( x 1: n | a 1: n ) = X i, u i ,a CL( n ia u i • ) (3) The r ewards ar e mo re complica ted. Rew ard structure. Let R a xx ′ be (a mo del of ) the ob- served reward when actio n a in s tate x r esults in state x ′ . It is natura l to assume that the structure of the re- wards R a xx ′ is rela ted to the transition structure T a xx ′ . In- deed, this is not re s trictive, since o ne ca n alwa ys consider a DBN with the union of tr ansition and reward dep en- dencies. Usually it is ass umed that the “global” reward is a sum of “ lo cal” rewards R ia u i x ′ i , o ne for each feature i [KP9 9]. F or simplicity o f exp ositio n I assume that the lo cal reward R i only dep ends on the featur e v alue x ′ i and not o n u i and a . Even this is not restr ictive and actually may b e adv antageous as discussed in [Hut09] for MDPs. So I assume R a xx ′ = m X i =1 R i x ′ i =: R ( x ′ ) F or instance, in the ex a mple of Section 2, tw o lo ca l re w ards ( R A x ′ A = 1 1 x ′ A < 3 and R B x ′ B = 1 1 x ′ B < 3 ) depend o n x ′ only , but the third reward dep ends on the action ( R R = − 1 1 a 6 = N ). Often it is a ssumed that the lo c a l rewards are directly observed or known [KP9 9], but we neither wan t nor can do this her e: Having to specify many local rewards is an extra burden for the environment (e.g. the teacher), which preferably should be av o ided. In our case, it is not even po ssible to pre-sp ecify a lo cal reward for each fea tur e, since the features Φ i themselves ar e learned by the age nt and are not statica lly av ailable. They are ag en t-internal and not part of the ΦDBN in terface. In case multiple rewards ar e av ailable, they can be mo deled as part of the regular observ a tions o , and r only holds the ov erall rew ard. The agent must and can le a rn to interpret a nd exploit the loca l rewards in o by himself. Learning the rew ard functio n. In analog y to the MDP case for R and the DBN ca se for T ab ov e it is tempting to estimate R i x i by P r ′ r ′ n ir ′ + x i /n i + + x i but this ma kes no sense. F or instance if r t = 1 ∀ t , then ˆ R i x i ≡ 1, a nd ˆ R a xx ′ ≡ m is a g ross mis - estimation of r t ≡ 1. The lo ca lization of the global reward is somewhat mor e complicated. The goa l is to cho ose R 1 x 1 ,...,R m x m such that r t = R ( x t ) ∀ t . Without loss we can se t R i 0 ≡ 0, since we can subtract a constant fro m each lo cal rew ard and a bsorb them into an ov e rall constant w 0 . This allows us to write R ( x ) = w 0 x 0 + w 1 x 1 + ... + w m x m = w ⊤ x where w i := R i 1 and x 0 : ≡ 1. 3 In practice, the ΦDBN mo del will not b e p er fect, and an a pproximate solution, e.g. a least squa res fit, is the b est we can ac hieve. The square loss ca n b e written a s Loss( w ) := n X t =1 ( R ( x t ) − r t ) 2 = w ⊤ A w − 2 b ⊤ w + c (4) A ij := n X t =1 x i t x j t , b i := n X t =1 r t x i t , c := n X t =1 r 2 t Note that A ij counts the num b er of times feature i and j are “on” (=1) simultaneously , and b i sums all rewards for which feature i is on. The loss is minimized for ˆ w := arg min w Loss( w ) = A − 1 b , ˆ R ( x ) = ˆ w ⊤ x which inv olves an inv ersion of the ( m + 1 ) × ( m + 1) matrix A . F or singula r A we take the pseudo-inv e rse. Rew ard co ding. The quadratic lo ss function suggests a Gaussian mo del for the rewards: P( r 1: n | ˆ w , σ ) := exp( − Loss( ˆ w ) / 2 σ 2 ) / (2 π σ 2 ) n/ 2 Maximizing this w.r.t. the v ariance σ 2 yields the maximum likelihoo d estimate − log P( r 1: n | ˆ w , ˆ σ ) = n 2 log(Lo ss( ˆ w )) − n 2 log n e 2 π where ˆ σ 2 = Loss( ˆ w ) / n . Given ˆ w a nd ˆ σ this ca n b e re- garded as the (Shannon- F ano) co de length of r 1: n (there are actually a few subtleties her e which I gloss ov er). E ach weigh t ˆ w k and ˆ σ need a lso b e co ded to accura cy O (1 / √ n ), which ne e ds ( m + 2) 1 2 log n bits total. T o gether this gives a complete co de of length CL( r 1: n | x 1: n a 1: n ) = (5) = n 2 log(Loss( ˆ w )) + m +2 2 log n − n 2 log n e 2 π ΦDBN ev aluation and sele ctio n is similar to the MDP case. Let G denote the graphical structur e of the DBN, i.e. the s e t of par ents Pa i ⊆ { 1 ,...,m } of each feature i . (Remem b er u i are the par en t v alues). Simila rly to the MDP case, the cost of ( Φ ,G ) on h n is defined as Cost( Φ , G | h n ) := CL( x 1: n | a 1: n ) + CL( r 1: n | x 1: n , a 1: n ) , (6) and the best ( Φ ,G ) minimizes this cost. ( Φ best , G best ) := arg min Φ ,G { Cost( Φ , G | h n ) } A general discuss ion why this is a go o d cr iterion can b e found in [Hut09]. In the following section I mainly high- light the difference to the MDP ca s e, in par ticular the additional dep endence on and optimization over G . 5 DBN Structure Learning & Up dating This section briefly dis cusses minimization of (6) w.r.t. G given Φ and e ven briefer minimization w.r.t. Φ . F or the moment r egard Φ a s given and fixed. Cost a nd DBN structure. F or genera l str uctured loca l rewards R ia u i x ′ i , (3) and (5) b oth dep end o n G , a nd (6) represents a nov el DBN structur e learning criterio n that includes the rewards. F or our simple reward mo del R i x i , (5) is indep endent of G , hence o nly (3) needs to b e conside r ed. This is a standard MDL cr iterion, but I hav e not s een it used in DBNs b efore. F urther, the features i a re indep endent in the sense that we ca n search for the optimal pa rent sets Pa i ⊆ { 1 ,...,m } for eac h feature i separately . Complexity of structure search. Even in this cas e, finding the o ptimal DBN structure is gener ally har d. In principle we could re ly on off-the-shelf heuristic sear ch metho ds for finding go o d G , but it is pr obably b etter to use or dev elop some special purp ose optimizer. One may even re s trict the s pace of considered graphs G to those for which (6) can b e minimized w.r.t. G efficiently , as long as this r e striction can be compensated by “ s marter” Φ . A brute force ex haustive search algor ithm f or Pa i is to consider all 2 m subsets of { 1 ,...,m } a nd s e lect the o ne that minimizes P u i ,a CL( n ia u i • ). A reas onable and o ften em- ploy e d as sumption is to limit the num b er of parents to some small v alue p , which r educes the sea rch space siz e to O ( m p ). Indeed, since the Co st is exp onential in the maximal nu mber o f par ent s of a feature, but only linear in n , a Cost minimizing Φ can usually not hav e more than a log- arithmic num b er o f pa rents, whic h lea ds to a se arch s pa ce that is pseudo-p olynomial in m . Heuristic structure searc h. W e could als o replac e the well-founded criterion (3) by some heuristic. One such heuristic has been develop ed in [SDL07]. The m utual in- formation is another po pular criterion for determining the depe ndency of tw o random v ariables , so we could add j as a pa rent o f feature i if the m utual infor mation of x j and x ′ i is ab ov e a c e rtain thresho ld. Overall this takes time O ( m 2 ) to determine G . An MDL inspir e d thresho ld for binar y random v a riables is 1 2 n log n . Since the mut ual information treats parents indepe ndently , ˆ T has to be es- timated accordingly , essen tially as in naive Bay es c la ssifi- cation [Lew98] with feature sele ction, where x ′ i represents the class lab el and u i are the feature s selected x . The improv ed T ree-Augmented na ive B ay es (T AN) classifier [F GG97] could be used to mo del synchronous feature de- pendenc ie s (i.e. within a time slice). The Chow-Liu [CL68] minim um spanning tr ee alg orithm allows determining G in time O ( m 3 ). A tree b eco mes a for est if we employ a low er threshold for the mutu al information. Φ searc h is even harder than structure sear ch, and re- mains an art. Nevertheless the reduction of the complex 4 (ill-defined) reinforce ment lea r ning problem to an inter- nal feature search problem with well-defined ob jective is a clear conceptual adv ance. In principle (but not in practice) we could co nsider the s et o f al l (computable) functions { Φ : H → { 0 , 1 }} . W e then co mpute Cost( Φ | h ) for every finite subset Φ = { Φ i 1 ,..., Φ i m } and ta ke the minim um (note that the or der is irre le v ant). Most practica l sea rch algor ithms require the sp e c ifica- tion of some neighbo rho o d function, her e for Φ . F or in- stance, sto chastic sea rch algor ithms suggest and acce pt a neig hbor of Φ with a pr obability that dep ends on the Cost reduction. See [Hut09] for more details. Here I will only pres ent some very simplistic idea s for features a nd neighborho o ds. Assume binary observ ations O = { 0 , 1 } a nd c o nsider the last m observ ations a s featur es, i.e. Φ i ( h n ) = o n − i +1 and Φ ( h n ) = (Φ 1 ( h n ) ,..., Φ m ( h n )) = o n − m +1: n . So the states are the same as f or Φ m MDP in [Hut09], but no w S = { 0 , 1 } m is structured as m binar y features. In the example her e, m = 5 lea d to a per fect ΦDBN. W e can add a new feature o n − m ( m ❀ m + 1) or remove the la st feature ( m ❀ m − 1), which defines a natural neig h b orho o d structure . Note that the context trees of [McC9 6, Hut09] a re more flexible. T o achieve this flexibility here we either hav e to use smarter features within o ur framework (simply inter- pret s = Φ S ( h ) as a feature v ector of leng th m = ⌈ lo g |S |⌉ ) or use smarter (non-tabular) estimates of P a ( x i | u i ) ex- tending o ur framework (to tree dep endencies). F or genera l purp ose intelligen t age nts we c le arly need more p ow erful features . Lo gical expressio ns o r (non)accepting T uring machines or recursive sets can ma p histories or parts thereo f into true/ false or ac cept/reject or in/out, resp ectively , hence na turally repres en t binary features. Randomly gener ating s uc h expr essions or pro- grams with an appropr ia te bias tow ards simple ones is a universal feature g enerator that even tually finds the opti- mal feature map. The idea is known as Universal Searc h [Gag07]. 6 V alue & Policy Learning in ΦDBN Given an estimate ˆ Φ of Φ best , the next step is to deter- mine a go o d a ction for our agent. I mainly concentrate on the difficulties one faces in ada pting MDP algorithms and discuss state of the art DBN a lgorithms. V alue and p olicy learning in known finite state MDPs is easy provided one is satisfied with a polynomia l time algorithm. Since a DBN is just a sp ecial (structur ed) MDP , its ( Q ) V alue function resp ects the same B ellman equations [Hut09, E q .(6)], and the optimal p olicy is still given b y a n +1 := ar gmax a Q ∗ a x n +1 . Nevertheless, their solution is now a nigh tmare, since the state space is exp onential in the num b er of features. W e need algorithms that are polyno mial in the num b er of fea- tures, i.e. logarithmic in the n umber of states. V alue function appro ximation. The fir st problem is that the optimal v alue and p olicy do not resp ect the struc- ture of the DBN. They are usua lly complex functions o f the (exp onentially many) s tates, which ca nnot even b e stored, not to mention computed [KP9 9]. It ha s been sug- gested that the v alue ca n often b e a pproximated w ell a s a sum of lo cal v alues similarly to t he r e w ards . Such a v alue function ca n at least b e stored. Mo del- based l earning. The default quality measur e for the approximate v a lue is the ρ -weigh ted squared dif- ference, where ρ is the stationary distributio n. Even for a fixed p o licy , v alue iteratio n do es not con- verge to the b est a ppr oximation, but usually converges to a fixed po in t close to it [BT9 6]. V alue itera tion requir es ρ explicitly . Since ρ is als o to o large to store, one has to ap- proximate ρ a s w ell. Another problem, as pointed out in [KP00], is that p olicy iteratio n may not c o nv er ge, s ince dif- ferent p olicies hav e different (misleading) stationar y dis- tributions. Ko ller a nd Parr [KP0 0] devised a lgorithms for general facto red ρ , and Gues tr in et al. [GKP V03] for max- norm, allev iating this problem. Finally , g eneral p olicies cannot b e s tored exactly , and another restr iction or ap- proximation is necessary . Mo del- free learning. Given the difficulties ab ov e, I sug- gest to (re)consider a very simple class of algo r ithms, with- out sugges ting that it is b etter. The ab ov e mo del-based algorithms exploit ˆ T a nd ˆ R directly . An alternative is to sample from ˆ T and use mo del-free “T empo ral Difference (TD)” learning alg orithms based only on this internal vir- tual s a mple [SB 9 8]. W e could use TD( λ ) or Q -v alue v ar i- ants with linear v alue function approximation. Beside their simplicit y , a no ther a dv antage is that nei- ther the statio na ry distribution nor the p olicy needs to be stored or approximated. Once approximation ˆ Q ∗ has been obtained, it is trivial to deter mine the optimal (w.r .t. ˆ Q ∗ ) action via a n +1 = argmax a Q ∗ a x n +1 for any state of in terest (namely x n +1 ) exa ctly . Exploration. Optimal a ctions based o n approximate rather than exact v alues can lea d to very p o or b ehav- ior due to lack of explor ation. The r e a re po lynomially optimal alg orithms (Rmax,E3,OIM) for the e xploration- exploitation dilemma . F or mo del-ba sed lear ning, extending E3 to DBNs is straightforward, but E3 needs an ora cle for planning in a given DBN [K K99]. Recently , Strehl et al. [SDL07] ac- complished the same for Rmax. They even learn the DBN structure, alb eit in a very simplistic way . Algorithm OIM [SL08], which I describ ed in [Hut09] for MDPs , can a lso likely b e generaliz e d to DBNs, and I can imagine a mo del- free v er sion. 7 Incremen t al Up dates As discussed in Section 5, most sear c h algorithms are lo- cal in the sense tha t they pro duce a chain of “slightly” mo dified ca ndidate solutions, her e Φ ’s. This suggests a po ten tial sp eedup by computing quantities of interest in- crementally . 5 Cost. Computing CL( x | a ) in (3 ) takes a t mo st time O ( m 2 k |A| ), where k is the maximal num b er of pa rents of a feature. If we remov e feature i , we can simply re- mov e/ subtract the cont ributions from i in the sum. If w e add a new feature m +1, we only need to sear ch for the b est parent set u m +1 for this new feature, a nd add the corre- sp onding co de length. In prac tice , many transitions do n’t o ccur, i.e. n ia u i x i = 0, so CL( x | a ) can actually be co mputed m uch faster in time O ( |{ n ia u i x i > 0 }| ), and incrementally even faster . Rew ards. When adding a new f eature, the c urrent lo ca l reward estimates may not c hange m uch. If we re assign a fraction α ≤ 1 of rew ard to the new feature x m +1 , we get the following ansatz 1 . ˆ R ( x 1 , ..., x m +1 ) = (1 − α ) ˆ R ( x ) + w m +1 x m +1 =: v ⊤ ψ ( x ) v := (1 − α, w m +1 ) ⊤ , ψ := ( ˆ R ( x ) , x m +1 ) ⊤ Minimizing P n t =1 ( ˆ R ( x 1 t ...x m +1 t ) − r t ) 2 w.r.t. v ana logous to (4) just requires a triv ial 2 × 2 ma trix inversion. The minim um ˜ v results in a n initial new estimate ˜ w = ((1 − ˜ α ) ˆ w 0 ,..., (1 − ˜ α ) ˆ w m , ˜ w m +1 ) ⊤ , which ca n b e improved b y some first or der gradien t decent a lgorithm in tim e O ( m ), compared to the ex act O ( m 3 ) a lgorithm. When removing a feature, we simply redistribute its lo ca l reward to the other features, e.g. unifor mly , followed by improv ement steps that cost O ( m ) time. V alue. All itera tion algo rithms describ ed in Sec tion 6 for computing ( Q ) V a lues need a n initial v a lue for V or Q . W e can take the estimate ˆ V from a previous Φ as a n initial v alue for the new Φ . Similarly as for the r ewards, we can redistribute a fraction of the v alues by solving relatively small sys tems of equations. The res ult is then used as an initia l v alue for the itera tio n alg orithms in Sectio n 6. A further sp eedup c a n b e o btained b y using pr ioritized iteration algorithms that concentrate their time on badly estimated para meters, which are in our case the new v alues [SB98]. Similarly , r esults from time t can be (re)used as initial estimates for the next cycle t + 1, followed by a fast im- prov ement step. 8 Outlo ok ΦDBN leav es muc h more q uestions op en and ro om for mo difications and impro vemen ts than ΦMD P . Here a r e a few. • The cost function can be improv e d by integrating out the states a na logous to the ΦMDP ca se [Hut09]: The likelihoo d P ( r 1: n | a 1: n , ˆ U ) is unchanged, except that ˆ U ≡ ˆ T ˆ R is now e s timated locally , and the complexity pena lt y b ecomes 1 2 ( M + m + 2)log n , where M is (es- sentially) the num b er o f no n-zero counts n ia u i x i , but an efficient algo r ithm ha s y et to be found. 1 An An satz is an i nitial mathematical or physical mo del with some free parameters to b e determined subsequen tly . [ht tp://en.wikip edia.org/wiki/Ansatz] • It may be necessa ry to impose and exploit struc tur e on the conditional probability tables P a ( x i | u i ) them- selves [BDH99]. • Rea l-v alued obs e rv ations and b eliefs suggest to ex- tend the binary fea ture mo del to [0 , 1] interv a l v a lue d features ra ther than co ding them bina r y . Since a n y contin uous semantics that pres erves the role of 0 and 1 is acce pta ble, there should b e an efficient wa y to generalize Co s t and V alue estima tio n pro c e dur es. • I assumed that the rew ard/ v alue is linear in loc a l re- wards/v alues. Is this sufficient for all prac tica l pur- po ses? I a lso ass umed a leas t squares and Gaussian mo del for the local r ewards. There are efficient algo- rithms fo r much more flexible mo dels. The least we could do is to co de w.r.t. the prop er co v aria nce A . • I also bar ely dis cussed s ynchronous (within time-slice) depe ndencie s. • I guess ΦDBN will often be able to work ar ound to o restrictive DBN mo dels , by finding features Φ that are more compatible with the DBN a nd reward structure. • E xtra edg es in the DBN can improv e the linear v alue function a pproximation. T o give ΦDBN incentiv es to do s o , the V alue would have to b e included in the Cost criterion. • Implicitly I assumed that the action space A is small. It is p ossible to extend ΦDBN to la rge structur ed action spa c es. • Apar t from the Φ- search, all par ts of ΦDBN seem to b e p oly-time appr oximable, which is satisfactor y in theory . In practice, this needs to b e improved to essentially linear time in n and m . • Developing smart Φ generation and smar t stochastic search algo rithms for Φ a r e the ma jor op en challenges. • A mor e B ay es ian Cos t criterio n would be desir able: a likelihoo d o f h given Φ and a prior over Φ leading to a po sterior of Φ g iven h , or so. Monte Car lo (s e a rch) al- gorithms like Metrop olis-Hasting s could sample from such a p o sterior. Currently proba bilities ( b =2 − CL ) a re assigned o nly to rewards and states, but not to ob- serv atio ns and featur e maps. Summary . In this work I introduced a p ow e r ful frame- work (ΦDBN) for general- purp ose intelligen t lear ning agents, and presen ted algor ithms fo r all r equired building blo cks. The int ro duced cost criterion reduced the informal reinforcement learning problem to an in ternal w ell-defined search for “relev ant” feature s . References [BDH99] C. Boutilier, T. D ean, and S. Hank s. Decisio n- theoretic planning: Stru ctural assumptions and com- putational leverage. Journal of A rtificial Intel ligenc e R ese ar ch , 11:1–94, 1999. [Bis06] C. M. Bishop. Patte rn R e c o gnition and Machine L e arning . S pringer, 2006. [BT96] D. P . Bertsek as and J. N. Tsitsiklis. Neur o-Dynamic Pr o gr amming . A thena Scienti fic, Belmont, MA, 1996 . 6 [CL68] C. K. Chow and C. N. Liu. Approximating discrete probability distributions with dep end ence trees. IEEE T r ansactions on Information The ory , I T- 14(3):462– 467, 1968 . [DK89] T. Dean and K. K anaza w a. A mod el for reasoning abou t persistence and causation. Computational In- tel li genc e , 5(3):142–15 0, 198 9. [F GG97] N. F riedman, D. Geiger, and M. Goldszmid. Bay esian netw ork classifiers. Machine L e arning , 29(2):131–163, 1997. [Gag07] M. Gaglio. Universal search. Scholarp e dia , 2(11):2575 , 2007 . [GKPV03] C. Guestrin, D. Koller, R. Parr, and S. V enkata ra- man. Efficient solution algorithms for factored MDPs. Journal of Art ificial Intel ligenc e Re se ar ch (JAIR) , 19:399– 468, 2003 . [GP07] B. Goertzel and C. Pennachin, editors. Artificial Gen- er al Intel ligenc e . Springer, 2007. [Gr¨ u07] P . D . Gr¨ unw ald. The M inimum Description L ength Principle . The MIT Press, Cambridge, 2007. [Hut03] M. Hutter. Optimality of u niversa l Ba yesian predic- tion for general loss and alph abet. Journal of Machine L e arning R ese ar ch , 4:971–1000, 2003. [Hut05] M. Hutter. Uni versal A rtificial Intel ligenc e: Se- quential De cisions b ase d on Algorithmic Pr ob a- bility . Sp ringer, Berlin, 200 5. 300 pages, http://w ww.hutter1.net/a i/uaib ook.htm. [Hut09] M. H utter. F eature Marko v decision pro cesses. In Ar- tificial Gener al I ntel li genc e (AGI’09) . Atlantis Press, 2009. [KK99] M. Kearns and D. Koller. Efficien t reinforcement learning in factored MDPs. In Pr o c. 16th Inter- national Joint Confer enc e on Artificial Intel li genc e (IJCAI-99) , pages 740–747, San F rancisco, 1999. Morgan Kaufmann. [KLC98] L. P . K aelbling, M. L. Littman, and A. R . Cassandra. Planning and acting in partially observ able stochastic domains. Artificial I ntel ligenc e , 101:99–134, 1998. [KP99] D. Koller and R. Parr. Computing factored v alue functions for policies in structured MD Ps,. In Pr o c. 16st International Joint Conf. on Art ificial I ntel li - genc e (IJCAI’ 99) , pages 1332–1339, Edinburgh, 1999. [KP00] D. Koller and R. Pa rr. P olicy iteration for factored MDPs. In Pr o c. 16th Confer enc e on Unc ertainty in Ar tificial I ntel ligenc e (UAI-00) , pages 326–334, San F rancisco, CA, 2000. Morgan Kau fmann . [Lew98] D. D. Lewis. Naive (Bay es) at forty: The inde- p endence assumption in information retriev al. In Pr o c. 10th Eur op e an C onfer enc e on Machine L e arn- ing (ECML’98) , pages 4–15, Chemnitz, DE, 1998. Springer. [LH07] S. Legg and M. Hutter. Univ ersal intell igence: A definition of machine intelligence. M inds & Machines , 17(4):391– 444, 2007 . [McC96] A. K . McCallum. R einf or c ement L e arning wi th Sele ctive Per c eption and Hidden State . PhD t h e- sis, Department of Compu t er Science, Universit y of Ro c hester, 1996. [RN03] S. J. Russell and P . Norvig. Artificial Intel li genc e. A Mo dern Appr o ach . Pren tice-Hall, Englewoo d Cliffs, NJ, 2nd edition, 2003. [RPPCd08] S . Ross, J. Pineau, S. Paquet, and B. Chaib-draa. Online planning algorithms for POMDPs. Journal of A rtificial Intel li genc e R ese ar ch , 2008(32):663–7 04, 2008. [SB98] R. S. Sut ton and A. G. Barto. R einfor c ement L e arn- ing: A n I ntr o duction . MIT Press, Cambridge, MA , 1998. [SDL07] A. L. S trehl, C. D iuk, and M. L . Littman. Effi- cien t structure learning in factored-state MDPs. In Pr o c. 27th AAAI Confer enc e on Art ificial Intel li- genc e , pages 645–650, V ancouver, BC, 2007. AAAI Press. [SL08] I. S zita and A. L¨ orincz. The many faces of optimism: a unifying approach. I n Pr o c. 12th I nternat ional Con- fer enc e (ICML 2008) , volume 307, Helsinki, Finland , June 2008. 7
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment