Time Consistent Discounting

Time Consis ten t Discounting T or Lattimore 1 and Marcus Hutter 1 , 2 Researc h Sc ho ol of C omputer Science 1 Australian National Univ e rsit y 2 ETH Z ¨ uric h { tor.lattimo re,marcus.hutter } @anu.edu.au 15 July 2011 Abstract A possibly immortal agen t tries to maximise its summed discounted r e- w ards o v er time, where discount ing is used to a v o id in ﬁnite utilities and en- courage t he age nt to v alue current rewards m ore than futu re ones. Some commonly u sed discoun t f unctions lead to time-inconsisten t b eha vior wh ere the agen t c hanges its plan o v er time. These in consistencies can lead to very p o or b ehavi or. W e generalise the usual discoun ted utilit y mo del to one where the discount fu nction changes with the age o f the agen t. W e then give a sim- ple charac terisation of time-(in)consisten t discoun t f unctions and sho w the existence of a rational p olicy for an ag en t that kno ws its discount function is time-inconsisten t. Con ten ts 1 In tro duction 2 2 Notation and Problem Setup 3 3 Examples 6 4 Theorems 7 5 Game Theoretic Approach 10 6 Discussion 14 A T echnical Pro ofs 16 B T able of Notation 17 Keyw ords Rational agen ts; sequen tial d ecision theory; general discounting; time- consistency; game t heory . 1 1 In tro d u ction The goal of an agen t is to maximise its expected utility; but ho w do w e measure utilit y? One metho d is to assign an instan taneous r eward to particular ev en ts, suc h as hav ing a go o d meal, or a pleasant w alk. I t would b e natural t o measure the utilit y of a plan (p olicy) b y simply summing the exp ected instan taneous rew ards, but for immortal agents this may lead to inﬁnite utility and also assumes rew ards are equally v aluable irrespectiv e of the time at whic h they are r eceiv ed. One solution, the discoun ted utility (D U) mo del in tro duced b y Sam uelson in [Sam37], is to tak e a w eigh ted sum of the rewards with earlier rew ards usually v a lued more t ha n later ones. There ha v e b een a num ber of criticisms of the D U mo del, whic h w e will not discuss. F or an excellen t summary , see [F OO02]. Despite the criticisms, the DU mo del is widely used in b o t h economics and computer science. A discoun t function is time-inconsisten t if plans chose n to maximise exp ected discoun ted utilit y change ov er time . F or example, many p eople express a preference for $1 1 0 in 31 da ys o v er $100 in 30 days, but rev erse that preference 30 day s later when given a choice b et w een $110 tomorrow or $100 to da y [GFM94 ]. This b ehav ior can b e caused by a rational agent with a t ime-inconsisten t discoun t function. Unfortunately , time-inconsisten t discoun t functions can lead to extremely bad b eha vior and so it b ecomes imp ort an t to ask what discoun t f unctions are time- inconsisten t. Previous w ork has fo cussed on a con tin uous mo del where agen ts can tak e actions at a ny time in a con tin uous time-space. W e consider a discrete mo del where agen ts act in ﬁnite time-steps. In general this is not a limitation since any contin uous en vironment can b e approximated arbitrarily w ell b y a discrete o ne. T he discrete setting has the a dv a ntage of easier a nalysis, whic h allo ws us to consider a v ery general setup where en vironmen ts are arbitrary ﬁnite o r inﬁnite Mark o v decision pro cesses. T r aditionally , the DU mo del has a ssumed a sliding discoun t function. F ormally , a sequen ce of instantaneous utilities (rew ards) R = ( r k , r k +1 , r k +2 , · · · ) starting at time k , is give n utility equal t o P ∞ t = k d t − k r t where d ∈ [0 , 1 ] ∞ . W e generalise this mo del as in [Hut06] by allo wing the discount function to dep end on the age o f the agen t. The new utilit y is g iven by P ∞ t = k d k t r t . This generalisation is consisten t with ho w some agen ts tend to b ehav e; for example, h umans becoming temp o rally less m y opic as they gr o w o lder. Strotz [Str55] sho we d that the only time-consisten t sliding discount function is g eometric discoun ting. W e extend this result to a full c har a cterisation of time- consisten t disc oun t functions whe re the discoun t function is p ermitted to c hange o v er time. W e also sho w that discoun ting functions that are “nearly” time-consisten t giv e rise to lo w regret in the anticipated future changes of the p olicy ov er time. Another imp ortant question is what p olicy should b e ado pted by an agen t that kno ws it is t ime-inconsisten t. F o r example, if it knows it will b ecome temp orarily 2 m y opic in the near future then it ma y b eneﬁt fr o m pa ying a price to pre-commit to following a particular p olicy . A n um b er of authors ha v e examined this question in sp ecial con tin uous cases, including [Gol8 0, PY73, P ol68, Str 55]. W e mo dify their results to o ur general, but discrete, setting using game theory . The pap er is structured as fo llows. First the required notation is in tro duced (Section 2 ). Example discoun t functions and the consequences of time-inconsisten t discoun t functions are then presen ted (Section 3). W e next state and pro ve the main theorems, the complete classiﬁcation of discount functions and the contin uit y result (Section 4). The ga me theoretic view of what an ag en t should do if it kno ws its discoun t function is c hanging is analyzed (Section 5). Finally we oﬀer some discussion and concluding remarks (Section 6). 2 Notation and Proble m Setup The general reinforcemen t learning (RL) setup inv o lv es an agent interacting sequen- tially with an en vironmen t where in each time-step t the agen t c ho oses some action a t ∈ A , whereup on it receiv es a rew ard r t ∈ R ⊆ R and observ ation o t ∈ O . The environmen t can b e for mally deﬁned as a probabilit y distribution µ where µ ( r t o t | a 1 r 1 o 1 a 2 r 2 o 2 · · · a t − 1 r t − 1 o t − 1 a t ) is the probability of receiving r eward r t and observ atio n o t ha ving tak en action a t after history h 0 for at least one t ≥ k . The apparently sup erﬂuous sup erscript k will b e useful la t er when w e allow the discoun t vector to c hange with time. W e do not insist that the discoun t v ector b e summable, P ∞ t = k d k t < ∞ . Deﬁnition 5 (Expected V alues) . The expected discoun ted rew ard (or utility or v a lue) when using p olicy π starting in history h 0. Deﬁnition 6 (Optimal Policy/V alue) . In general, our agent will try to c ho ose a p olicy π ∗ d k to maximise V π d k ( h 1 . Geometric. d k t = γ t with γ ∈ ( 0 , 1). Geometric discoun ting is the most commonly used discoun t matrix. Philosophically it can b e justiﬁed by assuming an agen t will die (and not care ab out t he future after death) with probability 1 − γ a t eac h time- step. Another justiﬁcation for geometric discount is its analytic simplicit y - it is summable and leads to time-consisten t p o licies. It also mo dels ﬁxed interes t rates. No Discoun ting. d k t = 1 , for all k , t . [L H07] and [Leg08] p oin t out that dis- coun t ing future rew ards via an explicit discoun t mat r ix is unnecess ary since the en vironment can capture b oth temp o ral preferences for early (o r late) consumption, as w ell as the r isk asso ciated with dela ying consumption. Of course, this “discoun t matrix” is not summable , but can b e made to w ork b y insisting that all en viron- men ts satisfy Assumption 7. This approach is elegan t in the sense t ha t it eliminates 1 [ [ expr ] ] = 1 if expr is true and 0 otherwis e. 6 the need fo r a discoun t ma t r ix, essen t ia lly admitting far mor e complex preferences regarding in ter-temp oral rew ards than a discoun t matrix allow s. On the ot her hand, a discoun t matrix gives the “controller” an explicit w a y to adjust the m y opia of the agen t. S 1 / 2 0 2 / 3 0 3 / 4 0 4 / 5 0 0 0 T o illustrate the p oten t ial consequences o f time- inconsisten t discoun t matrices w e consider the p oli- cies of sev eral agen ts acting in the follow ing env iron- men t. Let ag en t A use a constan t horizon discoun t matrix with H = 2 and agent B a geometric discoun t matrix with some discoun t rate γ . In the ﬁrst time-step agen t A prefers to mo v e right with the in ten tion of mo ving up in the second time-step fo r a reward of 2 / 3. Ho w ever, once in second time-step, it will c hange its plan b y moving right again. This contin ues indeﬁnitely , so agen t A will alw ays dela y mo ving up and receiv es zero rew ard forev er. Agen t B acts v ery diﬀerently . Let π t b e the p olicy in whic h the a gen t mo v es right un til time-step t , then up and righ t indeﬁnitely . V π t d k ( h < 1 ) = γ t ( t +1) ( t +2) . This v alue do es not dep end on k and so the agen t will mo v e righ t un til t = arg max n γ t ( t +1) t +2 o < ∞ when it will mov e up and receiv e a rew ard. The actions of a gen t A are an example of the w orst p ossible b ehav ior arising from time-inconsisten t discounting. Nev ertheless, agen ts with a constant horizon discoun t matrix are used in all kinds of problems. In pa rticular, agen ts in zero sum games where ﬁxed depth mini-max searc hes are common. In pra ctise, serious time- inconsisten t b eha vior for game-pla ying agents seems rar e, presumably b ecause most strategic games don’t hav e a rew ard structure similar to the example ab ov e. 4 Theorems The main theorem of this pap er is a complete characterisation of time consisten t discoun t matrices. Theorem 13 (Characterisation) . L et d b e a disc ount ma trix, then the fol lowi n g ar e e quivalent. 1. d i s time-c onsis tent (D eﬁnition 12) 2. F or e ach k ther e exists an α k ∈ R such that d k t = α k d 1 t for al l t ≥ k ∈ N . Recall that a discoun t matrix is sliding if d k t = d 1 t − k + 1 . Theorem 13 can b e used to show that if a sliding discoun t matrix is used as in [Str55] then the only time- consisten t discoun t mat r ix is g eometric. Let d b e a time-consisten t sliding discount matrix. By Theorem 13 and t he deﬁnition o f sliding, α 1 d 1 t +1 = d 2 t +1 = d 1 t . Therefore 1 α 1 d 1 2 = d 1 1 and d 1 3 = 1 α 1 d 1 2 =  1 α 1  2 d 1 1 and similarly , d 1 t =  1 α 1  t − 1 d 1 1 ∝ γ t with 7 γ = 1 /α 1 , whic h is geometric discounting. This is the analogue to the results of [Str55] conv erted to our setting. The theorem can a lso be used to construct time-consisten t discoun t rat es. L et d 1 b e a discoun t v ector, then the discoun t matrix deﬁned b y d k t := d 1 t for all t ≥ k will alw a ys b e time-consisten t, for example, the ﬁxe d lifetime discount matrix with d k t = 1 if t ≤ H for some horizon H . Indeed, all time-consisten t discoun t ra tes can b e constructed in this w ay (up to scaling). Pr o of of T he or em 13. 2 = ⇒ 1: This direction f ollo ws easily from linearity of the scalar pro duct. π ∗ d k ( h 0. By the previous argumen t w e ha v e tha t, [ d k t i , d k t i +1 ] = α k [ d 1 t i , d 1 t i +1 ] and [ d k t i +1 , d k t i +2 ] = ˜ α k [ d 1 t i +1 , d 1 t i +2 ]. Therefore α k = ˜ α k , and by induction, d k t i = α k d 1 t i for all i . Now if t ≥ k and d 1 t = 0 then d k t = 0 b y equation (6). By symmetry , d k t = 0 = ⇒ d 1 t = 0 . Therefore d k t = α k d 1 t for all t ≥ k a s required. In Section 3 w e sa w an example where time-inconsistency led to v ery bad b e- ha vior. The discoun t matrix causing this w as ve ry time-inconsisten t. Is it p ossible that an agen t using a “nearly” time-consisten t discoun t matrix can exhibit similar bad b eha vior? F or example, could rounding errors when using a geometric discoun t matrix seriously aﬀect the agent’s b eha vior? The follow ing Theorem sho ws that this is not p ossible. First w e require a measure of the cost of time-inconsisten t b ehavior. The regret experienced b y the agen t at time zero from follow ing p olicy π d rather than π ∗ d 1 is V ∗ d 1 ( h < 1 ) − V π d d 1 ( h < 1 ). W e also need a distance measure on the space of discoun t ve ctors. Deﬁnition 14 (Distance Measure ) . Let d k , d j b e discoun t v ectors then deﬁne a distance measure D by D ( d k , d j ) := ∞ X i =max { k, j } | d k i − d j i | . Note that this is almost the taxicab metric, but the sum is restricted to i ≥ max { k , j } . Theorem 15 (Con tinuit y) . Supp ose ǫ ≥ 0 a n d D k ,j := D ( d k , d j ) then V ∗ d 1 ( h < 1 ) − V π d d 1 ( h < 1 ) ≤ ǫ + D 1 ,t + t − 1 X k =1 D k ,k +1 with t = min n t : P h 0 is guar ante e d to exist by Assumption 7. Theorem 15 implies that the regret o f the agent a t time zero in its future time- inconsisten t actions is b ounded by the sum of the diﬀerences b etw een the discoun t v ectors used a t diﬀeren t times. If these diﬀerences are small then the regret is also small. F or example, it implies that small p erturba t io ns (suc h as rounding errors) in a time-consisten t discoun t matr ix lead to minimal bad b eha vior. The pro o f is o mitted due to limitations in space. It r elies on pro ving the result for ﬁnite horizon en vironmen t s a nd show ing that this extends to t he inﬁnite case b y using the horizon, t , a fter which t he a ctions of the agen t a re no longer imp or t a n t. The b o und in Theorem 15 is tight in the follow ing sense. 9 Theorem 16. F or δ > 0 and t ∈ N and any suﬃcie ntly sm al l ǫ > 0 ther e exists an envir onment and d i sc ount matrix such that ( t − 2)(1 − ǫ ) δ < V ∗ d 1 ( h < 1 ) − V π d d 1 ( h < 1 ) < ( t + 1) δ ≡ D 1 ,t + t − 1 X i =1 D i,i +1 wher e t = min n t : P h 0 where the regret due to time-inconsistency is nearly equal to the b ound giv en by Theorem 15. Pr o of of T he or em 16. Deﬁne d by d k i = ( δ if k < i < t 0 otherwise Observ e that D ( d k , d j ) = δ for all k < j < t since d j i = d k i for all i except i = j . No w consider the en vironmen t b elow. S · · · 0 0 0 1 − ǫ 1 − ǫ 2 1 − ǫ t − 1 1 − ǫ 1 − ǫ 2 0 F or suﬃcien t ly small ǫ , the agen t at time zero will plan to mov e right and then down leading to R ∗ d 1 ( h < 1 ) = [0 , 1 − ǫ, 1 − ǫ, · · · ] and V ∗ d 1 ( h < 1 ) = ( t − 1) δ (1 − ǫ ). T o compute R d note that d k k = 0 for all k . Therefore the agent in time-step k do esn’t care ab out the next instan taneous reward, so prefers to mo v e right with the inte n tion of mo ving do wn in t he next time-step when the rew ards are slightly b etter. This leads to R d ( h < 1 ) = [0 , 0 , · · · , 1 − ǫ t − 1 , 0 , 0 , · · · ]. Therefore, V ∗ d 1 ( h < 1 ) − V π d d 1 ( h < 1 ) = ( t − 1) δ (1 − ǫ ) − (1 − ǫ t − 1 ) δ ≥ ( t − 2) δ (1 − ǫ ) as required. 5 Game Theoretic Approac h What should an agent do if it kno ws it is time inconsisten t? One option is to treat its future selv es as “opp onen ts” in an extensiv e game. The ga me has one pla y er p er 10 time-step who choo ses the action for tha t time-step only . A t the end of the game the agen t will hav e receiv ed a rew a r d sequence r ∈ R ∞ . The utility g iven to the k th pla y er is then r · d k . So eac h play er in this game wishes to maximise the discoun ted rew ard with resp ect to a diﬀeren t discoun t ing v ector. S 4 1 3 1 0 0 0 3 F or example, let d 1 = [2 , 1 , 2 , 0 , 0 , · · · ] and d 2 = [ ∗ , 3 , 1 , 0 , 0 , · · · ] and consider the en vironmen t on the right. Ini- tially , the agen t has tw o c hoices. It can either mov e do wn to guaran tee a rew ard sequence of r = [4 , 0 , 0 , · · · ] whic h has util- it y of d 1 · [4 , 0 , 0 , · · · ] = 8 or it can mo v e right in whic h case it will receiv e a rew ard sequence of either r ′ = [1 , 3 , 0 , 0 , · · · ] with utilit y 5 o r r ′′ = [1 , 1 , 3 , 0 , 0 , · · · ] with utilit y 9. Whic h of t hese t w o rew ard sequences it receiv es is determined by the a ctio n take n in the second time-step. How ev er this action is chosen to maximise utilit y with resp ect to discoun t sequence d 2 and d 2 · r ′ > d 1 · r ′′ . This means t ha t if a t time 1 the agen t c ho oses to mo v e r ig h t, the ﬁnal rew ar d sequence will b e [1 , 3 , 0 , 0 , · · · ] and the ﬁnal utilit y with resp ect to d 1 will b e 5. Therefore the rational thing to do in time-step 1 is to mov e do wn immediately for a utilit y o f 8. The tec hnique ab o v e is kno wn as backw a rds induction whic h is used to ﬁnd sub-game p erfect equilibria in ﬁnite extensiv e g ames. A v a rian t of Kuhn’s theorem pro v es that backw ards induction can b e used to ﬁnd suc h equilibria in ﬁnite extensiv e games [O R94]. F or arbitrary extensiv e games (p ossibly inﬁnite) a sub-game p erfect equilibrium need not exist, but w e prov e a theorem for our particular class of inﬁnite games. A sub-game p erfect equilibrium p olicy is one the pla y ers could ag ree to play , and subseque n tly hav e no incen tiv e to renege on their agreemen t during play . It isn’t alw a ys philosophically clear tha t a sub-game p erfect equilibrium p olicy sho uld b e pla y ed. F or a deep er discussion, including a num b er of go o d examples, see [OR94]. Deﬁnition 17 (Sub-game P erfect Equilibria) . A p olicy π ∗ d is a sub-game p erfect equilibrium p olicy if a nd only if for eac h t V π ∗ d d t ( h t . That is, the en vironment whic h giv es zero rew ard alw a ys after time t . W e can assume without loss of generalit y that π t ( h t i . 3. π t i ( h t w e ha v e V π d t ( h 0. Therefore π i ( h

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment