Time Consistent Discounting
A possibly immortal agent tries to maximise its summed discounted rewards over time, where discounting is used to avoid infinite utilities and encourage the agent to value current rewards more than future ones. Some commonly used discount functions l…
Authors: Tor Lattimore, Marcus Hutter
Time Consis ten t Discounting T or Lattimore 1 and Marcus Hutter 1 , 2 Researc h Sc ho ol of C omputer Science 1 Australian National Univ e rsit y 2 ETH Z ¨ uric h { tor.lattimo re,marcus.hutter } @anu.edu.au 15 July 2011 Abstract A possibly immortal agen t tries to maximise its summed discounted r e- w ards o v er time, where discount ing is used to a v o id in finite utilities and en- courage t he age nt to v alue current rewards m ore than futu re ones. Some commonly u sed discoun t f unctions lead to time-inconsisten t b eha vior wh ere the agen t c hanges its plan o v er time. These in consistencies can lead to very p o or b ehavi or. W e generalise the usual discoun ted utilit y mo del to one where the discount fu nction changes with the age o f the agen t. W e then give a sim- ple charac terisation of time-(in)consisten t discoun t f unctions and sho w the existence of a rational p olicy for an ag en t that kno ws its discount function is time-inconsisten t. Con ten ts 1 In tro duction 2 2 Notation and Problem Setup 3 3 Examples 6 4 Theorems 7 5 Game Theoretic Approach 10 6 Discussion 14 A T echnical Pro ofs 16 B T able of Notation 17 Keyw ords Rational agen ts; sequen tial d ecision theory; general discounting; time- consistency; game t heory . 1 1 In tro d u ction The goal of an agen t is to maximise its expected utility; but ho w do w e measure utilit y? One metho d is to assign an instan taneous r eward to particular ev en ts, suc h as hav ing a go o d meal, or a pleasant w alk. I t would b e natural t o measure the utilit y of a plan (p olicy) b y simply summing the exp ected instan taneous rew ards, but for immortal agents this may lead to infinite utility and also assumes rew ards are equally v aluable irrespectiv e of the time at whic h they are r eceiv ed. One solution, the discoun ted utility (D U) mo del in tro duced b y Sam uelson in [Sam37], is to tak e a w eigh ted sum of the rewards with earlier rew ards usually v a lued more t ha n later ones. There ha v e b een a num ber of criticisms of the D U mo del, whic h w e will not discuss. F or an excellen t summary , see [F OO02]. Despite the criticisms, the DU mo del is widely used in b o t h economics and computer science. A discoun t function is time-inconsisten t if plans chose n to maximise exp ected discoun ted utilit y change ov er time . F or example, many p eople express a preference for $1 1 0 in 31 da ys o v er $100 in 30 days, but rev erse that preference 30 day s later when given a choice b et w een $110 tomorrow or $100 to da y [GFM94 ]. This b ehav ior can b e caused by a rational agent with a t ime-inconsisten t discoun t function. Unfortunately , time-inconsisten t discoun t functions can lead to extremely bad b eha vior and so it b ecomes imp ort an t to ask what discoun t f unctions are time- inconsisten t. Previous w ork has fo cussed on a con tin uous mo del where agen ts can tak e actions at a ny time in a con tin uous time-space. W e consider a discrete mo del where agen ts act in finite time-steps. In general this is not a limitation since any contin uous en vironment can b e approximated arbitrarily w ell b y a discrete o ne. T he discrete setting has the a dv a ntage of easier a nalysis, whic h allo ws us to consider a v ery general setup where en vironmen ts are arbitrary finite o r infinite Mark o v decision pro cesses. T r aditionally , the DU mo del has a ssumed a sliding discoun t function. F ormally , a sequen ce of instantaneous utilities (rew ards) R = ( r k , r k +1 , r k +2 , · · · ) starting at time k , is give n utility equal t o P ∞ t = k d t − k r t where d ∈ [0 , 1 ] ∞ . W e generalise this mo del as in [Hut06] by allo wing the discount function to dep end on the age o f the agen t. The new utilit y is g iven by P ∞ t = k d k t r t . This generalisation is consisten t with ho w some agen ts tend to b ehav e; for example, h umans becoming temp o rally less m y opic as they gr o w o lder. Strotz [Str55] sho we d that the only time-consisten t sliding discount function is g eometric discoun ting. W e extend this result to a full c har a cterisation of time- consisten t disc oun t functions whe re the discoun t function is p ermitted to c hange o v er time. W e also sho w that discoun ting functions that are “nearly” time-consisten t giv e rise to lo w regret in the anticipated future changes of the p olicy ov er time. Another imp ortant question is what p olicy should b e ado pted by an agen t that kno ws it is t ime-inconsisten t. F o r example, if it knows it will b ecome temp orarily 2 m y opic in the near future then it ma y b enefit fr o m pa ying a price to pre-commit to following a particular p olicy . A n um b er of authors ha v e examined this question in sp ecial con tin uous cases, including [Gol8 0, PY73, P ol68, Str 55]. W e mo dify their results to o ur general, but discrete, setting using game theory . The pap er is structured as fo llows. First the required notation is in tro duced (Section 2 ). Example discoun t functions and the consequences of time-inconsisten t discoun t functions are then presen ted (Section 3). W e next state and pro ve the main theorems, the complete classification of discount functions and the contin uit y result (Section 4). The ga me theoretic view of what an ag en t should do if it kno ws its discoun t function is c hanging is analyzed (Section 5). Finally we offer some discussion and concluding remarks (Section 6). 2 Notation and Proble m Setup The general reinforcemen t learning (RL) setup inv o lv es an agent interacting sequen- tially with an en vironmen t where in each time-step t the agen t c ho oses some action a t ∈ A , whereup on it receiv es a rew ard r t ∈ R ⊆ R and observ ation o t ∈ O . The environmen t can b e for mally defined as a probabilit y distribution µ where µ ( r t o t | a 1 r 1 o 1 a 2 r 2 o 2 · · · a t − 1 r t − 1 o t − 1 a t ) is the probability of receiving r eward r t and observ atio n o t ha ving tak en action a t after history h 0 for at least one t ≥ k . The apparently sup erfluous sup erscript k will b e useful la t er when w e allow the discoun t vector to c hange with time. W e do not insist that the discoun t v ector b e summable, P ∞ t = k d k t < ∞ . Definition 5 (Expected V alues) . The expected discoun ted rew ard (or utility or v a lue) when using p olicy π starting in history h 0. Definition 6 (Optimal Policy/V alue) . In general, our agent will try to c ho ose a p olicy π ∗ d k to maximise V π d k ( h 1 . Geometric. d k t = γ t with γ ∈ ( 0 , 1). Geometric discoun ting is the most commonly used discoun t matrix. Philosophically it can b e justified by assuming an agen t will die (and not care ab out t he future after death) with probability 1 − γ a t eac h time- step. Another justification for geometric discount is its analytic simplicit y - it is summable and leads to time-consisten t p o licies. It also mo dels fixed interes t rates. No Discoun ting. d k t = 1 , for all k , t . [L H07] and [Leg08] p oin t out that dis- coun t ing future rew ards via an explicit discoun t mat r ix is unnecess ary since the en vironment can capture b oth temp o ral preferences for early (o r late) consumption, as w ell as the r isk asso ciated with dela ying consumption. Of course, this “discoun t matrix” is not summable , but can b e made to w ork b y insisting that all en viron- men ts satisfy Assumption 7. This approach is elegan t in the sense t ha t it eliminates 1 [ [ expr ] ] = 1 if expr is true and 0 otherwis e. 6 the need fo r a discoun t ma t r ix, essen t ia lly admitting far mor e complex preferences regarding in ter-temp oral rew ards than a discoun t matrix allow s. On the ot her hand, a discoun t matrix gives the “controller” an explicit w a y to adjust the m y opia of the agen t. S 1 / 2 0 2 / 3 0 3 / 4 0 4 / 5 0 0 0 T o illustrate the p oten t ial consequences o f time- inconsisten t discoun t matrices w e consider the p oli- cies of sev eral agen ts acting in the follow ing env iron- men t. Let ag en t A use a constan t horizon discoun t matrix with H = 2 and agent B a geometric discoun t matrix with some discoun t rate γ . In the first time-step agen t A prefers to mo v e right with the in ten tion of mo ving up in the second time-step fo r a reward of 2 / 3. Ho w ever, once in second time-step, it will c hange its plan b y moving right again. This contin ues indefinitely , so agen t A will alw ays dela y mo ving up and receiv es zero rew ard forev er. Agen t B acts v ery differently . Let π t b e the p olicy in whic h the a gen t mo v es right un til time-step t , then up and righ t indefinitely . V π t d k ( h < 1 ) = γ t ( t +1) ( t +2) . This v alue do es not dep end on k and so the agen t will mo v e righ t un til t = arg max n γ t ( t +1) t +2 o < ∞ when it will mov e up and receiv e a rew ard. The actions of a gen t A are an example of the w orst p ossible b ehav ior arising from time-inconsisten t discounting. Nev ertheless, agen ts with a constant horizon discoun t matrix are used in all kinds of problems. In pa rticular, agen ts in zero sum games where fixed depth mini-max searc hes are common. In pra ctise, serious time- inconsisten t b eha vior for game-pla ying agents seems rar e, presumably b ecause most strategic games don’t hav e a rew ard structure similar to the example ab ov e. 4 Theorems The main theorem of this pap er is a complete characterisation of time consisten t discoun t matrices. Theorem 13 (Characterisation) . L et d b e a disc ount ma trix, then the fol lowi n g ar e e quivalent. 1. d i s time-c onsis tent (D efinition 12) 2. F or e ach k ther e exists an α k ∈ R such that d k t = α k d 1 t for al l t ≥ k ∈ N . Recall that a discoun t matrix is sliding if d k t = d 1 t − k + 1 . Theorem 13 can b e used to show that if a sliding discoun t matrix is used as in [Str55] then the only time- consisten t discoun t mat r ix is g eometric. Let d b e a time-consisten t sliding discount matrix. By Theorem 13 and t he definition o f sliding, α 1 d 1 t +1 = d 2 t +1 = d 1 t . Therefore 1 α 1 d 1 2 = d 1 1 and d 1 3 = 1 α 1 d 1 2 = 1 α 1 2 d 1 1 and similarly , d 1 t = 1 α 1 t − 1 d 1 1 ∝ γ t with 7 γ = 1 /α 1 , whic h is geometric discounting. This is the analogue to the results of [Str55] conv erted to our setting. The theorem can a lso be used to construct time-consisten t discoun t rat es. L et d 1 b e a discoun t v ector, then the discoun t matrix defined b y d k t := d 1 t for all t ≥ k will alw a ys b e time-consisten t, for example, the fixe d lifetime discount matrix with d k t = 1 if t ≤ H for some horizon H . Indeed, all time-consisten t discoun t ra tes can b e constructed in this w ay (up to scaling). Pr o of of T he or em 13. 2 = ⇒ 1: This direction f ollo ws easily from linearity of the scalar pro duct. π ∗ d k ( h 0. By the previous argumen t w e ha v e tha t, [ d k t i , d k t i +1 ] = α k [ d 1 t i , d 1 t i +1 ] and [ d k t i +1 , d k t i +2 ] = ˜ α k [ d 1 t i +1 , d 1 t i +2 ]. Therefore α k = ˜ α k , and by induction, d k t i = α k d 1 t i for all i . Now if t ≥ k and d 1 t = 0 then d k t = 0 b y equation (6). By symmetry , d k t = 0 = ⇒ d 1 t = 0 . Therefore d k t = α k d 1 t for all t ≥ k a s required. In Section 3 w e sa w an example where time-inconsistency led to v ery bad b e- ha vior. The discoun t matrix causing this w as ve ry time-inconsisten t. Is it p ossible that an agen t using a “nearly” time-consisten t discoun t matrix can exhibit similar bad b eha vior? F or example, could rounding errors when using a geometric discoun t matrix seriously affect the agent’s b eha vior? The follow ing Theorem sho ws that this is not p ossible. First w e require a measure of the cost of time-inconsisten t b ehavior. The regret experienced b y the agen t at time zero from follow ing p olicy π d rather than π ∗ d 1 is V ∗ d 1 ( h < 1 ) − V π d d 1 ( h < 1 ). W e also need a distance measure on the space of discoun t ve ctors. Definition 14 (Distance Measure ) . Let d k , d j b e discoun t v ectors then define a distance measure D by D ( d k , d j ) := ∞ X i =max { k, j } | d k i − d j i | . Note that this is almost the taxicab metric, but the sum is restricted to i ≥ max { k , j } . Theorem 15 (Con tinuit y) . Supp ose ǫ ≥ 0 a n d D k ,j := D ( d k , d j ) then V ∗ d 1 ( h < 1 ) − V π d d 1 ( h < 1 ) ≤ ǫ + D 1 ,t + t − 1 X k =1 D k ,k +1 with t = min n t : P h 0 is guar ante e d to exist by Assumption 7. Theorem 15 implies that the regret o f the agent a t time zero in its future time- inconsisten t actions is b ounded by the sum of the differences b etw een the discoun t v ectors used a t differen t times. If these differences are small then the regret is also small. F or example, it implies that small p erturba t io ns (suc h as rounding errors) in a time-consisten t discoun t matr ix lead to minimal bad b eha vior. The pro o f is o mitted due to limitations in space. It r elies on pro ving the result for finite horizon en vironmen t s a nd show ing that this extends to t he infinite case b y using the horizon, t , a fter which t he a ctions of the agen t a re no longer imp or t a n t. The b o und in Theorem 15 is tight in the follow ing sense. 9 Theorem 16. F or δ > 0 and t ∈ N and any sufficie ntly sm al l ǫ > 0 ther e exists an envir onment and d i sc ount matrix such that ( t − 2)(1 − ǫ ) δ < V ∗ d 1 ( h < 1 ) − V π d d 1 ( h < 1 ) < ( t + 1) δ ≡ D 1 ,t + t − 1 X i =1 D i,i +1 wher e t = min n t : P h 0 where the regret due to time-inconsistency is nearly equal to the b ound giv en by Theorem 15. Pr o of of T he or em 16. Define d by d k i = ( δ if k < i < t 0 otherwise Observ e that D ( d k , d j ) = δ for all k < j < t since d j i = d k i for all i except i = j . No w consider the en vironmen t b elow. S · · · 0 0 0 1 − ǫ 1 − ǫ 2 1 − ǫ t − 1 1 − ǫ 1 − ǫ 2 0 F or sufficien t ly small ǫ , the agen t at time zero will plan to mov e right and then down leading to R ∗ d 1 ( h < 1 ) = [0 , 1 − ǫ, 1 − ǫ, · · · ] and V ∗ d 1 ( h < 1 ) = ( t − 1) δ (1 − ǫ ). T o compute R d note that d k k = 0 for all k . Therefore the agent in time-step k do esn’t care ab out the next instan taneous reward, so prefers to mo v e right with the inte n tion of mo ving do wn in t he next time-step when the rew ards are slightly b etter. This leads to R d ( h < 1 ) = [0 , 0 , · · · , 1 − ǫ t − 1 , 0 , 0 , · · · ]. Therefore, V ∗ d 1 ( h < 1 ) − V π d d 1 ( h < 1 ) = ( t − 1) δ (1 − ǫ ) − (1 − ǫ t − 1 ) δ ≥ ( t − 2) δ (1 − ǫ ) as required. 5 Game Theoretic Approac h What should an agent do if it kno ws it is time inconsisten t? One option is to treat its future selv es as “opp onen ts” in an extensiv e game. The ga me has one pla y er p er 10 time-step who choo ses the action for tha t time-step only . A t the end of the game the agen t will hav e receiv ed a rew a r d sequence r ∈ R ∞ . The utility g iven to the k th pla y er is then r · d k . So eac h play er in this game wishes to maximise the discoun ted rew ard with resp ect to a differen t discoun t ing v ector. S 4 1 3 1 0 0 0 3 F or example, let d 1 = [2 , 1 , 2 , 0 , 0 , · · · ] and d 2 = [ ∗ , 3 , 1 , 0 , 0 , · · · ] and consider the en vironmen t on the right. Ini- tially , the agen t has tw o c hoices. It can either mov e do wn to guaran tee a rew ard sequence of r = [4 , 0 , 0 , · · · ] whic h has util- it y of d 1 · [4 , 0 , 0 , · · · ] = 8 or it can mo v e right in whic h case it will receiv e a rew ard sequence of either r ′ = [1 , 3 , 0 , 0 , · · · ] with utilit y 5 o r r ′′ = [1 , 1 , 3 , 0 , 0 , · · · ] with utilit y 9. Whic h of t hese t w o rew ard sequences it receiv es is determined by the a ctio n take n in the second time-step. How ev er this action is chosen to maximise utilit y with resp ect to discoun t sequence d 2 and d 2 · r ′ > d 1 · r ′′ . This means t ha t if a t time 1 the agen t c ho oses to mo v e r ig h t, the final rew ar d sequence will b e [1 , 3 , 0 , 0 , · · · ] and the final utilit y with resp ect to d 1 will b e 5. Therefore the rational thing to do in time-step 1 is to mov e do wn immediately for a utilit y o f 8. The tec hnique ab o v e is kno wn as backw a rds induction whic h is used to find sub-game p erfect equilibria in finite extensiv e g ames. A v a rian t of Kuhn’s theorem pro v es that backw ards induction can b e used to find suc h equilibria in finite extensiv e games [O R94]. F or arbitrary extensiv e games (p ossibly infinite) a sub-game p erfect equilibrium need not exist, but w e prov e a theorem for our particular class of infinite games. A sub-game p erfect equilibrium p olicy is one the pla y ers could ag ree to play , and subseque n tly hav e no incen tiv e to renege on their agreemen t during play . It isn’t alw a ys philosophically clear tha t a sub-game p erfect equilibrium p olicy sho uld b e pla y ed. F or a deep er discussion, including a num b er of go o d examples, see [OR94]. Definition 17 (Sub-game P erfect Equilibria) . A p olicy π ∗ d is a sub-game p erfect equilibrium p olicy if a nd only if for eac h t V π ∗ d d t ( h t . That is, the en vironment whic h giv es zero rew ard alw a ys after time t . W e can assume without loss of generalit y that π t ( h t i . 3. π t i ( h t w e ha v e V π d t ( h 0. Therefore π i ( h
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment