Online Learning under Delayed Feedback

Online Learning under De la y ed F eedbac k P o oria Joulani pooria@ualber t a. ca Andr´ as Gy¨ orgy gyorgy@ualber t a.ca Csaba Szep esv´ ari szepesv a@ual ber t a.ca Dept. of Computing Science, University of Alberta, E dmont on, AB, T6G 2E 8 CANAD A Abstract Online learning with delay ed feedback has re- ceived increas ing attention rec ent ly due to its several applications in distributed, web-based learning pro blems. In this pap er w e provide a systematic study of the topic, and a nalyze the eﬀect of delay on the reg ret of o nline learning algorithms. Somewhat sur prisingly , it turns out tha t delay increases the regret in a mul- tiplicative wa y in adversaria l pro blems, and in an additive w ay in sto chastic problems. W e give meta-alg orithms that transfor m, in a black-box fashion, algorithms developed for the non-delay ed case in to ones that can han- dle the presence of delays in the feedback lo op. Mo diﬁcations of the well-kno w n UCB algorithm ar e a lso developed for the bandit problem with delayed feedback, with the ad- v antage ov er the meta-a lgorithms that they can b e implemented with low er complexity . 1. In tro duction In this paper w e study seq ue ntial learning whe n the feedback ab out the predictions made by the fore- caster are delay ed. This is the case, for example, in web advertisement, where the informatio n whether a user h as clic ked on a certain ad may come bac k to the engine in a dela yed fashion: after an ad is s e- lected, while waiting for the inf ormation if the user clicks or not, the engine has to provide ads to other users. Also, the click information may b e aggr egated and then p erio dically sent to the mo dule that dec ides ab out the a ds , r esulting in further delays. ( Li et al. , 2010 ; Dudik et al. , 201 1 ). Another exa mple is parallel, distributed learning, where propaga ting information among no des caus es delays ( Agarwal & Duc hi , 2011 ). Pr o c e e dings of the 30 th International Confer enc e on Ma- chine L e arning , A tlanta, Georgi a, USA, 2013. JMLR: W&CP volume 28. Copyrigh t 2013 b y the aut hor(s). While online learning has prov ed to b e successful in many machine lear ning problems a nd is applied in practice in situations where the feedbac k is delay e d, the theoretical res ults for the non-delay e d setup a re not a pplicable when delays ar e pr esent. Pr evious work concerning the delay ed se tting fo cussed on sp eciﬁc on- line learning settings and delay mo dels (mos tly with constant delays). T hus, a comprehens ive understa nd- ing of the eﬀects of delays is missing. In this pap er, we provide a systematic study of online lea rning problems with delayed feedback. W e co ns ider the p artial moni- toring setting , which cov ers all settings previo usly con- sidered in the liter ature, ex tending, unifying, and often improving up on existing results. In particular , we g ive general meta-a lg orithms that tr ansform, in a black- box fas hion, a lg orithms develop ed for the non-delayed case into algo rithms that can handle delays eﬃciently . W e analy ze how the delay eﬀects the re gret of the algo- rithms. One interesting, p erha ps somewhat sur pr ising, result is that the delay inﬂa tes the reg ret in a multi- plicative wa y in adversarial problems , while this eﬀect is only additive in sto chastic pr oblems. While our gen- eral meta- algorithms are use ful, their time- a nd spac e - complexity may be unnecessarily larg e. T o resolve this problem, we w ork out mo diﬁcations of v ariants of the UCB algorithm ( Auer et a l. , 2002 ) fo r sto chastic ban- dit pro ble ms with delay ed feedbac k that have muc h smaller complexity than the black-box algorithms. The rest o f the pa pe r is o rganized a s follows. The pr o b- lem of o nline learning with delay ed feedback is deﬁned in Section 2 . The adversarial and sto chastic problems are analy zed in Sections 3.1 and 3.2 , while the mo d- iﬁcation o f the UCB algor ithm is g iven in Section 4 . Some pro ofs, as w ell as results abo ut the KL-UCB al- gorithm ( Garivier & Ca pp´ e , 2011 ) under delay ed feed- back, a re provided in the app endix. 2. The dela y ed feedbac k mo del W e cons ide r a general model of online learning , which we call the pa r tial monitor ing problem with side in- Online Learning under Delay e d F e edback P arameters: F o recaster’s prediction set A , set of out- comes B , side informa tio n set X , reward function r : X × A × B → R , feedback function h : X × A × B → H , time horizo n n (optional). A t each time instant t = 1 , 2 , . . . , n : 1. The environmen t chooses some s ide infor mation x t ∈ X and an outcome b t ∈ B . 2. The side information x t is pre sented to the fore- caster, who makes a pre dic tio n a t ∈ A , whic h results in the reward r ( x t , a t , b t ) (unknown to the forecaster ). 3. The feedback h t = h ( x t , a t , b t ) is scheduled to be revealed a fter τ t time instants. 4. The agent obser ves H t = { ( t ′ , h t ′ ) : t ′ ≤ t, t ′ + τ t ′ = t } , i.e., all the feedback v alues scheduled to be revealed at time step t , toge ther with their timestamps. Figure 1: Partial monitoring under delayed, time- stamp ed feedback. formation. In this model, t he forecaster (decision maker) has to make a sequence of predictions (ac- tions), p ossibly based o n some side information, and for each prediction it receives some reward and feed- back, where the feedback is delay ed. More forma lly , given a set of p ossible side information v alues X , a set of p os sible predictions A , a set o f reward functions R ⊂ { r : X × A → R } , and a set of p os sible feedback v alues H , a t each time ins ta nt t = 1 , 2 , . . . , the for e- caster receives some side informa tion x t ∈ X ; then, po ssibly based o n the side information, the forecaster predicts some v alue a t ∈ A while the environment si- m ultaneously cho oses a r eward function r t ∈ R ; ﬁ- nally , the forecaster receives reward r t ( x t , a t ) a nd some time-stamp ed feedback set H t ⊂ N × H . In particular, each ele men t of H t is a pair of time index and a feed- back v alue, the time index indicating the time insta n t whose dec is ion the asso cia ted feedba ck corre sp o nds to. Note that the forec a ster may or may not r eceive any di- rect information ab out the r ewards it receives (i.e., the rewards ma y b e hidden). In standard online learning, the feedback-set H t is a singleton and the feedback in this set dep ends o n r t , a t . In the delay ed mo del, how- ever, the feedbac k that co ncerns the decision at time t is rec e ived at the e nd of the time p erio d t + τ t , after the prediction is made, i.e., it is delay ed by τ t time steps. Note that τ t ≡ 0 cor resp onds to the non-delay ed case. Due to the delays m ultiple feedbacks may ar rive at the same time, hence the deﬁnition of H t . The go al of the foreca ster is to maximize its cumula- tive rew ard P n t =1 r t ( x t , a t ) ( n ≥ 1). The p erformance of the forecas ter is meas ured r elative to the b est sta tic strategy selected fr o m some set F ⊂ { f | f : X → A} in hindsight. In particula r, the for ecaster’s p erfor- mance is meas ured through the r e gr et , deﬁned by R n = sup a ∈F n X t =1 r t ( x t , a ( x t )) − n X t =1 r t ( x t , a t ) . A fore c a ster is consistent if it achieves, a symptotically , the av er a ge reward of the b est s tatic s trategy , that is E [ R n ] /n → 0, a nd we a re interested in how fast the av era ge r egret can b e made to conv erg e to 0. The above g eneral problem form ula tion includes mos t scenarios cons idered in online learning . In the full information cas e, the feedback is the reward func- tion itself, that is, H = R and H t = { ( t, r t ) } ) (in the non-delay ed case). In the bandit case, the fore- caster o nly lear ns the rewards of its own pre dic tio n, i.e., H = R and H t = { ( t, r t ( x t , a t )) } . In the par- tial monitor ing ca se, the foreca ster is given a reward function r : X × A × B → R and a feedback func- tion h : X × A × B → H , where B is a se t of choices (outcomes) of the environment. Then, for each time instant the en vironment picks a n outcome b t ∈ B , and the r eward b ecomes r t ( x t , a t ) = r ( x t , a t , b t ), while H t = { ( t, h ( x t , a t , b t )) } . This in tera ction proto c ol is shown in Figure 1 in the delayed case. Note tha t the bandit and full informa tion pro blems can also b e treated as sp ecia l partial monitor ing problems . Ther e- fore, we will use this last formulation of the pr oblem. When no stochastic assumption is made on how the sequence b t is g enerated, we talk ab out the adversarial mo del. In the sto chastic setting we will co ns ider the case when b t is a sequence of indep endent, identically distributed (i.i.d.) random v aria bles. Side info r ma- tion may or may not be pr esent in a rea l problem; in its absence X is a singleton set. Finally , we may have diﬀer e n t assumptions on the de- lays. Most often, we will as s ume that ( τ t ) t ≥ 1 is an i.i.d. sequence, which is independent of the past predictions ( a s ) s ≤ t of the foreca s ter. In the sto chastic setting, we also allow the distribution of τ t to dep end o n a t . Note that the delays may change the order of obser v- ing the feedbacks, with the feedback of a mor e rece nt prediction b e ing obser ved b efore the feedback of a n earlier one. 2.1. R elated work The eﬀect of dela yed feedback has be en studied in the recent years under diﬀerent online lear ning scena rios Online Learning under Delay e d F e edback Sto chastic F eedback General (Adversarial) F eedback L No R ( n ) ≤ R ′ ( n ) + O ( E  τ 2 t  ) R ( n ) ≤ O ( τ const ) × R ′ ( n/τ const ) Side ( Agarwal & Duc hi , 2011 ) ( W einberg er & O rdentlic h , 2002 ) F ull Info Info ( Langford et a l. , 2009 ) ( Agarwal & Duchi , 2011 ) L L Side Info R ( n ) ≤ R ′ ( n ) + O ( D ∗ ) R ( n ) ≤ O ( ¯ D ) × R ′ ( n/ ¯ D ) ( Mesterharm , 20 07 ) ( Mesterharm , 20 07 ) No Side R ( n ) ≤ C 1 R ′ ( n ) + C 2 τ max log( τ max ) R ( n ) ≤ O ( τ const ) × R ( n/τ const ) Bandit Info ( Desautels et al. , 2012 ) ( Neu et al. , 2010 ) F eedba ck Side Info R ( n ) ≤ R ′ ( n ) + O ( τ const √ log n ) ( Dudik et al. , 20 11 ) Partial No Side Info R n ≤ R ′ ( n ) + O ( G ∗ n ) R n ≤ ( 1 + E [ G ∗ n ]) × R ′  n 1 + E [ G ∗ n ]  Monitoring Side Info R n ≤ ( 1 + E [ G ∗ n ]) × R ′  n 1 + E [ G ∗ n ]  T a ble 1. Su mmary of work on online learning under delay ed feedback. R ( n ) shows the (exp ected) regret in the d ela yed setting, while R ′ ( n ) shows the (upp er b ound on) th e (exp ected) regret in th e non-d ela yed setting. L d enotes a matching lo wer b ound. D ∗ and ¯ D indicate th e maximum and av erage gap , resp ectively , where a gap is a num b er of consecut ive time steps th e agen t do es not get an y feedback (in the adversaria l delay formulati on used by Mesterharm ( 2005 ; 2007 )). The term τ const indicates that th e results are for constan t delays only . F or the work of ( Desautels et al. , 2012 ) , C 1 and C 2 are p ositiv e constants, with C 1 > 1, and τ max denotes t h e maximum dela y . The results presented in this pap er are sho wn in b oldface, where G ∗ t is th e maximum num b er of outstanding feedbac k s during t he ﬁrst t time-steps. In particular, G ∗ n ≤ τ max when the dela y s hav e an upp er b ound τ max , and w e show that G ∗ n = O  E [ τ t ] + p E [ τ t ] log n + log n  when the d elays τ t are i.i.d. The new b ound s for the partial monitoring problem are automatically applicable in the other, spacial, cases, and giv e improv ed results in most cases . and diﬀerent as sumptions on the delay . A concise sum- mary , together with the contributions of this pap er , is given in T a ble 1 . T o the bes t of our knowledge, W einberg er & Orde ntlic h ( 2002 ) w er e the ﬁrst to analyze the delay e d feedback pro blem; they consid- ered the adversarial full information setting with a ﬁxed, known dela y τ const . They show ed that the minimax optimal solution is to r un τ const + 1 independent optimal pr edictors on the subsa mpled reward s equences: τ const + 1 prediction str ategies are used such that the i th predictor is used at time instants t with ( t mod ( τ const + 1 )) + 1 = i . This approach for ms the ba sis of our metho d devised for the adversarial case (see Section 3 .1 ). Langford et al. ( 2009 ) s howed that un der the usual conditions, a suﬃciently slow ed-down version of the mirro r descent algorithm achieves optimal decay rate of the av erage regret. Mesterharm ( 200 5 ; 200 7 ) co nsidered another v ariant of the full informa tion setting, using a n adversarial mo del on the delays in the lab el pre- diction setting, where the for ecaster has to predict the label cor resp onding to a side information v ector x t . While in the full infor mation o nline predic tio n problem W einberg er & Orde ntlic h ( 2002 ) show ed that the reg ret increases by a m ultiplicative factor of τ const , in the work of Mester harm ( 200 5 ; 2 007 ) the impo rtant quantit y be c omes the ma ximum/a verage gap deﬁned as the length of the largest time interv al the forecaster do es not receive feedback. Mesterharm ( 2005 ; 2007 ) also s hows that the minimax re gret in the adversarial ca s e inc r eases mult iplicatively by the av era ge gap, while it increases only in an additiv e fashion in the s to chastic case, by the maximum gap. Agarwal & Duc hi ( 2011 ) consider ed the problem of online s to chastic optimization a nd show ed tha t, fo r i.i.d. random delays, the regret increases with an additive facto r of or der E  τ 2  . Qualitatively similar results w er e obtained in the bandit setting. Consider ing a ﬁxed and known de- lay τ const , Dudik et al. ( 201 1 ) showed an additiv e O ( τ const √ log n ) p enalty in the reg ret for the sto chas- tic se tting (with side infor mation), while ( Neu et al. , 2010 ) show ed a multiplicativ e regret for the adversar ial bandit ca se. The problem of delay ed feedback ha s also bee n studied for Ga ussian pro c e ss bandit optimiza tion Online Learning under Delay e d F e edback ( Desautels et al. , 2012 ), r e sulting in a m ultiplicative increase in the regret that is indep endent o f the de- lay a nd a n additive term dep ending on the maximum delay . In the rest o f the pap er we generaliz e the ab ov e results to the partial monitoring setting, extending, unifying, and often improving existing results. 3. Blac k-Bo x Algorithms for Dela y ed F eedbac k In this section we provide black-box alg orithms for the delay ed feedback problem. W e assume that there ex- ists a base algorithm Base for solving the prediction problem without delay . W e o ften do not sp ecify the assumptions underlying the regret bo unds of these al- gorithms, a nd assume tha t the problem w e c o nsider only diﬀers fr o m the orig inal problem b ecause o f the delays. F o r exa mple, in the adversarial setting, Base may build on the assumption that the r eward func- tions are selected in an oblivious or no n-oblivious way (i.e., indep endently or not of the pr e dic tio ns of the forecaster ). First we consider the adversaria l case in Section 3.1 . Then in Section 3.2 , we provide tight er bo unds for the sto chastic ca se. 3.1. Adversarial s etting W e say that a predictio n algorithm enjoys a r e- gr et or exp e cte d r e gr et b ound f : [0 , ∞ ) → R un- der the given ass umptions in the no n-delay ed set- ting if (i) f is nondecr easing, co ncav e, f (0) = 0; and (ii) sup b 1 ,...,b n ∈B R n ≤ f ( n ) or, resp ectively , sup b 1 ,...,b n ∈B E [ R n ] ≤ f ( n ) for all n . The algor ithm of W einberg er & Orde ntlic h ( 2002 ) for the adversar- ial full information setting subsamples the reward se- quence by the constant delay τ const + 1, and runs a base algorithm Base on each o f the τ const + 1 subsampled sequences. W einberg er & Orde ntlic h ( 20 02 ) showed that if Base enjoys a regret bound f then their a l- gorithm in the ﬁxed delay ca se enjoys a regr et b ound ( τ const + 1) f ( n/ ( τ const + 1)). F ur ther more, when Base is minimax o ptimal in the non-delay ed setting, the sub- sampling algo rithm is a lso minimax optimal in the (full information) delay ed setting, as ca n b e seen by co n- structing a re ward s equence that changes only in every τ const + 1 times. N ote that W einberg er & Orde ntlic h ( 2002 ) do not require condition (i) o f f . How ever, these conditions imply that y f ( x/y ) is a concav e func- tion of y for an y ﬁxed x (a fact which will turn out to be useful in the a nalysis la ter ), and are satisﬁed b y all regret b ounds we ar e aw ar e of (e.g., for multi-armed bandits, contextual bandits, par tial monitor ing, etc.), which all hav e a regret upper b ound of the fo rm e O ( n α ) for some 0 ≤ α ≤ 1, with, typically , α = 1 / 2 or 2 / 3. 1 . In this section w e ex tend the algorithm of W einberg er & Orde ntlic h ( 2002 ) to the case when the delays are no t constant, a nd to the partial monitor ing setting. The idea is that we r un several instances of a non-delay ed algorithm Base as needed: a n instance is “free” if it has received the feedback cor resp onding to its previo us pr ediction – b efore this we say that the instance is “busy”, waiting for the f eedback. When we ne e d to make a prediction, we use one of existing instances that is free , and is hence rea dy to make a no ther prediction. If no such instance exists, we create a new one to b e used (a ne w instance is alwa ys “free”, as it is not waiting for the feedba ck of a previous predic tio n). The resulting a lgorithm, which we call Blac k- Box Online Learning under Delay ed feedback (BOLD) is shown b elow (note that when the delays are c onstant, BO LD r e duce s to the algor ithm of W einberg er & Ordentlic h ( 200 2 )): Algorithm 1 Bla ck-box Online Lea rning under De- lay ed feedback ( BOLD ) for eac h time instant t = 1 , 2 , . . . , n do Prediction: Pick a fre e instance o f Base (indep endently of pa st pre dictions), or create a new instance if all existing instance s are busy . F eed the instance pick ed with x t and use its prediction. Up date: for eac h ( s, h s ) ∈ H t do Update the instance used a t time instant s with the feedback h s . end for end for Clearly , the p erformance of BOLD dep ends on how many instances o f Base we need to crea te, and how many times each instance is used. Let M t denote the nu mber of Base instances created by BOL D up to and including time t . That is, M 1 = 1, and we cre - ate a new instance a t the b eginning of a ny time in- stant when all insta nc e s are waiting for their feedback. Let G t = P t − 1 s =1 I { s + τ s ≥ t } be the to tal num b er of outstanding (missing) feedbacks when the for ecaster is making a prediction a t time instant t . Then we hav e G t algorithms waiting for their feedback, and so M t ≥ G t + 1. Since we only in tro duce new instances when it is necessa ry (and eac h time ins tant a t most 1 u n = e O ( v n ) means that there is a β ≥ 0 such that lim n →∞ u n / ( v n log β n ) = 0. Online Learning under Delay e d F e edback one new instance is created), it is e a sy to see tha t M t = G ∗ t + 1 (1) for any t , where G ∗ t = ma x 1 ≤ s ≤ t G t . W e can use the result ab ove to transfer the r egret gua r - antee of the non-delay ed base a lgorithm Base to a guarantee on the regret of BOLD. Theorem 1. Su pp ose that the non-delaye d algorithm Base use d in BOLD enjoys an (exp e cte d) r e gr et b ound f Base . Assume, furthermor e, t hat the delays τ t ar e in- dep endent of the for e c aster’s pr e diction a t . Then t he exp e cte d r e gr et of BOLD after n time steps satisﬁes E [ R n ] ≤ E  ( G ∗ n + 1 ) f Base  n G ∗ n + 1  ≤ ( E [ G ∗ n ] + 1) f Base  n E [ G ∗ n ] + 1  . Pr o of. As the seco nd inequality follows from the con- cavit y of y 7→ y f Base ( x/y ) ( x, y > 0), it remains to prov e the ﬁrs t one. F or any 1 ≤ j ≤ M n , let L j denote the list of time instants in which BOLD has used the prediction cho- sen by insta nce j , and let n j = | L j | be the num b er of time instan ts this happ ens. F urthermore, let R j n j denote the regret incur red during the time instants t with t ∈ L j : R j n j = sup a ∈F X t ∈ L j r t ( x t , a ( x t )) − X t ∈ L j r t ( x t , a t ) , where a t is the prediction made by BOLD (and in- stance j ) at time instant t . By construction, ins tance j do es not exp erience a n y delays. Hence, R j n j is its r e- gret in a non-delayed online lear ning problem. 2 Then, R n = sup a ∈F n X t =1 r t ( x t , a ( x t )) − n X t =1 r t ( x t , a t ) = sup a ∈F M n X j =1 X t ∈ L j r t ( x t , a ( x t )) − M n X j =1 X t ∈ L j r t ( x t , a t ) ≤ M n X j =1   sup a ∈F X t ∈ L j r t ( x t , a ( x t )) − X t ∈ L j r t ( x t , a t )   = M n X j =1 R j n j . 2 Note that L j is a function of the dela y sequence and is not a function of the p redictions ( a t ) t ≥ 1 . H ence, the rew ard sequence that instance j is ev aluated on is chosen obliviously when ever the ad versary of BOLD is oblivious. Now, using the fact that f Base is an (exp ected) r egret bo und, we obtain E [ R n | τ 1 , . . . , τ n ] ≤ M n X j =1 E h R j n j | τ 1 , . . . , τ n i ≤ M n X j =1 f Base ( n j ) = M n M n X j =1 1 M n f Base ( n j ) ≤ M n f Base   M n X j =1 1 M n n j   = M n f Base  n M n  , where the ﬁrst inequality follows since M n is a deter- ministic function of the delays, while the last inequal- it y follows from Jensen’s inequality and the co ncavit y of f Base . Substituting M n from ( 1 ) a nd taking the exp ectation concludes the pr o of. Now, we need to bound G ∗ n to make the theorem mean- ingful. When all delays ar e the same cons ta nt s, for n > τ const we g et G ∗ n = τ t = τ const , and we get back the regr et bound E [ R n ] ≤ ( τ const + 1 ) f Base  n τ const + 1  of W einberg er & Orde ntlic h ( 2002 ) , th us generaliz- ing their result to partial monitoring. W e do not kno w whether this bo und is tight even when Base is minimax optimal, as the argument of W einberg er & Orde ntlic h ( 2 002 ) for the low er b ound do es not work in the partial information setting (the forecaster can gain extra information in each blo ck with the same r eward functions). Assuming the delays ar e i.i.d., w e ca n give a n int erest- ing b ound on G ∗ n . The result is bas ed on the fact that although G t can b e as larg e as t , b oth its expecta tion and v ariance ar e upper bo unded by E [ τ 1 ]. Lemma 2. Assume τ 1 , . . . , τ n is a se qu enc e of i.i.d . r andom variables with ﬁ nite exp e cte d value, and let B ( n, t ) = t + 2 log n + √ 4 t log n . Then E [ G ∗ n ] ≤ B ( n, E [ τ 1 ]) + 1 . Pr o of. First consider the exp ectation and the v aria nce of G t . F or a ny t , E [ G t ] = E " t − 1 X s =1 I { s + τ s ≥ t } # = t − 1 X s =1 P { s + τ s ≥ t } = t − 2 X s =0 P { τ 1 > s } ≤ E [ τ 1 ] , Online Learning under Delay e d F e edback and, similarly σ 2 [ G t ] = t − 1 X s =1 σ 2 [ I { s + τ s ≥ t } ] ≤ t − 1 X s =1 P { s + τ s ≥ t } , so σ 2 [ G t ] ≤ E [ τ 1 ] in the same wa y as ab ove. By Bernstein’s inequality ( Cesa-Bianchi & Lugosi , 2 006 , Corollar y A.3), for a ny 0 < δ < 1 a nd a ny t we hav e, with probability at least 1 − δ , G t − E [ G t ] ≤ log 1 δ + q 2 σ 2 [ G t ] log 1 δ . Applying the union bo und for δ = 1 /n 2 , a nd our pre- vious b ounds o n the v ar iance and exp ectation of G t , we obta in that with pr o bability at le a st 1 − 1 / n , max 1 ≤ t ≤ n G t ≤ E [ τ 1 ] + 2 log n + p 4 E [ τ 1 ] log n. T a k ing in to acco un t tha t max 1 ≤ t ≤ n G t ≤ n , we g e t the statement of the lemma. Corollary 3. Under the c onditions of The or em 1 , if the se quenc e of delays is i.i.d, t hen E [ R n ] ≤ ( B ( n, E [ τ 1 ]) + 2) f Base  n B ( n, E [ τ 1 ]) + 2  . Note that a lthough the delays can b e arbitra r ily large, whenever the exp ected v alue is ﬁnite, the b ound only increases by a log n fa c to r. 3.2. Fi nite s to c hastic setting In this section, we consider the ca se when the predic- tion s e t A of the forecaster is ﬁnite; without loss of generality we a ssume A = { 1 , 2 , . . . , K } . W e a lso as- sume that there is no side infor mation (that is, x t is a cons tant for all t , and, hence, will b e omitted; the results can b e extended e a sily to the ca se of a ﬁnite side informa tion set, where we c a n rep eat the pro ce- dures descr ibe d b elow for each v alue of the side in- formation separ ately). The main assumption in this section is that the outcomes ( b t ) t ≥ 1 form an i.i.d. se- quence, which is als o indep endent of the predictio ns of the forecas ter. When B is ﬁnite, this leads to the standard i.i.d. partial monitoring (IP M) setting, while the conv entional m ulti-armed bandit (MAB) setting is recov ered when the feedback is the reward of the la st prediction, that is, h t = r t ( a t , b t ). As in the previ- ous sec tion, we will assume that the feedback delays are indepe ndent o f the o utcomes of the environment . The main result o f this section shows that under these assumptions, the penalty in the r egret grows in an ad- ditive fashion due to the delays, as o pp os ed to the mul- tiplicative p enalty that we hav e seen in the adversarial case. By the indep endence assumption on the outcomes, the sequences of po ten tial rewards r t ( i ) . = r ( i, b t ) and feed- backs h t ( i ) . = h ( i , b t ) are i.i.d., r esp ectively , for the same prediction i ∈ A . In this setting we als o a s- sume that the feedback and reward s equences o f dif- ferent pr edictions ar e indep endent of each other. Let µ i = E [ r t ( i )] denote the exp ected reward of predict- ing i , µ ∗ = max i ∈A µ i the optimal rew ard a nd i ∗ with µ i ∗ = µ ∗ the o ptimal pr ediction. Mor eov er , let T i ( n ) = P n t =1 I { a t = i } denote the num b er of times i is predicted b y the end o f time instan t n . Then, deﬁn- ing the “ga ps” ∆ i = µ ∗ − µ i for all i ∈ A , the exp ected regret of the forecaster b ecomes E [ R n ] = n X t =1 µ ∗ − µ a t = K X i =1 ∆ i E [ T i ( n )] . (2) Similarly to the adversaria l setting, we build on a base algorithm Base for the non-delay ed case. T he a dv an- tage in the IPM s e tting (and that we consider exp ected regret) is that here Base can co nsider a p ermuted o r - der o f rewards and feedba cks, and so we do not hav e to w a it fo r the actual feedback; it is enough to receive a feedback for the same prediction. This is the idea at the core o f our algorithm, Queued Partial Mo nitoring with Delay ed F eedbac k (QPM-D): Algorithm 2 Queued Partial Monitor ing with Delays (QPM-D) Create an empty FIF O buﬀer Q [ i ] for each i ∈ A . Let I be the ﬁrst prediction of Base . for eac h time instant t = 1 , 2 , . . . , n do Predict: while Q [ I ] is not empt y do Update Base with a feedba ck fro m Q[I]. Let I b e the next prediction of Base . end whil e There ar e no buﬀered feedbacks for I , so predict a t = I at time instant t to get a feedback. Up date: for eac h ( s, h s ) ∈ H t do Add the feedback h s to the buﬀer Q [ a s ]. end for end for Here we hav e a Base pa r tial mo nitoring algor ithm for the non-delayed case, which is run inside the al- gorithm. The feedback informatio n coming from the environmen t is s tored in separ a te queues for each pr e - diction v alue. The outer alg orithm constantly queries Base : while feedbacks for the pr edictions made are av ailable in the q ueues, o nly the inner algo rithm Base runs (that is, this happ ens within a sing le time instant Online Learning under Delay e d F e edback in the r eal pre diction pr oblem). When no feedback is av ailable, the outer algor ithm k eeps sending the same prediction to the r eal en vironment un til a feedback for that predictio n ar r ives. In this w ay Base is run in a simulated non-delay ed environment. The next lemma implies that the inner algor ithm Base a c tually runs in a non-delayed version of the pr oblem, a s it exp eriences the same distributions : Lemma 4 . Consider a delaye d sto chastic IPM pr ob- lem as deﬁne d ab ove. F or any pr e diction i , for any s ∈ N let h ′ i,s denote the s th fe e db ack QPM-D r e c eives for pr e dicting i . Then t he se quenc e ( h ′ i,s ) s ∈ N is an i.i.d . se quenc e with the same distribution as the se qu enc e of fe e db acks ( h t,i ) t ∈ N for pr e diction i . T o relate the non-delayed p e r formance of Base and the regret o f Q P M-D, we need a few deﬁnitions. F or any t , let S i ( t ) denote the num b er o f feedbacks for prediction i that ar e r eceived b y the end of time in- stant t . Then the num b er of miss ing feedbacks for i when making a pre diction at time instant t is G i,t = T i ( t − 1) − S i ( t − 1). L e t G ∗ i,n = max 1 ≤ t ≤ n G i,t . F ur- thermore, for ea ch i ∈ A , le t T ′ i ( t ′ ) b e the num b er of times algo rithm Base ha s predicted i while b eing queried t ′ times. Let n ′ denote the n umber of steps the inner alg orithm Ba se makes in n steps of the rea l IPM problem. Next we r elate n and n ′ , as well as the nu mber of times Q PM-D a nd Base (in its simulated environmen t) make a sp eciﬁc prediction. Lemma 5. Su pp ose QPM-D is run for n ≥ 1 time instants, and has querie d Base n ′ times. Then n ′ ≤ n and 0 ≤ T i ( n ) − T ′ i ( n ′ ) ≤ G ∗ i,n . (3) Pr o of. Since Base can take at most one step for each feedback that a rrives, and QPM-D ha s to make a t lea st one step for ea ch ar riving feedback, n ′ ≤ n . Now, ﬁx a prediction i ∈ A . If Base , and hence, QPM-D, has not predicted i by time instant n , ( 3 ) trivially holds. Otherwise, let t n,i denote the la st time instant (up to time n ) when QPM-D predicts i . Then T i ( n ) = T i ( t n,i ) = T i ( t n,i − 1) + 1. Supp ose Base has b e en queried n ′′ ≤ n times by time instant t n,i (inclusive). At this time insta nt , the buﬀer Q [ i ] must be empty and Base m ust b e pr edicting i , otherwis e QPM-D would not pr edict i in the real environmen t. This means that a ll the S i ( t n,i − 1) feedbacks that hav e arrived b efore this time ins ta nt hav e be e n fed to the base alg orithm, which has also made an extra step, that is, T ′ i ( n ′ ) ≥ T ′ i ( n ′′ ) = S i ( t n,i − 1 ) + 1. Therefore, T i ( n ) − T ′ i ( n ′ ) ≤ T i ( t n,i − 1 ) + 1 − ( S i ( t n,i − 1 ) + 1) ≤ G i,t n,i ≤ G ∗ i,n . W e can now g ive a n upp er b ound on the exp ected r e- gret of Algorithm 2 . Theorem 6. Supp ose the non-delaye d Base algo- rithm is use d in QPM-D in a delaye d sto chastic IPM envir onment. Then the exp e cte d r e gr et of QPM-D is upp er-b ounde d by E [ R n ] ≤ E  R Base n  + K X i =1 ∆ i E  G ∗ i,n  , (4) wher e E  R Base n  is the ex p e cte d r e gr et of Base when run in the same envir onment without delays. When the delay τ t is b ounded by τ max for all t , we also hav e G ∗ i,n ≤ τ max , and E [ R n ] ≤ E  R Base n  + O ( τ max ). When the sequence of delays for ea ch prediction is i.i.d. with a ﬁnite expected v alue but unbounded sup- po rt, we can use Lemma 2 to bo und G ∗ i,n , and o btain a b ound E  R Base n  + O ( E [ τ 1 ] + p E [ τ 1 ] log n + log n ). Pr o of. Assume that QP M-D is r un longer so that Base is quer ied for n times (i.e., it is quer ied n − n ′ more times). Then, since n ′ ≤ n , the n umber of times i is predicted by the base algo rithm, namely T ′ i ( n ), can only increase , that is, T ′ i ( n ′ ) ≤ T ′ i ( n ). Com bining this with the exp ectatio n o f ( 3 ) gives E [ T i ( n )] ≤ E [ T ′ i ( n )] + E  G ∗ i,n  , which in tur n gives, K X i =1 ∆ i E [ T i ( n )] ≤ K X i =1 ∆ i E [ T ′ i ( n )] + K X i =1 ∆ i E  G ∗ i,n  . (5) As shown in Lemma 4 , the r eordered r ewards and feed- backs h ′ i, 1 , h ′ i, 2 , . . . , h ′ i,T ′ i ( n ′ ) , . . . h ′ i,T i ( n ) are i.i.d. with the same distribution a s the or iginal feedba ck seq uence ( h t,i ) t ∈ N . The base alg orithm Base has work ed o n the ﬁrst T ′ i ( n ) of these feedbacks for each i (in its extended run), and ha s ther efore op erated for n s teps in a simu- lated environmen t with the same rew ard and feedback distributions, but without delay . Hence, the ﬁrs t sum- mation in the right hand side of ( 5 ) is in fact E  R Base n  , the exp ected reg r et of the base algo rithm in a non- delay ed environmen t. This concludes the pro of. 4. UCB for the Multi-Armed Bandit Problem with Dela yed F eedbac k While the algor ithms in the previous sectio n provide an easy wa y to c onv ert alg o rithms devised for the non- delay ed ca s e to ones that can handle delays in the feed- back, impr ov ements ca n b e achiev ed if one makes mo d- iﬁcations inside the existing non-delay ed algor ithms Online Learning under Delay e d F e edback while r etaining their theoretical guarantees. This can be view ed as a ” white-b ox” approach to extending on- line lear ning algorithms to the delayed setting, and enables us to escap e the high memo ry requirements of black-b ox algor ithms tha t arises for b oth of our metho ds in the previous section when the delays are large. W e consider the sto chastic m ulti- a rmed bandit problem, a nd extend the UCB family o f alg orithms ( Auer et a l. , 2002 ; Garivier & Ca pp´ e , 20 11 ) to the de- lay ed setting. The mo diﬁcation propo s ed is quite na t- ural, and the common c ha racteristics of UCB-t yp e al- gorithms enable a uniﬁed wa y of extending their p er- formance gua rantees to the delayed setting (up to an additive p enalty due to delays). Recall that in the sto chastic MAB setting, which is a sp ecial case o f the sto chastic IP M problem of Section 3.2 , the feedba ck at time instant t is h t = r ( a t , b t ), and there is a dis tr ibution ν i from which the rewards of each prediction i are dr awn in an i.i.d. manner. Here we assume that the r e wards of diﬀerent predictions a re independent of each other. W e use the same notation as in Section 3.2 . Several algorithms dev is ed for the non-delay ed sto chastic MAB pro blem ar e ba s ed on upp er conﬁ- dence b ounds (UCBs), which are optimistic es timates of the exp ected r eward of diﬀer e n t pre dic tio ns. Dif- ferent UCB-type algorithms use diﬀeren t upp e r c o n- ﬁdence b ounds, and choose , at each time instant, a prediction with the larg est UCB. Let B i,s,t denote the UCB for prediction i at time instant t , where s is the nu mber o f r eward samples used in computing the es- timate. In a non-delayed setting, the prediction of a UCB-type algo rithm at time instant t is given by a t = a rgmax i ∈A B i,T i ( t − 1) ,t . In the presence of dela y s , one ca n simply use the sa me upper conﬁdence bounds only with the r ewards that are observed, a nd predict a t = a rgmax i ∈A B i,S i ( t − 1) ,t (6) at time instant t (recall that S i ( t − 1) is the num ber of r ewards that can be observed for pre dictio n i b efore time instant t ). Note tha t if the delays ar e zer o, this algorithm r educes to the co rresp onding non-delayed version o f the alg orithm. The a lgorithms deﬁned by ( 6 ) ca n easily b e shown to enjoy the same regr et guarantees compared to their non-delay e d versions, up to an a dditive penalty de- pending on the delays. This is b ecause the analys es o f the reg rets of UCB alg orithms follow the same pattern of upp er bo unding the n umber of trials of a sub optimal prediction using conce ntration inequalities suitable for the sp eciﬁc for m of UCBs they us e . As an example, the UCB1 algorithm ( Auer et al. , 2002 ) us e s UCBs of the for m B i,s,t = ˆ µ i,s + p 2 log ( t ) /s , where ˆ µ i,s = 1 s P s t =1 h ′ i,t is the average of the ﬁr st s observed rewards. Using this UCB in our decision r ule ( 6 ), we can b ound the regr et of the resulting algor ithm (called Delay ed-UCB1) in the de- lay ed setting: Theorem 7. F o r any n ≥ 1 , the exp e ct e d re gr et of the Delaye d-UCB1 algorithm is b ounde d by E [ R n ] ≤ X i :∆ i > 0  8 log n ∆ i + 3 . 5 ∆ i  + K X i =1 ∆ i E  G ∗ i,n  . Note that the last term in the b ound is the addi- tive p ena lt y , and, under diﬀerent ass umptions , it can be b ounded in the same wa y as a fter Theorem 6 . The pro of of this theor em, as well as a similar re- gret bo und for the delay ed version of the KL-UCB algorithm ( Garivier & Ca pp´ e , 2 011 ) can be found in Appendix B . 5. Conclusion and future w ork W e analyzed the eﬀect of feedback delays in online learning pro blems. W e examined the partia l monitor- ing cas e (which also cov ers the full informa tio n and the bandit settings ), and pr ovided g eneral a lgorithms that transform forecasters devised for the non-delayed cas e int o ones that handle de layed feedback. It tur ns out that the pr ice o f delay is a multiplicativ e incr e a se in the regret in adversarial problems, and only an additive in- crease in sto chastic problems. While we b elieve that these ﬁndings are qualitatively cor rect, we do no t hav e low er b ounds to prove this (matching low er bo unds are av ailable for the full infor mation case only). It also turns out that the mo st impo rtant quantit y that determines the p er formance of o ur algo rithms is G ∗ n , the ma ximum num b er o f missing r ewards. It is int eresting to note that G ∗ n is the max im um num b er of servers used in a m ulti- s erver queuing sys tem with inﬁnitely many servers and deterministic arr iv al times. It is also the maximum deviation of a cer tain type of Marko v c ha in. While we ha ve no t found any immedi- ately applica ble res ults in these ﬁelds, we think that applying techniques from these areas could lead to an improv ed understanding o f G ∗ n , a nd hence an impr ov ed analysis of online lea rning under delayed feedback. 6. Ackn o wledgemen t s This work was supp orted by the Alb erta Innov ates T echnology F uture s and NSERC. Online Learning under Delay e d F e edback References Agarwal, Alekh and Duchi, John. Distributed delay ed sto chastic optimization. In Shaw e-T a ylor, J., Zemel, R.S., Bar tlett, P ., Pereira, F., and W ein b erger , K.Q. (eds.), A dvanc es in Neur al In formation Pr o c essing Systems 24 (NIPS) , pp. 8 7 3–88 1, 201 1 . Auer, Peter, Cesa-Bia nchi, Nicol` o, and Fischer, Paul. Finite-time analysis o f the m ultiarmed ba ndit prob- lem. Machine L e arning , 47(2 - 3):235– 256, May 200 2. Cesa-Bianchi, Nicol` o a nd Lugos i, G´ abo r. Pr e diction, L e arning, and Games . Cambridge Universit y Pre s s, New Y ork, NY, USA, 2006. ISBN 05218 41089 . Desautels, Thomas, Kr ause, Andreas, and Burdick, Jo el. Parallelizing explora tion-exploitation trade- oﬀs with g a ussian pro ces s bandit optimization. In Pr o c e e dings of the 29th International Confer enc e on Machine L e arning (ICML) , Edinburgh, Scotland, UK, 201 2. O mnipress. Do o b, Joseph L. Sto chastic Pr o c esses . John Wiley & Sons, 1953 . Dudik, Miroslav, Hsu, Daniel, Ka le, Sa t yen, Ka ram- patziakis, Nikos, La ngford, John, Reyzin, Lev, and Zhang, T ong. E ﬃcient optimal lear ning for contex- tual bandits. In Pr o c e e dings of the 27th Confer enc e on Unc ertainty in Artiﬁcia l Intel ligenc e (UA I) , pp. 169–1 78, Cor v allis, Oreg on, 2011. A UAI Press. Garivier, Aur´ elien and Capp´ e, O livier. The KL- UCB algorithm fo r b ounded sto chastic bandits and be - yond. In Pr o c e e dings of the 24th A n n ual Confer enc e on L e arning The ory (COL T) , volume 19, pp. 359 – 376, Budap est, Hungar y , July 2 011. Ho eﬀding, W a ssily . Probability inequalities fo r sums o f bo unded r andom v a riables. Jour n al of the Americ an Statistic al Asso ciation , 58(301):13 –30, 196 3. Langford, John, Smola , Alexander, and Zinkevic h, Martin. Slow lear ners ar e fast. In Bengio, Y., Sch u- urmans, D., L a ﬀerty , J., Willia ms , C. K. I., and Culotta, A. (eds.), A dvanc es in Neur al Information Pr o c essing Systems 22 , pp. 2331 – 2339 . 2009. Li, Lihong, Chu, W ei, Lang fo rd, Jo hn, and Schapire, Rob ert E . A cont extual-bandit approach to pers on- alized news article recommenda tio n. In Pr o c e e dings of t he 19th Int ernational Confer enc e on World Wide Web (WWW) , pp. 66 1 –670 , New Y ork, NY, USA, 2010. ACM. Mesterharm, Chris J. On-line learning with delayed lab el feedback. In Jain, Sanjay , Simon, HansUl- rich, and T omita, Etsuji (eds.), Algorithmic L e arn- ing The ory , volume 373 4 of L e cture Notes in Com- puter Scienc e , pp. 399–4 13. Springer B erlin Heidel- ber g, 2005 . Mesterharm, Chris J . Impr oving on-line le arning . PhD thesis, Depar tmen t of Co mputer Science, Rutgers Univ ersity , New Br unswick, NJ, 2 0 07. Neu, Gerg ely , Gy¨ orgy , Andr´ as, Szep es v ´ ari, Csaba, and An tos, Andr´ as. O nline markov decisio n pro cesse s under bandit feedback. In Laﬀerty , J., Williams, C. K. I., Shaw e -T aylor, J ., Zemel, R.S., and Culotta, A. (eds.), A dvanc es in Neur al In formation Pr o c essing Systems 23 (NIPS) , pp. 1 8 04–1 812, 2 0 10. Titch marsh, Edward Charles and H eath-Brown, David Ro dney . The The ory of t he Riemann Zeta- F unctions . Ox ford University Press, second edition edition, January 1987. W einberg er, Marcelo J. and Ordentlic h, Erik. On delay ed pr ediction of individual sequences . IEEE T r ansactio ns on Information The ory , 48(7):195 9– 1976, September 2 002. Online Learning under Delay e d F e edback A. Pro of of Lemma 4 In this appe ndix we pro ve Lemma 4 that w a s used in the i.i.d. par tial monitoring setting (Sectio n 3.2 ). T o that end, we will ﬁrst need tw o other lemmas. The ﬁrst lemma shows that the i.i.d. prop er t y of a sequence of r a ndom v ariables is pr e s erved under an indep endent ra ndo m reordering of that sequence. Lemma 8. L et ( X t ) t ∈ N , b e a se quenc e of indep endent, identic al ly distribute d r andom variables. If we r e or der this se quenc e ac c or ding to an indep endent r andom p ermutation, then the r esu lting se quenc e is i.i.d. with the same distribution as ( X t ) t ∈ N . Pr o of. Let the reorde r ed sequence be denoted b y ( Z t ) t ∈ N . It is suﬃcient to show that for all n ∈ N , for all y 1 , y 2 , . . . , y n , we ha ve P { Z 1 ≤ y 1 , Z 2 ≤ y 2 , . . . , Z n ≤ y n } = P { X 1 ≤ y 1 , X 2 ≤ y 2 , . . . , X n ≤ y n } . Since ( X t ) t ∈ N is i.i.d., for an y ﬁxed permutation the equation abov e holds as b oth sides are equal to Π n t =1 P { X 1 ≤ y t } . Since the pe rmutations are indep endent of the s equence ( X t ) t ∈ N , us ing the law o f total probability this extends to the gener al case as well. W e also need the following result ( Do ob , 1 953 , Page 145, Chapter II I, Theorem 5.2). Lemma 9. L et ( X t ) t ∈ N b e a s e quenc e of i.i.d. r andom variables, and ( X ′ t ) t ∈ N b e a subse quenc e of it such that the de cision whether to include X t in the subse quenc e is indep endent of futu r e values in the se quenc e, i.e., of X s for s ≥ t . Then t he se quenc e ( X ′ t ) t ∈ N is an i.i.d. se qu en c e with the same distribution as ( X t ) t ∈ N . W e can now pr o ceed to the pro of of Lemma 4 . Pr o of of L emma 4 . Let ( Z i,t ) t ∈ N be the sequence res ulting fro m sorting the v ariables h i,t by their p ossible obse r - v ation times t + τ i,t (that is, Z i, 1 is the earliest feedba ck that can b e observed if i is pr edicted at the appr opriate time, a nd so on). Since delays are indep endent of the outco mes, they deﬁne an indep endent reorde r ing on the sequence of feedba cks. Hence, by Lemma 8 , ( Z i,t ) t ∈ N is a n i.i.d. sequence with the same distribution as ( h i,t ) t ∈ N . Note that ( h ′ i,s ) s ∈ N , the seque nc e o f feedbacks (s o rted by their obser v ation times) that the a g ent observes for predicting i , is a subsequence o f ( Z i,t ) t ∈ N where the decision whe ther to include ea ch Z i,t in the subsequence cannot dep end o n future p oss ible obser v ations Z i,t ′ , t ′ ≥ t . Also, the feedbacks of o ther predictions that are used in this dec ision were a ssumed to b e indep endent of ( Z i,t ) t ∈ N . Hence, by Le mma 9 , ( h ′ i,s ) s ∈ N is an i.i.d. sequence with the same dis tr ibution as ( Z i,t ) t ∈ N , which in turn has the sa me distribution as ( h i,t ) t ∈ N . B. UCB for the Multi-Armed Bandit Problem with Dela yed F eedb ac k This a ppendix details the framework w e describ ed in Section 4 fo r analy z ing UCB- t yp e a lg orithms in the delay ed settings, a nd provides the missing proo fs. The reg ret o f a UCB alg o rithm is usually ana ly zed by upp er b ounding the (exp ected) nu mber of times a sub optimal predictio n is made, and then using Equa tion ( 2 ) to get an e xpe cted regret b ound. Consider a UCB algo r ithm with upper co nﬁdence b ounds B i,s,t , a nd ﬁx a subo ptimal prediction i . The typical analys is (e.g., by Auer et a l. ( 20 02 )) consider s the case when this prediction is made fo r at least ℓ > 1 times (for a large enoug h ℓ ), and uses concentration inequalities suitable for the sp eciﬁc form of the upp er-conﬁdence b ound to show that it is unlik ely to make this sub optimal pre diction more than ℓ times bec ause o bserving ℓ samples from its reward distribution suﬃces to distinguish it from the optimal prediction with high co nﬁdence. This v alue ℓ th us gives an upp er b ound on the ex pec ted num b er o f times i is predicted. Examples of suc h concen tr ation ineq ualities include Hoeﬀding’s inequa lity ( Ho eﬀding , 1963 ) and The o rem 10 o f Garivier & Capp´ e ( 201 1 ), whic h are used for the UCB1 and KL - UCB algorithms, resp ectively . More precisely , the g eneral analy s is of UCB-type a lgorithms in the non-delay e d setting w orks as follows: for ℓ > 1, we have T i ( n ) ≤ ℓ + P n t =1 I { a t = i , T i ( t ) > ℓ } , wher e the s um o n the right ha nd side captures ho w muc h larger than ℓ the v alue of T i ( n ) is (recall tha t T i ( t ) is the num b er of times i is pr edicted up to and including time t ). Whenever i is predicted, its UCB, B i,T i ( t − 1) ,t , must hav e b een greater than that of an o ptimal predictio n, Online Learning under Delay e d F e edback B i ∗ , T i ∗ ( t − 1) ,t , which implies T i ( n ) ≤ ℓ + n X t =1 I { B i,T i ( t − 1) ,t ≥ B i ∗ ,T i ∗ ( t − 1) ,t , a t = i , T i ( t − 1) ≥ ℓ } . (7) The exp ected v alue of the summation o n the righ t-hand-side is then b ounded using concentration inequa lities as men tioned ab ove. In the delay ed-feedback setting, if we use upp er conﬁdence b ounds B i,S i ( t − 1) ,t instead (where S i ( t ) was deﬁned to b e the num ber o f rewards o bserved up to a nd including time insta nt t ), in the same wa y as ab ov e we can write T i ( n ) ≤ ℓ + n X t =1 I { B i,S i ( t − 1) ,t ≥ B i ∗ ,S i ∗ ( t − 1) ,t , a t = i, T i ( t − 1) ≥ ℓ } . Since T i ( t − 1) = G i,t + S i ( t − 1), with ℓ ′ = ℓ − G ∗ i,n we get T i ( n ) ≤ ℓ ′ + G ∗ i,n + n X t =1 I { B i,S i ( t − 1) ,t ≥ B i ∗ ,S i ∗ ( t − 1) ,t , a t = i , S i ( t − 1) ≥ ℓ ′ } . (8) Now the same concentration inequalities used to b ound ( 7 ) in the analys is of the non-delay ed setting can be used to upp er b o und the exp ected v alue of the sum in ( 8 ). Putting this into ( 2 ), we see that one ca n reuse the same upper co nﬁdence b o und in the delayed setting (with only the obs e rved rewards) and get a p er fo rmance s imila r to the non- delay ed setting, with only an a dditive p enalty tha t depe nds on the delays. The following tw o sectio ns demonstrate the use o f this metho d on t wo UCB- type a lgorithms. B.1. UCB1 under del a yed feedback : Pro of of Theorem 7 Below comes the pro of of Theorem 7 fo r the Delay ed-UCB1 algorithm (Section 4 ). Pr o of of The or em 7 . F ollowing the outline of the previo us s ection, we can bound the summation in ( 8 ) using the same a nalysis as in the original UCB1 pap er ( Auer et al. , 2002 ). In par ticular, fo r an y predic tio n i we can write n X t =1 I  B i,S i ( t − 1) ,t ≥ B i ∗ ,S i ∗ ( t − 1) ,t , S i ( t − 1) ≥ ℓ ′  ≤ n X t =1 I  B i ∗ ,S i ∗ ( t − 1) ,t ≤ µ i ∗ , S i ( t − 1) ≥ ℓ ′  + n X t =1 I  B i,S i ( t − 1) ,t ≥ µ i ∗ , S i ( t − 1) ≥ ℓ ′  . (9) The even t in the s econd summation implies tha t either µ i + 2 s 2 log ( t ) S i ( t − 1) > µ i ∗ or ˆ µ i,S i ( t − 1) − s 2 log ( t ) S i ( t − 1) ≥ µ i (otherwise we will hav e B i,S i ( t − 1) ,t < µ i ∗ ). Hence, ( 9 ) ≤ n X t =1 I ( ˆ µ i ∗ ,S i ∗ ( t − 1) + s 2 log( t ) S i ∗ ( t − 1) ≤ µ i ∗ ) + n X t =1 I ( ˆ µ i,S i ( t − 1) − s 2 log( t ) S i ( t − 1) ≥ µ i ) + n X t =1 I ( µ i + 2 s 2 log( t ) S i ( t − 1) > µ i ∗ , S i ( t − 1) ≥ ℓ ′ ) . (10) Cho osing ℓ ′ =  8 log ( n ) ∆ 2 i  makes the even ts in the last summation a b ove imp ossible, b ecaus e S i ( t − 1) ≥ ℓ ′ ≥ Online Learning under Delay e d F e edback 8 log ( n ) ∆ 2 i which implies 2 s 2 log( t ) S i ( t − 1) ≤ 2 r 2 log( n ) ℓ ′ ≤ ∆ i . Therefore, combining with ( 8 ), we can write T i ( n ) ≤  8 log ( n ) ∆ 2 i  + G ∗ i,n + n X t =1 t X s =1 I ( ˆ µ i ∗ ,s + r 2 log ( t ) s ≤ µ i ∗ ) + I ( ˆ µ i,s − r 2 log ( t ) s ≥ µ i ) ! . T a k ing exp ectation gives E [ T i ( n )] ≤  8 log ( t ) ∆ 2 i  + E  G ∗ i,n  + n X t =1 t X s =1 P ( ˆ µ i ∗ ,s + r 2 log ( t ) s ≤ µ i ∗ ) + P ( ˆ µ i,s − r 2 log ( t ) s ≥ µ i ) ! . As in the original analysis, Ho eﬀding’s inequality ( Ho eﬀding , 1 963 ) can be used to bo und each of the probabilities in the summatio n, to get P ( ˆ µ i ∗ ,s + r 2 log ( t ) s ≤ µ i ∗ ) ≤ e − 4 log ( t ) = t − 4 , P ( ˆ µ i,s − r 2 log ( t ) s ≥ µ i ) ≤ e − 4 log ( t ) = t − 4 . Therefore, we ha ve E [ T i ( n )] ≤  8 log ( t ) ∆ 2 i  + E  G ∗ i,n  + ∞ X t =1 2 t − 3 ≤ 8 log( n ) ∆ 2 i + 1 + E  G ∗ i,n  + 2 ζ (3) , where ζ (3) < 1 . 21 is the Riemann Zeta function. 3 Combining with ( 2 ) prov es the theor em. B.2. KL-UCB under dela yed feedbac k The KL-UCB algor ithm was intro duced b y Ga r ivier & Capp´ e ( 2011 ). The upp er conﬁdence b ound us ed b y KL-UCB for predicting i at time t is B i,T i ( t − 1) ,t , where B i,s,t is max { q ∈ [ ˆ µ i,s , 1] : sd ( ˆ µ i,s , q ) ≤ log t + 3 lo g(log t ) } , with d ( p, q ) = p lo g( p q ) + (1 − p ) log ( 1 − p 1 − q ) the KL-divergence of t wo Bernoulli random v ariables with pa r ameters p and q . In their Theor em 2, Garivier & Capp´ e ( 2011 ) show that there ex is ts a cons tant C 1 ≤ 1 0, as well as functions 0 ≤ C 2 ( ǫ ) = O ( ǫ − 2 ) and 0 ≤ β ( ǫ ) = O ( ǫ 2 ), such that for a ny ǫ > 0, the exp ected r egret of the KL-UCB algorithm (in the non-delay ed setting) satisﬁes E [ R n ] ≤ X ∆ i > 0 ∆ i " log( n ) d ( µ i , µ i ∗ ) (1 + ǫ ) + C 1 log(log n ) + C 2 ( ǫ ) n β ( ǫ ) # . (11) Using this uppe r co nﬁdence b ound with ( 6 ), we a rrive at the Delay ed-KL-UCB alg o rithm. F or this algorithm, we can prov e the following r egret bo und using the genera l scheme describ ed ab ov e together with the s ame techniques used by Garivier & Capp´ e ( 201 1 ), again obtaining an a dditive pena lt y compared to the non-delay ed setting. Theorem 10. F or any ǫ > 0 , t he exp e cte d r e gr et of the Delaye d-KL-UCB algorithm after n time instant s satisﬁes E [ R n ] ≤ X i :∆ i > 0 ∆ i  log( n ) d ( µ i , µ i ∗ ) (1 + ǫ ) + C 1 log(log ( n ))  + K X i =1 ∆ i  C 2 ( ǫ ) n β ( ǫ ) E  G ∗ i,n  + E  G ∗ i,n  + 1  , wher e C 1 , C 2 , and β ar e t he same as in ( 11 ) . 3 F or prop erties and theory of the Riemann Zeta function, see th e b o ok of Titc h marsh & Heath-Brown ( 1987 ). Online Learning under Delay e d F e edback In this case, working out the pro of and reusing the ana lysis is somewhat more complicated compar ed to UCB1. In particular , we will need an adaptation of Lemma 7 of Garivier & Capp´ e ( 2011 ), whic h is captur ed by the following lemma. Lemma 11. L et d + ( x, y ) = d ( x, y ) I { x < y } . Then for any n ≥ 1 , n X t =1 I  a t = i, µ i ∗ ≤ B i ∗ ,S i ∗ ( t − 1) ,t , S i ( t − 1) ≥ ℓ ′  ≤ G ∗ i,n n X s = ℓ ′ I  sd + ( ˆ µ i,s , µ i ∗ ) < log( n ) + 3 log(log( n ))  . Pr o of of L emma 11 . W e start in the same wa y as the or iginal pro of. Note tha t d + ( p, q ) is no n- decreasing in its second parameter , and that a t = i and µ i ∗ ≤ B i ∗ ,S i ∗ ( t − 1) ,t together imply B i,S i ( t − 1) ,t ≥ B i ∗ ,S i ∗ ( t − 1) ,t ≥ µ i ∗ , which in tur n gives S i ( t − 1) d + ( ˆ µ i,S i ( t − 1) , µ i ∗ ) ≤ S i ( t − 1) d ( ˆ µ i,S i ( t − 1) , B i,S i ( t − 1) ,t ) ≤ log( t ) + 3 log(log( t )) . Therefore, we ha ve n X t =1 I  a t = i, µ i ∗ ≤ B i ∗ ,S i ∗ ( t − 1) ,t , t > S i ( t − 1) ≥ ℓ ′  ≤ n X t = ℓ ′ I  a t = i, S i ( t − 1) d + ( ˆ µ i,S i ( t − 1) , µ i ∗ ) ≤ log( t ) + 3 log(log( t )) , S i ( t − 1) ≥ ℓ ′  ≤ n X t = ℓ ′ I  a t = i, S i ( t − 1) d + ( ˆ µ i,S i ( t − 1) , µ i ∗ ) ≤ log( n ) + 3 log(log( n )) , S i ( t − 1) ≥ ℓ ′  ≤ n X t = ℓ ′ t X s = ℓ ′ I { a t = i , S i ( t − 1) = s } × I  sd + ( ˆ µ i,s , µ i ∗ ) ≤ log( n ) + 3 log(log( n ))  = n X s = ℓ ′ n X t = s I { a t = i, S i ( t − 1) = s } × I  sd + ( ˆ µ i,s , µ i ∗ ) ≤ log( n ) + 3 log(log( n ))  = n X s = ℓ ′ I  sd + ( ˆ µ i,s , µ i ∗ ) ≤ log ( n ) + 3 log(log( n ))  × n X t = s I { a t = i , S i ( t − 1) = s } ! . But note that the seco nd summation is b ounded by G ∗ i,n , b ecause for each s , there ca nnot b e mo re than G ∗ i,n time insta nts a t which i is pr edicted and S i ( t ) = s remained constant; other wise for some t ′ ∈ { s, . . . , n } we would hav e T i ( t ′ − 1) − S i ( t ′ − 1 ) = G i,t ′ > G ∗ i,n , whic h is not p ossible. Substituting this b ound in the last expression proves the lemma. W e also recall the following tw o results from the orig inal paper . Theorem 12 (Theor em 10 of Garivier & Ca pp´ e ( 2 011 )) . L et ( Y t ) , t ≥ 1 b e a se quenc e of indep endent r andom variables b ounde d in [0 , 1] , with c ommon exp e ctation µ = E [ Y t ] . Co nsider a se qu enc e ( ǫ t ) , t ≥ 1 of Bernoul li variables su ch that for al l t > 0 , ǫ t is a r andom funct ion of Y 1 , . . . , Y t − 1 4 , and is indep endent of Y s , s ≥ t . L et δ > 0 and for every 1 ≤ t ≤ n , let S t = t X s =1 ǫ s and ˆ µ t = P t s =1 ǫ s Y s S t , with ˆ µ t = 0 when S t = 0 , and B n = max { q > ˆ µ n : S n d ( ˆ µ n , q ) ≤ δ } . Then P { B n < µ } ≤ e ⌈ δ log( n ) ⌉ e − δ . 4 That is, a function of Y 1 , . . . , Y t − 1 together with p ossibly an extra, indep endent randomization. Online Learning under Delay e d F e edback Lemma 13 (Lemma 8 of Ga rivier & Capp´ e ( 201 1 )) . F or a sub optimal pr e diction i , for every ǫ > 0 , let K n = $ 1 + ǫ d + ( µ i , µ i ∗ ) log( n ) + 3 lo g (log( n )) !% . Then ther e exist C 2 ( ǫ ) > 0 and β ( ǫ ) > 0 such that ∞ X s = K n +1 P  d + ( ˆ µ i,s , µ i ∗ ) < d ( µ i , µ i ∗ ) 1 + ǫ  ≤ C 2 ( ǫ ) n β ( ǫ ) . Now, we are ready to pr ov e Theorem 10 b y reusing the s ame techniques as in the orig ina l pap er. Pr o of of The or em 10 . F or a subo ptimal prediction i , b ounding the terms in ( 8 ) gives n X t =1 I  a t = i, B i,S i ( t − 1) ,t ≥ B i ∗ ,S i ∗ ( t − 1) ,t , S i ( t − 1) ≥ ℓ ′  ≤ n X t =1 I  B i ∗ ,S i ∗ ( t − 1) ,t < µ i ∗  + n X t =1 I  a t = i , µ i ∗ ≤ B i ∗ ,S i ∗ ( t − 1) ,t , S i ( t − 1) ≥ ℓ ′  ≤ n X t =1 I  B i ∗ ,S i ∗ ( t − 1) ,t < µ i ∗  + G ∗ i,n n X s = ℓ ′ I  sd + ( ˆ µ i,s , µ i ∗ ) < log( n ) + 3 log(log( n ))  , (12) where the las t inequa lit y follows from Lemma 11 . Let K n = $ 1 + ǫ d ( µ i , µ i ∗ ) log( n ) + 3 lo g (log( n )) !% , (13) and note that d ( µ i , µ i ∗ ) = d + ( µ i , µ i ∗ ). Let ℓ ′ = 1 + K n . Then we have: n X s = ℓ ′ I  sd + ( ˆ µ i,s , µ i ∗ ) ≤ log( n ) + 3 log(log( n ))  ≤ ∞ X s = K n +1 I  ( K n + 1 ) d + ( ˆ µ i,s , µ i ∗ ) ≤ log( n ) + 3 log(log( n ))  ≤ ∞ X s = K n +1 I  d + ( ˆ µ i,s , µ i ∗ ) < d ( µ i , µ i ∗ ) 1 + ǫ  . (14) Putting the v a lue of ℓ ′ and inequalities ( 13 ) and ( 14 ) back into ( 12 ) and combining with ( 8 ), we get E [ T i ( n )] ≤ 1 + ǫ d ( µ i , µ i ∗ ) log( n ) + 3 lo g(log ( n )) ! + E  G ∗ i,n  + 1 + n X t =1 P  B i ∗ ,S i ∗ ( t − 1) ,t < µ i ∗  + E  G ∗ i,n  ∞ X s = K n +1 P  d + ( ˆ µ i,s , µ i ∗ ) < d ( µ i , µ i ∗ ) 1 + ǫ  , where the la st term is a result o f the delays b eing indep endent of the rewards. The ﬁrst summation can b e bo unded using Theo rem 12 , for which it suﬃces to set ǫ t = 1 , 1 ≤ t ≤ n , and use the sequence of observed rewards ( h ′ i,t ) for the arm under consideration as the seque nc e ( Y t ) in the theorem. In the same way as the analysis of Garivier & Ca pp´ e ( 2011 ), this g ives an upp er b ound of the form C ′ 1 log(log n ) with the sa me v alue of C ′ 1 ≤ 7 as in the non- delay ed setting. The sec o nd s ummation can b e bo unded by Lemma 13 . Therefore , the exp ected num b er of times a sub optimal prediction is made is b ounded by: E [ T i ( n )] ≤ 1 + ǫ d ( µ i , µ i ∗ ) log( n ) + 3 lo g (log( n )) ! + C ′ 1 log(log ( n )) + C 2 ( ǫ ) n β ( ǫ ) E  G ∗ i,n  + E  G ∗ i,n  + 1 . Combining this with ( 2 ) and letting C 1 = C ′ 1 + 3 ﬁnishes the pro o f.

Online Learning under Delayed Feedback

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment