Decentralized Restless Bandit with Multiple Players and Unknown Dynamics

Decentra lized Restless Bandit with Multiple Players and Unkno wn Dynamics Haoyang Liu, Keqin Liu, Qing Zhao Departmen t of E lectrical an d Com puter En gineerin g University of California, Davis, CA 95616 { liu, kqliu, qzhao } @ucdavis.edu Abstract —W e consider d ecentralized restless multi-armed ban- dit pro blems with unknown d ynamics and multi ple pl ayers . The reward state of each arm transits according to an unkn own Markov ian rule when it i s played and evolv es according to an arbitrary u nknown random process when it is p assiv e. Players activ ating the same arm at the same time collide and suffer from reward loss. The objective is to maximize the l ong-term reward b y designing a decentralized arm selection policy to address unknown rewa rd models and collisions among pl ayers. A decentralized poli cy is constructed that achiev es a regr et with logarithmic order when an arbitrary nontrivial bound on certain system parameters is known. When no knowledge about the system is av ailable, we extend the p olicy to achiev e a regr et arbitrarily close to the logarithmic order . The result ﬁnds applications in communication networks, ﬁnancial in vestment, and indu strial engineering. I . I N T RO D U C T I O N A. The Classic MAB with A S ingle Player In the classic MAB, the re are N ind epende nt arms an d a single p layer . Each arm, when played, of fers an i.i.d. r andom rew ard to the play er . The rew ard distribution of each arm is unk nown. At each tim e, th e player chooses o ne arm to play , aiming to m aximize th e total expected reward in the long r un. This problem inv o lves the well-known trad eoff between exploitation and explora tion. For exploitation , the player sh ould select the arm with the la rgest sample mean of reward. F or exploration , the player should select an under- played ar m to learn its r ew ard statistics. Under the n on-Bayesian formu lation, the performan ce mea- sure of an arm selection policy is the so-called r egr et o r the cost of learnin g d eﬁned a s the reward loss with respect to the ca se with known reward mo dels [1 ]. In 19 85 Lai and Robbins showed that the minimum regret g rows at a logarithm ic order u nder certain regu larity cond itions [1]. The best leading constan t was also ob tained, an d an optimal po licy was constru cted to achieve the m inimum regret growth rate (both the lo garithmic order an d the b est leading con stant). In 1987 , Anan tharam et a l. extended La i and Robbins’ s results to acco mmodate multiple simu ltaneous pla ys [2] and a Mar kovian reward model where th e reward of each ar m ev olves as an unknown Markov pro cess over succ essi ve plays 0 0This work was supporte d by the Army Researc h Ofﬁc e under Grant W911NF-08-1-0467 and by the National S cienc e Foundation under Grant CCF-0830685 . and remains f rozen when th e arm is p assi ve (the so-called r ested Markovian rew a rd m odel) [3]. Sev eral other simpler po licies ha ve been developed to achieve logarith mic regret for the classic MAB under an i.i.d. rew ard mod el [4], [5]. In par ticular, the in dex policy—refer red to as Upper Conﬁdenc e Bound 1 ( UCB-1)—prop osed in [5] achieves the logarith mic regret with a u niform b ound on the leading constant over time. In [6], UCB-1 was extende d to the rested Ma rkovian rew ar d m odel adopted in [3]. B. Decentralized MAB with Distrib uted Multiple Players In [7], Liu an d Z hao fo rmulated and studied a decen tralized version of the classic MAB with M ( M < N ) distributed players un der the i.i.d. reward model. Different arm s can ha ve different reward distributions an d they are unkn own to the players. At each time, a player c hooses o ne a rm to play based on its local observation and decision history without excha ng- ing informa tion with o ther players. Collisions o ccur whe n multiple play ers choose the same arm, and, depending on the collision m odel, either no on e receives reward or th e colliding players sh are the reward in an arbitrary way . The objective is to maxim ize th e long -term sum r ew ard f rom all play ers. Another desired f eature of p olicies for decentralized MAB is fairness, i.e., dif ferent players h av e the same expected rew ard growth rate. Liu and Zhao p roposed the T ime Division Fair Sharing (TDFS) framew ork, it ach iev es the same logar ithmic regret order as the centralized case where all players share their observations in lear ning and collisions are eliminated thro ugh centralized perfe ct scheduling [7] . Assuming a Berno ulli re- ward mod el, decentralized M AB was a lso ad dressed in [ 8], where the single-play er policy UCB-1 was extended to the multi-playe r setting. C. Main Results In this pap er , we conside r the de centralized MAB with a restless Mar kovian rew ard model. In a single-play er restless MAB, the reward state of each arm tra nsits according to an unknown Markovian rule when p layed and transits accor ding to an arb itrary un known ra ndom p rocess when passive as addressed in our prior work [9 ]. In [9 ], we proposed a p olicy Restless UCB (R U CB), wh ich ach iev es a logarithmic order of the weak regret d eﬁned as the r ew ard loss co mpared to the case when the playe r kno ws wh ich arm is the most rew a rding and always plays the best arm. R UCB b orrows the index fo rm of UCB-1 given in [5] and has a deterministic ep och st ructure with carefully chosen epoch lengths to balance exploration and exploitation. The conc ept of weak regret was ﬁrst u sed in [1 0]; it m easures th e r ew ard loss with respect to the optimal single- arm policy , which, while o ptimal u nder the i.i. d. an d rested Markovian reward mod els (up to an O (1) term of loss f or the latter), is no longer optimal in ge neral under a kno wn restless rew ard mo del. Analysis o f the strict regret of restless MAB is in general intractab le given that ﬁnd ing the optimal policy of a restless bandit un der k nown model is itself PSP ACE-hard in general [ 11]. In th is paper, we extend R UCB pro posed in o ur p rior work [9] to a dec entralized setting of restless M AB with multiple players. W e c onsider two ty pes of restless rew ard models: exogen ous restless mo del and end ogeno us restless model. I n th e fo rmer, the system itself is r ested: the state o f an arm d oes not change when the arm is not eng aged. Howe ver , from e ach individual play er’ s p erspective, arms are restless due to actions of oth er player s that ar e un observable an d uncon trollable. Under the e ndoge nous restless model, th e state of an ar m ev o lves according to an arb itrary unk nown ran dom process even when the arm is not p layed. Under both r estless models, we extend R UCB to achieve a logarithmic order of the regret. Th e result for the exogeno us restless mode l, howev er , is stronger in the sense that the r egret is ind eed deﬁned with respect to the optimal policy under known re ward models. This is p ossible due to the in herent rested nature of the systems. There are a couple of parallel work to [9] on th e single- player restless M AB. In [ 12], T ekin and Liu adopted th e weak regret and pr oposed a policy th at achiev es log arithmic (weak) regret when cer tain knowledge about the system parame ters is av ailab le [1 2]. The policy pr oposed in [1 2] also uses the index form of UCB-1 gi ven in [5], but the stru cture is different from R UCB prop osed in [9]. Speciﬁcally , under the p olicy proposed in [12], an arm is played consecutively fo r a rando m num ber of times determined by the regen erative cycle of a particu lar state, a nd o bservations ob tained ou tside the regenerative cycle are not used in lear ning. R UCB, howe ver, has a deterministic epoch structur e, a nd all observations ar e used in learning. In [1 3], the strict regret was co nsidered for a special class of restless MAB. Spe ciﬁcally , w hen arms are governed by stochastically identical two-state Ma rkov chains, a policy w as constructed in [ 13] to a chieve a regret with a n o rder ar bitrarily close to logarith mic. Notatio n For two po siti ve integers k an d l , deﬁne k ⊘ l ∆ = (( k − 1) m od l ) + 1 , which is an integer taking values from 1 , 2 , · · · , l . I I . P R O B L E M F O R M U L AT I O N In the decen tralized MAB problem , we have M play ers and N indepe ndent arms. At each time, each player choo ses one arm to play . Each arm, when played (activ ated), o ffers certain amount o f re ward that mod els the curren t state of the arm. Let s j ( t ) and S j denote the state of arm j at time t and the state space of arm j respectively . D ifferent ar ms can have different state spaces. When arm j is played, its state changes accordin g to a Markovian rule with P j as the tran sition matrix. The transition matrixes are assumed to be irreducib le, aperiodic, and reversible. As for the state transition of passive arms, we consider two m odels: end ogeno us restless model and exogenou s r estless m odel. In the endogen ous restless mo del, arm states chan ge in arbitrar y ways whe n not play ed. In the exogenou s restless mod el, arm states remain frozen if not engaged . The player s do not know the transition ma trices o f the arm s and do not commun icate with each o ther . Con ﬂicts occur when different p layers select the sam e arm to play . Under different conﬂict models, either the p layers in conﬂict share the reward or no one obtains any rew ard. The ob jectiv e is to m aximize the e xpected total reward collected in the long run. Let ~ π j = { π j s } s ∈S j denote the stationary d istribution of arm j (un der P j ), where π i s is the station ary pr obability (under P j ) that arm j is in state s . Th e statio nary mean reward µ j is g iv en by µ j = P s ∈S j sπ j s . Let σ b e a perm utation of { 1 , · · · , N } such that µ σ (1) ≥ µ σ (2) ≥ µ σ (3) ≥ · · · ≥ µ σ ( N ) . A policy Φ is a rule that speciﬁes a n arm to play based on th e local ob servation h istory . Let t j ( n ) denote the time ind ex of the n th play on arm j , and T j ( t ) the total numb er of plays on arm j by time t . Notice th at both t j ( n ) and T j ( t ) are random variables with distributions deter mined b y the policy Φ . Under the conﬂict mo del where players in co nﬂict share the rew ard, the tota l r ew ard by time t is giv en by R ( t ) = N X j =1 T j ( t ) X n =1 s j ( t j ( n )) . (1) Under the co nﬂict mode l where no p layers in co nﬂict obtain any rew ard, the total rew ar d b y time t is given by R ( t ) = N X j =1 T j ( t ) X n =1 s j ( t j ( n )) I j ( t j ( n )) . (2) where I j ( t j ( n )) = 1 if arm j is p layed by one and only one player at time t j ( n ) , and I j ( t j ( n )) = 0 otherwise. As mention ed in Sec. I, f or both restless m odels, per- forman ce o f any policy Φ is ev alu ated using regret r Φ ( t ) deﬁned a s the reward loss with resp ect to having M best arms constantly engaged. Speciﬁcally , fo r both restless models, regret is deﬁned as follows: r Φ ( t ) = t M X i =1 µ σ ( i ) − E Φ R ( t ) + O (1) , (3) where the constant O (1) is caused by the transient ef fects of playing the M b est arms, E Φ denotes the expectation with respect to the rando m proc ess induc ed b y po licy Φ . The objective is to m inimize th e growth rate o f the r egret. Note that the constant O (1) term can be ign ored when studying the growth rate of the regret. P S f r a g r e p l a c e m e n t s S l o t Exploitation epo chs Exploration epoch s Slot Slot Slot Epoch arm arm arm arm arm arm arm arm arm arm arm arm arm The gene ral structure of dece ntralized R UCB 1 1 1 1 1 1 2 2 2 2 3 4 5 6 7 8 2 n − 1 2 × 4 n − 1 2 × 4 n − 1 2( M − 1) × 4 n − 1 2 M × 4 n − 1 a ∗ 2 a ∗ M − 1 a ∗ M a ∗ M ( N − 1) × 4 n − 1 + 1 N × 4 n − 1 N N Structure of t he n th e xploration epoch for player 1 Structure of the n th exploitation epoch for player 1 Compute the index es and identify the arms with th e M highest inde xes a ∗ 1 a ∗ 1 F i r s t S e c o n d T h i r d T w o t y p e s o f e p o c h s a r e n o t a l w a y s i n t e r l e a v i n g . e x p l o r a t i o n e x p l o i t a t i o n e p o c h P o l i c y s t r u c t u r e · · · · · · · · · · · · · · · · · · · · · · · · Fig. 1. Epoch structures of decentra lized R UCB I I I . T H E D E C E N T R A L I Z E D RU C B P O L I C Y The prop osed de centralized R UCB is based on an epoch structure. W e divide the time in to disjoin t e pochs. There are two types of epo chs: exploitation ep ochs and exploration epochs (see an illu stration in Fig. 1). I n th e exploitation epochs, the play ers calculate the ind exes of all arms and play the arms with the M highest indexes, which are believed to be the M best arms. I n the exploration e pochs, the players obtain inf ormation of all arms by p laying them eq ually many times. The purpo se of the explor ation epochs is to make decisions in th e exploitation epo chs sufﬁciently accurate. As shown in Fig. 1, in the n th explo ration epo ch, each player plays every arm 4 n − 1 times. At the beginn ing of th e n th exploitation e poch the p layer calcu lates index f or every arm (see ( 5) in Fig. 2) and selects the arm with the M high est indexes (denoted as arm a ∗ (1) to arm a ∗ ( M ) ). Each exploitation epoch is divided into M sube pochs with each having a length of 2 × 4 n − 1 . Player k plays arm a ∗ (( m − k + M +1) ⊘ M ) in the m th subepo ch o f eac h exploitation epoch . The details on interleaving the two types of epo chs are given in Step 2 in Fig. 2. Speciﬁcally , whe never su fﬁciently many ( D ln t , see ( 4)) ob servations have b een o btained fro m ev ery arm in the exploration epochs, the p layer is ready to proceed with a new exploitation ep och. Otherwise, an other exploration epoch is requir ed to gain m ore info rmation about each arm . It is a lso implied in (4) that o nly logarithmically many plays are spent in the explo ration epo chs, which is one of the key reasons for the logarithm ic regret of decen tralized R UCB. This also implies that the exploration epochs are much less fr equent than the exploitation epochs. Thoug h the exploratio n epo chs can be u nderstoo d as the “in formation gatherin g” p hase, and the e xploitation epochs as the “inform ation utilization ” ph ase, observations ob tained in the exploitation epochs are also used in learn ing the ar m dy namics. This can be seen in Step 3 in Fig. 2. The epoc h structure, ( i.e., the starting and ending points of ep ochs) are preﬁxed nu mbers on ly dep ending on parameter D . This is on e of the key reasons why different play ers can be coordin ated ( i.e., entering the same epo ch at the same time) without intercommun ications. Decentralized RUCB T ime is divided i nto epochs. There are two types of epochs, explora tion epochs and exploitation epochs. At the beginning of the n th exploitation epoch, we choose the M arms to play , each of them for 2 × 4 n − 1 many times. In the n th exploration epo ch, we play every arm 4 n − 1 many ti mes. Let n O ( t ) denote the number of exploration epochs played by time t and n I ( t ) the number of exploitation epochs played by time t . 1. At t = 1 , we start the ﬁr st exploration epoch, in which e very arm is played once. W e set n O ( N + 1) = 1 , n I ( N + 1) = 0 . Then go to S tep 2 . 2. Let X 1 ( t ) = (4 n O ( t ) − 1) / 3 be the time spent on each arm i n explora tion epochs by time t . C hoose D acco rding to (6)(7). If X 1 ( t ) > D ln t, (4) go to Step 3 (start an exp loitation epoch). Otherwise, go to Step 4 (start an exploration epo ch). 3. Calculate inde xes d i,t for all arms using t he formula below: d i,t = ¯ s i ( t ) + s L ln t T i ( t ) , (5) where t is the curren t time, ¯ s i ( t ) is the sample mean from arm i by time t , L is chosen according to (6), and T i ( t ) is the number of times we have played arm i by time t . Then choose the arms with the M highest indexes (arm a ∗ (1) to arm a ∗ ( M ) ). E ach exploitation epoch is divided into M subepochs with each having a length of 2 × 4 n − 1 . Player k plays arm a ∗ (( m − k + M + 1) ⊘ M ) in the m th subepoch of each exploitation epoch. Af ter arm a ∗ (1) to arm a ∗ ( M ) are played, increase n I by one and go to step 2 . 4. Each Play each arm for 4 ( n O − 1) slots. Each exp loration epoch i s div ided into N subepoch s w ith each having a l ength of 4 ( n O − 1) . Player k plays arm a ∗ ( m − k + N + 1 ⊘ N ) in the m th subepoch of each exp loitation epoch. After all the arms are played, increase n I by one and go to step 2 . Fig. 2. Decentr alize d R UCB polic y A. Eliminate Pr e-A gr eement So far we ha ve assumed a pre-agreement among the players: they target at the M best arms with different offsets to a void excessi ve collisions. In this subsection, we show that this p re- agreemen t can be eliminated while maintaining the logarithmic order of th e system regret. Fur thermor e, play ers can join the system at different times withou t any global synchroniza tion. Speciﬁcally , at each player , the stru cture of th e exploration and exp loitation epochs is the sam e as the local R UCB policy with p re-agr eement. The only difference here is that in each exploitation epoch, the pla yer ra ndomly choo ses one of the M arms consider ed as the best to play wh enever a c ollision with other p layers is ob served. If no collision is observed, th e player keeps playing th e same arm. This simple elimin ation of pre-agree ment leads to a complete decentra lization amon g players while achieving th e same loga rithmic or der of the system r egret. Excep t that each p layer can join the system accordin g to the local sched ule, the player can also leave the system for an arbitrary ﬁn ite time period. I V . T H E L O G A R I T H M I C R E G R E T O F D E C E N T R A L I Z E D RU C B In this section , we show th at the regret achiev ed by the decentralized RUCB p olicy has a log arithmic ord er . This is giv en in the fo llowing theorem. Theor em 1: Under th e exogeno us restless Markovian re- ward mod el, assume that when arms are en gaged , they ca n be mo deled as ﬁn ite state, irr educible, aper iodic, and re- versible Markov ch ains. All the states (rewards) are posi- ti ve. Let π min = min s ∈S i , 1 ≤ i ≤ N π i s , ǫ max = ma x 1 ≤ i ≤ N ǫ i , ǫ min = min 1 ≤ i ≤ N ǫ i , s max = max s ∈S i , 1 ≤ i ≤ N s , s min = min s ∈S i , 1 ≤ i ≤ N s , and |S | max = max 1 ≤ i ≤ N |S i | where ǫ i = 1 − λ i ( λ i is the second largest eigenv alue of the matrix P i ). Assume that different arms have d ifferent µ values 1 Set the policy param eters L an d D to satisfy the f ollowing conditions: L ≥ 1 ǫ min (4 20 s 2 max |S | 2 max (3 − 2 √ 2) + 10 s 2 max ) , (6) D ≥ 4 L (min j ≤ M ( µ σ ( j ) − µ σ ( j +1) )) 2 . (7) Under the conﬂict model wh ere players shar e the reward, the regret of decentralized R UCB at the end of any epoch can be u pper bounded by r Φ ( t ) ≤ 1 3 [4(3 D ln t + 1) − 1] M X i =1 µ σ ( i ) − M N N X i =1 µ σ ( i ) ! +3 ⌈ log 4 ( 3 2 ( t − N ) + 1) ⌉ (1 + ǫ max √ L 10 s min ) M − 1 X i =1 N X j =1 ,j 6 = i µ σ ( i ) |S σ ( i ) | + |S σ ( j ) | π min +3 ⌈ log 4 ( 3 2 ( t − N ) + 1) ⌉ (1 + ǫ max √ L 10 s min ) N X j = M +1 ( µ σ ( M ) − µ σ ( j ) ) |S σ ( i ) | + |S σ ( j ) | π min . +3 ⌈ log 4 ( 3 2 ( t − N ) + 1) ⌉ (1 + ǫ max √ L 10 s min ) M − 1 X j =1 µ σ ( M ) |S σ ( M ) | + |S σ ( j ) | π min + N X i =1 [(min s ∈S i π s ) − 1 X s ∈S i s ] (8) 1 This assumption can be rela xed by uti lizi ng the shared inde x set. This assumption is only for simplicit y of the presentation. Under the m odel wher e no play er in conﬂict ge ts any rew ard, the regret of d ecentralized R UCB at the end of any epoch ca n be upp er bou nded by: r Φ ( t ) ≤ 3 ⌈ log 4 ( 3 2 ( t − N ) + 1) ⌉ (1 + ǫ max √ L 10 s min ) ( M X i =1 µ σ ( i ) ) M X i =1 N X j =1 ,j 6 = i |S σ ( i ) | + |S σ ( i ) | π min + 1 3 [4(3 D ln t + 1 ) − 1] M X i =1 µ σ ( i ) − M N N X i =1 µ σ ( i ) ! + N X i =1 [(min s ∈S i π s ) − 1 X s ∈S i s ] (9) W e po int out that up per bo unds o f regret in Theor em 1 can be extended to any time t in stead o f only for ending points of epoch s. They can also be extend ed to the endogenou s restless model in terms of weak regret. The no p re-agree ment version of de centralized R UCB can also achie ve regret with a logarithm ic order . Pr oof: See Ap pendix A for details. Theorem 1 r equires an arb itrary (n ontrivial) bound on s 2 max , |S | max , ǫ min , and min j ≤ M ( µ σ ( j ) − µ σ ( j +1) ) . In the case where these b ounds are unavailable, D and L can be chosen to increase with time to ac hieve a regret order arbitr arily close to loga rithmic order . Th is is fo rmally stated in the f ollowing theorem. Theor em 2: Assume the e xogeno us restless model and that all arms, when eng aged, are modeled as ﬁnite state, ir re- ducible, ap eriodic, and reversible Markov chain s. For any increasing seq uence f ( t ) ( f ( t ) → ∞ as t → ∞ ), if L ( t ) an d D ( t ) are chosen such that L ( t ) → ∞ as t → ∞ , f ( t ) D ( t ) → ∞ as t → ∞ , and D ( t ) L ( t ) → ∞ a s t → ∞ , then we h av e r Φ ( t ) ∼ o ( f ( t ) log ( t )) . (10) W e point ou t that the conc lusion in Th eorem 2 still ho lds for the endog enous restless model, thoug h th e pro of n eeds to be modiﬁed. Pr oof: See Ap pendix B for details. V . C O N C L U S I O N In this pap er , we studied the de centralized restless m ulti- armed bandit pro blems, wh ere distributed players aim to accrue the m aximum long -term r ew ard without kn owing the system reward statistics. Under the exog enous mo del where the arm r ew a rd status rema ins static when n ot eng aged, we propo sed a po licy to a chieve the optimal logarithmic ord er of th e system regret. Und er the endoge nous model wh ere the arm reward status evolves acco rding to an arb itrary r andom process when not engaged, we s howed that t he proposed policy achieves a lo garithmic (weak ) regre t. Furtherm ore, we showed that the propo sed policy achieves a com plete decen tralization where no p re-agree ment or global synchr onization among players is requ ired. A P P E N D I X A . P RO O F O F T H E O R E M 1 W e ﬁrst re write the deﬁnition of regret as r Φ ( t ) = t M X i =1 µ σ ( i ) − E Φ R ( t ) = N X i =1 [ µ i E [ T i ( t )] − E [ T i ( t ) X n =1 s i ( t i ( n ))]] + E [ T i ( t ) X n =1 s i ( t i ( n ))]] − E Φ R ( t ) + t M X i =1 µ σ ( i ) − N X i =1 µ i E [ T i ( t )] . (11) T o bo und the ﬁrst term in (1 1), Lem ma 1 is introduced below: Lemma 1 [3]: L et Y 1 , Y 2 , · · · be Markovian with state space S , ma trix of tr ansition pr obabilities P , an initial distribution ~ q , and stationary distribution ~ π ( π s is the stationary probability of state s ). Let F t be the σ -algebra generated by Y 1 , Y 2 , · · · , Y t and G an σ -alg ebra ind ependen t of Y ∞ = ∨ Y t . Let T b e a stopping time of { F t ∨ G } . The state (re ward) at time t is de- noted by s ( t ) . Let µ de note the mean rew ard. For any stopping time T , there exists a value A P ≤ (min s ∈S π s ) − 1 P s ∈S s such th at E [ P T t =1 s ( t ) − µT ] ≤ A P . Using Lemma 1 the ﬁrst term in (11) can be bou nded b y the following constant: N X i =1 [(min s ∈S i π s ) − 1 X s ∈S i s ] (12) T o show that the regre t h as a logarithmic order, it is sufﬁcient to show th at the second term p lus the third term in (1 1) has a log arithmic order . T hese two te rms can be understoo d as regret caused by two rea sons. The ﬁrst one is engagin g bad arms in the exploration epochs. The second one is no t play ing the expected arms in the exploitation epo chs. T o show the second term in (11) ha s a log arithmic order, it is sufﬁcient to show th at the regret caused by the two rea sons above have logarithmic order s. Let E [ T O ( t )] d enote the time spent on each arm in the exploration epoch s by time t and an upper b ound on T O ( t )] is: T O ( t ) ≤ 1 3 [4(3 D ln t + 1) − 1] . (13) Consequently th e regret caused by enga ging bad arms in the exploration epochs b y time t is upp er bou nded by 1 3 [4(3 D ln t + 1) − 1] M X i =1 µ σ ( i ) − M N N X i =1 µ σ ( i ) ! . (14) The second reason for regret in the secon d ter m of (11) is not playin g the expected arms in the exploitation epoc hs. Let t n denote the beginning p oint to th e n th exploitation ep och. Let Pr[ i, j, n ] deno te the possibility th at arm i has a h igher index than arm j at t n , where µ i < µ j and µ j ≥ µ σ ( M ) . It can b e shown th at: Pr[ i, j, n ] ≤ |S i | + |S j | π min (1 + ǫ max √ L 10 s min ) t − 1 n (15) Since d ifferent subepo chs in the exploitation epoch s are symmetric, the r egret in different subepochs are th e same. In the ﬁrst subepoc h, play er k aims at arm σ ( k ) . In the m odel where players in c onﬂict share th e rew ard, player k failing to identify arm σ ( k ) in the ﬁrst subepo ch of the n th e xploitation epoch can le ad to a regret no more than µ σ ( k ) 2 × 4 n − 1 . In calculating th e upper b ound fo r regret, for p layer M , we can assume that playin g the arm σ ( M + 1) to ar m σ ( N ) can contribute to the total rew ard. Thus an upper bo und for regret in the n th exploitation epoch can b e ob tained as 2 M 4 n − 1 (1 + ǫ max √ L 10 s min ) t − 1 n [ M − 1 X i =1 N X j =1 ,j 6 = i µ i |S i | + |S j | π min + M − 1 X j =1 µ M |S M | + |S j | π min + N X j = M +1 ( µ M − µ j ) |S i | + |S j | π min ] (16) By time t , at most ( t − N ) time slots have been spent o n the explo itation epoc hs. Thus n I ( t ) ≤ ⌈ log 4 ( 3 2 ( t − N ) + 1) ⌉ . (17) From the u pper boun d o n the nu mber of th e exploitation epochs given in (17), an d also th e fact that t n ≥ 2 3 4 n − 1 , we have th e following upp er bou nd on regret caused in the exploitation epoch s by time t (Deno ted by r Φ ,I ( t ) ): r Φ ,I ( t ) ≤ 3 ⌈ log 4 ( 3 2 ( t − N ) + 1) ⌉ (1 + ǫ max √ L 10 s min ) M − 1 X i =1 N X j =1 ,j 6 = i µ σ ( i ) |S σ ( i ) | + |S σ ( j ) | π min +3 ⌈ log 4 ( 3 2 ( t − N ) + 1) ⌉ (1 + ǫ max √ L 10 s min ) N X j = M +1 ( µ σ ( M ) − µ σ ( j ) ) |S σ ( i ) | + |S σ ( j ) | π min +3 ⌈ log 4 ( 3 2 ( t − N ) + 1) ⌉ (1 + ǫ max √ L 10 s min ) M − 1 X j =1 µ σ ( M ) |S σ ( M ) | + |S σ ( j ) | π min (18) Combining (11 ) (12) (14) (18), we can get th e upper boun d of r egret: r Φ ( t ) ≤ 1 3 [4(3 D ln t + 1) − 1] M X i =1 µ σ ( i ) − M N N X i =1 µ σ ( i ) ! +3 ⌈ log 4 ( 3 2 ( t − N ) + 1) ⌉ (1 + ǫ max √ L 10 s min ) M − 1 X i =1 N X j =1 ,j 6 = i µ σ ( i ) |S σ ( i ) | + |S σ ( j ) | π min +3 ⌈ log 4 ( 3 2 ( t − N ) + 1) ⌉ (1 + ǫ max √ L 10 s min ) N X j = M +1 ( µ σ ( M ) − µ σ ( j ) ) |S σ ( i ) | + |S σ ( j ) | π min +3 ⌈ log 4 ( 3 2 ( t − N ) + 1) ⌉ (1 + ǫ max √ L 10 s min ) M − 1 X j =1 µ σ ( M ) |S σ ( M ) | + |S σ ( j ) | π min + N X i =1 [(min s ∈S i π s ) − 1 X s ∈S i s ] (19) Next we con sider the m odel whe re no p layer in conﬂict gets rew ard. In th e ﬁrst subepo ch of th e n th exploitation epcoh, each mistake by player k can cause re gret more than µ σ ( k ) 2 × 4 n − 1 . Assuming each mistake can cause P M i =1 µ σ ( i ) 2 × 4 n − 1 regret leads to the fo llowing u pper bou nd for re gret und er this conﬂict m odel: r Φ ( t ) ≤ 3 ⌈ log 4 ( 3 2 ( t − N ) + 1) ⌉ (1 + ǫ max √ L 10 s min ) ( M X i =1 µ σ ( i ) ) M X i =1 N X j =1 ,j 6 = i |S σ ( i ) | + |S σ ( i ) | π min + 1 3 [4(3 D ln t + 1 ) − 1] M X i =1 µ σ ( i ) − M N N X i =1 µ σ ( i ) ! + N X i =1 [(min s ∈S i π s ) − 1 X s ∈S i s ] (20) A P P E N D I X B . P RO O F O F T H E O R E M 2 The ch oice of L ( t ) and D ( t ) implies that D ( t ) → ∞ as t → ∞ . The regret has th ree parts: the tr ansient ef fect o f arms, the regret c aused by playin g bad ar ms in th e exploration epochs, and the r egret caused by mistakes in the exploitation epochs. It will be shown that e ach p art part of th e regret is on a lower ord er th an f ( t ) log( t ) . The transient effect of arms is the same as in Theorem 1 . Thu s it is up per bou nded by a constant indepen dent of time t and is on a lo wer order th an f ( t ) log( t ) . The regret caused by p laying bad ar ms in the explo ration epochs is bou nded by 1 3 [4(3 D ( t ) ln t + 1) − 1] M X i =1 µ σ ( i ) − M N N X i =1 µ σ ( i ) ! . (21) Since f ( t ) D ( t ) → ∞ as t → ∞ , the par t of regret in (21) is on a lower order th an f ( t ) log( t ) . For t he regret caused by playing bad arms in the exploitation epochs, it is shown below that the time spent on a b ad ar m i can b e bo unded by a constant ind ependen t of t . Since D ( t ) L ( t ) → ∞ as t → ∞ , there exists a time t 1 such th at ∀ t ≥ t 1 , D ( t ) ≥ 4 L ( t ) (min j ≤ M ( µ σ ( j ) − µ σ ( j +1) )) 2 . There also exists a time t 2 such that ∀ t ≥ t 2 , L ( t ) ≥ 1 ǫ min (7 20 s 2 max |S | 2 max (3 − 2 √ 2) + 10 s 2 max ) . The time spent on play ing bad arms bef ore t 3 = max( t 1 , t 2 ) is at mo st t 3 , and the caused regret is at mo st ( P M j =1 µ σ ( j ) ) t 3 . The regret caused b y mistakes after t 3 is upper bou nded b y 6(1 + ǫ max √ L 10 s min )( P M i =1 µ σ ( i ) ) P M i =1 P N j =1 ,j 6 = i |S σ ( i ) | + |S σ ( i ) | π min . Thus the regret c aused by mistakes in the exploitation epochs is on a lower order th an f ( t ) log( t ) . Because e ach pa rt of the regret is on a lower ord er than f ( t ) log( t ) , the total r egret is a lso on a lower order than f ( t ) log( t ) . R E F E R E N C E S [1] T . Lai and H. Robbins, “ Asymptotic ally Efﬁcient Adapti ve Allocation Rules, ” Advances in Applied Mathematic s , V ol. 6, No. 1, pp. 4C22, 1985. [2] V . Ananthara m, P . V araiya, J. W alrand, “ Asymptotical ly Efﬁci ent Alloca- tion Rules for the Multia rmed Bandit Prob lem wit h Mul tiple Play s-Part I: I.I.D. Rew ards, ” IEE E T ransaction on Aut omatic Contr ol , V ol . AC-32 ,No.11 , pp. 968-976, Nov ., 1987. [3] V . Ananthara m, P . V araiya, J. W alrand, “ Asymptotical ly Efﬁci ent Alloca- tion Rule s for the Multiarmed Bandit Problem with Mult iple Plays-Part II: Marko vian Rew ards, ” IEEE T ransac tion on Automati c Contr ol , V ol. AC- 32 , No.11 ,pp. 977-982, Nov ., 1987. [4] R. Agra wal , “Sampl e Mean Based Inde x Policies W ith O(log n) Re gret for the Multi-armed Bandit Probl em, ” Advances in Applied Proba bilit y , V ol. 27, pp. 1054C1078, 1995. [5] P . Auer, N. Cesa-Bia nchi, P . Fischer , “ Finite- time Analysis of the Multi- armed Bandit Problem, ” Machi ne Le arning , 47, 235-256, 2002. [6] C. T ekin, M. L iu, “Online Algorithms for the Multi-Armed Bandit Problem Wi th Markovia n Rew ards, ” P r oc. of Allerton Confer ence on Communicat ions, Co ntr ol, and Comp uting , Sep., 2010. [7] K. Liu, Q. Zhao, “Distrib uted Learning in Multi-Armed Bandit with Multipl e Players, ” IEEE T ransati ons on Signal Proc essing , V ol. 58, No. 11, pp. 5667-5681, Nov . 2010. [8] A. Anandkumar , N. Micha el, A.K. T ang, A. Swami “Distrib uted Algo- rithms for Learning and Cogniti ve Medium Access with Logarithmic Regr et, ” Submitt ed to IEEE JSAC on Advances in Cognit ive Radio Network ing an d Communicati ons . [9] H. Liu, K. Liu, Q. Zhao, “Logari thmic W eak Re gret of Non-Baye sian Restless Multi-Armed Bandit, ” Pro c. of Internanion al Confere nce on Acoustics, Speec h and Signal Pr ocessing (ICASSP) , May , 2011. [10] P . Auer , N. Cesa-Bianch i, Y . Freund, R.E. Schapire “The nonstochast ic multiarmed band it probl em, ” SIAM J ournal on Computi ng , V ol. 32, pp. 48C77, 2002. [11] C. Pap adimitri ou, J. Tsitsikl is, “The Complex ity of Optimal Queu ing Networ k Control, ” Mathematics of Operati ons R esear ch , V ol. 24, No. 2, pp. 293-305, May 1999. [12] C. T ekin, M. Liu, “Onli ne Learning in Opportunist ic Spectrum Access: A Restle ss Bandit Approac h, ” Arxiv pre-print http:/ /arxi v .org/abs/1010.005 6 , Oct. 2010. [13] W . Dai, Y . Gai, B. Krishnamach ari, Q. Zhao “The Non-Bayesi an Restless Multi-armed Bandit: A Case Of Near-Logari thmic Regret, ” Pr oc. of Internanional Conferen ce on Acoustics, Speec h and Signal P r ocessing (ICASSP) , May , 2011. [14] D. Gillman, “ A Chernof f Bound for Random W alks on Expander Graphs, ” Pr oc. 34t h IEEE Symp. on F oundatioins of Computer Science (FOCS93),vol. SIAM J. Comp. ,V ol. 27, No. 4, 199 8.

Decentralized Restless Bandit with Multiple Players and Unknown Dynamics

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment