Regenerative Particle Thompson Sampling

Regenerativ e P article Thompson Sampling ∗ Zeyu Zhou † , Bruce Ha jek ‡ , Nakjung Choi § , An war W alid ¶ Jan uary 24, 2024 Abstract This pap er prop oses regenerativ e particle Thompson sampling (RPTS), a ﬂexible v ariation of Thompson sampling. Thompson sampling itself is a Ba yesian heuristic for solving stochastic bandit problems, but it is hard to implement in practice due to the in tractability of maintaining a contin uous posterior distribution. P article Thompson sampling (PTS) is an appro ximation of Thompson sampling obtained b y simply replacing the contin uous distribution b y a discrete distribution supported at a set of weigh ted static particles. W e observ e that in PTS, the w eights of all but a f ew ﬁt particles con verge to zero. RPTS is based on the heuristic: delete the deca ying unﬁt particles and regenerate new particles in the vicinity of ﬁt surviving particles. Empirical evidence sho ws uniform improv ement from PTS to RPTS and ﬂexibility and eﬃcacy of RPTS across a set of represen tative bandit problems, including an application to 5G netw ork slicing. 1 In tro duction A band it problem is a sequential decision problem that elegan tly captures the fundamen tal trade-oﬀ b et ween the exploitation of actions with high rewards in the past and the exploration of actions that ma y pro duce higher rew ards in the future. Thompson sampling (TS) is a Ba yesian heuristic for solving bandit problems with an assumption that the rew ards are generated according to a giv en distribution with a ﬁxed unknown parameter. TS main tains a p osterior distribution on the parameter and selects an action according to the p osterior probability that the action is optimal. The biggest adv an tage of TS is its abilit y to automatically handle setups with a complex information structure, where kno wing the performance of one action ma y inform properties ab out other actions. Also, it has strong empirical performance [ 5 ]. Theoretical performance guarantees of TS hav e b een established for some bandit problems [ 12 , 1 , 2 , 8 ]. Ho wev er, eﬃcient up dating, storing, and sampling from the p osterior distribution in TS are only feasible for some sp ecial cases (e.g. conjugate distributions). F or general bandit problems, one has to resort to v arious appro ximations, most of which are complicated and ha ve restrictive assumptions. Particle Thompson sampling (PTS) is an approximation of TS based on the follo wing idea: re- place the contin uous p osterior distribution by a discrete distribution supp orted at a set of weigh ted ∗ P arts of this work hav e b een published as t wo pap ers in Pr o c e e dings of the 57th Annual Confer enc e on Information Scienc es and Systems (CISS) , Baltimore, MD, USA, March, 2023, titled Particle Thompson Sampling with Static P articles and Impro ving P article Thompson Sampling with Regenerativ e P articles. † Departmen t of Radiology , May o Clinic, Ro chester, Minnesota, USA. Email: zeyuzhou91@gmail.com ‡ Departmen t of Electrical and Computer Engineering and the Co ordinated Science Lab oratory , Univ ersity of Illinois at Urbana-Champaign, Email: b-ha jek@illinois.edu § Net work System and Security Researc h, Nokia Bell Labs, Murra y Hill, New Jersey , USA. Email: nakjung.c hoi@nokia-b ell-labs.com ¶ Amazon, New Y ork, USA. Email: acmanw ar@acm.org 1 static particles. Up dating the posterior distribution then b ecomes updating the particles’ w eights b y Ba yes form ula, follow ed b y normalization. PTS is ﬂexible: it applies to very general bandit setups. Also, PTS is v ery easy to implement. How ever, it ma y seem on the surface that the crude appro ximation may bring down the performance of TS signiﬁcan tly , b ecause the set of particles in PTS is ﬁnite and static and may not contain the actual parameter. Intuitiv ely , the perfor- mance of PTS can b e improv ed b y using more particles. Ho wev er, that comes with an increasing computational cost. The main contributions of this pap er: • W e pro vide an analysis of PTS for general bandit problems, without assuming that the set of particles contains the hidden system parameter. The main result is a drift-based sample-path necessary condition on the surviving particles, illuminating the phenomenon that ﬁt particles surviv e and unﬁt particles decay . • W e prop ose an algorithm, r e gener ative p article Thompson sampling (RPTS), to impro ve PTS. The heuristic is: p erio dically replace the deca ying unﬁt particles in PTS with new generated particles in the vicinity of the surviv ors. Empirical results show that RPTS algorithms out- p erform PTS uniformly for a set of representativ e bandit problems. RPTS is v ery ﬂexible and easy to implemen t. • W e sho w an application of PTS and RPTS to netw ork slicing, a 5G comm unication netw ork problem, and demonstrate their eﬃcacy through sim ulation. The remainder of this pap er is organized as follows. Section 2 lists some related work. Section 3 introduces the general setup and notation of stochastic bandit problems and PTS. Section 4 pro vides a sample-path analysis of PTS. Section 5 introduces RPTS and presents some sim ulation results. Section 6 sho ws an application of PTS and RPTS to net work slicing. Section 7 concludes the pap er and men tions some p oten tial future w ork. 2 Related W ork See [ 4 ] and [ 15 ] for a surv ey and recen t developmen ts in bandit problems. Upp er-conﬁdence-b ound (UCB) algorithms [ 3 , 7 ] hav e certain theoretical guarantees for some simple bandit mo dels. KL-UCB [ 7 ] even meets a low er b ound on regret established in [ 14 ]. Empiri- cally , UCB algorithms are not v ery competitive in the non-asymptotic regime due to their ineﬃcien t exploration and inability to take adv an tage of the problem structure for complex bandit problems. Rew ard-biased maxim um likelihoo d estimation (RBMLE) [ 16 , 11 ] reduces to an indexed p ol- icy lik e UCB and p erforms well compared to state-of-art algorithms. But for many problems in whic h the actions give information ab out the parameter in complicated w ays, there is no eﬃcien t implemen tation of RBMLE. Thompson sampling (TS) [ 20 ] has strong empirical p erformance [ 5 ] and can handle rather general and complex sto c hastic bandit problems [ 8 , 19 ]. Note that there are certain problems for whic h TS do es not w ork w ell [ 19 ] and it is still an active area of researc h to identify such problems and design algorithms to solve them. TS can b e implemented eﬃcien tly in setups where a conjugate prior exists for the rew ard distribution. In cases where a conjugate prior is not av ailable, one need to resort to approximations of TS, suc h as Gibbs sampling, Laplace approximation, Langevin Monte Carlo, and b o ostrapping [ 19 ]. These approximations are either complicated, or rely on restrictive assumptions. 2 [ 17 ] proposes ensem ble sampling, which is related to the idea of PTS because it aims to main tain a set of particles (called “mo dels” in the pap er) indep endently and identically sampled from the p osterior distribution in order to approximate TS. P articles in ensemble sampling are unw eighted. A ma jor restriction of the algorithm is that it requires Gaussian noise in the observ ation. Also, except in sp ecial setups, up dating the particles in ensemble sampling requires solving an optimization problem that accounts for all the data from the start to the curren t time. T o the b est of our kno wledge, the term p article Thompson sampling ﬁrst app eared in [ 13 ], where the authors apply PTS as an eﬃcien t approximation of TS to solve a matrix-factorization recommendation problem. Note that in their w ork, the particles are not static, but are incrementally re-sampled at eac h step through an MCMC-kernel. The re-sampling metho d relies hea vily on the sp eciﬁc problem structure. It is not clear ho w it can b e generalized to other bandit problems. [ 8 ] analyzes TS for general sto chastic bandit problems. The main result is that with high probabilit y the num b er of plays of non-optimal actions is upp er b ounded by B + C log T , where B , C are problem-dep endent constants and T is the time horizon. F or technical tractability , the pap er assumes the prior distribution of the parameter is supp orted o v er a ﬁnite (p ossibly huge) set instead of a con tinuum. Therefore, TS in the pap er is tan tamount to PTS, with the ﬁnite prior supp ort set equiv alen t to a set of particles. The result of the pap er relies on a realizability assumption (called “grain of truth” in the pap er): the ﬁnite supp ort set of the prior includes the true system parameter. Ho wev er, for PTS when the true parameter exists in a contin uum, the realizabilit y assumption is unreasonable. In fact, without the realizabilit y assumption, PTS may b e inconsistent, i.e., the running av erage regret may not conv erge to zero. In this pap er, PTS is analyzed without the realizability assumption. The analysis is inspired by [ 8 ], esp ecially on how KL divergence comes in to play in the measurement of the ﬁtness of particles. 3 Setup and Preliminaries 3.1 Sto c hastic bandit problem A sto chastic b andit pr oblem con tains the follo wing elemen ts: an action set A , an observ ation space Y , a parameter space Θ, a known observ ation mo del P θ ( ·| a ) and a reward function R : Y → R . Consider a play er who acts at steps t = 1 , 2 , · · · . A t step t , the play er takes an action A t ∈ A , then observ es Y t ∈ Y according to the observ ation mo del P θ ∗ ( ·| A t ) for some ﬁxed and unknown θ ∗ ∈ Θ, indep endent of past observ ations. The observ ation Y t then incurs a rew ard R t = R ( Y t ). The goal of the play er is to maximize the cum ulative reward. F or notational conv enience, w e denote an instance of the sto chastic bandit problem by Sto chasticBandit( A , Y , Θ , P θ ( ·| a ) , R, θ ∗ ). 1 Let H t = ( A 1 , · · · , A t , Y 1 , · · · , Y t ) denote the history of actions and observ ations up to time t . An algorithm is a (p ossibly randomized) mapping from H t − 1 to A , for each step t . The p erformance of an algorithm is measured by r e gr et . Let a ∗ ≜ arg max a ∈A E θ ∗ [ R ( Y ) | a ] denote the optimal action that maximizes the mean reward, assuming complete knowledge ab out θ ∗ . Let R ∗ ≜ E θ ∗ [ R ( Y ) | a ∗ ] denote the maximum expected rew ard. The regret of an algorithm that selects A t at time t is reg t ≜ R ∗ − E θ ∗ [ R ( Y ) | A t ], the diﬀerence b etw een the exp ected reward of an optimal action and the action selected by the algorithm. The cum ulative regret and running av erage regret up to time t are P t τ =1 reg τ and 1 t P t τ =1 reg τ , resp ectively . 1 The problem can b e made more general by adding contexts. Let C b e a context set. The observ ation model b ecomes P θ ( ·| a, c ). At eac h step of the game, the game play er receiv es an arbitrary con text c t ∈ C b efore taking action A t . The observ ation Y t follo ws distribution P θ ∗ ( ·| A t , c t ). This is known as the contextual sto chastic bandit mo del, for which PTS still works. The reason we do not use this more general mo del here is that we wan t to emphasize the k ey word sto chastic , not contextual. 3 Example 1 (Bernoulli bandit) . Let K b e a p ositive integer. A Bernoulli bandit problem depicts a pla yer who picks an arm indexed b y a ∈ { 1 , · · · , K } at each step, which generates a reward of either 0 or 1 according to a Bernoulli distribution parameterized b y θ ∗ a ∈ [0 , 1], ﬁxed and unknown. This is a sto c hastic bandit problem with A = { 1 , 2 , · · · , K } , Y = { 0 , 1 } , Θ = [0 , 1] K , P θ ( ·| a ) ∼ Bernoulli( θ a ), and R ( y ) = y . This is a bandit problem with separable actions – the observ ation distribution for eac h action is parametrized by a corresp onding coordinate of θ ∗ . Example 2 (Max-Bernoulli bandit) . Let K , M b e p ositive in tegers with K ≥ 2 and M < K . A max-Bernoulli bandit problem is similar to the Bernoulli bandit, with arms indexed by { 1 , · · · , K } and eac h arm is asso ciated with a Bernoulli distribution with a ﬁxed and unkno wn parameter θ ∗ a . The diﬀerence is that, in a max-Bernoulli bandit problem, the pla yer pic ks M diﬀerent arms at eac h step instead of one. The rew ard is the maxim um of the M binary v alues generated by the M selected arms. This problem can b e formulated as a sto c hastic bandit problem with Θ = [0 , 1] K , A =  [ K ] M  = { S ⊂ [ K ] : | S | = M } , Y = { 0 , 1 } . Given a = ( a 1 , · · · , a M ) ∈ A , observ e Y = max m ∈ [ M ] X m , where X m ∼ Bernoulli( θ ∗ a m ). That is, the observ ation mo del is P θ ( ·| a ) ∼ Bernoulli  1 − Q m ∈ M (1 − θ a m )  . The reward function is R ( y ) = y . Actions in the max-Bernoulli bandit problem are not separable. The num b er of actions,  K M  , can b e muc h larger than K , the dimension of the parameter space. The problem is considered in [ 8 ]. Example 3 (Linear bandit) . A linear bandit problem has t wo parameters: a p ositiv e integer K and σ 2 W > 0. It is a sto chastic bandit problem with Θ = R K , A = S K − 1 = { x ∈ R K : ∥ x ∥ 2 = 1 } , the surface of a unit sphere in R K , Y = R and R ( y ) = y . Given an action a ∈ A , w e observ e Y = ⟨ θ ∗ , a ⟩ + W , where θ ∗ ∈ Θ K is ﬁxed and unknown and W ∼ N (0 , σ 2 W ) is some Gaussian noise. That is, the observ ation mo del is P θ ( ·| a ) ∼ N ( ⟨ θ, a ⟩ , σ 2 W ). The problem is named “linear” b ecause the exp ected reward in each round is an unknown linear function of the action taken. This is an example of a bandit problem in whic h the dimension of the parameter space is ﬁnite, but the n umber of actions is inﬁnite. 3.2 P article Thompson sampling (PTS) Thompson sampling (TS) is the algorithm for solving sto chastic bandit problems, shown in Algo- rithm 1 . Algorithm 1 Thompson sampling (TS) Inputs : A , Y , Θ , P θ ( ·| a ) , R, θ ∗ Initialization : prior π 0 o ver Θ 1: for t = 1 , 2 , · · · do 2: Sample θ t ∼ π t − 1 3: Pla y A t ← arg max a ∈A E θ t [ R ( Y ) | A t = a ] 4: Observ e Y t ∼ P θ ∗ ( ·| A t ) 5: Up date π t : π t ( θ ) = P θ ( Y t | A t ) π t − 1 ( θ ) R Θ P θ ( Y t | A t ) π t − 1 ( θ ) d θ ∀ θ ∈ Θ. 6: end for TS is often diﬃcult to implement in practice b ecause π t ma y not hav e a closed form. Even if a closed form can b e obtained, it is not clear how it can b e eﬃciently stored and be sampled from. The idea of particle Thompson sampling (PTS) (Algorithm 2 ) is to approximate π t b y a discrete distribution w t = ( w t, 1 , · · · , w t,N ) supp orted on a ﬁnite set of ﬁxed particles P N =  θ (1) , · · · , θ ( N )  ⊂ Θ, where N is the num b er of particles. 4 Algorithm 2 Particle Thompson sampling (PTS) Inputs : A , Y , Θ , P θ ( ·| a ) , R, θ ∗ , P N Initialization : w 0 ←  1 N , · · · , 1 N  1: for t = 1 , 2 , · · · do 2: Generate θ t from P N according to weigh ts w t − 1 3: Pla y A t ← arg max a ∈A E θ t [ R ( Y ) | A t = a ] 4: Observ e Y t ∼ P θ ∗ ( ·| A t ) 5: for i ∈ { 1 , 2 , · · · , N } do 6: e w t,i = w t − 1 ,i P θ ( i ) ( Y t | A t ) 7: end for 8: w t ← normalize e w t 9: end for In practice, one can use a pre-determined set of p oints P N in Θ, or randomly generate some p oin ts from Θ. e w t,i is the unnormalized w eight of particle i at time t . Step 6 can b e alternativ ely implemen ted b y e w t,i = e w t − 1 ,i P θ ( i ) ( Y t | A t ), with the initialization e w 0 = w 0 , b ecause it yields the same normalized v ectors w t . PTS is v ery ﬂexible b ecause it do es not require any structure on the observ ation mo del P θ ( ·| a ), as long as the model is giv en. Steps 5-7 in Algorithm 2 are easy to implemen t: they require only m ultiplication and normalization. F or notational conv enience, we denote an instance of particle Thompson sampling with particle set P N b y PTS( P N ). 4 A Sample-P ath Analysis of PTS W e provide an analysis of PTS in this section. The main result is a sample-path necessary condition for surviving particles based on drift information. Notation: Let I t ∈ [ N ] b e the index of the particle chosen at time t . Th us, I t ∼ w t − 1 . Let A t ∈ A b e the arm c hosen at time t . Let A : Θ → A b e the function mapping from a particle to the corresp onding optimal arm, deﬁned by A ( θ ) = arg max a ∈A E θ [ R ( Y ) | a ]. If there are multiple maximizers, let A ( θ ) b e one of them selected deterministically . With a slight abuse of notation, we sometimes abbreviate A ( θ ( i ) ) by A ( i ). So A t = A ( I t ). F or an y x ∈ R N , deﬁne supp( x ) ≜ { i ∈ [ N ] : x i  = 0 } and arg max x ≜  i ∈ [ N ] : x i = max j ∈ [ N ] x j  . Recall from Algorithm 2 that the unnormalized weigh ts of the particles ev olve by the equation e w t,i = e w t − 1 ,i P θ ( i ) ( Y t | A t ), where Y t ∼ P θ ∗ ( ·| A t ). Deﬁnition 1. (Drift matrix) F or a given StochasticBandit( A , Θ , Y , P θ ( ·| a ) , R, θ ∗ ) problem and a set of particles P N ⊂ Θ, the drift matrix D is a N × N matrix, where D ij ≜ E [ln e w t,j − ln e w t − 1 ,j | I t = i ] = E [ln P θ ( j ) ( Y t | A t ) | I t = i ] = E Y ∼ P θ ∗ ( ·| A ( i )) [ln P θ ( j ) ( Y | A ( i ))] , for i, j ∈ [ N ]. In words, D ij is the (exp onential) drift of particle j when particle i is chosen. The following prop erties of D are readily veriﬁed: 1) En tries in D are non-p ositive; 2) D is indep enden t of time, fundamen tally b ecause { e w t } is a time-homogeneous Mark ov pro cess; 3) Row i 1 and ro w i 2 of D are the same if A ( i 1 ) = A ( i 2 ). Therefore D can hav e at most |A| distinct rows. In what follo ws w e consider drift matrices D and D ′ to be equiv alen t if each row in D ′ is equal to the corresp onding ro w of D up to an additiv e constant. Therefore, D remains in the same equiv alence class if for each i the constant − E [ln P θ ∗ ( Y | A ( i ))] is added to ro w i. Therefore, a representativ e 5 c hoice of D is the follo wing: D ij equiv alen t = − E Y ∼ P θ ∗ ( ·| A ( i ))  ln P θ ∗ ( Y | A ( i )) P θ ( j ) ( Y | A ( i ))  = − KL  P θ ∗ ( ·| A ( i ))     P θ ( j ) ( ·| A ( i ))  . Here D ij is the negative of KL divergence b etw een distributions P θ ∗ ( ·| A ( i )) and P θ ( j ) ( ·| A ( i )) . In this sense, the i th row of D giv es the relative ﬁtness of the particles for action A ( i ), and the j th column of D giv es the ﬁtness of particle j for action A ( i ) v arying o ver all i. W e need the follo wing tw o assumptions b efore the main result. Assumption 1 (Sample path assumptions) . Consider the problem Sto chasticBandit( A , Θ , Y , P θ ( ·| a ) , R, θ ∗ ) and supp ose PTS( P N ) is run for a set of N particles P N ⊂ Θ. Assume that the sample path satisﬁes the following: there exists a non-empt y set S ⊂ [ N ] that satisﬁes 2 (a) (Non-zero deca ying rate gap) F or any i ∈ S and j ∈ S , lim sup t →∞ 1 t (ln e w t,i − ln e w t,j ) < 0, and (b) (Existence of surviv or limiting distribution) G t = (ln e w t,i − ln e w t,j : i, j ∈ S ) ∈ R | S |×| S | has a limiting empirical distribution µ G . In other words, for any b ounded contin uous function h on R | S |×| S | , 1 t P t τ =0 h ( G τ ) → E µ G [ h ]. The set S can b e thought of as the set of surviving particles. Assumption 1 (a) says the (unnor- malized) weigh t decaying rate of a non-surviving particle is strictly less than that of a surviving particle. Consequently , the weigh t of a non-surviving particle con verges to 0 exp onen tially fast. Assumption 1 (b) says that the pro cess G t has some ergo dicity prop erty . It is similar to saying that G t is Harris recurrent, except G t is not Marko v, b ecause it excludes information ab out particles not in S . Note that kno wing an y row of G t determines all the other entries of G t . Assumption 2 (Boundedness of observ ation mo del) . Assume that the observ ation mo del P θ ( ·| a ) satisﬁes: there exists constan ts b 0 , B 0 > 0, suc h that for any θ , θ ′ ∈ Θ, b 0 ≤ P θ ( y | a ) P θ ′ ( y | a ) ≤ B 0 for any y ∈ Y , a ∈ A . The assumption can b e easily v eriﬁed for problems in which |Y | < ∞ and |A| < ∞ , for example, the Bernoulli bandit and max-Bernoulli bandit problems. Deﬁne a probability v ector π o v er [ N ] by π i = lim t →∞ 1 t +1 P t τ =0 w τ ,i . That is, π i is the limiting running a verage weigh t of particle i , if it exists. The following prop osition shows the relationship b et ween π and the drift matrix D and provides a necessary condition for surviving particles in a sample path. Prop osition 1 (Sample-path necessary surviving condition) . L et Sto c hasticBandit( A , Θ , Y , P θ ( ·| a ) , R, θ ∗ ) b e a given pr oblem and P N ⊂ Θ a given set of N p articles. Supp ose P θ ( ·| a ) satisﬁes Assumption 2 . Consider running PTS( P N ) for the pr oblem. L et D b e the drift matrix. F or a sample p ath of the algorithm under Assumption 1 , π is wel l deﬁne d and satisﬁes arg max( π D ) = supp( π ) = S , (1) wher e S is the set in Assumption 1 . 2 There are t wo additional tec hnical assumptions on sample-path, which are put in appendix Section A to sav e space. 6 The prop osition says that, if a set of particles S w ere to survive in a sample path, they must ha ve a limiting a verage selection distribution π that satisﬁes ( 1 ). The j th co ordinate of π D , ( π D ) j , is equal to ⟨ π , D · j ⟩ , where D · j = ( D 1 j , · · · , D N j ) is the j th column of D , the drifts of particle j when particles 1 , 2 , · · · , N are chosen, whic h we recall can b e interpreted as the ﬁtness of particle j . Th us, ( π D ) j is the a verage ﬁtness of particle j, assuming distribution π is used to select a random action A ( i ) . Therefore, ( 1 ) means that, with resp ect to distribution π , each surviving particle has the same a verage ﬁtness, and the av erage ﬁtness of each non-surviving particle is strictly smaller. This aligns with our observ ation in exp eriments: ﬁt p articles survive, unﬁt p articles de c ay . Note the follo wing cav eat: Prop osition 1 provides a sample-path condition for surviving particles. The actual set of surviv ors may b e random. Thus, there ma y b e more than one π that satisﬁes ( 1 ). Applying Prop osition 1 to Bernoulli bandit with randomly generated particles in PTS, yields the following corollary that sa ys that not many particles can survive. Corollary 2. L et P N b e a set of N p oints gener ate d indep endently and uniformly at r andom fr om [0 , 1] K . Consider running PTS( P N ) for a given Bernoul li b andit pr oblem with K arms and with θ ∗ ∈ [0 , 1] K . Supp ose that any sample p ath satisﬁes Assumption 1 . Then with pr ob ability one, at most K p articles c an survive, i.e. | supp( π ) | ≤ K . W e susp ect that something similar can b e said about the fewness of surviv ors for other bandit problems in which the action space has a ﬁnite dimension K (the num b er of actions may b e muc h larger). But we don’t ha ve a proof. Pro ofs of Proposition 1 and Corollary 2 can b e found in Appendix Section A . F or more evidence and intuition of the assumptions and conclusions of Prop osition 1 and Corollary 2 , see App endix Section B , where a thorough analysis of PTS for tw o-arm Bernoulli bandit is provided. 5 RPTS: Regenerativ e P article Thompson Sampling This section prop oses r e gener ative p article Thompson sampling (RPTS) and demonstrates its p er- formance b y simulation. Recall that, in PTS, ﬁt particles survive, unﬁt particles decay , and most particles ev entually decay . When the weigh ts of the decaying particles b ecome so small that they b ecome essentially inactive, contin uing using these particles would b e a waste of computational re- source. A natural thing to do is to delete those deca ying particles and use the sa ved computational resource to impro ve the algorithm. RPTS (Algorithm 3 ) is based on the follo wing heuristic inspired b y biological evolution: delete unﬁt de c aying p articles, r e gener ate new p articles in the vicinity of the ﬁt surviving p articles. Steps 1-8 of RPTS are the same as PTS (Algorithm 2 ). The diﬀerence is that RPTS adds steps 9-14. Three new h yp er-parameters are in tro duced: f del , the fraction of particles to delete; w inact , the weigh t threshold for deciding inactive particles; w new , the new (aggregate) w eigh t of regenerated particles. The CONDITION in Step 9 chec ks if f del fraction of the particles b ecome inactive. If so, w e ﬁnd the low est w eighted f del fraction of the particles (Step 10), delete them, and regenerate the same num b er of particles through RPTS-Exploration (Step 11). In RPTS-Exploration, w e ﬁrst calculate the empirical mean µ t and cov ariance matrix Σ t of all the particles based on their current w eights w t 3 , i.e. µ t = P N i =1 w t,i θ ( i ) and Σ t = P N i =1 w t,i  θ ( i ) − µ t   θ ( i ) − µ t  T , then generate the new particles according to a multi-v ariate Gaussian distribution. I K is the K × K iden tity matrix. W e use 1 K tr(Σ t ) I K as the co v ariance matrix instead of Σ t , in case Σ t is or close to singular. This 3 According to the RPTS heuristic, one may expect to calculate µ t and Σ t based on the weigh ts of the surviving particles only , instead of all the particles. But b ecause the surviving particles hav e a total weigh t of at least 1 − w inact , close to 1, the diﬀerence is negligible. 7 Algorithm 3 Regenerative particle Thompson sampling (RPTS) Input : A , Y , Θ ⊂ R K , P θ ( ·| a ) , R, θ ∗ , P N P arameters : N , f del ∈ (0 , 1), w inact ∈ (0 , 1), w new ∈ (0 , 1) Initialization : w 0 ←  1 N , · · · , 1 N  1: for t = 1 , 2 , · · · do 2: Generate θ t from P N according to weigh ts w t − 1 3: Pla y A t ← arg max a ∈A E θ t [ R ( Y ) | A t = a ] 4: Observ e Y t ∼ P θ ∗ ( ·| A t ) 5: for i ∈ { 1 , 2 , · · · , N } do 6: e w t,i = w t − 1 ,i P θ ( i ) ( Y t | A t ) 7: end for 8: w t ← normalize e w t 9: if CONDITION( w t , N , f del , w inact ) = T rue then 10: I del ← the indices of the low est weigh ted ⌈ f del N ⌉ particles in P N 11: { θ ( i ) : i ∈ I del } replace ← RPTS-Exploration 12: w t,i ← w new ⌈ f del N ⌉ for each i ∈ I del 13: normalize w t 14: end if 15: end for CONDITION( w t , N , f del , w inact ): w ′ t ← sort w t in ascending order If P ⌈ f del N ⌉ i =1 w ′ t,i ≤ w inact : Return T rue Else: Return F alse RPTS-Exploration: µ t ← E θ ∼ w t [ θ ], Σ t ← E θ ∼ w t [( θ − µ t )( θ − µ t ) T ] Generate ⌈ f del N ⌉ particles i.i.d. ∼ N ( µ t , 1 K tr(Σ t ) I K ), pro ject to Θ particle regeneration strategy requires that the parameter space Θ is a subset of R K . If a newly generated particle is outside of Θ, we pro ject it to Θ in an y natural wa y . 4 Step 12 means that the newly generated ⌈ f del N ⌉ particles are assigned a total weigh t of w new and each of them has the same weigh t. T ypical v alues of the three hyperparameters are f del = 0 . 8, w inact = 0 . 001 and w new = 0 . 01. Section C in app endix elab orates on the choice of these v alues. W e run simulations 5 to compare RPTS with PTS and TS. Selected results are shown in Figure 1 . F or the Bernoulli bandit problem, TS is implemented as a b enc h mark. F or max-Bernoulli bandit, it is not clear ho w TS can b e implemented. Eac h curve is obtained b y a veraging ov er 200 indep enden t simulations. In eac h simulation of PTS or RPTS, the initial particles are generated uniformly at random from [0 , 1] K . 4 Alternativ ely , we can reject it and regenerate un til it is in Θ. 5 Co de is av ailable if the pap er is accepted. 8 (a) Bernoulli bandit with K = 10 θ ∗ = [0 . 51 , 0 . 52 , · · · , 0 . 60] . (b) Bernoulli bandit with K = 100 θ ∗ consists of N = 100 p oints uniformly spaced o ver [0.5,0.7]. (c) Max-Bernoullin bandit with K = 100, M = 5 θ ∗ consists of N = 100 p oints uniformly spaced o ver [0.3,0.8]. Figure 1: Simulations 6 Application to Net w ork Slicing In this section, we describe an application of PTS and RPTS to 5G net w ork slicing. Netw ork slicing is the partition of a netw ork infrastructure into logically independent netw orks across multiple tech- nology domains, in order to supp ort independent vertical services with heterogeneous requiremen ts. A netw ork slice is an end-to-end virtual netw ork, formed by stitching resources across diﬀerent do- mains. Although net work slicing is a promising technology , there remain man y c hallenges both on the system lev el and theory level, see [ 18 ] for a detailed accoun t. One main challenge is the complexit y in the co ordination and integration of resources at diﬀeren t domains, which necessitates a cen tralized control for resource allo cation and cross-do dmain co ordination for stitching the slice. W e propose a high-lev el model that captures the main features and challenges of the net work slicing pro cess and solve it using PTS and RPTS. 9 6.1 Mo del On a high level, a mobile op erator creates netw ork slices across domains on-demand, whic h are then put in to use and exhibits certain performance. The system observ es each domain b ehaviors, e.g., latency , to make b etter decisions in the future. W e form ulate the problem as a con textual sto c hastic bandit problem by sp ecifying the follo wing elements: ( C , A , Y , Θ , θ ∗ , P θ ( ·| a, c ) , R ). See Figure 2 . Figure 2: A netw ork slicing model. Con text set C . Let C = [0 , 1] 2 . A con text vector c = ( c 1 , c 2 ) represen ts a slice request, c haracterizing the load and latency requirements for the intended service. Sp eciﬁcally , c 1 ∈ [0 , 1] is the scaled oﬀered load, relative to some maximum load that the mobile op erator can supp ort. F or example, if the maximum supp ortable load is 20Gbps and c 1 = 0 . 5, then the requested load is 20 · c 1 = 10Gbps. Let c 2 ∈ [0 , 1] b e the inv erse end-to-end latency requirement, scaled by the minim um possible. F or example, if the minimum latency the netw ork can support is 1ms and c 2 = 0 . 5, then the latency required by the service provider is 1 c 2 = 2ms. Action space A . Let A = [ B 1 ] × · · ·× [ B D ], where D is the n umber of domains, B i is the num b er of resource blocks in domain i , and [ n ] is short for { 1 , 2 , · · · , n } . That is, an action a = ( a 1 , · · · , a D ) is a stitched c hain of resource blo cks, one from eac h domain, that form an end-to-end netw ork slice. The resource blo cks mo del the resources av ailable in each domain, regardless of their sp eciﬁc types. Blo c k j in domain i is denoted as Blo ck ij . A t time t , the mobile op erator selects an action A t ∈ A through the cen tral orc hestrator. In Figure 2 , D = 3, ( B 1 , B 2 , B 3 ) = (2 , 3 , 3), and the action selected is (1 , 2 , 1). In practice, D and B i ’s are not large. P arameter space Θ and parameter θ ∗ . The parameter space is Θ = Θ 1 × · · · × Θ D , where Θ i = [0 , 1] 2 × · · · × [0 , 1] 2 | {z } B i such terms is the parameter space of domain i . Th us, the dimension of Θ is P D i =1 2 B i . The system parameter is θ ∗ =  θ ∗ ij  i ∈ [ D ] ,j ∈ [ B i ] , where θ ∗ ij = ( θ ∗ ij 1 , θ ∗ ij 2 ) ∈ [0 , 1] 2 reﬂects some intrinsic prop erties of Blo ck ij . Observ ation space Y . Let Y = Y 1 × · · · × Y D b e the observ ation space of the whole system, where Y i = [0 , ∞ ) for eac h i . Giv en that action a = ( a 1 , · · · , a D ) is tak en, the resource blo cks (Blo c k 1 ,a 1 , · · · , Blo ck D,a D ) are selected. Y i ∈ Y i is the observed latency in domain i , exhibited by Blo c k i,a i . Assume that Y i is observ able b y domain manager i for eac h i . Y t = ( Y t, 1 , · · · , Y t,D ) ∈ Y 10 is the system p erformance observed in all D domains at time t . Observ ation Mo del P θ ( ·| a, c ). Giv en action a = ( a 1 , · · · , a D ) and context c = ( c 1 , c 2 ), the observ ation Y = ( Y 1 , · · · , Y D ) is generated by the follo wing distribution: Y i ’s are independent and eac h Y i follo ws an exp onential distribution with E [ Y i ] = c 1 θ ∗ ij 1 + θ ∗ ij 2 , where j = a i . An interpretation of this expression is that the exp ected latency E [ Y i ] exhibited by domain i is p ositively related to the oﬀered load c 1 of the requested service, due to queueing eﬀects. θ ∗ ij 1 is the rate at whic h the latency scales with the oﬀered load at Blo c k ij , θ ∗ ij 2 is the baseline latency at Blo ck ij . Rew ard function R . The reward function R : Y × C → R is deﬁned b y R (( Y 1 , Y 2 , Y 3 ) , ( c 1 , c 2 )) = g c 2 ( Y 1 + Y 2 + Y 3 ), where g d for 0 ≤ d ≤ 1 is deﬁned b y g d ( y ) =  y d if 0 ≤ y ≤ d 0 if y > d . This reward function is based on t w o ideas. First, the minimum latency requiremen t c 2 in the con text serv es as a Service Level Agreement (SLA) b etw een the mobile op erator and the service pro vider. If the actual end-to-end latency is larger than c 2 , SLA is violated and the mobile op erator gets a huge penalty (zero rew ard). Second, minimizing the latency as m uch as p ossible might b e an ov erkill, which could b e costly . The mobile op erator w ould b e conten t with an observed latency that just meets the target. 6.2 Algorithm Algorithm 4 PTS for con textual sto chastic bandit (p er-system particles) Inputs : C , A , Y , Θ , θ ∗ , P θ ( ·| a, c ) , R , P N ⊂ Θ Initialization : w 0 ←  1 N , · · · , 1 N  1: for t = 1 , 2 , · · · do 2: Get c t 3: Generate θ t from P N according to weigh ts w t − 1 4: Pla y A t ← arg max a ∈A E θ t [ R ( Y ) | A t = a, c t ] 5: Observ e Y t ∼ P θ ∗ ( ·| A t , c t ) 6: for k ∈ { 1 , 2 , · · · , N } do 7: e w t,k = w t − 1 ,k P θ ( k ) ( Y t | A t , c t ) 8: end for 9: w t ← normalize e w t 10: end for PTS (Algorithm 2 ) can be easily up dated to include con texts, sho wn below in Algorithm 4 . RPTS (Algorithm 3 ) can b e similarly up dated to include contexts: just up date steps 1-8 of Algo- rithm 3 to steps in Algorithm 4 . In Algorithm 4 , eac h particle in P N has the same dimension as θ ∗ ∈ Θ. How ev er, due to the indep endence and av ailabilit y of observ ations across the domains for this particular mo del, there is a more eﬀective wa y to construct the particles and up date their weigh ts, called p er-blo ck particles, as follows (See Figure 3 for an illustration). F or eac h Blo ck ij , w e generate a set of N particles P ij = n θ (1) ij , · · · , θ ( N ) ij o ⊂ [0 , 1] 2 , whic h hav e w eights w t,ij = ( w t,ij, 1 , · · · , w t,ij,N ) at time t . In step 3 of Algorithm 4 , we generate θ t = { θ t,ij } i ∈ [ D ] ,j ∈ [ B i ] b y generating each θ t,ij from P ij according to 11 Figure 3: Per-block particles implemen tation. w eights w t,ij . Steps 6-8 of Algorithm 4 then become: for i ∈ { 1 , 2 , · · · , D } do : for k ∈ { 1 , · · · , N } do : e w t,i,A t,i ,k = w t − 1 ,i,A t,i ,k P θ ( k ) i,A t,i ( Y t,i | A t,i , c t ) w t,i,A t,i ← normalize e w t,i,A t,i due to the indep endence of observ ations across domains. In essence, we maintain a set of particles for each blo ck, and in eac h time step, we only up date the weigh ts of the particles of the chosen blo c k in eac h domain, while keeping unc hanged the w eights of the particles of the un used blo cks. P er-blo ck particle implementation stores the same num b er of parameter v alues in the system, 2 N P D i =1 B i , but the eﬀective num b er of p er-system particles is N P D i =1 B i (although these particles are not indep endent). F or this mo del, the exp ectation in step 4 of Algorithm 4 can b e appro ximately calculated. See App endix Section D . 6.3 Sim ulation Sim ulation setup: D = 3 and ( B 1 , B 2 , B 3 ) = (3 , 3 , 3). In practice, D and B i ’s are often small. Re- sults are in Figure 4 . Eac h curve is av eraged ov er 100 indep endent sim ulations. In each sim ulation, the system parameter θ ∗ and the initial set of particles are randomly generated in the parameter space. Both PTS and RPTS w ork p o orly with 10 p er-blo ck particles and is sub ject to m uch ran- domness. With 100 p er-blo ck particles, b oth algorithms are eﬀective, although the improv emen t of PR TS compared to PTS is not ob vious at the sho wn time scale. 7 Conclusions and F uture W ork This pap er provides a practical v ariation of Thomson sampling. An analysis of PTS for general sto c hastic bandit problems is provided, by which we show that ﬁt particles survive and unﬁt parti- cles decay . W e prop ose RPTS to improv e PTS based on a simple heuristic that p erio dically deletes 12 Figure 4: Simulation for net work slicing. essen tially inactiv e particles and regenerate new particles in the vicinit y of survivors. W e sho w empirically that RPTS signiﬁcantly outp erforms PTS in a set of representativ e bandit problems. Finally , w e show an application of PTS and RPTS to net w ork slicing and demonstrate through sim ulations that the algorithms are eﬀectiv e. Some directions for future work are as follows. First, the necessary surviv al condition in Prop o- sition 1 may b e further explored to pro vide insigh t on whic h particles can survive for some sp eciﬁc bandit problems. Second, while the particle regeneration strategy we used in PR TS is simple and eﬀectiv e, there may b e other and more principle-guided strategies that ha ve some theoretical guaran tees. References [1] Shipra Agraw al and Navin Goy al. Analysis of Thompson sampling for the multi-armed bandit problem. In Pr o c e e dings of the 25th A nnual Confer enc e on L e arning The ory , volume 23 of Pr o c e e dings of Machine L e arning R ese ar ch , pages 39.1–39.26, Edinburgh, Scotland, 25–27 Jun 2012. PMLR. [2] Shipra Agraw al and Na vin Go yal. Thompson sampling for con textual bandits with linear pa yoﬀs. In Pr o c e e dings of the 30th International Confer enc e on International Confer enc e on Machine L e arning - V olume 28 , ICML’13, pages I I I–1220–I I I–1228. JMLR.org, 2013. [3] P eter Auer, Nicol` o Cesa-Bianchi, and Paul Fisc her. Finite-time analysis of the multiarmed bandit problem. Mach. L e arn. , 47(2–3):235–256, May 2002. [4] S ´ ebastien Bub eck and Nicol` o Cesa-Bianchi. Regret analysis of sto c hastic and nonsto chastic m ulti-armed bandit problems. CoRR , abs/1204.5721, 2012. [5] Olivier Chap elle and Lihong Li. An empirical ev aluation of Thompson sampling. In A dvanc es in Neur al Information Pr o c essing Systems 24 , pages 2249–2257. Curran Asso ciates, Inc., 2011. [6] S.S. Dragomir, M.L. Scholz, and J. Sunde. Some upper b ounds for relativ e en tropy and applications. Computers and Mathematics with Applic ations , 39(9):91 – 100, 2000. 13 [7] Aur ´ elien Garivier and Olivier Capp ´ e. The KL-UCB algorithm for b ounded sto chastic bandits and b ey ond. In Pr o c e e dings of the 24th Annual Confer enc e on L e arning The ory , v olume 19 of Pr o c e e dings of Machine L e arning R ese ar ch , pages 359–376, Budapest, Hungary , 09–11 Jun 2011. PMLR. [8] Adit ya Gopalan, Shie Mannor, and Yisha y Mansour. Thompson sampling for complex online problems. In Pr o c e e dings of the 31st International Confer enc e on International Confer enc e on Machine L e arning - V olume 32 , ICML’14, pages I–100–I–108. JMLR.org, 2014. [9] Bruce Ha jek. Hitting-time and o cc upation-time b ounds implied by drift analysis with appli- cations. A dvanc es in Applie d Pr ob ability , 14(3):502–525, 1982. [10] Bruce Ha jek. Notes for ECE567 Comm unication Net work Analsysis. https://hajek.ece. illinois.edu/ECE567Notes.html , 2006. Accessed: 2020-08-12. [11] Y u-Heng Hung, Ping-Ch un Hsieh, Xi Liu, and P . R. Kumar. Rew ard-biased maxim um lik eli- ho o d estimation for linear sto c hastic bandits. arXiv e-prints , page arXiv:2010.04091, Octob er 2020. [12] Emilie Kaufmann, Nathaniel Korda, and R ´ emi Munos. Thompson sampling: An asymptot- ically optimal ﬁnite-time analysis. In A lgorithmic L e arning The ory , pages 199–213, Berlin, Heidelb erg, 2012. Springer Berlin Heidelb erg. [13] Ja v a Ka wale, Hung H Bui, Branislav Kv eton, Long T ran-Thanh, and Sanjay Cha wla. Eﬃcient Thompson Sampling for online matrix-factorization recommendation. In A dvanc es in Neur al Information Pr o c essing Systems 28 , pages 1297–1305. Curran Asso ciates, Inc., 2015. [14] T.L Lai and Herb ert Robbins. Asymptotically eﬃcient adaptive allo cation rules. A dvanc es in Applie d Mathematics , 6(1):4–22, 1985. [15] T or Lattimore and Csaba Szep esv´ ari. Bandit Algorithms . Cam bridge Univ ersit y Press, 1st edition, 2019. [16] Xi Liu, Ping-Chun Hsieh, Anirban Bhattachary a, and P . R. Kumar. Exploration through re- w ard biasing: rew ard-biased maximum lik eliho o d estimation for sto chastic m ulti-armed ban- dits. arXiv e-prints , page arXiv:1907.01287, July 2019. [17] Xiuyuan Lu and Benjamin V an Ro y . Ensem ble sampling. In Pr o c e e dings of the 31st Inter- national Confer enc e on Neur al Information Pr o c essing Systems , NIPS’17, pages 3260–3268, USA, 2017. Curran Asso ciates Inc. [18] Gianfranco Nencioni, Rossario G. Garropp o, Andres J. Gonzalez, Bjarne E. Helvik, and Gre- gorio Procissi. Orchestration and control in soft ware-deﬁned 5g net works: Research challenges. Wir eless Communic ations and Mobile Computing , 2018, 2018. [19] Daniel Russo, Benjamin V an Roy , Abbas Kazerouni, Ian Osband, and Zheng W e n. A tutorial on Thompson sampling. CoRR , abs/1707.02038, 2017. [20] William R. Thompson. On the theory of apportionment. Americ an Journal of Mathematics , 57(2):450–456, 1935. 14 A Pro ofs of Prop osition 1 and Corollary 2 This section contains the pro ofs of Prop osition 1 and Corollary 2 . Let L t,i ≜ ln e w t,i − ln e w t − 1 ,i . Assumption 1 has t wo additional assumptions: (c)    1 t P t τ =1 1 { I τ = i } − 1 t P t − 1 τ =0 w τ ,i    → 0 as t → ∞ for any i ∈ [ N ]. (d) F or any i ∈ [ N ] that is used inﬁnitely many times, 1 M P M m =1 L t i ( m ) → D i as M → ∞ , where t i ( m ) is the m th time particle i is chosen and D i is the i th ro w of the drift matrix D . In Assumption 1 (c), 1 { I τ = i } is a Bernoulli random v ariable with mean w τ − 1 ,i for each τ . There- fore it holds with probability one by the Azuma-Ho eﬀding inequalit y . Assumption 1 (d) holds with probabilit y one b y the deﬁnition of D and the strong law of large num b ers. The pro of of Prop osition 1 starts with the follo wing lemma. All the lemmas in the rest of this pro of deal with a sample path under Assumption 1 . Lemma 3. The pr ob ability ve ctor π is wel l deﬁne d. In addition, supp( π ) = S . That is, if i ∈ S , then π i = 0 ; if i ∈ S , then π i > 0 . Pr o of. F or i ∈ S , w t,i = e w t,i P N j =1 e w t,j = e ln e w t,i P N j =1 e ln e w t,j ≤ e ln e w t,i e ln e w t,j 0 for any j 0 ∈ S . By Assumption 1 (a), w t,i → 0. Hence π i = lim t →∞ 1 t +1 P t τ =0 w t,i = 0. Next, deﬁne w ′ t,i ≜ ( 0 if i / ∈ S w t,i P j ∈ S w t,j if i ∈ S . Fix i ∈ S . w ′ t,i − w t,i = w t,i 1 P j ∈ S w t,j − 1 ! = w t,i P j ∈ S w t,j P j ∈ S w t,j = w t,i P j ∈ S w t,j 1 − P j ∈ S w t,j . Since the set [ N ] \ S is ﬁnite, P j ∈ S w t,j → 0. It follo ws that w ′ t,i − w t,i → 0. Hence 1 t + 1 t X τ =0 w ′ τ ,i − 1 t + 1 t X τ =0 w τ ,i → 0 . (2) No w, observ e that w ′ t,i can b e determined from { ln e w t,j } j ∈ S b y w ′ t,i = e ln e w t,i P j ∈ S e ln e w t,j . Therefore, w ′ t,i is a con tinuous and b ounded function of { ln e w t,j } j ∈ S , and hence of G t . W e write this as w ′ t,i = w ′ i ( G t ). According to Assumption 1 (b), 1 t + 1 t X τ =0 w ′ τ ,i → E µ G [ w ′ i ] . (3) Com bining ( 2 ) and ( 3 ), w e obtain π i = E µ G [ w ′ i ]. Since w ′ i is a positive function and µ G is a distribution, we conclude that π i > 0 for i ∈ S . 15 Finally , X i ∈ [ N ] π i = X i ∈ [ N ] lim t →∞ 1 t + 1 t X τ =0 w τ ,i ( i ) = lim t →∞ X i ∈ [ N ] 1 t + 1 t X τ =0 w τ ,i = lim t →∞ 1 t + 1 t X τ =0 X i ∈ [ N ] w τ ,i = lim t →∞ 1 = 1 , where in step ( i ) we switch the limit and summation b ecause all summands are non-negativ e and N is ﬁnite. Thus π is well deﬁned. Lemma 4. 1 t P t τ =1 L τ → πD as t → ∞ . Pr o of. Let M i ( t ) b e the n umber of times particle i has b een play ed up to time t . Let τ i ( m ) b e the m th time that particle i is pla yed. Then 1 t t X τ =1 L τ = 1 t N X i =1 M i ( t ) X m =1 L τ i ( m ) = N X i =1 M i ( t ) t 1 M i ( t ) M i ( t ) X m =1 L τ i ( m ) . Since M i ( t ) = P t τ =1 1 { I τ = i } , by Assumption 1 (c) and the deﬁnition of π i , M i ( t ) t → π i for all i ∈ [ N ]. If particle i is play ed inﬁnitely many times in the sample path, then 1 M i ( t ) P M i ( t ) m =1 L τ i ( m ) → D i as t → ∞ by Assumption 1 (d). If particle i is play ed ﬁnitely many times, thus M i ( t ) ≤ C for some constan t C for all t , then M i ( t ) t → 0 and lim t →∞ 1 M i ( t ) P M i ( t ) m =1 L τ i ( m ) < ∞ . Either case, we hav e M i ( t ) t 1 M i ( t ) M i ( t ) X m =1 L τ i ( m ) → π i D i as t → ∞ . It follows that 1 t t X τ =1 L τ → N X i =1 π i D i = π D as t → ∞ . Lemma 5. If a r e al-value d se quenc e { x t } t ≥ 1 satisﬁes (1) { x t } has a limiting distribution µ . (2) { x t } is B -Lipschitz: ther e exists some c onstant B such that | x t − x s | ≤ B | t − s | for al l t, s ∈ N + . Then lim t → 1 t x t = 0 . Pr o of. W e sho w lim sup t →∞ 1 t x t ≤ δ for any δ > 0. Supp ose there exists δ > 0 such that lim sup t →∞ 1 t x t > δ . Condition (1) implies that, there exists c ∈ R suc h that 1 t t X τ =1 1 { x τ ≥ c } ≤ δ 2 B for all t suﬃcien tly large . (4) Let { t 1 , t 2 , · · · , t n , · · · } b e a sequence of positive integers such that lim n →∞ t n = ∞ and 1 t n x t n ≥ δ for all n . Thus x t n ≥ δt n for all n . Since { x t } is B -Lipschitz, for an y t ∈ [1 , t n ], x t ≥ x t n − B ( t n − t ) ≥ δ t n − B ( t n − t ) = B t − ( B − δ ) t n . 16 It follows that, if t ≥ c B +  1 − δ B  t n , then x t ≥ c . Therefore, for t n > 2 c δ , 1 t n t n X τ =1 1 { τ ≥ c } ≥ 1 t n t n X τ =1 1 { τ ≥ c L + ( 1 − δ L ) t n } = 1 t n  t n −  c B +  1 − δ B  t n  = δ B − c B t n > δ 2 B , whic h con tradicts ( 4 ). Therefore, lim sup t 1 t x t ≤ δ for an y δ > 0. Similarly , we can show that lim inf t →∞ 1 t x t ≥ − δ for an y δ > 0. W e conclude that lim t →∞ 1 t x t = 0. Lemma 6. If i, j ∈ S , then ( π D ) i = ( π D ) j . Pr o of. Consider i, j ∈ S . Then 1 t t X τ =1 L τ ,i − 1 t t X τ =1 L τ ,j = 1 t t X τ =1 ( L τ ,i − L τ ,j ) = 1 t t X τ =1 [(ln e w τ ,i − ln e w τ − 1 ,i ) − (ln e w τ ,j − ln e w τ − 1 ,j )] = 1 t [(ln e w t,i − ln e w 0 ,i ) − (ln e w t,j − ln e w 0 ,j )] = 1 t (ln e w t,i − ln e w t,j ) = 1 t G t ( i, j ) . The third equality abov e used ln e w 0 ,i = ln e w 0 ,j = 0 b y initialization (although that is not imp ortant, as long as the diﬀerence is ﬁnite). By the dynamics of the weigh ts { w t,i } and { w t,j } , we hav e that G t +1 ( i, j ) = G t ( i, j ) + ln P θ ( i ) ( Y t +1 | A t +1 ) P θ ( j ) ( Y t +1 | A t +1 ) . By Assumption 2 , | G t +1 ( i, j ) − G t ( i, j ) | ≤ B , where B = max {| ln b 0 | , | ln B 0 |} . Th us { G t ( i, j ) } t ≥ 1 is an B -Lipschitz sequence. Therefore ( π D ) i − ( π D ) j ( i ) = lim t →∞ 1 t t X τ =1 L τ ,i − 1 t t X τ =1 L τ ,j ! = lim t →∞ 1 t G t ( i, j ) ( ii ) = 0 , where equality ( i ) is due to Lemma 4 and equality ( ii ) equality is due to Lemma 5 and Assumption 1 (b). Lemma 7. If i ∈ S and j ∈ S , then ( π D ) i < ( πD ) j . Pr o of. Similar to the pro of of Lemma 6 , we ha ve 1 t t X τ =1 L τ ,i − 1 t t X τ =1 L t,j = 1 t (ln e w t,i − ln e w t,j ) The LHS conv erges to ( π D ) i − ( π D ) j as t → ∞ b y Lemma 3 . The RHS conv erges to a strictly negativ e v alue as t → ∞ by Assumption 1 (a). Th us ( π D ) i < ( πD ) j . Pr o of of Pr op osition 1 . Lemma 3 shows supp( π ) = S . Lemma 6 and Lemma 7 sho w arg max( π D ) = S . Prop osition 1 is th us prov ed. 17 Pr o of of Cor ol lary 2 . If N ≤ K , then | supp( π ) | ≤ N ≤ K trivially . Let N > K . The observ ation mo del of a Bernoulli bandit problem satisﬁes Assumption 2 trivially . By Prop osition 1 , with probabilit y one, for an y sample path, the probability vector π is well-deﬁned and π and S satisfy arg max( π D ) = supp( π ) = S , which implies the following constrain ts on π : π i = 0 for i ∈ S , ( π D ) i = ( π D ) j for all i, j ∈ S , (5) where S is the subset of [ N ] in Assumption 1 . Supp ose | S | > K . The remainder of the proof sho ws that, with probability one, an y π that satisﬁes ( 5 ) is the all-zero v ector (thus π cannot b e a probability vector). This leads to a con tradiction with | S | > K and therefore we conclude that | S | ≤ K . W e construct a matrix e D ∈ R K × N and a probabilit y (row) vector e π ∈ [0 , 1] K from D and π , as follo ws. Recall that, ro w i 1 and row i 2 of D are the same if A ( i 1 ) = A ( i 2 ). Since there are K arms, there can b e at most K unique rows in D . Let e D b e D reduced to its unique K rows. That is, e D k = E [ L t | A t = k ] (which is indep endent of t ) for k ∈ [ K ]. F or k ∈ [ K ], let e π k = P i : i ∈ S,A ( i )= k π i . That is, e π k is the sum of the asymptotic w eigh ts of surviving particles with the optimal arm k . If no i ∈ S satisﬁes A ( i ) = k , then e π k = 0. It is easy to verify that e π 1 + · · · + e π K = 1. No w, observ e that, π D = N X i =1 π i D i = X i ∈ S π i D i = K X k =1 X i : i ∈ S,A ( i )= k π i D i = K X k =1 X i : i ∈ S,A ( i )= k π i e D k = K X k =1   X i : i ∈ S,A ( i )= k π i   e D k = K X k =1 e π k e D k = e π e D . Therefore, the constraints ( 5 ) on π imply the follo wing constraints on e π : ( e π e D ) i = ( e π e D ) j for all i, j ∈ S . (6) Let e D i b e the i th column of e D . Then ( e π e D ) i = D e π , e D i E . Constrain ts ( 6 ) can th us b e re-written as D e π , e D i − e D j E = 0 for all i, j ∈ S . (7) F or a Bernouli bandit problem, the entries in e D = [ e D kj ] 1 ≤ k ≤ K , 1 ≤ j ≤ N are in the form e D kj = − d ( θ ∗ k || θ ( j ) k ), where d ( x || y ) = x ln x y + (1 − x ) ln 1 − x 1 − y for x, y ∈ [0 , 1] and θ ( j ) k is uniformly distributed in [0 , 1] and is indep endent across k ∈ [ K ] and j ∈ [ N ]. Therefore, since | S | > K , with probabilit y one, the set of vectors { e D i − e D j : i, j ∈ S } spans R K , in which case the only e π ∈ R K that satisﬁes ( 7 ) is the all-zero v ector. By construction of e π , with probabilit y one, the only vector π ∈ R N that satisﬁes ( 5 ) is the all-zero vector. B Analysis of PTS for Tw o-Arm Bernoulli Bandit This section considers p erhaps the most simple bandit problem in more depth than Prop osition 1 . The results pro vide further in tuition ab out PTS and ab out the assumptions and conclusions 18 of Prop osition 1 and its corollary . Sp eciﬁcally , we analyze PTS for the t wo-arm Bernoulli bandit problem. The section is organized as follows. Subsection B.1 pro vides a general analysis of the weigh t dynamics for N given particles. Subsection B.2 takes a closer look at the case of t wo given particles, including, in particular, the counter-reinforcing pair and the self-reinforcing pair. Subsection B.3 discusses the asymptotic b ehavior of N given particles. Subsection B.4 discusses the p erformance of PTS for N randomly generated particles, including tw o wa ys of generation: co ordinate-wise and whole-particle. Subsection B.5 summarizes the results in this section. Subsection B.6 includes for reference tw o known b ounds that are used in this section. F or a tw o-arm Bernoulli bandit problem, A = { 1 , 2 } , Y = { 0 , 1 } , Θ = [0 , 1] 2 , R ( y ) = y . PTS (Algorithm 2 ) is then reduced to Algorithm 5 b elow. Algorithm 5 PTS for t wo-arm Bernoulli bandit Input : θ ∗ , P N Initialization : w eights w 0 ←  1 N , · · · , 1 N  , unnormalized weigh ts e w 0 ← (1 , · · · , 1). 1: for t = 1 , 2 , · · · do 2: Generate θ t from P N according to weigh ts w t − 1 3: Pla y A t ← arg max a ∈{ 1 , 2 } θ t,a 4: Observ e rew ard R t ∼ Bernoulli( θ ∗ A t ) 5: for i ∈ { 1 , 2 , · · · , N } do 6: e w t,i = e w t − 1 ,i P θ ( i ) A t ( R t ) = ( e w t − 1 ,i θ ( i ) A t if R t = 1 e w t − 1 ,i (1 − θ ( i ) A t ) if R t = 0 . (8) 7: end for 8: w t ← normalize e w t 9: end for Notation: Let w t,i , e w t,i , ¯ w t,i b e the normalized, unnormalized, and running-av erage weigh t of particle i ∈ [ N at time t , resp ectively . Let w t = ( w t, 1 , · · · , w t,N ). let I t ∈ [ N ] be the index of the particle c hosen at time t ; I t ∼ w t − 1 . Let q t,i b e the fraction of time particle i has b een play ed up to time t , i.e., q t,i = 1 t P t τ =1 1 { I t = i } . Let A t ∈ A = { 1 , 2 } b e the action/arm taken at time t . Let A : [0 , 1] 2 → { 1 , 2 } b e the function mapping from a particle to the corresp onding b est action/arm, deﬁned b y A ( θ ) = arg max a ∈{ 1 , 2 } θ a . In the case θ 1 = θ 2 , we let A ( θ ) equal to either θ 1 or θ 2 deterministically . With a slight abuse of notation, we sometimes abbreviate A ( θ ( i ) ) by A ( i ). Thus A t = A ( I t ). Let r t ∈ [0 , 1] b e the usage frequency of arm 1 at time t , namely , the fraction of time that arm 1 has b een pulled up to and including time t . It follo ws that 1 − r t is the usage frequency of arm 2 at time t . Let d ( x || y ) ≜ x ln x y + (1 − x ) ln 1 − x 1 − y denote the KL-divergence b et ween t wo Bernoulli distributions parameterized by x and y resp ectively . Let D i ( r ) ≜ r d ( θ ∗ 1 || θ ( i ) 1 ) + (1 − r ) d ( θ ∗ 2 || θ ( i ) 2 ) denote the con vex combination of the KL divergences b etw een θ ∗ and θ ( i ) at the t wo arms, with w eight r on arm 1 and weigh t 1 − r on arm 2, for some r ∈ [0 , 1]. F or brevity , we shall call D i ( r ) the diver genc e of p article i at r . Let an instance of a tw o-arm Bernouli bandit problem with parameter θ ∗ b e denoted as BernoulliBandit( K = 2 , θ ∗ ). B.1 N giv en particles, w eigh t dynamics W e start with some informal analysis to pro vide some high-lev el intuition. Consider the pro cess in Algorithm 5 . Consider a given particle θ ( i ) ∈ P N . By ( 8 ), the unnormalized w eight of particle i at 19 time t can b e written as e w t,i = t Y τ =1 P θ ( i ) A τ ( R τ ) = exp t X τ =1 ln P θ ( i ) A τ ( R τ ) ! = exp   X τ ∈T 1 ln P θ ( i ) 1 ( R τ ) + X τ ∈T 2 ln P θ ( i ) 2 ( R τ )   , where T a ≜ { τ ∈ { 1 , · · · , t } : A τ = a } for a = 1 , 2, i.e., T a is the set of time instances up to time t at whic h arm a is pla yed. By the deﬁnition of r t , |T 1 | = tr t and |T 2 | = t (1 − r t ). Supp ose b oth |T 1 | and |T 2 | are non-zero and grow with t . F or large t , we hav e 1 t ln e w t,i = r t 1 tr t X τ ∈T 1 ln P θ ( i ) 1 ( R τ ) + (1 − r r ) 1 t (1 − r t ) X τ ∈T 2 ln P θ ( i ) 2 ( R τ ) ≈ r t E θ ∗ h ln P θ ( i ) 1 ( R 1 ) i + (1 − r t ) E θ ∗ h ln P θ ( i ) 2 ( R 1 ) i = r t  − d ( θ ∗ 1 || θ ( i ) 1 ) − H ( θ ∗ 1 )  + (1 − r t )  − d ( θ ∗ 2 || θ ( i ) 2 ) − H ( θ ∗ 2 )  = − D i ( r t ) − ( r t H ( θ ∗ 1 ) + (1 − r t ) H ( θ ∗ 2 )) . The term r t H ( θ ∗ 1 ) + (1 − r t ) H ( θ ∗ 2 ) do esn’t dep end on i . Therefore, for large t , e w t,i ∝ ∼ e − tD i ( r t ) . The ab o ve discussion can b e made formal by the following prop osition. Prop osition 8. Given a pr oblem Bernoul liBandit ( K = 2 , θ ∗ ) and a p article set P N ⊂ [0 , 1] 2 . Consider the pr o c ess of running PTS ( P N ) as in A lgorithm 5 . F or any i ∈ { 1 , · · · , N } and t ≥ 1 , 1 t ln e w t,i = − D i ( r t ) + ϵ t,i + C ( r t ) , (9) wher e C ( r t ) is a given function on r t that do es not dep end on i , and { ϵ t,i } t ≥ 1 is a r andom se quenc e that c onver ges to zer o in pr ob ability. 6 Mor e sp e ciﬁc al ly, for some p ositive c onstant B θ ( i ) dep ending on θ ( i ) , P {| ϵ t,i | > δ } ≤ 4 te − B θ ( i ) δ 2 t (10) for any δ > 0 and t ≥ 1 . Pr o of. Let N t,a b e the num ber of times action a has b een pla yed up to time t , a ∈ { 1 , 2 } . N t, 1 + N t, 2 = t . Consider the following alternativ e construction of the rew ard generation pro cess. Before the game starts, we generate a v alue Z a ( k ) for each action a ∈ { 1 , 2 } and eac h time k = 1 , 2 , · · · indep enden tly according to the distribution Bernoulli( θ ∗ a ). At each step t , pla ying action A t = a yields reward R t = Z a ( N t,a ). That is, step 4 of Algorithm 5 b ecomes R t = Z A t ( N t,A t ). It is easy to see that the distributions of any giv en sample path seen by the algorithm in b oth constructions are iden tical. Therefore, we can equiv alen tly work with the alternative construction whenever it is more conv enien t. W e ha ve e w t,i = exp t X τ =1 ln P θ ( i ) A τ ( R τ ) ! = exp   X a ∈{ 1 , 2 } t X τ =1 1 { A τ = a } ln P θ ( i ) a ( R τ )   = exp   X a ∈{ 1 , 2 } t X τ =1 1 { A τ = a } ln P θ ( i ) a ( Z a ( N τ ,a ))   = exp   X a ∈{ 1 , 2 } N t,a X k =1 ln P θ ( i ) a ( Z a ( k ))   6 It can b e further sho wn that this conv ergence is almost sure by using the Borel-Cantelli lemma. W e state the con vergence in probability result here b ecause it will b e used later. 20 for an y time t and particle i ∈ { 1 , · · · , N } . The v alues in n ln P θ ( i ) 1 ( Z 1 ( k )) o N t, 1 k =1 are i.i.d. random v ariables, each equals to ln θ ( i ) 1 with probabilit y θ ∗ 1 or ln(1 − θ ( i ) 1 ) with probabilit y 1 − θ ∗ 1 , with mean − d ( θ ∗ 1 || θ ( i ) 1 ) − H ( θ ∗ 1 ). Similarly , v alues in n ln P θ ( i ) 2 ( Z 2 ( k )) o N t, 2 k =1 are i.i.d. random v ariables with mean − d ( θ ∗ 2 || θ ( i ) 2 ) − H ( θ ∗ 2 ). It follows after some simple algebraic re-arrangemen ts that 1 t ln e w t,i = 1 t   N t, 1 X k =1 ln P θ ( i ) 1 ( Z 1 ( k )) + N t, 2 X k =1 ln P θ ( i ) 2 ( Z 2 ( k ))   = − D i ( r t ) + ϵ t,i − r t H ( θ ∗ 1 ) − (1 − r t ) H ( θ ∗ 2 ) | {z } ≜ C ( r t ) , where ϵ t,i = 1 t   N t, 1 X k =1 ln P θ ( i ) 1 ( Z 1 ( k )) −  − d ( θ ∗ 1 || θ ( i ) 1 ) − H ( θ ∗ 1 )    | {z } ≜ E 1 ( N t, 1 ) + 1 t   N t, 2 X k =1 ln P θ ( i ) 2 ( Z 2 ( k )) −  − d ( θ ∗ 2 || θ ( i ) 2 ) − H ( θ ∗ 2 )    | {z } ≜ E 2 ( N t, 2 ) . E 1 ( N t, 1 ) is the sum of N t, 1 i.i.d. random v ariables, eac h has mean zero and is contained in an in terv al with length    ln θ ( i ) 1 − ln(1 − θ ( i ) 1 )    . N t, 1 is a random v ariable that tak es v alues in { 1 , · · · , t } . Therefore, for any γ > 0, P {| E 1 ( N t, 1 ) | > γ } = t X n =1 P {| E 1 ( n ) | > γ | N t, 1 = n } P { N t, 1 = n } ≤ t X n =1 P {| E 1 ( n ) | > γ } ≤ t X n =1 2 exp    − 2 γ 2 n  ln θ ( i ) 1 − ln(1 − θ ( i ) 1 )  2    ≤ t X n =1 2 exp    − 2 γ 2 t  ln θ ( i ) 1 − ln(1 − θ ( i ) 1 )  2    = 2 t exp    − 2 γ 2 t  ln θ ( i ) 1 − ln(1 − θ ( i ) 1 )  2    . (11) The second inequality ab o ve is due to the Azuma-Ho eﬀding inequality . Similarly , P {| E 2 ( N t, 2 ) | > γ } ≤ 2 t exp    − 2 γ 2 t  ln θ ( i ) 2 − ln(1 − θ ( i ) 2 )  2    . (12) 21 Using ( 11 ) and ( 12 ), we hav e P {| ϵ t,i | ≥ δ } ≤ X a ∈{ 1 , 2 } P  | E a ( N t,a ) | ≥ tδ 2  ≤ X a ∈{ 1 , 2 } 2 t exp    − δ 2 t 2  ln θ ( i ) a 1 − θ ( i ) a  2    ≤ 4 te − B θ ( i ) δ 2 t , where B θ ( i ) = 1 2 min (  ln θ ( i ) 1 1 − θ ( i ) 1  − 2 ,  ln θ ( i ) 2 1 − θ ( i ) 2  − 2 ) . Let us discuss the implication of Prop osition 8 . Since C ( r t ) do es not dep end on i , it follows from ( 9 ) that e w t,i ∝ exp ( − t ( D i ( r t ) + ϵ t,i )). W e mak e t wo observ ations here: • F or large t , the term ϵ t,i b ecomes insigniﬁcant. The particle i with the low est D i ( r t ) at time t is more likely to hav e the largest normalized weigh t. In this sense, the divergence D i ( r t ) reﬂects the ﬁtness of particle i for surviv al: the smaller D i ( r t ) is, the more ﬁt particle i . Ho wev er, w e cannot simply say one particle is more ﬁt than another without men tioning r t , whic h is a random pro cess. It is not clear at this p oint ho w r t ev olves. • Ob viously , r t is aﬀected b y the history of the particles’ weigh ts { e w τ ,i : 1 ≤ τ ≤ t − 1 , 1 ≤ i ≤ N } . T o inv estigate the interpla y betw een the particles’ w eigh ts w t (or e w t ) and their usage frequencies ( r t , 1 − r t ), we take a lo ok at the simplest case: tw o given particles. B.2 Tw o given particles Before we discuss p ossible conﬁgurations of tw o giv en particles, w e in tro duce a helpful graphical to ol called the diver genc e diagr am . A div ergence diagram example is drawn in Figure 5 , with the div ergence of a particle i , D i ( r ) for 0 ≤ r ≤ 1, represen ted by a line segme n t. The right (respectively , left) endp oint of the line segment is highlighted b y a dot if A ( θ ( i ) ) = 1 (resp ectively , if A ( θ ( i ) ) = 2), that is, arm 1 (respectively , arm 2) is the optimal arm if θ ( i ) is the true parameter. Informally sp eaking, the closer the line segment is to the b ottom, the more ﬁt the corresp onding particle is. A line segment that coincides with the b ottom line segmen t represen ts θ ∗ itself, b ecause the KL div ergences on b oth arms are zero. Note that, not every line segment in the diagram corresp onds to a unique particle in [0 , 1] 2 , b ecause in general it is p ossible to hav e d ( x || y 1 ) = d ( x || y 2 ) with y 1  = y 2 . Consider the p ossible conﬁgurations of t wo particles in terms of their relative positions in the div ergence diagram. See Figure 6 . • In case (a), The line segmen t of one particle is completely b elow the other particle. In this case, with probability one, the low er particle will gain all the w eight. This is a trivial case. • In case (b), the line segm en ts of t w o particles cross eac h other. This case can b e further divided into three sub-cases, sho wn in (c), (d) and (e) resp ectively , dep ending on the optimal arm for eac h particle. In case (e), the optimal arm for b oth particles is the same. The problem essen tially degenerates to a one-arm Bernoulli bandit problem, whic h is not so interesting. W e will take a closer lo ok at the remaining t wo c ases: (c) counter-reinforcing pair and (d) self-reinforcing pair. 22 Figure 5: A divergence diagram example. Figure 6: Possible tw o-particle conﬁgurations in the divergence diagram. B.2.1 Coun ter-reinforcing pair Deﬁnition 2. (Counter-reinforcing pair) F or a giv en BernoulliBandit( K = 2 , θ ∗ ) problem, we say that t w o particles { θ (1) , θ (2) } ⊂ [0 , 1] 2 form a c ounter-r einfor cing p air (CR p air) if they can be re-lab eled such that the following conditions hold: d ( θ ∗ 1 || θ (1) 1 ) > d ( θ ∗ 1 || θ (2) 1 ) , d ( θ ∗ 2 || θ (1) 2 ) < d ( θ ∗ 2 || θ (2) 2 ) , A (1) = { 1 } , A (2) = { 2 } . (13) Note: The only w ay to re-lab el the tw o particles is to switch their labels. Without loss of generalit y , in the rest of this section, when w e say { θ (1) , θ (2) } form a CR pair, we mean that they ha ve already b een prop erly re-lab eled to meet the conditions ( 13 ). A CR pair example is sho wn in Figure 7 . Figure 7 (a) depicts the p ositions of θ ∗ , θ (1) and θ (2) in [0 , 1] 2 . Figure 7 (b) depicts the div ergences of the tw o particles. Let ¯ r ∈ (0 , 1) be such that D 1 ( ¯ r ) = D 2 ( ¯ r ), i.e., the p oin t at which these t wo lines intersect. The deﬁnition of a CR pair guaran tees that ¯ r exists and is unique. 23 (a) Particle positions. (b) Divergences. Figure 7: A counter-reinforcing pair example. Consider a large time t . Supp ose r t > ¯ r . Since e w t,i ∝ ∼ e − tD i ( r t ) and D 2 ( r t ) < D 1 ( r t ), we exp ect w t, 2 to b e larger than w t, 1 , thus particle 2 will b e selected more often, which causes arm 2 to b e pulled more often. But pulling arm 2 will make r t decrease. If r t decreases to a v alue less than ¯ r , then by a similar argument w e exp ect w t, 1 to b ecome larger than w t, 2 . Then particle 1 will b e selected more often, whic h makes arm 1 to b e pulled more often and r t to increase. Therefore, these tw o particles are c ounter-r einfor cing each other: selecting one particle will likely increase the w eight of the other particle and vice versa. W e exp ect to observe that r t cannot stay to o far aw ay either ab ov e or b elow ¯ r . The drift of r t is alwa ys to ward ¯ r . How ev er, we also observ e through simulations that the weigh ts of the tw o particles keep oscillating. The random oscillations are so strong that the drift do es not mak e w eights con verge, that is, w eigh ts b ounce around to o muc h to con verge, but are sto chastically b ounded. The ab ov e observ ations are formally stated in the following prop osition. Prop osition 9. Given a BernoulliBandit( K = 2 , θ ∗ ) pr oblem and supp ose a given p article set P 2 = { θ (1) , θ (2) } form a CR p air for the pr oblem. Consider the pr o c ess of running PTS( P 2 ) as in A lgorithm 5 . L et ¯ r ∈ (0 , 1) b e the solution to D 1 ( r ) = D 2 ( r ) . Then, r t → ¯ r almost sur ely. Also, q t → ( ¯ r , 1 − ¯ r ) and ¯ w t → ( ¯ r , 1 − ¯ r ) almost sur ely. The remainder of this section is dedicated to the proof of Prop osition 9 . The pro of starts with constructing a sequence { X t } , deﬁned by X t ≜ ln e w t, 1 e w t, 2 = ln w t, 1 w t, 2 . Recall that, for i = 1 , 2, e w t +1 ,i = e w t,i P θ ( i ) A t +1 ( R t +1 ) = ( e w t,i θ ( i ) A t +1 if R t +1 = 1 e w t,i (1 − θ ( i ) A t +1 ) if R t +1 = 0 . By the conditions in ( 13 ) that A (1) = { 1 } and A (2) = { 2 } , A t +1 = i iﬀ particle θ ( i ) is selected at time t + 1, whic h occurs with probability w t,i . So for i = 1 , 2, e w t +1 ,i =          e w t,i θ ( i ) 1 w .p. w t, 1 θ ∗ 1 e w t,i (1 − θ ( i ) 1 ) w .p. w t, 1 (1 − θ ∗ 1 ) e w t,i θ ( i ) 2 w .p. w t, 2 θ ∗ 2 e w t,i (1 − θ ( i ) 2 ) w .p. w t, 2 (1 − θ ∗ 2 ) . 24 Since w t, 1 + w t, 2 = 1, if we are given that x = ln e w t, 1 e w t, 2 = ln w t, 1 w t, 2 , then w t, 1 = e x 1+ e x and w t, 2 = 1 1+ e x . It follows that X t +1 = X t +                    ln θ (1) 1 θ (2) 1 w .p. e X t 1+ e X t θ ∗ 1 ln (1 − θ (1) 1 ) (1 − θ (2) 1 ) w .p. e X t 1+ e X t (1 − θ ∗ 1 ) ln θ (1) 2 θ (2) 2 w .p. 1 1+ e X t θ ∗ 2 ln (1 − θ (1) 2 ) (1 − θ (2) 2 ) w .p. 1 1+ e X t (1 − θ ∗ 2 ) . (14) Note that X 0 = 0 since w 0 , 1 = w 0 , 2 = 1 2 . { X t } t ≥ 0 is a time-homogeneous Marko v pro cess living in a state space of inﬁnite cardinality . Note that ( 14 ) is derived using only the conditions A (1) = { 1 } and A (2) = { 2 } in ( 13 ), therefore it holds ev en if the tw o particles do not form a CR pair. The dynamics of X t in ( 14 ) will b e used again in the next section in the case of a self-reinforcing pair. In the next lemma, we show that { X t } is sto c hastically b ounded giv en the CR pair conditions. Lemma 10. Consider the pr o c ess describ e d in Pr op osition 9 . L et X t ≜ ln e w t, 1 e w t, 2 = ln w t, 1 w t, 2 . Then, for some c onstants A 0 and B 0 dep ending on θ ∗ and P 2 = { θ (1) , θ (2) } , P {| X t | ≥ x } ≤ A 0 e − B 0 x ∀ t ≥ 1 and x > 0 . Pr o of. The pro of essen tially relies on a drift implied b ound in [ 9 ] (copied as Prop osition 20 in Section B.6.1 for reference). W e c heck the t wo conditions of Prop osition 20 for { X t } . By ( 14 ), the drift of the process { X t } at time t is E [ X t +1 − X t | X t = x ] = e x 1 + e x θ ∗ 1 ln θ (1) 1 θ (2) 1 + e x 1 + e x (1 − θ ∗ 1 ) ln 1 − θ (1) 1 1 − θ (2) 1 + 1 1 + e x θ ∗ 2 ln θ (1) 2 θ (2) 2 + 1 1 + e x (1 − θ ∗ 2 ) ln 1 − θ (1) 2 1 − θ (2) 2 =  e x 1 + e x d ( θ ∗ 1 || θ (2) 1 ) + 1 1 + e x d ( θ ∗ 2 || θ (2) 2 )  −  e x 1 + e x d ( θ ∗ 1 || θ (1) 1 ) + 1 1 + e x d ( θ ∗ 2 || θ (1) 2 )  = D 2  e x 1 + e x  − D 1  e x 1 + e x  ≜ h ( x ) . Let f ( r ) ≜ D 2 ( r ) − D 1 ( r ). Then h ( x ) = f ( e x 1+ e x ). f ( r ) is a linear function in r : f ( r ) = α r + β , where α =  d ( θ ∗ 1 || θ (2) 1 ) − d ( θ ∗ 2 || θ (2) 2 )  −  d ( θ ∗ 1 || θ (1) 1 ) − d ( θ ∗ 2 || θ (1) 2 )  , β = d ( θ ∗ 2 || θ (2) 2 ) − d ( θ ∗ 2 || θ (1) 2 ) . (15) Since the t wo particles form a CR pair, α < 0 and β > 0. Let ¯ r = − β α , which is the solution to f ( r ) = 0. It can b e veriﬁed that Condition C1 of Prop osition 20 is satisﬁed with a = ln 1+ ¯ r 1 − ¯ r and ϵ 0 = 1 2  d ( θ ∗ 1 || θ (1) 1 ) − d ( θ ∗ 1 || θ (2) 1 )  . This corresp onds to solving e a 1+ e a = ¯ r +1 2 , so h ( a ) = f ( ¯ r +1 2 ) = 1 2 ( f ( ¯ r ) + f (1)) = 1 2 f (1) = ϵ 0 . Note that a > 0. T o chec k Condition C2 of Prop osition 20 , let x ∗ ≜ max      ln θ (1) 1 θ (2) 1     ,     ln (1 − θ (1) 1 ) (1 − θ (2) 1 )     ,     ln θ (1) 2 θ (2) 2     ,     ln (1 − θ (1) 2 ) (1 − θ (2) 2 )      , and let random v ariable Z = x ∗ with probability 1. Then ob viously ( | X t +1 − X t | | X t ) ≺ Z . Cho ose λ = 1 (any p ositiv e v alue works), then D = E [ e λZ ] = e x ∗ = max ( θ (1) 1 θ (2) 1 , θ (2) 1 θ (1) 1 , 1 − θ (1) 1 1 − θ (2) 1 , 1 − θ (2) 1 1 − θ (1) 1 , θ (1) 2 θ (2) 2 , θ (2) 2 θ (1) 2 , 1 − θ (1) 2 1 − θ (2) 2 , 1 − θ (2) 2 1 − θ (1) 2 ) . (16) 25 Note that D > 1. Condition C2 of Prop osition 20 is satisﬁed. Since c ≥ E [ e λZ ] − (1+ E [ Z ]) λ 2 = D − 1 − x ∗ , w e can c ho ose the follo wing constants: c = D , η = min  1 , ϵ 0 2 c  , ρ = 1 − 1 2 η ϵ 0 . Note that 0 = X 0 ≤ a . Applying Prop osition 20 , we ha ve P { X t ≥ x } ≤ D 1 − ρ e − η ( x − a ) = A 1 e − B 1 x ∀ t, x > 0 , (17) where A 1 = D 1 − ρ e η a = 2 D η ϵ 0 e η a = 2 D η ϵ 0  1+ ¯ r 1 − ¯ r  η and B 1 = η . Apply the same analysis to the sequence {− X t } t ≥ 0 with the following constan ts: a ′ = ln 2 − ¯ r ¯ r , ϵ ′ 0 = 1 2  d ( θ ∗ 2 || θ (2) 2 ) − d ( θ ∗ 2 || θ (1) 2 )  , λ = 1, D as in ( 16 ), c = D , η ′ = min  λ, ϵ ′ 0 2 c  and ρ ′ = 1 − 1 2 η ′ ϵ ′ 0 , w e get P {− X t ≥ x } ≤ D 1 − ρ ′ e − η ′ ( x − a ′ ) = A 2 e − B 2 x ∀ t, x > 0 , (18) where A 2 = D 1 − ρ ′ e η ′ a ′ = 2 D η ′ ϵ ′ 0 e η ′ a ′ = 2 D η ′ ϵ ′ 0  2 − ¯ r ¯ r  η ′ and B 2 = η ′ . Let A 0 = 2 max { A 1 , A 2 } and B 0 = min { B 1 , B 2 } and combine ( 17 ) and ( 18 ), w e get P {| X t | ≥ x } ≤ A 0 e − B 0 x ∀ t and x > 0 . W e are no w ready to pro ve Prop osition 9 . Roughly sp eaking, since ln e w t,i ≈ − tD i ( r t ), X t = ln e w t, 1 e w t, 2 ≈ t ( D 2 ( r t ) − D 1 ( r t )). The sto chastic b oundedness of X t then implies the sto chastic b ound- edness of t | D 2 ( r t ) − D 1 ( r t ) | . So for large t , D 2 ( r t ) − D 1 ( r t ) is close to zero and hence r t is close to ¯ r . W e show that r t con verges to ¯ r in probability , which com bined with the Borel-Contelli lemma leads to conv ergence almost surely . The con vergence of q t and ¯ w t naturally follows. Pr o of of Pr op osition 9 . Recall that f ( r ) = D 2 ( r ) − D 1 ( r ) = αr + β for α and β given in ( 15 ) and f ( ¯ r ) = 0. So | f ( r t ) | = | f ( r t ) − f ( ¯ r ) | = | ( αr t + β ) − ( α ¯ r + β ) | = | α | | r t − ¯ r | . Therefore, for any δ > 0, P {| r t − ¯ r | ≥ δ } = P {| f ( r t ) | ≥ | α | δ } ≤ P  | f ( r t ) + ϵ t, 1 − ϵ t, 2 | ≥ | α | δ 3  + P  | ϵ t, 1 | ≥ | α | δ 3  + P  | ϵ t, 2 | ≥ | α | δ 3  . But f ( r t ) + ϵ t, 1 − ϵ t, 2 = D 2 ( r t ) − D 1 ( r t ) + ϵ t, 1 − ϵ t, 2 = ( − D 1 ( r t ) + ϵ t, 1 + C ( r t )) − ( − D 2 ( r t ) + ϵ t, 2 + C ( r t )) ( i ) = 1 t ln e w t, 1 − 1 t ln e w t, 2 = 1 t ln e w t, 1 e w t, 2 = 1 t X t , where step ( i ) is due to Proposition 8 . Therefore, by Prop osition 8 and Lemma 10 , P {| r t − ¯ r | ≥ δ } ≤ P  | X t | ≥ | α | δ t 3  + P  | ϵ t, 1 | ≥ | α | δ 3  + P  | ϵ t, 2 | ≥ | α | δ 3  ≤ A 0 e − B 0 | α | δt 3 + 4 te − B θ (1) | α | 2 δ 2 9 t + 4 te − B θ (2) | α | 2 δ 2 9 t ≤ Ate − B δ 2 t , 26 where A = 3 max { A 0 , 4 } and B = min  B 0 | α | 3 , B θ (1) | α | 2 9 , B θ (2) | α | 2 9  . It follows that ∞ X t =1 P {| r t − ¯ r | ≥ δ } ≤ ∞ X t =1 Ate − B δ 2 t = Ae B δ 2 ∞ X t =1 te − B δ 2 ( t − 1) = Ae B δ 2  1 − e − B δ 2  2 < ∞ . By the Borel-Can telli Lemma, P {| r t − ¯ r | ≥ δ i.o. } = 0 for any δ > 0. It follows that r t → ¯ r almost surely as t → ∞ . Since arm 1 (resp. arm 2) is chosen iﬀ particle 1 (resp. particle 2) is chosen, q t = ( r t , 1 − r t ). So q t → ( ¯ r , 1 − ¯ r ). Finally , since I t ∼ w t − 1 = ( w t − 1 , 1 , · · · , w t − 1 ,N ), 1 { I t = i } ∼ Bernoulli( w t − 1 ,i ). F or i = 1 , 2, b y the Azuma-Ho eﬀding inequality , for an y γ > 0, Pr {| q t,i − ¯ w t − 1 ,i | ≥ γ } = Pr (      1 t t X τ =1 1 { I t = i } − 1 t t − 1 X τ =0 w τ ,i      ≥ γ ) = Pr (      t X τ =1  1 { I t = i } − w t − 1 ,i       ≥ tγ ) ≤ 2 exp  − 2( tγ ) 2 t  = 2 e − 2 γ 2 t , whic h is summable in t . Apply the Borel-Can telli Lemma again, w e get | q t − ¯ w t − 1 | → 0 with probabilit y one. So ¯ w t → ( ¯ r , 1 − ¯ r ). B.2.2 Self-reinforcing pair Deﬁnition 3. (Self-reinforcing pair) F or a given BernoulliBandit( K = 2 , θ ∗ ) problem, w e sa y tw o particles θ (1) , θ (2) ∈ [0 , 1] 2 form a self-r einfor cing p air (SR pair) if they can b e relab eled suc h that the following conditions hold: d ( θ ∗ 1 || θ (1) 1 ) < d ( θ ∗ 1 || θ (2) 1 ) , d ( θ ∗ 2 || θ (1) 2 ) > d ( θ ∗ 2 || θ (2) 2 ) , A (1) = { 1 } , A (2) = { 2 } . (19) Without loss of generality , in this section when w e say particles θ (1) and θ (2) are a SR pair, we assume they hav e already been prop erly lab eled such that they satisfy ( 19 ). An SR pair example is drawn in Figure 8 . Consider a large time t . Since e w t,i ∝ ∼ e − tD i ( r t ) , if r t > ¯ r , with high probability particle 1 will b e selected more often, which will cause r t to further increase. If r t < ¯ r , then with high probability particle 2 will be selected often, which will cause r t to further decrease. Therefore, each of the tw o particles is self-r einfor cing : selecting one particle will likely increase the weigh t of the particle itself whic h mak es it to b e selected more often. Eac h particle b ehav es lik e a black hole. W e exp ect that, in the end, either particle 1 or particle 2 gain all the weigh t. Which of the tw o particles wins out in the end is random and is inﬂuenced b y the initial condition. W e state this observ ation more formally in the follo wing proposition. Prop osition 11. Given a pr oblem BernoulliBandit( K = 2 , θ ∗ ) and a p article set P 2 = { θ (1) , θ (2) } , supp ose { θ (1) , θ (2) } forms a SR p air for the pr oblem. Consider the pr o c ess of running PTS( P 2 ) as in A lgorithm 5 . L et X t = ln e w t, 1 e w t, 2 = ln w t, 1 w t, 2 for t ≥ 0 . Then, with pr ob ability one, one of the fol lowing two c ases happ ens: 1. X t → ∞ , q t → (1 , 0) , w t → (1 , 0) and r t → 1 . 27 (a) Particle positions. (b) Divergences. Figure 8: A self-reinforcing pair example. 2. X t → −∞ , q t → (0 , 1) , w t → (0 , 1) and r t → 0 . The remainder of this section is dedicated to the pro of of Prop osition 11 . W e ﬁrst deﬁne the notion of sto chastic asymptotic stability , which will b e used for the pro of. Deﬁnition 4. Let { X n } n ≥ 0 b e a discrete time Marko v pro cess with state space R . 1. W e say that x ∈ R is sto chastic al ly asymptic al ly stable (SAS) for { X n } if for an y ϵ > 0, there exits δ > 0 suc h that if | X n 0 − x | ≤ δ for some n 0 , then Pr {| X n − x | ≤ ϵ ∀ n ≥ n 0 | X n 0 } ≥ 1 − ϵ and Pr {{| X n − x | ≤ ϵ ∀ n ≥ n 0 } \ { X n → x } | X n 0 } = 0. 2. W e say that −∞ is SAS for { X n } if for any L ∈ R and ϵ > 0, there exists L 0 ∈ R suc h that if X n 0 ≤ L 0 for some n 0 , then Pr { X n ≤ L ∀ n ≥ n 0 | X n 0 } ≥ 1 − ϵ and Pr {{ X n ≤ L ∀ n ≥ n 0 } \ { X n → −∞} | X n 0 } = 0. 3. W e say that + ∞ is SAS for { X n } if for any L ∈ R and ϵ > 0, there exists L 0 ∈ R suc h that if X n 0 ≥ L 0 for some n 0 , then Pr { X n ≥ L ∀ n ≥ n 0 | X n 0 } ≥ 1 − ϵ and Pr {{ X n ≥ L ∀ n ≥ n 0 } \ { X n → ∞} | X n 0 } = 0. The second condition in the 1st (resp. 2nd or 3rd) deﬁnition ab ov e means that, given X n 0 , if X n is close to x (resp. −∞ , + ∞ ) from n 0 on ward, then X n con verges to x (resp. −∞ , + ∞ ). In tuitively , a SAS p oint is lik e a black hole: if the pro cess is close enough to the p oint, then with high probability it will b e trapp ed around the p oin t and ev entually suck ed to the p oin t. W e start the proof of Prop osition 11 with the following lemma. Lemma 12. The pr o c ess { X t } describ e d in Pr op osition 11 is a Markov pr o c ess. Mor e over, it c an b e r epr esente d as: X t +1 = X t + U t +1 , wher e the distribution of U t +1 is determine d by X t and it satisﬁes: (a) | U t | ≤ C for al l t ≥ 1 , (b) E [ U t +1 | X t = x ] ≤ − µ 1 whenever x ≤ C 1 , (c) E [ U t +1 | X t = x ] ≥ µ 2 whenever x ≥ C 2 , for some c onstants µ 1 > 0 , µ 2 > 0 , C , C 1 and C 2 that dep end on θ ∗ and P 2 . 28 Pr o of. By the recursiv e up date form ula for e w t in ( 8 ) and the conditions A (1) = { 1 } and A (2) = { 2 } in ( 19 ), w e can obtain the same dynamics of X t as in ( 14 ), suc h that that X t +1 = X t + U t +1 , where U t +1 is the increment of the pro cess { X t } at time t , given by U t +1 =                    ln θ (1) 1 θ (2) 1 w .p. e X t 1+ e X t θ ∗ 1 ln (1 − θ (1) 1 ) (1 − θ (2) 1 ) w .p. e X t 1+ e X t (1 − θ ∗ 1 ) ln θ (1) 2 θ (2) 2 w .p. 1 1+ e X t θ ∗ 2 ln (1 − θ (1) 2 ) (1 − θ (2) 2 ) w .p. 1 1+ e X t (1 − θ ∗ 2 ) (20) for t ≥ 0. Clearly , { X t } t ≥ 0 is a Marko v pro cess and the distribution of U t +1 is determined by X t . Prop ert y (a) is easily satisﬁed b y setting C ≜ max (      ln θ (1) 1 θ (2) 1      ,      ln (1 − θ (1) 1 ) (1 − θ (2) 1 )      ,      ln θ (1) 2 θ (2) 2      ,      ln (1 − θ (1) 2 ) (1 − θ (2) 2 )      ) . Let h ( x ) ≜ E [ U t +1 | X t = x ]. It can b e sho wn that h ( x ) = α e x 1+ e x + β , where α =  d ( θ ∗ 1 || θ (2) 1 ) − d ( θ ∗ 1 || θ (1) 1 )  +  d ( θ ∗ 2 || θ (1) 2 ) − d ( θ ∗ 2 || θ (2) 2 )  , and β =  d ( θ ∗ 2 || θ (2) 2 ) − d ( θ ∗ 2 || θ (1) 2 )  . By conditions ( 19 ), α > 0 and β < 0. Let f ( r ) = αr + β , 0 ≤ r ≤ 1. The graph of f ( r ) is shown b elo w: A t r = ¯ r = − β α , f ( r ) = 0. Let µ 1 = d ( θ ∗ 2 || θ (1) 2 ) − d ( θ ∗ 2 || θ (2) 2 ) 2 and µ 2 = d ( θ ∗ 1 || θ (2) 1 ) − d ( θ ∗ 1 || θ (1) 1 ) 2 . Then f ( r ) ≤ − µ 1 whenev er 0 ≤ r ≤ ¯ r 2 and f ( r ) ≥ µ 2 whenev er ¯ r +1 2 ≤ r ≤ 1. Let e C 1 1+ e C 1 = ¯ r 2 and e C 1 1+ e C 2 = ¯ r +1 2 , we get C 1 = ln ¯ r 2 − ¯ r = ln − β 2 α + β and C 2 = ln 1 + ¯ r 1 − ¯ r = ln α − β α + β . 29 Since h ( x ) = f ( e x 1+ e x ) and h ( x ) is monotonely increasing in x , w e hav e that h ( x ) ≤ − µ 1 whenev er x ≤ C 1 and h ( x ) ≥ µ 2 whenev er x ≥ C 2 . Lemma 13. The pr o c ess { X t } describ e d in Pr op osition 11 has + ∞ and −∞ as two SAS p oints. Pr o of. First, we sho w that −∞ is SAS for { X t } . Consider any giv en L ∈ R and ϵ > 0. Without loss of generalit y , we can assume L ≤ C 1 and choose L 0 = L − C 2 2 µ 1 ln 1 ϵ , where C 1 and C are giv en in Lemma 12 . 7 Deﬁne T ≜ min { t > 0 : X t > L } to b e the crossing time, the ﬁrst time the pro cess { X t } crosses ab o ve the threshold L . By con v ention, if { X t > L } never happ ens, T = ∞ . Deﬁne a random sequence { e X t } t ≥ 0 b y e X 0 = X 0 and e X t =  X t if 1 ≤ t ≤ T e X t − 1 − µ 1 if t > T . Let e U t +1 = e X t +1 − e X t , then e U t =  U t if 1 ≤ t ≤ T − µ 1 if t > T By Lemma 12 and the ab ov e construction, E [ e U t +1 | e X t ] ≤ − µ 1 < 0 and    e U t    ≤ C for all t . It immediately follows from LLN that e X t → −∞ with probabilit y one. Also, if e X 0 ≤ L 0 , then Pr n e X t ≤ L ∀ t    e X 0 o = Pr  max t ≥ 0 e X t ≤ L    e X 0  = Pr  max t ≥ 0 ( e X t − L 0 ) ≤ L − L 0    e X 0  = Pr  max t ≥ 0 ( e X t − L 0 ) ≤ C 2 2 µ 1 ln 1 ϵ    e X 0  ( i ) ≥ 1 − exp  − 2 µ 1 C 2 C 2 2 µ 1 ln 1 ϵ  = 1 − ϵ , where inequality ( i ) is due to Proposition 23 (see App endix B.6.2 ). Note that, { X t ≤ L ∀ t } = { e X t ≤ L ∀ t } , and under such ev en t, { X t } t ≥ 0 = { e X t } t ≥ 0 . It follo ws that Pr n X t ≤ L ∀ t    X 0 o = Pr n e X t ≤ L ∀ t    e X 0 o ≥ 1 − ϵ and Pr n { X t ≤ L ∀ t } \ { X t → −∞}    X 0 o = Pr n { X t ≤ L ∀ t } ∩ { X t → −∞}    X 0 o = Pr n { X t ≤ L ∀ t } ∩ n e X t → −∞ o    e X 0 o ≤ Pr n e X t → −∞    e X 0 o = 0 . 7 If L > C 1 , we can c ho ose L 0 = C 1 − C 2 2 µ 1 ln 1 ϵ . Then by the same argument in this proof, we can show that Pr { X t ≤ C 1 ∀ t | X 0 } ≥ 1 − ϵ , whic h still implies Pr { X t ≤ L ∀ t | X 0 } ≥ 1 − ϵ . 30 W e conclude that −∞ is SAS for { X t } . By a similar argument, using prop erties (a) and (c) of Lemma 12 and Corollary 24 (see Appendix B.6.2 ), we can sho w that + ∞ is SAS for { X t } . W e are now ready to prov e Prop osition 11 . Pr o of of Pr op osition 11 . Fix ϵ = 0 . 5 (any p ositive ϵ will do) and some L 1 , R 1 ∈ R such that L 1 ≤ C 1 ≤ C 2 ≤ R 1 . By Lemma 13 , there exists L 2 < L 1 and R 2 > R 1 suc h that (1) If X t 0 ≤ L 2 for some t 0 , then Pr n X t ≤ L 1 ∀ t ≥ t 0    X t 0 o ≥ 0 . 5 and X t ≤ L 1 ∀ t ≥ t 0 implies X t → −∞ , and (2) If X t 0 ≥ R 2 for some t 0 , then Pr n X t ≥ R 1 ∀ t ≥ t 0    X t 0 o ≥ 0 . 5 and X t ≥ R 1 ∀ t ≥ t 0 implies X t → ∞ . F or a b etter illustration, see the ﬁgure b elo w: Tw o observ ations: • If X t 0 ev er mov es outside of the interv al ( L 2 , R 2 ) for some t 0 , then with probabilit y at least 0 . 5, X t sta ys ≤ L 1 or ≥ R 1 for all t ≥ t 0 and conv erges to −∞ or ∞ . • If X t 0 is inside the interv al ( L 2 , R 2 ) for some t 0 , then within a ﬁxed M num b er of the follo wing steps, with a strictly p ositive probability δ , X t will mo ve outside of [ L 2 , R 2 ]. T o see this, consider the following. Since the tw o particles form a SR pair, θ (1) 1  = θ (2) 1 . W e can assume without loss of generalit y that θ (1) 1 > θ (2) 1 . By the form of the distribution of the step U t +1 in ( 20 ), if X t ∈ ( L 2 , R 2 ), then within the next M =      R 2 − L 2 ln θ (1) 1 θ (2) 1      steps, with probabilit y at least δ =  e L 2 1+ e L 2 θ ∗ 1  M > 0, X t will b ecome ≥ R 2 . Consider the following: (a) Observ e the pro cess { X t } from t = 0. If X t alw ays stays b elow L 1 or abov e R 1 , then it will con verge to ∞ or −∞ . (b) If X t ev er mov es in to the interv al ( L 1 , R 1 ), it is also in the interv al ( L 2 , R 2 ), then we start the following trial: observe whether X t wil l b e c ome ≤ L 2 or ≥ R 2 within the next M steps, and if it do es, observe whether it wil l stay ≤ L 1 or ≥ R 1 onwar d for ever . The trial fails if X t do esn’t b ecome ≤ L 2 or ≥ R 2 within the next M steps, or it do es, but after that it enters the in terv al ( L 1 , R 1 ) at some time. By the ab ov e tw o observ ations, this trial is successful with probabilit y at least 0 . 5 δ > 0. The failure of the trial, if it ever happ ens, can b e detected in a ﬁnite num ber of steps. 31 (c) If the ab ov e trial fails, we start the next trial, same as the one in (b), which is also successful with probability at least 0 . 5 δ . Rep eat this trial pro cess whenev er a trial fails. (d) Since 0 . 5 δ > 0, one trial will even tually b e successful with probability one. W e conclude that X t con verges either to −∞ or ∞ with probability one. In either case, the con vergences of q t , w t and r t are obvious. B.3 N giv en particles: asymptotic b eha vior W e now turn to the case of N given particles. The question is: which particles can surviv e? Let us start with a discussion of a represen tative example of a four-particle conﬁguration in Figure 9 . W e discuss ho w the w eights of the particles change based on our understanding of the case of t wo particles in the previous section. Figure 9: An example of four particles. In the divergence diagram in Figure 9 , w e divide the b ottom interv al [0 , 1] in to three in terv als, [0 , r ], [ r , s ] and [ s, 1], based on the intersections of the line segmen ts of particles 1, 2 and 3 (it will b e so on clear wh y w e ignore particle 4). Recall Prop osition 8 again, w e hav e e w t,i ∝ ∼ e − tD i ( r t ) . F or large t , if r t ∈ (0 , r ), particle 1 will tend to dominate, and r t will drift to the righ t; if r t ∈ ( r , s ), particle 2 will tend to dominate, and r t will drift to the left; if r t ∈ ( s, 1), particle 3 will tend to dominate, and r t will drift to the right. • If r t sta ys around r for a long time, then weigh ts of particles 3 and 4 will even tually b ecome negligible. The system essentially reduces to particles 1 and 2, which form a CR pair. By the discussion and results in Section B.2.1 , we expect that ln w t, 1 w t, 2 oscillates but is sto chastically b ounded, ln w t, 1 w t, 3 → ∞ and ln w t, 1 w t, 4 → ∞ . Also, we exp ect that q t → ( r, 1 − r, 0 , 0), ¯ w → ( r , 1 − r, 0 , 0) and r t → r . • If r t sta ys close to 1 for a long time, then weigh ts of particles 1, 2 and 4 b ecome negligible and the system essen tially reduces to a single particle 3. Thus, when r t > s , particle 3 is self-reinforcing. W e exp ect that q t → (0 , 0 , 1 , 0), w t → (0 , 0 , 1 , 0) and r t → 1. 32 Therefore, we exp ect that r t con verges to either r or 1. In either case, we exp ect only tw o or one particle will surviv e in the end. W e now state the ideas in the ab ov e discussion more formally for general N ﬁxed particles. Consider a tw o-arm Bernoulli bandit problem with parameter θ ∗ and a given set of N particles P N . Deﬁne D o ( r ) ≜ min i ∈{ 1 , ··· ,N } D i ( r ). Let D o b e an abbreviation of the curve { D o ( r ) : r ∈ [0 , 1] } and let D i b e an abbreviation of the line segment { D i ( r ) : r ∈ [0 , 1] } . Graphically , D o is the b ottom piece-wise linear curve formed b y the line segments of inv olv ed particles in the divergence diagram. W e mak e the follo wing assumptions ab out the particles. Assumption 3. Assume that θ ∗ ∈ [0 , 1] 2 and P N ⊂ [0 , 1] 2 satisfy: 1. There do not exist tw o diﬀeren t particles i, j such that D i = D j . 2. |{ i : D i ( r ) = D o ( r ) }| ≤ 2 for all r ∈ (0 , 1). The ﬁrst assumption ab ov e means that each line segment in the divergence diagram represents one unique particle. The second assumption means that no point on the curve D o is shared by more than t wo particles, except p ossibly at the b oundaries. Both assumptions hold with probability one if the N particles are generated uniformly at random. F or the rest of this section, we assume Assumption 3 holds. 8 The breakp oints and their asso ciated particles for D o are deﬁned as follo ws. Deﬁnition 5. A p oin t r ∈ [0 , 1] is a br e akp oint for D o if it is a b oundary p oin t (i.e., 0 or 1), or it is where tw o diﬀerent particles in tersect on D o (i.e., D o ( r ) = D i ( r ) = D j ( r ) for some i  = j ). Each breakp oin t is asso ciated with a set of one or tw o particles: • If r ∈ (0 , 1) is a breakp oin t where D o ( r ) = D i ( r ) = D j ( r ) for some i  = j , then its asso ciated particles are { i, j } . • The breakp oint 0 has one asso ciated particle i 0 , which is the particle such that there exists some ϵ > 0 such that D i 0 ( δ ) < D i ( δ ) for all i  = i 0 for all δ ∈ (0 , ϵ ). • The breakp oint 1 has one asso ciated particle i 1 , which is the particle such that there exists some ϵ > 0 such that D i 1 (1 − δ ) < D i (1 − δ ) for all i  = i 1 for all δ ∈ (0 , ϵ ). Deﬁnition 6. Let ξ ∈ (0 , 1) be a non-breakp oint for D o . The dominant p article at ξ for the pro cess { r t } is a particle i such that D i ( ξ ) = min j ∈ [ N ] D j ( ξ ), i.e., D i ( ξ ) = D o ( ξ ). If ξ is con tained in ( r, s ), where r , s are tw o neighbor breakp oints for D o , w e also sa y i is the dominan t particle for interv al ( r , s ) for the pro cess { r t } . By Prop osition 8 , if r t sta ys around a non-breakpoint ξ ∈ (0 , 1) for a long time, the w eigh t of the corresp onding dominant particle tends to increase exp onen tially . In that sense the particle dominates other particles. Example 4. T o illustrate the ab o ve deﬁnitions, see an example of six particles in the divergence diagram in Figure 10 . In this example, the breakp oints are { 0 , r, s, 1 } and their asso ciated particles are 0 → { 1 } , r → { 1 , 2 } , s → { 2 , 3 } and 1 → { 3 } , resp ectively . The dominant particles for interv als (0 , r ) , ( r , s ) , ( s, 1) are particles 1 , 2 , 3, resp ectively . 8 Ev en if Assumption 3 do not hold, i.e., if tw o diﬀerent particles hav e the same line segment or if more than t wo particles intersect at some point on D o , we exp ect that Conjecture 14 is still true, p erhaps with some minor mo diﬁcations of the related deﬁnitions. But since we don’t hav e any rigorous results for these scenarios, and since those scenarios are not useful in practice, we deem it reasonable to proceed with Assumption 3 . 33 Figure 10: An example of six particles. Deﬁnition 7. The con traction set for the { r t } pro cess, denoted by R , is deﬁned as follows. A v alue r ∈ [0 , 1] is in R if one of the following is true: 1. r = 0 and A ( i 0 ) = 2, where i 0 is the asso ciated particle for breakp oint 0. 2. r = 1, and A ( i 1 ) = 1, where i 1 is the asso ciated particle for breakp oint 1. 3. r ∈ (0 , 1) is a breakp oint and particles { i, j } form a CR pair, where i, j are the asso ciated particles for r . F or the example in Figure 10 , R = { r , 1 } . R emark. Note that once θ ∗ and P N are given, R is determined, even b efore PTS runs. Conjecture 14. Consider a given pr oblem BernoulliBandit( K = 2 , θ ∗ ) and a p article set P N that satisfy Assumption 3 . Consider the pr o c ess of running PTS( P N ) as in Algorithm 5 . L et R b e the c ontr action set for the { r t } pr o c ess. Then R is non-empty and with pr ob ability one, r t → r for some r ∈ R , and the one or two p articles asso ciate d with the br e ak p oint r survive, while al l other p articles’ weights c onver ge to zer o. A pro of for this conjecture might b egin with analyzing a prop erly deﬁned N − 1 dimensional Mark ov pro cess ab out the particles’ w eights (just lik e for the t wo-particle case w e analyzed a one- dimensional Mark ov pro cess). W e don’t ha ve a pro of for the conjecture, although its truthfulness is strongly indicated b y discussion at the b eginning of this section and empirical evidence. The ma jor tak e-aw a y lesson of this section is that, with Assumption 3 , no more than t wo particles can survive in the asymptotic regime, and the p ossible surviving particles can b e found b y dra wing the div ergence diagram, as discussed. Informally sp eaking, the line segmen ts of the surviving particles should b e low in the div ergence diagram. This is a sp ecial case of the sample-path necessary surviv al condition for general sto chastic bandit problems in Section 4 . 34 B.4 N Random particles Up to this p oint, we ha v e b een considering ﬁxed giv en particles. In practice, particles are not giv en at the very b eginning. One can use a pre-determined set of particles, or randomly generate some particles. In this section, w e ev aluate the p erformance of PTS with N randomly generated particles. W e will consider t wo diﬀeren t metho ds for particle generation. The follo wing lemma is useful for the analysis of b oth cases. Deﬁnition 8. W e sa y that a particle θ ∈ [0 , 1] 2 is action-optimal for a giv en problem BernoulliBandit( K = 2 , θ ∗ ) if A ( θ ) = A ( θ ∗ ). In particular, if θ ∗ 1 = θ ∗ 2 , then any θ ∈ [0 , 1] 2 is action-optimal. Lemma 15. Consider a given BernoulliBandit( K = 2 , θ ∗ ) pr oblem and assume θ ∗ 1  = θ ∗ 2 . Ther e exist θ ∗ -dep endent p ositive c onstants ¯ d 1 and ¯ d 2 such that, if a p article θ ∈ [0 , 1] 2 satisﬁes d ( θ ∗ 1 || θ 1 ) < ¯ d 1 and d ( θ ∗ 2 || θ 2 ) < ¯ d 2 , then θ is action-optimal. In p articular, ¯ d 1 = d  θ ∗ 1 || θ ∗ 1 + θ ∗ 2 2  and ¯ d 2 = d  θ ∗ 2 || θ ∗ 1 + θ ∗ 2 2  works. The lemma pro vides us with a useful div ergence based suﬃcien t condition under whic h a particle is action-optimal. Pr o of. Without loss of generality , assume θ ∗ 1 > θ ∗ 2 . It is clear that, if θ satisﬁes θ ∗ 1 + θ ∗ 2 2 < θ 1 ≤ 1 and 0 ≤ θ 2 < θ ∗ 1 + θ ∗ 2 2 , then A ( θ ∗ ) = A ( θ ). See the region highligh ted by red in Figure 11 . Figure 11: Any θ in the red region is consisten t. The function g ( y ) = d ( x || y ) for x ∈ (0 , 1) is monotone decreasing for y ∈ (0 , x ) and monotone increasing for y ∈ ( x, 1). Therefore a suﬃcient condition for θ ∗ 1 + θ ∗ 2 2 < θ 1 ≤ 1 is d ( θ ∗ 1 || θ 1 ) < d  θ ∗ 1 || θ ∗ 1 + θ ∗ 2 2  and a suﬃcient condition for 0 ≤ θ 2 < θ ∗ 1 + θ ∗ 2 2 is d ( θ ∗ 2 || θ 2 ) < d  θ ∗ 2 || θ ∗ 1 + θ ∗ 2 2  . Let ¯ d 1 = d  θ ∗ 1 || θ ∗ 1 + θ ∗ 2 2  and ¯ d 2 = d  θ ∗ 2 || θ ∗ 1 + θ ∗ 2 2  , the pro of is done. 35 B.4.1 Co ordinate-wise random generation Metho d 1 (co ordinate-wise random generation): Generate t wo sets A and B , each con tains √ N v al- ues generated indep enden tly uniformly at random from [0 , 1]. Let P N = A × B = { ( a, b ) : a ∈ A, b ∈ B } . (a) Particles positions. (b) Divergence diagram. Figure 12: An example of 16 particles produced b y co ordinate-wise random generation. An example of 16 particles pro duced by Metho d 1 is shown in Figure 12 . The particles form a grid in the [0 , 1] 2 square (Fig. 12 ). The line segments of the particles form a complete bipartite graph in the div ergence diagram (Fig. 12 ). By the discussion in Section B.3 , the weigh t of the particle represented by the lo west line segmen t will conv erge to one with probability one. Call this the b ottom particle. F or particles generated b y Metho d 1, the b ottom particle alw ays exists and is unique. The running av erage regret of PTS will conv erge to zero if and only if the b ottom particle is action-optimal. If N is large, w e exp ect that with high probability , the KL div ergences of the b ottom particle at the tw o arms will b e b elo w ¯ d 1 and ¯ d 2 resp ectiv ely and hence the b ottom particle is action-optimal. Deﬁnition 9. F or a giv en sto chastic bandit problem, w e say that an algorithm is c onsistent for a giv en sample path if the running av erage regret conv erges to zero. In particular, for a giv en BernoulliBandit( K = 2 , θ ∗ ) problem, the running a verage regret is 1 T P T t =1  max a ∈{ 1 , 2 } θ ∗ a − θ ∗ A t  . Therefore, PTS is consistent for a given sample path if w t,i → 1 and    1 T P T t =1 w t,i − 1 T P T t =1 1 { I t = i }    → 0 for some action-optimal particle i . Prop osition 16. L et P N b e a set of N p articles gener ate d by Metho d 1. Consider the pr o c ess of running PTS( P N ) for a given pr oblem BernoulliBandit( K = 2 , θ ∗ ) as in A lgorithm 5 . L et E denote the event that the algorithm is c onsistent. Assume Conje ctur e 14 is true. Then, for N suﬃciently lar ge, Pr { E } ≥ 1 − 2 e − | θ ∗ 1 − θ ∗ 2 | √ N 2 . The ab ov e result sa ys that with coordinate-wise random particle generation, PTS is consistent with high probability . Observe that, if | θ ∗ 1 − θ ∗ 2 | is large, it is more likely for the algorithm to b e consisten t, or in other words, it is easier for the algorithm to identify the optimal arm. That makes sense. 36 Pr o of. Let A, B ⊂ [0 , 1] b e the tw o random sets of √ N v alues generated by Metho d 1. Let a 0 = min a ∈ A d ( θ ∗ 1 || a ) and b 0 = min b ∈ B d ( θ ∗ 2 || b ) and let particle i 0 ∈ [ N ] b e the one with θ ( i 0 ) = ( θ ( i 0 ) 1 , θ ( i 0 ) 2 ) = ( a 0 , b 0 ). Particle i 0 is the b ottom particle in our previous discussion. With probability one, a 0 , b 0 and i 0 are unique. By construction, the contraction set R of the { r t } pro cess contains only one p oin t, either 0 or 1, dep ending on the optimal arm for particle i 0 . By Conjecture 14 , the algorithm is consisten t if and only if particle i 0 is action-optimal. W e sho w that particle i 0 is action-optimal w.h.p. If θ ∗ 1 = θ ∗ 2 , any algorithm is consistent, there is nothing to prov e. Without loss of generality , assume θ ∗ 1 > θ ∗ 2 . Let X and Y b e tw o indep endent uniform random v ariables in [0 , 1]. Let p 1 ≜ Pr  d ( θ ∗ 1 || X ) ≤ ¯ d 1  and p 2 ≜ Pr  d ( θ ∗ 2 || Y ) ≤ ¯ d 2  for ¯ d 1 = d  θ ∗ 1 || θ ∗ 1 + θ ∗ 2 2  , ¯ d 2 = d  θ ∗ 2 || θ ∗ 1 + θ ∗ 2 2  as in Lemma 15 . Since a suﬃcient condition for d ( θ ∗ 1 || X ) ≤ ¯ d 1 is X ∈  θ ∗ 1 + θ ∗ 2 2 , θ ∗ 1  and a suﬃcient condition for d ( θ ∗ 2 || Y ) ≤ ¯ d 2 is Y ∈  θ ∗ 2 , θ ∗ 1 + θ ∗ 2 2  , we hav e p 1 ≥ Pr  X ∈  θ ∗ 1 + θ ∗ 2 2 , θ ∗ 1  = θ ∗ 1 − θ ∗ 2 2 and p 2 ≥ Pr  Y ∈  θ ∗ 2 , θ ∗ 1 + θ ∗ 2 2  = θ ∗ 1 − θ ∗ 2 2 . It follows that Pr { E } ≥ Pr n d ( θ ∗ 1 || θ ( i 0 ) 1 ) ≤ ¯ d 1 and d ( θ ∗ 2 || θ ( i 0 ) 2 ) ≤ ¯ d 2 o = 1 − Pr n d ( θ ∗ 1 || θ ( i 0 ) 1 ) > ¯ d 1 or d ( θ ∗ 2 || θ ( i 0 ) 2 ) > ¯ d 2 o ≥ 1 − Pr n d ( θ ∗ 1 || θ ( i 0 ) 1 ) > ¯ d 1 o − Pr n d ( θ ∗ 2 || θ ( i 0 ) 2 ) > ¯ d 2 o = 1 − Pr  d ( θ ∗ 1 || a ) > ¯ d 1 ∀ a ∈ A  − Pr  d ( θ ∗ 2 || b ) > ¯ d 2 ∀ b ∈ B  = 1 − (1 − p 1 ) √ N − (1 − p 2 ) √ N ≥ 1 − 2  1 − θ ∗ 1 − θ ∗ 2 2  √ N ≥ 1 − 2 e − ( θ ∗ 1 − θ ∗ 2 ) √ N 2 . Despite the nice p erformance guarantee of PTS for tw o-arm Bernoulli bandit, co ordinate-wise random particle generation has tw o ma jor limitations. First, for problems in which the parameter space do es not hav e a pro duct top ology , it is not clear ho w particles can b e generated co ordinate- wise. Second, the metho d do es not scale well for problems with a high dimensional parameter space. F or example, for the K -arm Bernoulli bandit problem, ev en if w e only generate t wo v alues on each co ordinate, w e hav e 2 K particles, which brings concerns on computational cost. B.4.2 Whole-particle random generation Metho d 2 (whole-particle random generation): Let P N b e a set of N particles generated indep en- den tly and uniformly at random from [0 , 1] 2 . Let us discuss the p erformance of PTS( P N ) on a high-level when P N is generated b y Metho d 2. Supp ose θ ∗ is given, and so are ¯ d 1 and ¯ d 2 in Lemma 15 . If N is large enough, w.h.p. w e exp ect 37 that the line segment of at least one particle is low and ﬂat enough suc h that its tw o ends are b elow ¯ d 1 and ¯ d 2 resp ectiv ely , which mak es the particle action-optimal. Let us call it particle 1. Without loss of generality , suppose a (1) = 1. See Figure 13 for an illustration. (a) Particles positions. (b) Divergence diagram. Figure 13: How things could go wrong with whole-particle random generation. Ho wev er, unlik e co ordinate-wise random generation, here the existence of particle 1 do es not guaran tee that algorithm is consistent. Things could go wrong in tw o w ays. • There could b e a non-action-optimal particle that is close to θ ∗ on arm 2, but far from θ ∗ on arm 1. Call this the type-1 bad particle, exempliﬁed by particle 2 in Fig 13 . Particles 1 and 2 form an SR pair, pro ducing an interv al (0 , s ) in which the pro cess r t w ould drift to the wrong side. • There could also b e a non-action-optimal particle that is close to θ ∗ on arm 1, but far from θ ∗ on arm 2. Let us call this the type-2 bad particle, which is exempliﬁed by particle 3 in Fig 13 . P articles 1 and 3 form a CR pair. If r t mo ves to anywhere in ( s, 1), it will drift tow ard r and stay around 1, not conv erging to 1. In other w ords, for the particle conﬁguration in Fig 13 , the process { r t } has con traction set R = { 0 , r } . Since R do esn’t contain 1, PTS cannot be consisten t. No matter ho w large N is, the probabilit y that there exist at least one type-1 bad particle and one type-2 bad particle like 2 and 3 in Fig 13 is non-zero. Ho wev er, a bad particle of either t yp e cannot b e to o ﬂat in the div ergence diagram. F or example, the right end of the line segment of a type-1 bad particle cannot b e b elow ¯ d 1 . Therefore, even with the existence of bad particles, a suﬃcien tly go o d particle creates an in terv al in [0 , 1] (e.g. ( s, t ) in Fig 13 ) in whic h r t alw ays drifts to the right direction. F or large N , we exp ect to hav e at least one go o d particle. And as N increases, the line segment of that go o d particle b ecomes low er and ﬂatter, making, making the aforemen tioned interv al expand to (0 , 1). W e formally state these ideas as follows. Prop osition 17. Consider a given BernoulliBandit( K = 2 , θ ∗ ) pr oblem and let P N b e a r andom set of N p articles gener ate d by Metho d 2. L et R b e the c ontr action set for pr o c ess { r t } deﬁne d in Deﬁnition 7 . Then for suﬃciently lar ge N , with pr ob ability at le ast 1 − e − N 1 / 3 , the fol lowing statements ar e true: 38 (a) A ny r ∈ R satisﬁes either r ≤ s 0 or r ≥ r 0 for some s 0 , r 0 ∈ [0 , 1] satisfying s 0 ≤ C 1 N − 1 3 and r 0 ≥ 1 − C 2 N − 1 3 , wher e C 1 , C 2 ar e some θ ∗ -dep endent c onstants. (b) F or any ξ ∈ ( s 0 , r 0 ) , the c orr esp onding dominant p article is action-optimal. An illustration of Prop osition 17 is sho wn in Figure 14 . Figure 14: An illustration of Prop osition 17 . Before we pro ve this result, let us discuss its implication. Supp ose without loss of generality that arm 1 is the optimal arm, i.e., θ ∗ 1 > θ ∗ 2 . Let E 1 ≜ n lim t →∞ r eg t ≥  1 − C 1 3 √ N  | θ ∗ 1 − θ ∗ 2 | o , a bad ev ent in which the running a v erage regret is large. Let E 2 ≜ n lim t →∞ r eg t ≤ C 2 3 √ N | θ ∗ 1 − θ ∗ 2 | o , a goo d ev ent where the running av erage is small, i.e., the algorithm is almost consistent. According to Prop osition 17 and Conjecture 14 , with high probability r t ev entually conv erges to some r ∈ [0 , 1], with either r ≤ s 0 or r ≥ r 0 , and the former implies E 1 and the latter implies E 2 . Th us Pr { E 1 ∪ E 2 } ≥ 1 − e − 3 √ N . (21) Without ev ent E 1 , ( 21 ) means that PTS is probably appro ximately consistent (P AC). But b ecause w e cannot exclude the p ossibilit y of E 1 , we cannot say that PTS is P A C. Ho wev er, as N increases, the interv al (0 , s 0 ] shrinks, we exp ect that the probabilit y that r t is trapp ed somewhere in [0 , s 0 ] b ecomes smaller. That is, we exp ect that Pr { E 1 } → 0 as N → ∞ , although w e do not ha v e a proof. If that is indeed true, then Prop osition 17 implies that, with whole-particle random generation, PTS is P AC. W e no w pro ve Prop osition 17 , starting with the follo wing lemma. Lemma 18. L et θ ∗ ∈ [0 , 1] 2 b e given. L et ¯ d 1 and ¯ d 2 b e the c onstants in L emma 15 . In the diver genc e diagr am, let L 1 b e the line with end p oints 0 and ¯ d 1 and let L 2 b e the line with end p oints 1 and ¯ d 2 . Se e Fig. 15 . L et δ 0 b e the height at which L 1 and L 2 interse cts. F or any δ ∈ [0 , δ 0 ) , let L = { L ( r ) = δ : 0 ≤ r ≤ 1 } b e the horizontal line of height δ . L et s 0 b e such that L ( s 0 ) = L 1 ( s 0 ) and let r 0 b e such that L ( r 0 ) = L 2 ( r 0 ) . Then s 0 < r 0 . The fol lowing ar e true: (a) If ther e exists a p article i that satisﬁes D i ( r ) ≤ L ( r ) = δ for any r ∈ ( s 0 , r 0 ) (i.e., D i interse cts with the r e d r e ctangle in Fig. 15 ), then p article i must b e action-optimal. 39 Figure 15: An illustration of Lemma 18 . (b) If ther e exists a p article j such that D j is entir ely b elow L , then any r ∈ R must satisfy r ≤ s 0 or r ≥ r 0 . Pr o of. The pro of is geometric. See Figure 15 . It is obvious that s 0 < r 0 . W e show part (a) by showing that its con trap osition is true. Consider a particle i asso ciated with a line D i in the diagram. Supp ose particle i is not action-optimal. Then by Lemma 15 , either D i (0) ≥ ¯ d 2 or D i (1) ≥ ¯ d 1 . Without loss of generalit y , assume D i (1) ≥ ¯ d 1 . Then D i m ust b e en tirely ab o ve L 1 . Therefore D i cannot intersect the red rectangle in Fig. 15 . Next, we sho w part (b). Supp ose particle j has D j en tirely b elo w L . Obviously particle j is action-optimal. F or an y ξ ∈ ( s 0 , r 0 ), its dominan t particle must b e either particle j itself or b elow particle j at ξ . In the latter case, the dominan t particle must b e action-optimal according to part (a). Thus, the dominant particle for an y ξ ∈ ( s 0 , r 0 ) m ust b e action-optimal. Therefore if r t is in ( s 0 , r 0 ), it alwa ys drift to the optimal arm side. R do es not con tain any p oin ts in ( s 0 , r 0 ). Lemma 19. L et U b e a r andom variable uniformly distribute d in [0 , 1] . Then for any ϵ ∈ (0 , 1) , for any value x ∈ [0 , 1] ﬁxe d and given, Pr { d ( x || U ) ≤ ϵ } ≥ ϵ 2 . Pr o of. By Theorem 1 in [ 6 ], d ( x || u ) ≤ x 2 u + (1 − x ) 2 1 − u − 1. Therefore, if u satisﬁes u ≥ 1 1 + ϵ x and 1 − u ≥ 1 1 + ϵ (1 − x ) , then d ( x || u ) ≤ (1 + ϵ ) x + (1 + ϵ )(1 − x ) − 1 = ϵ . It follows that Pr { d ( x || U ) ≤ ϵ } ≥ Pr  1 1 + ϵ x ≤ U ≤ 1 − 1 1 + ϵ (1 − x )  = 1 − 1 − x 1 + ϵ − x 1 + ϵ = ϵ 1 + ϵ ≥ ϵ 2 . Pr o of of Pr op osition 17 . Consider a ﬁxed large N . Let δ ( N ) = 2 N − 1 3 . Without loss of generalit y , supp ose N is large enough suc h that δ ( N ) < δ 0 as in Lemma 18 . Let L ( N ) , s 0 ( N ) , r 0 ( N ) be deﬁned 40 for δ ( N ) as L, s 0 , r 0 are deﬁned for δ in Lemma 18 . If a particle i satisﬁes that D i is entirely b elow the line L ( N ), we say that particle i is go o d. Let E b e the e v ent that there exists at least one go o d particle in P N . It follows that Pr { E } = 1 − (1 − Pr { particle 1 is go o d } ) N = 1 −  1 − Pr n d ( θ ∗ 1 || θ (1) 1 ) ≤ δ ( N ) o · Pr n d ( θ ∗ 2 || θ (1) 2 ) ≤ δ ( N ) o N ( i ) ≥ 1 −  1 − N − 1 / 3 N − 1 / 3  N ≥ 1 − e − N − 2 / 3 N = 1 − e − N 1 3 , where ( i ) is due to Pr n d ( θ ∗ i || θ (1) i ) ≤ δ ( N ) o ≥ N − 1 3 b y Le mma 19 for i = 1 , 2. Supp ose even t E is true. Let i 0 b e one go o d particle. Then by Lemma 18 part (b), any r ∈ R m ust satisfy r ≤ s 0 ( N ) or r ≥ r 0 ( N ). Simple geometry shows that s 0 ( N ) = δ ( N ) ¯ d 1 = 2 ¯ d 1 N − 1 3 and r 0 ( N ) = 1 − δ ( N ) ¯ d 2 = 1 − 2 ¯ d 2 N − 1 3 . Let C 1 = 2 ¯ d 1 and C 2 = 2 ¯ d 2 , part (a) of Prop osition 17 is pro ved. Consider any ξ ∈ ( s 0 , r 0 ), let the corresp onding dominan t particle b e j . Then D j ( ξ ) ≤ D i 0 ( ξ ). By Lemma 18 part (a), particle j must be action-optimal. P art (b) of Prop osition 17 is prov ed. B.5 Summary In this section w e analyzed PTS for the tw o-arm Bernoulli bandit problem. Our k ey ﬁndings are the following. • Fit p articles survive, unﬁt p articles de c ay , in the sense describ ed in Prop osition 8 and Conjec- ture 14 . The ﬁtness of a particle i is measured in terms of its closeness to θ ∗ b y the div ergence D i ( r t ), a conv ex combination of the KL divergences on the t wo arms. Unfortunately we can- not directly compare the ﬁtness of particles b ecause D i ( r t ) dep ends on the random pro cess r t . It is p ossible that the w eights of the surviving particles oscillates forever due to the coun ter-reinforcing eﬀect. Also, the w eights of the deca ying particles decay exp onen tially fast. • The set of surviving p articles is r andom. This is mainly due to the self-reinforcing eﬀect. One w ay to ﬁnd out the possible sets of surviving particles is b y dra wing the divergence diagram describ ed in Section B.3 . • Most p articles de c ay . Under Assumption 3 , w e exp ect that all except at most t wo particles deca y ev en tually . • Roughly sp eaking, with randomly generated particles, PTS is c onsistent or ne ar-c onsistent with high pr ob ability . See Proposition 16 and Prop osition 17 . W e b elieve these ﬁndings and some related concepts can b e extended to other and more general kinds of sto chastic bandit problems. F or example, for the K -arm Bernoulli bandit problem with K ≥ 3, w e exp ect to observe counter-reinforcing sets (not just pairs) of particles in PTS, in which the particles reinforce eac h other in some wa y . Prop osition 1 pro vides a generalized method to iden tify surviving particles, including coun ter-reinforcing particles, for general stochastic bandit problems and for an y ﬁnite n umber of particles. 41 B.6 Useful Drift Implied Bounds This section includes for reference tw o useful drft implied b ounds. B.6.1 One drift implied b ound with sto c hastic dominance The following result (Prop osition 20 ) is tak en out from [ 9 ] for conv enience of reference. Let X 0 , X 1 , · · · b e a sequence of random v ariables. The drift at time t is deﬁned as E [ X t +1 − X t |F t ], where F t = σ ( X 0 , · · · , X t ). Consider the following t wo conditions: Condition C1 : E  ( X t +1 − X t ) 1 { X t ≥ a } |F t  ≤ − ϵ 0 t ≥ 0 (22) for some constan ts −∞ ≤ a < ∞ and ϵ 0 > 0. That is, the drift at time t is strictly negative whenev er X t ≥ a . Condition C2 : There exists a random v ariable Z with E [ e λZ ] = D for some constan ts λ > 0 and D > 0 such that ( | X t +1 − X t | |F t ) ≺ Z . That is, given F t , | X t +1 − X t | is sto chastically dominated by a random v ariable with exp onential tail. Let c, η , ρ b e constan ts such that c ≥ E [ e λZ ] − (1 + λ E [ Z ]) λ 2 , 0 < η ≤ λ , η < ϵ 0 /c , ρ = 1 − ϵ 0 η + cη 2 . Then ρ < 1. Prop osition 20 (Theorem 2.3 in [ 9 ]) . Conditions C1 and C2 imply that P { X t ≥ b | X 0 } ≤ ρ t e η ( Y 0 − b ) + 1 − ρ t 1 − ρ D e − η ( b − a ) . In p articular, if X 0 ≤ a , then P { X t ≥ b | X 0 } ≤ D 1 − ρ e − η ( b − a ) . B.6.2 Another drift implied b ound with b ounded steps Tw o lemmas are stated ﬁrst. Lemma 21 (Ho eﬀding’s Lemma) . Supp ose Y is a r andom variable such that Pr { Y ∈ [ a, b ] } = 1 , then E  e θ ( Y − E [ Y ])  ≤ θ 2 ( b − a ) 2 8 . Lemma 22. Supp ose ( M k : k ≥ 0) is a non-ne gative sup ermartingale. Then for any n ≥ 0 and γ > 0 , Pr { max 0 ≤ k ≤ n M k > γ } ≤ E [ M 0 ] γ . A pro of of Lemma 22 can be found in Section 3.4 (P age 69) of [ 10 ]. Prop osition 23. Consider a r andon se quenc e ( U n : n ≥ 1) and deﬁne F = ∅ and F k = σ ( U 1 , · · · , U k ) . Supp ose E [ U k +1 |F k ] ≤ − µ < 0 for k ≥ 0 and Pr {| U k | ≤ C } = 1 for k ≥ 1 for some c onstancts µ, C > 0 . L et X n ≜ U 1 + · · · + U n for n ≥ 1 and X 0 = 0 . L et G n ≜ max 0 ≤ k ≤ n X k and G ≜ max k ≥ 0 X k . Then for any b > 0 , Pr { G > b } ≤ e − 2 µb C 2 . 42 Pr o of. By Ho eﬀding’s lemma (Lemma 21 ), E h e θ ( U k − E [ U k |F k − 1 ]) |F k − 1 i ≤ e θ 2 (2 C ) 2 8 = e θ 2 C 2 2 . Therefore, for all k ≥ 1, E h e θU k |F k − 1 i ≤ e θ E [ U k |F k − 1 ] e θ 2 C 2 2 ≤ e − θµ + θ 2 C 2 2 . − θ µ + θ 2 C 2 / 2 is quadratic in θ and is less than or equal to zero for all θ ∈ [0 , 2 µ/C 2 ]. Let θ ∗ = 2 µ/C 2 . Then E  e θ ∗ U k |F k − 1  ≤ 1 for all k ≥ 1. Next, deﬁne M 0 = 1 and M k = e θ ∗ X k for k ≥ 1. ( M k : k ≥ 0) is a sup ermartingale b ecause E [ M k +1 |F k ] = E h e θ ∗ ( U 1 + ··· + U k +1 ) |F k i = e θ ∗ ( U 1 + ··· + U k ) E h e θ ∗ U k +1 |F k i = M k E h e θ ∗ U k |F k − 1 i ≤ M k . It follows that, for an y n ≥ 0 and b > 0, Pr { G n > b } = Pr  max 0 ≤ k ≤ n X k > b  = Pr  max 0 ≤ k ≤ n e θ ∗ X k > e θ ∗ b  = Pr  max 0 ≤ k ≤ n M k > e θ ∗ b  ( i ) ≤ E [ M 0 ] e θ ∗ b = e − θ ∗ b . Step ( i ) is due to Lemma 22 . Finally , since G n is non-decreasing in n and G n → G for each sample path, 1 { G n >b } is non-negative and is non-decreasing in n and 1 { G n >b } → 1 { G>b } for each sample path. So by the monotone con vergence theorem Pr { G > b } = E  1 { G>b }  = lim n →∞ E  1 { G n >b }  = lim n →∞ Pr { G n > b } ≤ e − θ ∗ b = e − 2 µb C 2 . Corollary 24. Consider a r andon se quenc e ( U n : n ≥ 1) and deﬁne F = ∅ and F k = σ ( U 1 , · · · , U k ) . Supp ose E [ U k +1 |F k ] ≥ µ > 0 for k ≥ 0 and Pr {| U k | ≤ C } = 1 for k ≥ 1 for some c onstancts µ, C > 0 . L et X n ≜ U 1 + · · · + U n for n ≥ 1 and X 0 = 0 . L et G n ≜ min 0 ≤ k ≤ n X k and G ≜ min k ≥ 0 X k . Then for any b > 0 , Pr { G < − b } ≤ e − 2 µb C 2 . Pr o of. Apply Prop osition 23 to the sequence {− X n } . C Regenerativ e particle Thompson sampling: c hoice of h yp er- parameters and more sim ulations The recommended n umerical v alues of the three hyper-parameters for RPTS (Algorithm 3 ) are f del = 0 . 8, w inact = 0 . 001, and w new = 0 . 01. The b ehavior of the algorithm is relativ ely insensitive to these v alues, but further tuning ma y b e b eneﬁcial in a giv en application. In this section we commen t on ho w thes e v alues inﬂuence the p erformance of the algorithm. • Analysis for Bernoulli bandits (Section B ) and empirical evidence for other bandit mo dels indicate that with high probability all but a few particles ev entually decay in PTS. Hence it ma y b e attempting to make f del v ery large. How ev er, since the set of decaying particles is random, it ma y happ en that some ﬁt particles end up decaying. Also, a not-so-bad particle ma y hav e an oscillating w eigh t due to coun ter-reinforcing eﬀects and thus ma y hav e low w eigh t at times. Making f del not to o large gives those unfortunate ﬁt and not-so-bad particles a c hance to surviv e. W e hav e tried f del = 0 . 8 and f del = 0 . 5 and b oth w ork ﬁne. 43 • The v alue of w inact should b e small, but if it is to o small, it may take a long time for the CONDITION in Step 9 to b ecome true, esp ecially when the particles b ecome concentrated in a small subset of the parameter space. • The v alue of w new should b e small, but strictly larger than w inact . There are three asp ects of consideration here. First, it is desirable that the w eight re-balancing in Step 13 due to normalization has minimal eﬀect on the w eights of the surviving particles. W e discov ered through exp eriments that it is go o d for heavy weigh t particles to remain hea vily w eighted. Therefore w new should be small. Second, w new should be larger than w inact , b ecause otherwise, the newly generated particles in a step will be immediately deleted in the next step. Third, the purp ose of setting the v alue of w new is to giv e some initial weigh ts to the new particles so that they can participate in the weigh t up dating in the subsequen t steps. If a new particle is ﬁt, its weigh t will b o ost up exp onen tially fast; if a new particle is unﬁt, it will decay exp onen tially fast. Therefore, the initial weigh ts assigned to these new particles should not signiﬁcan tly aﬀect their c hance of surviv al and their long-term weigh t dynamics. Th us, as long as w new is fairly small and larger than w inact , the choice of its actual v alue may not mak e m uch diﬀerence qualitativ ely . More simulations are sho wn in Figure 16 . F or the linear bandit problem, TS can also b e exactly implemented by a Kalman ﬁlter. The initial set of particles of PTS and RPTS for linear bandits are generated uniformly at random from the unit ball in R K . That is based on the assumption that w e already kno w that θ ∗ is in the unit ball b efore running the algorithm. In practice, such knowledge may not b e a v ailable and a common practice is to use a distribution that spreads out wide enough so that it should co ver θ ∗ . F or the purp ose of demonstrating the p erformance of PTS and RPTS here, our practice should b e acceptable. D Appro ximation of exp ected reward for the net w ork slicing mo del In Section 6 , in step 4 of Algorithm 4 , the expected reward E θ t [ R ( Y ) | A t = a, c t ] becomes E θ t [ g c t, 2 ( Y t ) | a ] for the netw ork slicing mo del, where Y t = Y t, 1 + Y t, 2 + Y t, 3 . Since Y t, 1 , Y t, 2 , Y t, 3 are coupled through the non-linear function g d , it is not clear if the exp ectation can b e exactly calculated b y a closed-form expression. W e prop ose the following approximation. Given a random v ariable Y = Y 1 + Y 2 + Y 3 , where Y i is an exp onentially distributed random v ariable with mean µ i and Y i ’s are indep endent. Supp ose we approximate Y by a Gaussian random v ariable e Y with mean µ = µ 1 + µ 2 + µ 3 and v ariance σ 2 = µ 2 1 + µ 2 2 + µ 2 3 . Then E [ g d ( Y )] ≈ E [ g d ( e Y )] = Z d 0 y d 1 √ 2 π σ 2 e − ( y − µ ) 2 2 σ 2 d y = Z d − µ − µ 1 d ( z + µ ) 1 √ 2 π σ 2 e − z 2 2 σ 2 d z (with z = y − µ ) = 1 d √ 2 π σ 2 Z d − µ − µ z e − z 2 2 σ 2 d z + µ d Z d − µ − µ 1 √ 2 π σ 2 e − z 2 2 σ 2 d z = σ d √ 2 π  e − µ 2 2 σ 2 − e − ( d − µ ) 2 2 σ 2  + µ d  Φ  d − µ σ  − Φ  − µ σ   , 44 where Φ( x ) ≜ P ( N ≤ x ) for a standard Gaussian random v ariable N . Then E θ t  g c t, 1 ( Y t ) | a  ≈ σ t c t, 2 √ 2 π e − µ 2 t 2 σ 2 t − e − ( c t, 2 − µ t ) 2 2 σ 2 t ! + µ t c t, 2  Φ  c t, 2 − µ t σ t  − Φ  − µ t σ t  , (23) where µ t = µ t, 1 + µ t, 2 + µ t, 3 and σ 2 t = µ 2 t, 1 + µ 2 t, 2 + µ 2 t, 3 and µ t,i = c t, 1 θ t,i,a i , 1 + θ t,i,a i , 2 for i = 1 , 2 , 3. Step 4 of Algorithm 4 can then be approximately solved b y looping ov er all p ossible a ∈ [ B 1 ] × [ B 2 ] × [ B 3 ] and ﬁnd the one that maximizes ( 23 ). 45 (a) Bernoulli bandit, K = 10 θ ∗ = [0 . 05 , 0 . 10 , · · · , 0 . 50] . (b) Bernoulli bandit, K = 100 θ ∗ consists of N = 100 p oints uniformly spaced ov er [0.3,0.8]. (c) Max-Bernoulli bandit, K = 10, M = 3 θ ∗ = [0 . 51 , 0 . 52 , · · · , 0 . 60] . (d) Max-Bernoulli bandit, K = 10, M = 3 θ ∗ = [0 . 05 , 0 . 10 , · · · , 0 . 50] . (e) Linear bandit, K = 10, σ 2 W = 0 . 1, θ ∗ = [0 . 2 , · · · , 0 . 2]. (f ) Linear bandit, K = 100, σ 2 W = 0 . 1, θ ∗ = [0 . 08 , · · · , 0 . 08]. Figure 16: More simulations. 46

Regenerative Particle Thompson Sampling

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment