Near-optimal Bayesian Solution For Unknown Discrete Markov Decision Process

Near -optimal Bayesian Solution F or Unknown Discr ete Markov Decision Pr ocess Aristide T ossou, Christos Dimitrakakis, Debabrota Basu Department of Computer Science and Engineering Chalmers Uni versity of T echnology Göteborg, Sweden (aristide,chrdimi,basud)@chalmers.se Abstract W e tackle the problem of acting in an unkno wn ﬁnite and discrete Marko v Decision Process (MDP) for which the e xpected shortest path from any state to any other state is bounded by a ﬁnite number D . An MDP consists of S states and A possible actions per state. Upon choos- ing an action a t at state s t , one recei ves a real v alue re ward r t , then one transits to a ne xt state s t +1 . The re ward r t is generated from a ﬁxed re ward distrib ution depending only on ( s t , a t ) and similarly , the next state s t +1 is generated from a ﬁxed transition distrib ution depending only on ( s t , a t ) . The objecti ve is to maximize the accumulated re wards after T interactions. In this paper , we consider the case where the re ward distributions, the transitions, T and D are all unkno wn. W e deri ve the ﬁrst polynomial time Bayesian algorithm, BUCRL that achie ves up to logarithm factors, a re gret (i.e the dif ference between the accumulated re wards of the optimal policy and our algorithm) of the optimal order ˜ O ( √ D S AT ) . Importantly , our result holds with high probability for the worst-case (frequentist) re gret and not the weaker notion of Bayesian regret. W e perform experiments in a variety of en vironments that demonstrate the superiority of our algorithm ov er pre vious techniques. Our work also illustrates se veral results that will be of independent interest. In particular , we deri ve a sharper upper bound for the KL-div ergence of Bernoulli random variables. W e also deri ve sharper upper and lo wer bounds for Beta and Binomial quantiles. All the bound are very simple and only use elementary functions. 1 Introduction Marko v Decision Process (MDP) is a frame work that is of central importance in computer science. Indeed, MDPs are a generalization of (stochastic) shortest path problems and can thus be used for routing problems [ 19 ], scheduling and resource allocation problems [ 9 ]. One of its most successful application comes in reinforcement learning where it has been used to achie ve human-le vel performance for a v ariety of games such as Go [ 25 ], Chess [ 24 ]. It is also a generalization for online learning problems (such as multi-armed bandit problems) and as such has been used for online advertisement [ 14 ] and movie recommendations [ 22 ]. Problem F ormulation In this paper , we focus on the problem of online learning of a near optimal policy for an unknown Markov Decision Process. An MDP consists of S states and A possible actions per state. Preprint. Under re view . Upon choosing an action a t at state s t , one recei ves a real v alue re ward r t , then one transits to a next state s t +1 . The re ward r t is generated from a ﬁx ed re ward distrib ution depending only on ( s t , a t ) and similarly , the ne xt state s t +1 is generated from a ﬁxed transition distrib ution p ( . | s t , a t ) depending only on ( s t , a t ) . The objecti ve is to maximize the accumulated (and undiscounted) re wards after T interactions. An MDP is characterized by a quantity (called D ) kno wn as the diameter . It indicates an upper bound on the expected shortest path from any state to any other state. When this diameter (formally deﬁned by Deﬁnition 1) is ﬁnite, the MDP is called communicating . Deﬁnition 1 (Diameter of an MDP) . The diameter D of an MDP M is deﬁned as the minimum number of r ounds needed to go fr om one state s and r each any other state s 0 while acting using some deterministic policy . F ormally , D ( M ) = max s 6 = s 0 ,s,s 0 ∈S min π : S →A T ( s 0 | s, π ) wher e T ( s 0 | s, π ) is the e xpected number of r ounds it takes to r each state s 0 fr om s using policy π . In this paper , we consider the case where the rew ard distributions r , the transitions p , T and D are all unkno wn. Gi ven that the rew ards are undiscounted, a good measure of performance is the gain, i.e. the inﬁnite horizon av erage re wards. The gain of a policy π starting from state s is deﬁned by: V ( s | π ) , lim sup T →∞ 1 T E " T X t =1 r ( s t , π ( s t )) | s 1 = s # . Puterman [20] sho ws that there is a policy π ∗ whose gain, V ∗ is greater than that of any other policy . In addition, this gain is the same for all states in a communicating MDP . W e can then characterize the performance of the agent by its regret deﬁned as: Regret( T ) , T X t =1 ( V ∗ − r ( s t , a t )) . Thus our goal is equi valent to obtaining a re gret as low as possible. Related W ork It has been sho wn that an y algorithm must incur a re gret of Ω( D S T A ) in the worst case. [ 10 ]. Since the establishment of this lo wer bound on the regret, there has been numerous algorithms for the problem. They can be classiﬁed in tw o ways: Frequentist and Bayesian. The frequentist algorithms usually construct e xplicit conﬁdence interv al while the Bayesian algorithms start with a prior distribution and uses the posterior deri ved from Bayes Theorem. Follo wing a long line of algorithms KL-UCRL [ 7 ], REGAL.C [ 4 ], UCBVI [ 2 ], SCAL [ 8 ] the authors of [ 27 ] deri ved a frequentist algorithm that achie ved the lo wer bound up to logarithmic factors. In contrast, the situation is different for Bayesian algorithms. One of the ﬁrst to prov e theoretical guarantees for posterior sampling is Osband et al. [17] , for their PSRL algorithm. Ho wev er, they only consider reinforcement learning problems with a ﬁnite and kno wn episode length 1 and prov e an upper bound of O ( H S √ T A ) on the expected Bayesian regret where H is the length of the episode. Ouyang et al. [18] generalises Osband et al. [17] results to weakly communicating MDP and proves a O ( H S S √ T A ) on the expected Bayesian regret where H S is a bound on the span of the MDP . Other Bayesian algorithms have also been deri ved in the litterature ho wev er, none of them is able to attain the lo wer bound for the general communicating MDP considered in this paper . Also many of the previous Bayesian algorithms only provide guarantees about the Bayesian regret (i.e, the re gret under the assumption that the true MDP is being sampled from the prior). It was thus an open-ended question whether or not one can design Bayesian algorithms 1 Informally , it is known that the MDP resets to a starting state after a ﬁx ed number of steps. 2 with optimal worst-case re gret guarantees[ 16 , 15 ]. In this work, we provide guarantees for the worst-case (frequentist) regret. W e solve the challenge by designing the ﬁrst Bayesian algorithm with prov able upper bound on the regret that matches the lo wer bound up to logarithmic factors. Our algorithm start with a prior on MDP and computes the posterior similarly to previous works. Ho we ver , instead of sampling from the posterior , we compute a quantile from the posterior . W e then uses all the MDPs possible under the quantile as a set of statistically plausible MDPs and then follow the same steps as the state-of-the art UCRL-V [ 27 ]. The idea of using quantiles hav e already been explored in the algorithm named Bayes-UCB [ 13 ] for multi-armed bandit (a special case of MDP where there is only one single state). Our work can also be considered as a generalization to Bayes-UCB . Our Contributions. Hereby , we summarise the contributions of this paper that we elaborate in the upcoming sections. • W e pro vide a conceptually simple Bayesian algorithm BUCRL for reinforcement learning that achie ves near -optimal worst case re gret. Rather than actually sampling from the posterior distrib ution, we simply construct upper conﬁdence bounds through Bayesian quantiles. • Based on our analysis, we explain why Bayesian approaches are often superior in performance than ones based on concentration inequalities. • W e perform experiments in a variety of en vironments that v alidates the theoretical bounds as well as prov es BUCRL to be better than the state-of-the-art algorithms. (Section 3) W e conclude by summarising the techniques in volved in this paper and discussing the possible future works they can lead to (Section 4). 2 Algorithms Description and Analysis In this section, we describe our Bayesian algorithm BUCRL. W e combine Bayesian priors and posterior together with optimism in the face of uncertainty to achie ve a high probability upper bound of ˜ O ( √ D S AT ) 2 on the worst-case re gret in any ﬁnite communicating MDP . Our algorithm can be summarized as follo w: 1. Consider a prior distribution ov er MDPs and update the prior after each observ ation 2. Construct a set of statistically plausible MDPs using the set of all MDPs inside a Quantile of the posterior distribution. 3. Compute a policy (called optimistic ) whose gain is the maximum among all MDPs in the plausible set. W e used a modiﬁed extended value iter ation algorithm deri ved in [27]. 4. Play the computed optimistic polic y for an artiﬁcial episode that lasts until the a verage number of times state-action pairs has been doubled reaches 1. This is known as the e xtended doubling trick [27]. They are multiple variants of quantiles deﬁnition for MDP (since an unknown MDP can be vie wed as a multi-v ariate random variable). In this paper, we adopt a speciﬁc deﬁnition of quantiles for multi-variate random v ariable called mar ginal quantiles . More precisely , Deﬁnition 2 (Marginal Quantile [ 3 ]) . Let X = ( X 1 . . . X m ) be a multivariate random vector with joint d.f.( distrib ution function) F , the i-th mar ginal d.f. F i . W e denote the ith mar ginal quantile function by: Q i ( F , q ) = inf { x : F i ( x ) ≥ q } , 0 ≤ q ≤ 1 . 2 ˜ O is used to hide log factors. 3 Unless otherwise speciﬁed, we will refer to marginal quantile as simply quantile . For uni variate distributions, the subscript i can be omitted, as the quantile and the marginal quantile coincide. Algorithm 1 BUCRL Input: Let µ 1 the prior distribution o ver MDPs. 1 − δ are conﬁdence level. Initialization: Set t ← 1 and observe initial state s 1 Set N k , N k ( s, a ) , N t k ( s, a ) to zero for all k ≥ 0 and ( s, a ) . f or episodes k = 1 , 2 , . . . do t k ← t N t k +1 ( s, a ) ← N t k ( s, a ) ∀ s, a Compute optimistic policy ˜ π k : /*Update the bounds on statistically plausible MDPs*/ ˇ r ( s, a ) ← Q s,a, r ( µ t , δ k r ) (lo wer quantile) ˆ r ( s, a ) ← Q s,a, r ( µ t , 1 − δ k r ) (upper quantile) where Q s,a, r is the i -th marginal quantile function with i the component corresponding to ( s, a ) for the re wards. For an y S c ⊆ S use: ˇ p ( S c | s, a ) ← Q s,a, S c , p ( µ t , δ k p ) (lo wer quantile) ˆ p ( S c | s, a ) ← Q s,a, S c , p ( µ t , 1 − δ k p ) (upper quantile) where Q s,a, S c , p is the i -th marginal quantile function with i the component corresponding to ( s, a ) and the subset S c for the transitions. /*F ind ˜ π k with value 1 √ t k -close to the optimal*/ ˜ π k ← E X T E N D E D V A L U E I T E R AT I O N ( ˇ r , ˆ r , ˇ p, ˆ p, 1 √ t k ) (Algorithm 2 in T ossou et al. [27].) Execute Policy ˜ π k : while P s,a N k ( s,a ) max { 1 ,N t k ( s,a ) } < 1 do Play action a t and observe r t , s t +1 . Let r t ← Ber n ( r t ) Increase N k and N k ( s t , a t ) , N t k +1 ( s t , a t ) by 1. Update the posterior µ t +1 using Bayes rule. t ← t + 1 end while end f or Our analysis is based on the choice of a speciﬁc prior distribution for MDP with bounded re wards. Prior Distrib ution W e consider two different prior distrib utions. One for computing lower bound on re wards/transitions, that is when computing δ -marginal quantile. One for computing upper bound on re wards/transitions, that is when computing 1 − δ -marginal quantile. For the lower bound, we used independent distribution for the re wards and transitions. W e also used independent distribution for the rewards of each state-action ( s, a ) . And independent distribution for the transition from any state-action ( s, a ) to any ne xt subset of states S c . The prior distrib ution for any of those components is a beta distribution of parameter (0 , 1) : Beta (0 , 1) 3 . 3 T echnically , beta distributions are only deﬁned for parameter strictly greater than 0. In this paper , when the parameter  is 0, we compute the posterior and the quantiles by considering the limit when  tends to 0. 4 The situation is similar with the upper bound. Howe ver , here the prior distribution for any component is a beta distribution of parameter (1 , 0) : Beta (1 , 0) 3 . Posterior Distrib ution Let’ s start by assuming that the re wards come from the Bernoulli distribution. F or the upper bounds, using Bayes rule, the posterior at round t k are: For the re wards of any ( s, a ) : Beta ( α + X t ≤ t k : s t =( s,a ) r t , β + N t k ( s, a ) − X t ≤ t k : s t =( s,a ) r t ) For the transitions from an y ( s, a ) to any subset of next state S c are: Beta ( α + X t ≤ t k : s t =( s,a ) p t , β + N t k ( s, a ) − X t ≤ t k : s t =( s,a ) p t ) where p t = 1 if s t +1 ∈ S c ; p t = 0 otherwise. α = 1 , β = 0 for the upper posteriors and α = 0 , β = 1 for the lo wer posteriors. Dealing with non-Bernoulli r ewards W e deal with non-Bernoulli rewards by performing a Bernoulli trials on the observed rew ards. In other words, upon observing r t we used Ber n ( r t ) where Ber n ( r t ) is a sample from the Bernoulli distribution of parameter r t . This technique is already used in [ 1 ] and ensures that our prior remain v alid. Quantiles When N t k ( s, a ) = 0 the lower and upper quantiles are respecti vely 0 and 1 . When the ﬁrst parameter of the posterior is 0 , the lo wer quantile is 0 . When the second parameter of the posterior is 0 , the upper quantile is 1 . In all other cases, the δ quantile corresponds to the in verse cumulati ve distrib ution function of the posterior at the point δ . T o achiev e a high probability bound of 1 − δ on our regret, we used the follo wing parameters respecti vely for the re wards and transitions δ k r = δ 4 S A ln(2 t ) , δ k p = δ 8 S 2 A ln(2 t ) , where 1 − δ is the desired conﬁdence lev el of the set of plausible MDPs. Theorem 1 (Upper Bound on the Re gret of BUCRL) . W ith pr obability at least 1 − δ for any δ ∈ ]0 , 1[ , any T ≥ 1 , the re gr et of BUCRL is bounded by: R ( T ) ≤ 20 · s min { S, log 2 2 2 D } D T S A log T ln  B δ  + 9 D S A ln  B δ  for B = 9 S √ T D S A ln( T S A ) . Pr oof. Our proof is based on the generic proof pro vided in T ossou et al. [27] . T o apply that generic proof, we need to show that with high probability the true rew ards/transitions of any state-action is contained in the lo wer and upper quantiles of the Bayesian Posterior . In other words we need to sho w that the Bayesian quantiles provide e xact cov erage probabilities. F or that we notice that our prior lead to the same conﬁdence interv al as the Clopper-Pearson interv al (See Lemma 1). Furthermore, we need to provide upper and lo wer bound for the maximum de viation of the Bayesian posterior quantiles from the empirical v alues. This is a direct consequence of Proposition 2 and 3. The follo wing results were all useful in establishing our main result in Theorem 1. Our main contribution in Proposition 4 is the upper bound (the ﬁrst term of the upper bound) for the KL-di vergence of tw o bernoulli random v ariables. The last term of the upper bound is a direct deri v ation from the upper bounds in [ 6 ]. Our result in Proposition 4 shows a factor of 2 improv ement in the leading term of the upper bound. The KL 5 di ver gence of Bernoulli random is useful for many online learning problems and we used it here to bound the quantile of the Binomial distributions in term of simple functions. Proposition 4 (Bernoulli KL-Di ver gence) . For an y number p and x such that 0 < p < 1 , 0 ≤ x ≤ q where q = 1 − p , we ha ve: x 2 2( pq + x ( q − p ) / 3) ≤ D ( p + x k p ) ≤ x 2 2( pq − xp/ 2) ≤ x 2 pq . where D ( p + x k p ) is used to denote the KL-di ver gence between two Bernoulli random v ariables of parameters p + x and p . Pr oof Sketch. The main idea to prov e the upper bound is by studying the sign of the function in x obtained by taking the dif ference of the KL-div ergence and the upper bound. W e used Sturm’ theorem to basically sho w that this function starts as a decreasing function then after a point becomes increasing for the remaining of its domain. This together with the observation that at the end of its domain the function is non-positi ve concludes our proof. Full detailed are a vailable in the appendix. Proposition 1 provides tight lo wer and upper bound for the quantile of the binomial distribution in the same simple form as Bernstein inequalities. Binomial distributions and their quantiles are useful for a lot of applications and we use it here to deriv e the bounds for the quantile from a Beta distribution in Proposition 2 and 3. Proposition 1 (Lo wer and Upper bound on the Binomial Quantile) . Let X n,p ∼ Binom ( n, p ) . For any δ such that 0 . 5 ≤ 1 − δ < 1 , the quantile Q ( Binom ( n, p ) , 1 − δ ) of X n,p obeys:  np + C l ( p, Φ − 1 (1 − δ ))  ≤ Q ( Binom ( n, p ) , 1 − δ ) ≤  np + C u ( p, Φ − 1 (1 − δ ))  where C u ( x, y ) = min ( n (1 − x ) , s y 2  nx (1 − x ) + (1 − 2 x ) 2 y 2 36  + (1 − 2 x ) y 2 6 ) (1) C l ( x, y ) = max ( 0 , min ( n (1 − x ) − 1 , s y 2  nx (1 − x ) + x 2 y 2 16  − xy 2 4 − 1 )) (2) with Φ − 1 the quantile function of the standard normal distribution. Pr oof Sketch. W e used the tights bounds for the cdf of Binomial in [ 29 ]. W e in verted those bounds and then use the upper and lo wer bound for KL di ver gence in Proposition 4 to conclude. Full detailed is a vailable in the appendix. Proposition 2 and 3 provides lo wer and upper bound for the Beta quantiles in term of simple functions similar to the one for Bernstein inequalities. W e used it to prov e our main result in Theorem 1. Proposition 2 (Upper bound on the Beta Quantile) . Let Y x +1 ,n − x be Beta ( x + 1 , n − x ) for integers x, n such that 0 ≤ x < n and n > 0 . The 1 − δ th quantile of Y denoted by Q ( Beta ( x + 1 , n − x ) , 1 − δ ) with 0 . 5 ≤ 1 − δ < 1 satisﬁes: Q ( Beta ( x + 1 , n − x ) , 1 − δ ) ≤ x n + r  x n  1 − x n  y 2 n + 1 n y 2 5 6 + r 7 12 ! + 2 y + 2 ! (3) where y = Φ − 1 (1 − δ ) , Φ − 1 the quantile function of the standard normal distribution 6 Pr oof Sketch. These bounds comes directly from the relation between Beta and Binomial cdfs. W e apply Proposition 1 which giv es a bounds for the quantile p in term of p (1 − p ) . W e then applies again Proposition 1 to bound p (1 − p ) in term of x n  1 − x n  . Full proof is a vailable in the appendix. Proposition 3 (Lo wer bound on the Beta Quantile) . Let Y x,n − x +1 be Beta ( x, n − x + 1) for integers x, n such that 0 < x ≤ n and n > 0 . The δ th quantile of Y x,n − x +1 denoted by Q ( Beta ( x, n − x + 1) , δ ) with 0 . 5 ≤ 1 − δ < 1 satisﬁes: Q ( Beta ( x, n − x + 1) , δ ) ≥ x n − r  x n  1 − x n  y 2 n − 1 n y 2 5 6 + r 7 12 ! + 2 y + 2 ! (4) where y = Φ − 1 (1 − δ ) , Φ − 1 the quantile function of the standard normal distribution. Pr oof Sketch. The proof comes almost exclusiv ely by performing the same steps as in the proof of Proposition 2. 3 Experimental Analysis W e empirically ev aluate the performance of BUCRL in comparison with that of UCRL-V [ 27 ], KL-UCRL [ 7 ] and UCRL2 [ 10 ]. W e also compared against TSDE [ 18 ] which is a variant of posterior sampling for reinforcement learning suited for inﬁnite horizon problems. W e used the en vironments Bandits , Riverswim , GameOfSkill-v1 , GameOfSkill-v2 as described in T ossou et al. [27] . W e also eliminate unintentional bias and v ariance in the exact way described in [ 27 ]. Figure 1 illustrates the e volution of the a verage regret along with conﬁdence region (standard de viation). Figure 1 is a log-log plot where the ticks represent the actual values. Experimental Setup. The conﬁdence hyper -parameter δ of UCRL-V, KL-UCRL, and UCRL2 is set to 0 . 05 . TSDE is initialized with independent Beta ( 1 2 , 1 2 ) priors for each re ward r ( s, a ) and a Dirichlet prior with parameters ( α 1 , . . . α S ) for the transition functions p ( . | s, a ) , where α i = 1 S . W e plot the average re gret of each algorithm ov er T = 2 24 rounds computed using 40 independent trials. Implementation Notes on BUCRL W e note here that the quantiles to any subset of next states can be computed efﬁciently with of a complexity linear in S A and not the naiv e exponential complexity . This is because The posterior to any subset of ne xt states only depend on the sum of the re wards of its constituent. Results and Discussion. W e can see that BUCRL outperforms UCRL-V over all environments except in the Bandits one. This is in line with the theoretical regret whereby we can see that using the Bernstein bound is a factor times worse than the Bayesian quantile. Note that this is not an artifact of the proof. Indeed, pure optimism can be seen as using the pr oof inside the algorithm whereas the Bayesian version pro vides a general algorithm that has to be proven separately . Consequently , the actual performance of the Bayesian algorithm can often be much better than the bounds provided. 4 Conclusion In conclusion, using Bayesian quantiles lead to an algorithm with strong performance while enjoying the best of both frequentist and Bayesian view . It also provides a conceptually simple and very general algorithm for dif ferent scenarios. Although we were only able to prov e its performance for bounded rewards in [0 , 1] and a speciﬁc prior , we believe it should be possible to pro vide proof for other re wards distrib ution and prior such as Gaussian. As future w ork, it would be interesting to explore ho w one can re-use the idea of BUCRL for non-tabular settings such as with linear function approximation or deep learning. 7 -32 0 32 1024 32768 2 0 2 5 2 10 2 15 2 20 A verage R egr et r ound TSDE Bayes-UCRL UCRL - V KL -UCRL UCRL2 (a) Riv erSwim 0 32 1024 32768 1.04858x10 6 2 0 2 5 2 10 2 15 2 20 A verage R egr et r ound Bayes-UCRL UCRL - V TSDE KL -UCRL UCRL2 (b) GameOfSkill-v1 -1024 -32 0 32 1024 2 0 2 5 2 10 2 15 2 20 A verage R egr et r ound UCRL - V TSDE Bayes-UCRL KL -UCRL UCRL2 (c) Bandits 0 32 1024 32768 1.04858x10 6 2 0 2 5 2 10 2 15 2 20 A verage R egr et r ound Bayes-UCRL UCRL - V TSDE KL -UCRL UCRL2 (d) GameOfSkill-v2 Figure 1: T ime ev olution of av erage regret for B UCRL, UCRL-V, TSDE, KL-UCRL, and UCRL2. References [1] Agraw al, S. and Goyal, N. Analysis of thompson sampling for the multi-armed bandit problem. In Confer ence on Learning Theory , pp. 39–1, 2012. [2] Azar , M. G., Osband, I., and Munos, R. Minimax regret bounds for reinforcement learning. arXiv pr eprint arXiv:1703.05449 , 2017. [3] Babu, G. J. and Rao, C. R. Joint asymptotic distribution of mar ginal quantiles and quantile functions in samples from a multiv ariate population. In Multivariate Statistics and Probabilit y , pp. 15–23. Elsevier , 1989. [4] Bartlett, P . L. and T e wari, A. Regal: A regularization based algorithm for reinforcement learning in weakly communicating mdps. In Pr oceedings of the T wenty-F ifth Confer ence on Uncertainty in Artiﬁcial Intelligence , U AI ’09, pp. 35–42. A U AI Press, 2009. [5] Chiani, M., Dardari, D., and Simon, M. K. Ne w exponential bounds and approximations for the computation of error probability in fading channels. IEEE T ransactions on W ir eless Communications , 2 (4):840–845, 2003. [6] Dragomir , S. S., Scholz, M., and Sunde, J. Some upper bounds for relativ e entropy and applications. Computers & Mathematics with Applications , 39(9-10):91–100, 2000. [7] Filippi, S., Cappé, O., and Gari vier , A. Optimism in reinforcement learning and kullback-leibler div er- gence. In Communication, Contr ol, and Computing (Allerton), 2010 48th Annual Allerton Confer ence on , pp. 115–122. IEEE, 2010. 8 [8] Fruit, R., Pirotta, M., Lazaric, A., and Ortner , R. Efﬁcient bias-span-constrained e xploration-exploitation in reinforcement learning. arXiv preprint , 2018. [9] Gocgun, Y ., Bresnahan, B. W ., Ghate, A., and Gunn, M. L. A markov decision process approach to multi-category patient scheduling in a diagnostic facility . Artiﬁcial intelligence in medicine , 53(2): 73–81, 2011. [10] Jaksch, T ., Ortner, R., and Auer , P . Near-optimal re gret bounds for reinforcement learning. Journal of Machine Learning Resear ch , 11(Apr):1563–1600, 2010. [11] Janson, S. Large de viation inequalities for sums of indicator v ariables. arXiv pr eprint arXiv:1609.00533 , 2016. [12] Kaas, R. and Buhrman, J. M. Mean, median and mode in binomial distributions. Statistica Neerlandica , 34(1):13–18, 1980. [13] Kaufmann, E., Gari vier, A., and P aristech, T . On bayesian upper conﬁdence bounds for bandit problems. In In AIST A TS , 2012. [14] Lu, T ., Pál, D., and Pál, M. Showing rele vant ads via conte xt multi-armed bandits. In Pr oceedings of AIST A TS , 2009. [15] Osband, I. and V an Roy , B. Posterior sampling for reinforcement learning without episodes. arXiv pr eprint arXiv:1608.02731 , 2016. [16] Osband, I. and V an Roy , B. Why is posterior sampling better than optimism for reinforcement learning? In Precup, D. and T eh, Y . W . (eds.), Pr oceedings of the 34th International Confer ence on Machine Learning , volume 70 of Pr oceedings of Machine Learning Researc h , pp. 2701–2710, International Con vention Centre, Sydney , Australia, 06–11 Aug 2017. PMLR. URL http://proceedings. mlr.press/v70/osband17a.html . [17] Osband, I., Russo, D., and V an Roy , B. (more) ef ﬁcient reinforcement learning via posterior sampling. In Advances in Neural Information Pr ocessing Systems , pp. 3003–3011, 2013. [18] Ouyang, Y ., Gagrani, M., Nayyar , A., and Jain, R. Learning unkno wn marko v decision processes: A thompson sampling approach. In Advances in Neural Information Pr ocessing Systems , pp. 1333–1342, 2017. [19] Psaraftis, H. N., W en, M., and Konto v as, C. A. Dynamic vehicle routing problems: Three decades and counting. Networks , 67(1):3–31, 2016. [20] Puterman, M. L. Marko v decision pr ocesses: discrete stoc hastic dynamic pr ogramming . John W iley & Sons, 2014. [21] Pébay , P ., Rojas, J., and C Thompson, D. Sturm’ s theorem with endpoints. 07 2019. [22] Qin, L., Chen, S., and Zhu, X. Contextual combinatorial bandit and its application on di versiﬁed online recommendation. In Pr oceedings of the 2014 SIAM International Conference on Data Mining , pp. 461–469. SIAM, 2014. [23] Short, M. Improv ed inequalities for the poisson and binomial distribution and upper tail quantile functions. ISRN Pr obability and Statistics , 2013, 2013. 9 [24] Silver , D., Hubert, T ., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T ., et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint , 2017. [25] Silver , D., Schrittwieser , J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T ., Baker , L., Lai, M., Bolton, A., et al. Mastering the game of go without human knowledge. Natur e , 550(7676):354, 2017. [26] Thulin, M. et al. The cost of using exact conﬁdence intervals for a binomial proportion. Electr onic J ournal of Statistics , 8(1):817–840, 2014. [27] T ossou, A., Basu, D., and Dimitrakakis, C. Near-optimal Optimistic Reinforcement Learning using Empirical Bernstein Inequalities. arXiv e-prints , art. arXiv:1905.12425, May 2019. [28] Y ap, C.-K. Fundamental pr oblems of algorithmic algebr a , volume 49. Oxford University Press Oxford, 2000. [29] Zubko v , A. M. and Serov , A. A. A complete proof of universal inequalities for the distributi on function of the binomial law . Theory of Pr obability & Its Applications , 57(3):539–544, 2013. A Proofs A.1 Proof of Theor em 1 Our proof is a direct application of the generic proof pro vided in Section B.2 of T ossou et al. [27] . T o use that generic proof we need to show that with high probability the true rewards/transitions of an y state-action is contained in the lo wer and upper interv al of the Bayesian Posterior . This is a direct consequence of Lemma 1 and the fact that our posterior matches the Beta Distrib ution used in Lemma 1. Furthermore, we need to provide lower and upper bounds for the maximum de viation of the Bayesian posteriors from their empirical v alues. This comes directly from using Proposition 2 and Proposition 3, and bounding Φ − 1 using equation (15) in Chiani et al. [5]. Lemma 1 (Cov erage probability of Beta Quantile for Bernoulli random variable) . Let X 1 , . . . X n be n independent Bernoulli random variable with common parameter µ such that 0 < µ < 1 and n ≥ 1 . Let X = P n i =1 X i denote the corresponding Binomial random variable. Let U 1 − δ ( X ) the (random) 1 − δ th quantile of the distribution Beta ( X + 1 , n − X ) and U δ ( X ) the δ th quantile of the distribution Beta ( X , n − X + 1) . If 0 < X < n , we have: P [ U δ ( X ) ≤ µ ≤ U 1 − δ ( X ) | µ ] ≥ 1 − 2 δ. Pr oof. Since each X i is a Bernoulli random v ariable with parameter µ , then X = P n i =1 X i is a Binomial random v ariable with parameter ( n, µ ) . According to Thulin et al. [26] equation (4) the quantile of the Beta distribution used in this lemma corresponds exactly to the upper one sided Clopper–Pearson interval (for Binomial distribution) whose co verage probability is at least 1 − δ by construction [ 26 ]. The same argument holds for the lo wer one sided Clopper–Pearson interval. Combining them concludes the proof. Proposition 1 (Lo wer and Upper bound on the Binomial Quantile) . Let X n,p ∼ Binom ( n, p ) . F or any δ such that 0 . 5 ≤ 1 − δ < 1 , the quantile Q ( Binom ( n, p ) , 1 − δ ) of X n,p obe ys:  np + C l ( p, Φ − 1 (1 − δ ))  ≤ Q ( Binom ( n, p ) , 1 − δ ) ≤  np + C u ( p, Φ − 1 (1 − δ ))  10 wher e C u ( x, y ) = min ( n (1 − x ) , s y 2  nx (1 − x ) + (1 − 2 x ) 2 y 2 36  + (1 − 2 x ) y 2 6 ) (5) C l ( x, y ) = max ( 0 , min ( n (1 − x ) − 1 , s y 2  nx (1 − x ) + x 2 y 2 16  − xy 2 4 − 1 )) (6) with Φ − 1 the quantile function of the standar d normal distribution. Pr oof. Using basic computation, we can verify that the bounds hold tri vially for p = 0 , for p = 1 and n = 0 . Furthermore, it is kno wn that an y median m of the binomial satisﬁes b np c ≤ m ≤ d np e [ 12 ]. So, our bounds also holds for 1 − δ = 0 . 5 . As a result, we can focus the proof on the case where 0 < p < 1 , n > 0 and 0 . 5 ≤ 1 − δ < 1 . From equation (1) in [29] we hav e: Φ sgn( k n − p ) r 2 nD ( k n k p ) ! ≤ P { X n,p ≤ k } (7) ≤ Φ sgn( k + 1 n − p ) r 2 nD ( k + 1 n k p ) ! for 0 ≤ k < n . Let’ s also observe that when k = n , the lower bound in (7) tri vially holds since P { X n,p ≤ k } = 1 ≥ Φ sgn( k n − p ) r 2 nD ( k n k p ) ! . Proof of the upper bound Our upper bound pro vides a correction to the Theorem 5 in Short [23]. Consider any k ( 0 ≤ k ≤ n ) such that: Φ sgn( k n − p ) r 2 nD ( k n k p ) ! ≥ 1 − δ. (8) Combining (8) with the left side of (7) we hav e that P { X n,p ≤ k } ≥ 1 − δ and as a result: Q ( Binom ( n, p ) , 1 − δ ) = inf { x : P { X n,p ≤ x } ≥ 1 − δ } ≤ k (9) So we just need to ﬁnd a v alue k satisfying (8) . Remarking that Φ − 1 is the CDF of the normal distribution (since it is the in verse of the normal quantile) we can conclude that Φ − 1 is continuous and increasing. Applying Φ − 1 to (8), we hav e: sgn( k n − p ) r 2 nD ( k n k p ) ≥ Φ − 1 (1 − δ ) . (10) The sign of k n − p : Assume that Q ( Binom ( n, p ) , 1 − δ ) ≤ b np c . In that case, we can see that our upper bound tri vially holds since C u ( x, y ) ≥ q (1 − 2 x ) 2 y 4 36 + (1 − 2 x ) y 2 6 ≥    (1 − 2 x ) y 2 6    + (1 − 2 x ) y 2 6 ≥ 0 . Then we can focus on the case where Q ( Binom ( n, p ) , 1 − δ ) > b np c . Since the binomial distribution is discrete with domain the set of integers, Q ( Binom ( n, p ) , 1 − δ ) > b np c implies that Q ( Binom ( n, p ) , 1 − δ ) ≥ b np c + 1 . As a result we hav e k ≥ Q ( Binom ( n, p ) , 1 − δ ) ≥ b np c + 1 > np and sgn( k n − p ) = 1 . 11 Let x a number such that k n = p + x . Using this in (10), we thus need to ﬁnd an x such that: D ( p + x k p ) ≥ Φ − 1 (1 − δ ) 2 2 n Consider a function g such that D ( p + x k p ) ≥ g ( x ) . If we ﬁnd an x such that g ( x ) ≥ Φ − 1 (1 − δ ) 2 2 n , then it would mean that D ( p + x k p ) ≥ Φ − 1 (1 − δ ) 2 2 n . W e will pick g to be the lo wer bound on D ( p + x k p ) in Theorem 4. No w let’ s observe that since sgn( k n − p ) = 1 , it means that x ≥ 0 . Also q − x = 1 − p − x = 1 − k n ≥ 0 so that x ≤ q . So the condition of Theorem 4 are satisﬁed and our goal becomes ﬁnding an x ≥ 0 such that: x 2 2( pq + x ( q − p ) / 3) ≥ Φ − 1 (1 − δ ) 2 2 n Solving for this inequality leads to the upper bound part of the Theorem. Proof of the Lower bound If Q ( Binom ( n, p ) , 1 − δ ) = n , it is easy to verify that our lower bound tri vially holds. So we can focus on the case where Q ( Binom ( n, p ) , 1 − δ ) < n . Consider any k ( 0 ≤ k < n ) such that: Φ sgn( k + 1 n − p ) r 2 nD ( k + 1 n k p ) ! ≤ 1 − δ (11) Combining (11) with the right side of (7) we hav e that P { X n,p ≤ k } ≤ 1 − δ and since the CDF of a Binomial is an increasing function, we hav e: Q ( Binom ( n, p ) , 1 − δ ) ≥ k (12) The sign of k +1 n − p : Let’ s note that the quantile function of the binomial distribution is increasing (since it is the in verse of the cdf and the cdf is increasing). So, we hav e: Q ( Binom ( n, p ) , 1 − δ ) ≥ Q ( Binom ( n, p ) , 1 2 ) . As a result, there exists a number k satisfying both (12) and: Q ( Binom ( n, p ) , 1 2 ) ≤ k . W e will try to ﬁnd this number . Let’ s observ e that Q ( Binom ( n, p ) , 1 2 ) is the (smallest) median of the binomial distribution and thus we ha ve: Q ( Binom ( n, p ) , 1 2 ≥ b np c [12]. So, k ≥ Q ( Binom ( n, p ) , 1 − δ ) (13) ≥ Q ( Binom ( n, p ) , 1 2 ) (14) ≥ b np c (15) As a result, we hav e k + 1 ≥ b np c + 1 > np and sgn( k +1 n − p ) = 1 . Then our objecti ve is to ﬁnd a k ≥ b np c satisfying (11). Let x a number such that k +1 n = p + x . Applying the in verse Φ − 1 to (11) and replacing k +1 n by x , our objectiv e becomes ﬁnding an x ≥ 0 such that: D ( p + x k p ) ≤ Φ − 1 (1 − δ ) 2 2 n 12 Our objectiv e is equiv alent to ﬁnding an x such that g ( x ) ≤ Φ − 1 (1 − δ ) 2 2 n for a function g such that D ( p + x, p ) ≤ g ( x ) . W e can easily verify that x ≥ 0 and x ≤ q ( q − x = 1 − p − x = 1 − k +1 1 ≥ 0 ). And as a result, we pick g as the ﬁrst upper bound on D ( p + x k p ) in Theorem 4. Our objecti ve is thus to ﬁnd x ( 0 ≤ x ≤ 1 − p ) such that: x 2 2( pq − xp/ 2) ≤ Φ − 1 (1 − δ ) 2 2 n Solving for this equation and picking a v alue for x such that 0 ≤ x ≤ 1 − p , k ≥ b np c leads to the ﬁrst lo wer bound part of the Theorem. F act 1 (See [ 13 ]) . Let Y a,b ∼ Beta ( a, b ) wher e a and b some inte gers such that a > 0 , b > 0 a random variable fr om the Beta distribution. Then, for any p ∈ [0 , 1] : P ( Y a,b ≤ p ) = P ( X a + b − 1 , 1 − p ≤ b − 1) (16) P ( Y a,b ≥ p ) = P ( X a + b − 1 ,p ≤ a − 1) (17) wher e X n,x is used to denote a random variable distrib uted accor ding to the binomial distrib ution of parameter s ( n, x ) ( X n,x ∼ Binom ( n, x ) ). Proposition 2 (Upper bound on the Beta Quantile) . Let Y x +1 ,n − x be Beta ( x + 1 , n − x ) for inte gers x, n such that 0 ≤ x < n and n > 0 . The 1 − δ th quantile of Y denoted by Q ( Beta ( x + 1 , n − x ) , 1 − δ ) with 0 . 5 ≤ 1 − δ < 1 satisﬁes: Q ( Beta ( x + 1 , n − x ) , 1 − δ ) ≤ x n + r  x n  1 − x n  y 2 n + 1 n y 2 5 6 + r 7 12 ! + 2 y + 2 ! (18) wher e y = Φ − 1 (1 − δ ) , Φ − 1 the quantile function of the standar d normal distribution Pr oof. For simplicity , in this proof we used p = Q ( Beta ( x + 1 , n − x ) , 1 − δ ) and y = Φ − 1 (1 − δ ) . Using Equation (16), we hav e: P [ Y x +1 ,n − x ≤ p ] = P [ X n, 1 − p ≤ n − x − 1] . Since the CDF of the beta distribution is continuous, we kno w that P [ Y x +1 ,n − x ≤ p ] = 1 − δ . So we hav e P [ X n, 1 − p ≤ n − x − 1] = 1 − δ Using the upper bound for Binomial quantile in Lemma 1, we hav e: n − x − 1 ≤ n (1 − p ) + C u (1 − p, Φ − 1 (1 − δ )) + 1 where C u is the function deﬁned in (5). This leads to: p ≤ x n + C u (1 − p, y ) + 2 n = x n + r p (1 − p ) y 2 n + (2 p − 1) 2 y 4 36 n 2 + (2 p − 1) y 2 6 n + 2 n (19) W e would like to ﬁnd an upper bound for p (1 − p ) in (19) that depends on x n (1 − x n ) . Using Equation (16) with the lo wer bound for binomial quantile in Lemma 1, we hav e p ≥ x n + max  0 , min  np − 1 , C l (1 − p, Φ − 1 (1 − δ ))  n ≥ x n (20) 13 Multiplying equations (20) and (19) together (both are all positi ve) leads to: p (1 − p ) ≤  1 − x n  x n + r p (1 − p ) y 2 n + (2 p − 1) 2 y 4 36 n 2 + (2 p − 1) y 2 6 n + 2 n ! (21) Using the fact that √ a + b ≤ √ a + √ b , 2 p − 1 ≤ 1 and using 1 − x n ≤ 1 for the terms not in v olving x n , we hav e: p (1 − p ) ≤  x n  1 − x n  + r p (1 − p ) y 2 n + 1 3 y 2 + 6 n (22) Letting z = p p (1 − p ) in (22) leads to an inequality in volving a polynomial of de gree 2 in z . Solving for this inequality and then using √ a + b ≤ √ a + √ b : p p (1 − p ) ≤ r  x n  1 − x n  + 1 √ n r 7 y 2 + 24 12 + r y 2 4 ! (23) Replacing (23) into (19) and using the fact that 2 p − 1 ≤ 1 , we have the desired upper bound of the lemma Proposition 3 (Lo wer bound on the Beta Quantile) . Let Y x,n − x +1 be Beta ( x, n − x + 1) for inte gers x, n such that 0 < x ≤ n and n > 0 . The δ th quantile of Y x,n − x +1 denoted by Q ( Beta ( x, n − x + 1) , δ ) with 0 . 5 ≤ 1 − δ < 1 satisﬁes: Q ( Beta ( x, n − x + 1) , δ ) ≥ x n − r  x n  1 − x n  y 2 n − 1 n y 2 5 6 + r 7 12 ! + 2 y + 2 ! (24) wher e y = Φ − 1 (1 − δ ) , Φ − 1 the quantile function of the standar d normal distribution. Pr oof. Let’ s denote p = Q ( Beta ( x, n − x + 1) , δ ) . Using (17) , we ha ve that: P ( Y x,n − x +1 ≥ p ) = P ( X n,p ≤ x − 1) . Since the Beta distrib ution is continuous and also ha ve a continuous cdf, then there e xists a unique p such that: P ( Y x,n − x +1 ≥ p ) = 1 − P ( Y x,n − x +1 ≤ p ) = 1 − δ . As a result, we hav e P ( X n,p ≤ x − 1) = 1 − δ . Using the upper and lo wer bound for Binomial quantile in Lemma 1, we hav e respecti vely: p ≥ x n − C u ( p, y ) + 2 n = x n − r p (1 − p ) y 2 n + (1 − 2 p ) 2 y 4 36 n 2 − (1 − 2 p ) y 2 6 n − 2 n (25) p ≤ x n − C l ( p, Φ − 1 (1 − δ )) n ≤ x n (26) W e would like to ﬁnd a lo wer bound for p (1 − p ) in (25) that depends on x n (1 − x n ) . 14 (26) implies that: 1 − p ≥ 1 − x n (27) Note that we can multiply (27) by (25) to get a lower bound for p (1 − p ) e ven if the left hand side of (25) is negati ve since both p (1 − p ) and 1 − x n are always positi ve. After this multiplication, we follo w the exact same steps as in the equiv alent part of the proof for Lemma 2. W e can do that since e ven if we are looking for a lo wer bound, all the term previously upper bounded in Lemma 2 are multiplied by − . This completes the proof for this lemma. A.2 Useful Results Proposition 4 (Lo wer and Upper Bound on Bernoulli KL-Di vergence) . F or any number p and x such that 0 < p < 1 , 0 ≤ x ≤ q where q = 1 − p , we have: x 2 2( pq + x ( q − p ) / 3) ≤ D ( p + x k p ) ≤ x 2 2( pq − xp/ 2) ≤ x 2 pq . wher e D ( p + x k p ) is used to denote the KL-diver gence between two Bernoulli random variables of parameter s p + x and p . Pr oof. The proof of the lower bound already appear in Janson [11] (after equation (2.1)). First, let’ s observe that: D ( p + x k p ) = p (1 + x p ) ln(1 + x p ) + h ( x ) with h ( x ) = ( q (1 − x q ) ln(1 − x q ) if x < q 0 if x = q Note that this is a v alid deﬁnition for the KL-div ergence since for any q ∈ ]0 , 1[ , lim x → q − (1 − x q ) ln(1 − x q ) = 0 Let g ( x ) a parametric function deﬁned by: g ( x ) = p (1 + x p ) ln(1 + x p ) + h ( x ) − x 2 2( pq + x ( q − p − a ) /b ) Where a and b are constants (independent of x but possibly depend ing on p ) such that pq + x ( q − p − a ) /b > 0 for all q ∈ ]0 , 1[ , x ∈ [0 , q ] . W e can immediately see that g is continuous and differentiable in its domain [0 , q ] since it is the sum of continuous and dif ferentiable functions. For an y x ∈ [0 , q [ , the deriv ativ e g 0 ( x ) of g ( x ) is: g 0 ( x ) = ln(1 + x p ) − ln(1 − x q ) − 4 pq x + 2 x 2 ( q − p − a ) /b (2( pq + x ( q − p − a ) /b )) 2 15 And g 0 (0) = 0 . W e can see that g 0 is a continuous and dif ferentiable in [0 , q [ . The second deri vati ve for any x ∈ [0 , q [ is g 00 ( x ) = 1 x + p + 1 x + q − p 2 q 2 ( pq + x ( q − p − a ) /b ) 3 (28) = x 3 ( q − p − a ) 3 b 3 + 3 pq x 2 ( q − p − a ) 2 b 2 + p 2 q 2 x 2 − p 2 q 2 xa + ( 3 b − 1) p 2 q 2 x ( q − p − a ) ( x + p )( x + q )( pq + x ( q − p − a ) /b ) 3 (29) Proof of the Lo wer bound Let’ s set a = 0 and b = 3 . In that case we hav e for any x ∈ [0 , q [ : g 00 ( x ) = x 3 ( q − p ) 3 / 27 + pq x 2 ( q − p ) 2 / 3 + p 2 q 2 x 2 ( x + p )( x + q )( pq + x ( q − p ) /b ) 3 (30) ≥ − px 3 ( q − p ) 2 / 27 + pq x 2 ( q − p ) 2 / 3 + p 2 q 2 x 2 ( x + p )( x + q )( pq + x ( q − p ) /b ) 3 (31) ≥ − pq x 2 ( q − p ) 2 / 27 + pq x 2 ( q − p ) 2 / 3 + p 2 q 2 x 2 ( x + p )( x + q )( pq + x ( q − p ) /b ) 3 ≥ 0 (32) Furthermore, elementary calculations leads to g (0) = g 0 (0) = 0 . Since g 0 is continuous in [0 , q [ and g 00 is positi ve in [0 , q [ , we can conclude that g 0 is increasing in [0 , q [ . Since g 0 (0) = 0 , it means g 0 ( x ) ≥ 0 for any x ∈ [0 , q [ . Since g is continuous, g 0 ( x ) ≥ 0 for any x ∈ [0 , q [ means that g is increasing in [0 , q [ . Using the fact that g (0) = 0 we have that g ( x ) ≥ 0 for any x ∈ [0 , q [ . W e will no w show that g ( x ) is non-negati ve at q too. Using the fact that g is continuous at q , we hav e that lim x → q − g ( x ) = g ( q ) . Also lim x → q − g ( x ) ≥ 0 since g ( x ) ≥ 0 for any x ∈ [0 , q [ . And as a result, g ( q ) = lim x → q − g ( x ) ≥ 0 which concludes the proof of the lo wer bound. Proof of the ﬁrst upper bound Let a = q and b = 2 . W e want to analyze the sign of the resulting g 00 ov er its domain [0 , q [ . For that observ e that the denominator of g 00 is always strictly positi ve. This means that the sign of g 00 is the same as the sign of its numerator . Let’ s denote g 00 0 the numerator . W e hav e: g 00 0 ( x ) = x 3 ( − p ) 3 8 + 3 q x 2 p 3 4 + p 2 q 2 x 2 − p 2 q 3 x − p 3 q 2 x 2 (33) = x ·  x 2 ( − p ) 3 8 + 3 q xp 3 4 + p 2 q 2 x − p 2 q 3 − p 3 q 2 2  (34) Let’ s denote f ( x ) = x 2 ( − p ) 3 8 + 3 q xp 3 4 + p 2 q 2 x − p 2 q 3 − p 3 q 2 2 . W e will use Sturm’ s theorem (Theorem 2) to ﬁnd the number of roots of f in ]0 , q ] . The Sturm sequence of f is { f 0 , f 1 , f 2 } with: f 0 ( x ) = f ( x ) (35) f 1 ( x ) = x ( − p ) 3 4 + 3 q p 3 4 + p 2 q 2 (36) f 2 ( x ) = p ( − 5 p 4 + 26 p 3 − 53 p 2 + 48 p − 16) 8 (37) 16 W e hav e: f 0 (0) = − p 2 q 3 − p 3 q 2 2 < 0 f 1 (0) = 3 q p 3 4 + p 2 q 2 > 0 f 0 ( q ) = p 3 q 2 8 > 0 f 1 ( q ) = q p 3 + 2 p 2 q 2 2 > 0 And we hav e f 2 ( q ) = f 2 (0) The number of sign alternations in { f 0 (0) , f 1 (0) , f 2 (0) } is: 1 + 1 f 2 (0) < 0 where 1 f 2 (0) < 0 = 1 if f 2 (0) < 0 and 1 f 2 (0) < 0 = 0 otherwise. The number of sign alternations in { f 0 ( q ) , f 1 ( q ) , f 2 ( q ) } is: 1 f 2 (0) < 0 . Observing that neither 0 , nor q are roots of f , we can conclude by the Sturm’ s theorem (Theorem 2) that the number of roots of f in [0 , q ] is exactly 1. Since f is a polynomial it means that the sign of f changes at most once in the interval [0 , q ] . Let’ s α ( 0 < α < q ) the unique root of f in [0 , q ] . Then (ignoring zero-values) the function f hav e the same sign for all v alues in [0 , α ] and f have the same sign for all v alues in [ α , q ] . Observing that f (0) < 0 , it means that f ( x ) ≤ 0 for any x ∈ [0 , α ] . Observing that f ( q ) > 0 , it means that f ( x ) ≥ 0 for any x ∈ [ α, q ] . Since the second deriv ati ve g 00 is a multiple of a non-negati ve terms by f ; it means that g 00 ( x ) ≤ 0 for any x ∈ [0 , α ] and g 00 ( x ) ≥ 0 for any x ∈ [ α, q [ . W e will now deri ve the sign of g 0 and g ov er their domain. Since g 0 (0) = 0 , g 00 ( x ) ≤ 0 in x ∈ [0 , α ] and g 0 is continuous in [0 , α ] , we can conclude that g 0 is decreasing in [0 , α ] and g 0 ( x ) ≤ 0 for any x ∈ [0 , α ] . A similar argument for g allows us to conclude that g ( x ) ≤ 0 for any x ∈ [0 , α ] . Since g 00 ( x ) ≥ 0 for any x ∈ [ α, q [ , it means that g 0 is increasing in the interval [ α, q [ . Let α 0 , the lowest v alue in [ α, q ] such that g 0 ( α 0 ) = 0 . This means that for any x ∈ [ α, α 0 ] g 0 ( x ) ≤ 0 and for an y x ∈ [ α 0 , q ] , g 0 ( x ) ≥ 0 . Since g 0 is non-positi ve in [ α, α 0 ] and g is continuous, we ha ve that g is decreasing in [ α, α 0 ] so that g ( x ) ≤ g ( α ) ≤ 0 . If α 0 = q , our proof is essentially done since it implies g ( x ) ≤ 0 for all x ∈ [0 , q ] . Assume that α 0 < q . W e want to identify the sign of g in [ α 0 , q ] . In this case, we kno w that g 0 ( x ) ≥ 0 so that g is increasing in [ α 0 , q ] . No w let’ s observe that: g ( q ) = p (1 + q p ) ln(1 + q p ) + h ( q ) − q 2 pq (38) = ln(1 + q p ) − q p (39) ≤ q p − q p = 0 (40) W e will no w sho w by contradiction that g ( x ) ≤ 0 for all x ∈ [ α 0 , q ] . Assume that there exists a number c ∈ [ α 0 , q ] for which g ( c ) > 0 . Since g is increasing (and continuous) in [ α 0 , q ] , it means that g ( q ) > 0 which contradicts (40) . As a result, there is no v alue c ∈ [ α 0 , q ] such that g ( x ) > 0 . And this concludes the proof for the ﬁrst upper bound. 17 Proof of the second upper bound The second upper bound comes directly from the ﬁrst upper bound and the fact that: 2( pq − xp/ 2) = pq + p ( q − x ) ≥ pq B Pre viously known results B.1 Sturm’ s Theorem Deﬁnition 3 (Sturm sequence of a uni variate polynomial) . F or any univariate polynomial f ( x ) of de gree d with r eal coefﬁcients, the sturm sequence for f is a sequence of polynomials ¯ f = { f 0 , f 1 . . . f d } such that: f 0 = f f 1 = f 0 f i +1 = − f i − 1 % f i ∀ i ∈ { 0 , . . . , d − 2 } wher e f i − 1 % f i denotes the r emainder of the euclidian division of f i − 1 by f i Deﬁnition 4 (Number of sign alternations in an arbitrary sequence) . Let ¯ α = { α 0 , α 1 , . . . α n } be a sequence of real numbers. W e say that there is a sign alternation at position i ∈ { 1 , . . . n } , if ther e exists some j ∈ { 0 , . . . i − 1 } such that the following two conditions ar e satisﬁed (i) α i α j < 0 (ii) j = i − 1 or α j +1 = α j +2 = . . . = α i − 1 = 0 . The number of sign alternations of ¯ α is the number of positions for which ther e is a sign alternation. Deﬁnition 5 (Multiplicity of a root of a function) . Let r ≥ 0 a non-ne gative inte ger . Let f (0) = f and f ( n ) the n th derivative of a function f differ entiable up to n times. A r eal number α is called a r oot of multiplicity r for f if: f (0) ( α ) = . . . = f ( r − 1) ( α ) = 0; f ( r ) ( α ) 6 = 0 W e say a r eal number is not a multiple root if its multiplicity is less or equal to 1 (i.e. it is either a non-r oot or it is a r oot of multiplicity 1). Theorem 2 (Sturm’ s Theorem [ 28 , 21 ]) . F or any non-zer o univariate polynomial f ( x ) with r eal coefﬁcients and any numbers a ≤ b ; let N f ] a, b ] the number of distincts r eal r oots of f in ] a, b ] . If neither a nor b ar e multiple r oots, W e have: N f ] a, b ] = S f ( a ) − S f ( b ) Wher e S f ( x ) denotes the number of sign alternatives obtained for the sequences { f 0 ( x ) , f 1 ( x ) , . . . f d ( x ) } wher e { f 0 , f 1 , . . . , f d } is the sturm sequence of the polynomial f . 18

Near-optimal Bayesian Solution For Unknown Discrete Markov Decision Process

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment