PAC-Bayesian Inequalities for Martingales

IEEE TRANSA CTIONS ON INFORMA TION THEOR Y , V OL. XX, NO. Y , MONTH 201X 1 P A C-Bayesian Inequalities for Marting ales Y e vgeny Seldin, Franc ¸ ois Laviolette, Nicol ` o Cesa-Bianchi, John Shawe-T aylor, Peter Auer Abstract —W e present a set of high-probability inequalities that control the concentration of weighted a verages of multiple (possibly uncountably many) simultaneously evolving and inter - dependent martingales. Our r esults extend the P A C-Bay esian analysis in learning theory from the i.i.d. setting to martingales opening the way for its application to importance weighted sampling, reinf or cement learning, and other interactiv e learning domains, as well as many other domains in probability theory and statistics, wher e martingales are encountered. W e also present a comparison inequality that bounds the expectation of a con vex function of a martingale difference sequence shifted to the [0 , 1] interval by the expectation of the same function of independent Bernoulli variables. This inequality is applied to derive a tighter analog of Hoeffding-Azuma’s inequality . Index T erms —Martingales, Hoeffding-Azuma’s inequality , Bernstein’ s inequality , P A C-Bayesian bounds. I . I N T RO D U C T I O N M AR TINGALES are one of the fundamental tools in probability theory and statistics for modeling and studying sequences of random v ariables. Some of the most well-known and widely used concentration inequalities for individual martingales are Hoeffding-Azuma’ s and Bernstein’ s inequalities [1], [2], [3]. W e present a comparison inequality that bounds the expectation of a con vex function of a mar- tingale dif ference sequence shifted to the [0 , 1] interv al by the expectation of the same function of independent Bernoulli variables. W e apply this inequality in order to deri v e a tighter analog of Hoeffding-Azuma’ s inequality for martingales. More importantly , we present a set of inequalities that make it possible to control weighted averages of multiple simultaneously ev olving and interdependent martingales (see Fig. 1 for an illustration). The inequalities are especially interesting when the number of marting ales is uncountably inﬁnite and the standard union bound ov er the indi vidual martingales cannot be applied. The inequalities hold with high probability simultaneously for a lar ge class of averaging laws ρ . In particular, ρ can depend on the sample. One possible application of our inequalities is an analysis of importance-weighted sampling. Importance-weighted sam- pling is a general and widely used technique for estimating properties of a distribution by drawing samples from a dif- ferent distribution. V ia proper reweighting of the samples, the Y evgen y Seldin is with Max Planck Institute for Intelligent Systems, T ¨ ubingen, Germany , and Uni versity College London, London, UK. E-mail: seldin@tuebingen.mpg.de Franc ¸ ois Laviolette is with Univ ersit ´ e Laval, Qu ´ ebec, Canada. E-mail: francois.laviolette@ift.ula val.ca Nicol ` o Cesa-Bianchi is with Dipartimento di Informatica, Universit ` a degli Studi di Milano, Milan, Italy . E-mail: nicolo.cesa-bianchi@unimi.it John Sha we-T aylor is with University College London, London, UK. E- mail: jst@cs.ucl.ac.uk Peter Auer is with Chair for Information T echnology , Montanuni versit ¨ at Leoben, Leoben, Austria. E-mail: auer@unileoben.ac.at H                                      . . . . . . . . . . . . l l l ¯ M 1 ( h 1 ) % → & ¯ M 2 ( h 1 ) % → & · · · % → & ¯ M n ( h 1 ) l l l ¯ M 1 ( h 2 ) % → & ¯ M 2 ( h 2 ) % → & · · · % → & ¯ M n ( h 2 ) l l l ¯ M 1 ( h 3 ) % → & ¯ M 2 ( h 3 ) % → & · · · % → & ¯ M n ( h 3 ) l l l . . . . . . . . . . . . − − − − − − − − − → time Fig. 1. Illustration of an inﬁnite set of simultaneously ev olving and inter- dependent martingales. H is a space that index es the indi vidual marting ales. For a ﬁxed point h ∈ H , the sequence ¯ M 1 ( h ) , ¯ M 2 ( h ) , . . . , ¯ M n ( h ) is a single martingale. The arrows represent the dependencies between the values of the martingales: the value of a martingale h at time i , denoted by ¯ M i ( h ) , depends on ¯ M j ( h 0 ) for all j ≤ i and h 0 ∈ H (ev erything that is “before” and “concurrent” with ¯ M i ( h ) in time; some of the arrows are omitted for clarity). A mean v alue of the martingales with respect to a probability distribution ρ over H is giv en by h ¯ M n , ρ i . Our high-probability inequalities bound |h ¯ M n , ρ i| simultaneously for a large class of ρ . expectation of the desired statistics based on the reweighted samples from the controlled distrib ution can be made identical to the expectation of the same statistics based on unweighted samples from the desired distrib ution. Thus, the dif ference between the observed statistics and its expected value forms a marting ale dif ference sequence. Our inequalities can be ap- plied in order to control the de viation of the observed statistics from its expected value. Furthermore, since the averaging law ρ can depend on the sample, the controlled distrib ution can be adapted based on its outcomes from the preceding rounds, for example, for denser sampling in the data-dependent regions of interest. See [4] for an example of an application of this technique in reinforcement learning. Our concentration inequalities for weighted averages of martingales are based on a combination of Donsker -V aradhan’ s variational formula for relative entropy [5], [6], [7] with bounds on certain moment generating functions of martingales, including Hoef fding-Azuma’ s and Bernstein’ s inequalities, as well as the ne w inequality derived in this paper . In a nutshell, the Donsker -V aradhan’ s variational formula implies that for a probability space ( H , B ) , a bounded real- valued random v ariable Φ and any two probability distributions π and ρ over H (or , if H is uncountably inﬁnite, two probability density functions), the expected value E ρ [Φ] is bounded as: E ρ [Φ] ≤ KL( ρ k π ) + ln E π [ e Φ ] , (1) IEEE TRANSA CTIONS ON INFORMA TION THEOR Y , V OL. XX, NO. Y , MONTH 201X 2 where KL( ρ k π ) is the KL-div ergence (relative entropy) be- tween two distributions [8]. W e can also think of Φ as Φ = φ ( h ) , where φ ( h ) is a measurable function φ : H → R . Inequality (1) can then be written using the dot-product notation h φ, ρ i ≤ K L ( ρ k π ) + ln  h e φ , π i  (2) and E ρ [ φ ] = h φ, ρ i can be thought of as a weighted average of φ with respect to ρ (for countable H it is deﬁned as h φ, ρ i = P h ∈H φ ( h ) ρ ( h ) and for uncountable H it is deﬁned as h φ, ρ i = R H φ ( h ) ρ ( h ) dh ). 1 The weighted av erages h φ, ρ i on the left hand side of (2) are the quantities of interest and the inequality allows us to relate all possible averaging laws ρ to a single “reference” distribution π . (Sometimes, π is also called a “prior” distribu- tion, since it has to be selected before observing the sample.) W e emphasize that inequality (2) is a deterministic relation. Thus, by a single application of Marko v’ s inequality to h e φ , π i we obtain a statement that holds with high probability for all ρ simultaneously . The quantity ln h e φ , π i , kno wn as the cumulant-generating function of φ , is closely related to the moment-generating function of φ . The bound on ln h e φ , π i , after some manipulations, is achiev ed via the bounds on moment-generating functions, which are identical to those used in the proofs of Hoef fding-Azuma’ s, Bernstein’ s, or our new inequality , depending on the choice of φ . Donsker -V aradhan’ s variational formula for relativ e entropy laid the basis for P AC-Bayesian analysis in statistical learning theory [9], [10], [11], [12], where P AC is an abbre viation for the Probably Approximately Correct learning model intro- duced by V aliant [13]. P A C-Bayesian analysis provides high probability bounds on the de viation of weighted averages of empirical means of sets of independent random v ariables from their expectations. In the learning theory setting, the space H usually corresponds to a hypothesis space; the function φ ( h ) is related to the difference between the e xpected and empirical error of a hypothesis h ; the distribution π is a prior distribution o ver the hypothesis space; and the distrib ution ρ deﬁnes a randomized classiﬁer . The randomized classiﬁer draws a hypothesis h from H according to ρ at each round of the game and applies it to make the prediction on the next sample. P A C-Bayesian analysis supplied generalization guarantees for many inﬂuential machine learning algorithms, including support vector machines [14], [15], linear classiﬁers [16], and clustering-based models [17], to name just a few of them. W e show that P A C-Bayesian analysis can be extended to martingales. A combination of P A C-Bayesian analysis with Hoeffding-Azuma’ s inequality was applied by Le ver et. al [18] in the analysis of U-statistics. The results presented here are both tighter and more general, and make it possible to apply 1 The complete statement of Donsker-V aradhan’ s variational formula for relativ e entropy states that under appropriate conditions KL( ρ k π ) = sup φ  h φ, ρ i − ln h e φ , π i  , where the supremum is achiev ed by φ ( h ) = ln ρ ( h ) π ( h ) . Ho wev er , in our case the choice of φ is directly related to the values of the martingales of interest and the free parameters in the inequality are the choices of ρ and π . Therefore, we are looking at the inequality in the form of equation (1) and a more appropriate name for it is “change of measure inequality”. P AC-Bayesian analysis in new domains, such as, for example, reinforcement learning [4]. I I . M A I N R E S U L T S W e ﬁrst present our ne w inequalities for individual martin- gales, and then present the inequalities for weighted a v erages of martingales. All the proofs are provided in the appendix. A. Inequalities for Individual Martingales Our ﬁrst lemma is a comparison inequality that bounds expectations of con ve x functions of martingale difference sequences shifted to the [0 , 1] interv al by expectations of the same functions of independent Bernoulli random v ariables. The lemma generalizes a pre vious result by Maurer for inde- pendent random variables [19]. The lemma uses the following notation: for a sequence of random v ariables X 1 , . . . , X n we use X i 1 := X 1 , . . . , X i to denote the ﬁrst i elements of the sequence. Lemma 1: Let X 1 , . . . , X n be a sequence of random variables, such that X i ∈ [0 , 1] with probability 1 and E [ X i | X i − 1 1 ] = b i for i = 1 , . . . , n . Let Y 1 , . . . , Y n be independent Bernoulli random variables, such that E [ Y i ] = b i . Then for any con vex function f : [0 , 1] n → R : E [ f ( X 1 , . . . , X n )] ≤ E [ f ( Y 1 , . . . , Y n )] . Let kl( p k q ) = p ln p q + (1 − p ) ln 1 − p 1 − q be an abbreviation for KL ([ p, 1 − p ] k [ q , 1 − q ]) , where [ p, 1 − p ] and [ q , 1 − q ] are Bernoulli distributions with biases p and q , respectiv ely . By Pinsker’ s inequality [8], | p − q | ≤ p kl( p k q ) / 2 , which means that a bound on kl( p k q ) implies a bound on the absolute difference between the biases of the Bernoulli distributions. W e apply Lemma 1 in order to deriv e the following in- equality , which is an interesting generalization of an analogous result for i.i.d. variables. The result is based on the method of types in information theory [8]. Lemma 2: Let X 1 , . . . , X n be a sequence of random variables, such that X i ∈ [0 , 1] with probability 1 and E [ X i | X i − 1 1 ] = b . Let S n := P n i =1 X i . Then: E h e n kl ( 1 n S n k b ) i ≤ n + 1 . (3) Note that in Lemma 2 the conditional expectation E [ X i | X i − 1 1 ] is identical for all i , whereas in Lemma 1 there is no such restriction. Combination of Lemma 2 with Marko v’ s inequality leads to the follo wing analog of Hoeffding-Azuma inequality . Cor ollary 3: Let X 1 , . . . , X n be as in Lemma 2. Then, for any δ ∈ (0 , 1) , with probability greater than 1 − δ : kl  1 n S n     b  ≤ 1 n ln n + 1 δ . (4) S n is a terminal point of a random walk with bias b after n steps. By combining Corollary 3 with Pinsker’ s inequality we IEEE TRANSA CTIONS ON INFORMA TION THEOR Y , V OL. XX, NO. Y , MONTH 201X 3 can obtain a more e xplicit bound on the de viation of the ter- minal point from its e xpected value, | S n − bn | ≤ q n 2 ln n +1 δ , which is similar to the result we can obtain by applying Hoeffding-Azuma’ s inequality . Howe v er , in certain situations the less explicit bound in the form of kl is signiﬁcantly tighter than Hoeffding-Azuma’ s inequality and it can also be tighter than Bernstein’ s inequality . A detailed comparison is provided in Section III. B. P A C-Bayesian Inequalities for W eighted A verag es of Mar- tingales Next, we present several inequalities that control the con- centration of weighted averages of multiple simultaneously ev olving and interdependent martingales. The ﬁrst result sho ws that the classical P AC-Bayesian theorem for independent ran- dom variables [12] holds in the same form for martingales. The result is based on combination of Donsker-V aradhan’ s variational formula for relati ve entropy with Lemma 2. In order to state the theorem we need a fe w deﬁnitions. Let ( H , B ) be a probability space. Let ¯ X 1 , . . . , ¯ X n be a sequence of random functions, such that ¯ X i : H → [0 , 1] . Assume that E [ ¯ X i | ¯ X 1 , . . . , ¯ X i − 1 ] = ¯ b , where ¯ b : H → [0 , 1] is a deterministic function (possibly unknown). This means that E [ ¯ X i ( h ) | ¯ X 1 , . . . , ¯ X i − 1 ] = ¯ b ( h ) for each i and h . Note that for each h ∈ H the sequence ¯ X 1 ( h ) , . . . , ¯ X n ( h ) satisﬁes the condition of Lemma 2. Let ¯ S n := P n i =1 ¯ X i . In the following theorem we are bounding the mean of ¯ S n with respect to any probability measure ρ over H . Theor em 4 (P AC-Bayes-kl Inequality): Fix a reference dis- tribution π over H . Then, for any δ ∈ (0 , 1) , with probability greater than 1 − δ over ¯ X 1 , . . . , ¯ X n , for all distributions ρ o ver H simultaneously: kl  1 n ¯ S n , ρ      h ¯ b, ρ i  ≤ KL( ρ k π ) + ln n +1 δ n . (5) By Pinsk er’ s inequality , Theorem 4 implies that      1 n ¯ S n , ρ  − h ¯ b, ρ i     =      1 n ¯ S n − ¯ b  , ρ      ≤ s KL( ρ k π ) + ln n +1 δ 2 n , (6) howe v er , if  1 n ¯ S n , ρ  is close to zero or one, inequality (5) is signiﬁcantly tighter than (6). The next result is based on combination of Donsker- V aradhan’ s v ariational formula for relati ve entropy with Hoeffding-Azuma’ s inequality . This time let ¯ Z 1 , . . . , ¯ Z n be a sequence of random functions, such that ¯ Z i : H → R . Let ¯ Z i 1 be an abbreviation for a subsequence of the ﬁrst i random functions in the sequence. W e assume that E [ ¯ Z i | ¯ Z i 1 ] = ¯ 0 . In other words, for each h ∈ H the sequence Z 1 ( h ) , . . . , Z n ( h ) is a martingale dif ference sequence. Let ¯ M i := P i j =1 ¯ Z j . Then, for each h ∈ H the sequence ¯ M 1 ( h ) , . . . , ¯ M n ( h ) is a martingale. In the following theorems we bound the mean of ¯ M n with respect to any probability measure ρ on H . Theor em 5: Assume that ¯ Z i : H → [ α i , β i ] . Fix a reference distribution π over H and λ > 0 . Then, for any δ ∈ (0 , 1) , with probability greater than 1 − δ over ¯ Z n 1 , for all distributions ρ o ver H simultaneously: |h ¯ M n , ρ i| ≤ KL( ρ k π ) + ln 2 δ λ + λ 8 n X i =1 ( β i − α i ) 2 . (7) W e note that we cannot minimize inequality (7) simultane- ously for all ρ by a single value of λ . In the following theorem we take a grid of λ -s in a form of a geometric sequence and for each value of KL( ρ k π ) we pick a value of λ from the grid, which is the closest to the one that minimizes (7). The result is almost as good as what we could achieve if we would minimize the bound just for a single v alue of ρ . Theor em 6 (P AC-Bayes-Hoeffding-Azuma Inequality): Assume that ¯ Z n 1 is as in Theorem 5. Fix a reference distribution π over H . T ake an arbitrary number c > 1 . Then, for any δ ∈ (0 , 1) , with probability greater than 1 − δ ov er ¯ Z n 1 , for all distributions ρ ov er H simultaneously: |h ¯ M n ,ρ i| ≤ 1 + c 2 √ 2 v u u t  KL( ρ k π ) + ln 2 δ +  ( ρ )  n X i =1 ( β i − α i ) 2 , (8) where  ( ρ ) = ln 2 2 ln c  1 + ln  KL( ρ k π ) ln 2 δ  . Our last result is based on a combination of Donsker- V aradhan’ s v ariational formula with a Bernstein-type in- equality for martingales. Let ¯ V i : H → R be such that ¯ V i ( h ) := P i j =1 E h ¯ Z j ( h ) 2    ¯ Z j − 1 1 i . In other words, ¯ V i ( h ) is the variance of the martingale ¯ M i ( h ) deﬁned earlier . Let k ¯ Z i k ∞ = sup h ∈H ¯ Z i ( h ) be the L ∞ norm of ¯ Z i . Theor em 7: Assume that k ¯ Z i k ∞ ≤ K for all i with probability 1 and pick λ , such that λ ≤ 1 /K . Fix a refer- ence distribution π over H . Then, for any δ ∈ (0 , 1) , with probability greater than 1 − δ over ¯ Z n 1 , for all distributions ρ ov er H simultaneously: |h ¯ M n , ρ i| ≤ KL( ρ k π ) + ln 2 δ λ + ( e − 2) λ h ¯ V n , ρ i . (9) As in the previous case, the right hand side of (9) cannot be minimized for all ρ simultaneously by a single value of λ . Furthermore, ¯ V n is a random function. In the following theorem we take a similar grid of λ -s, as we did in Theorem 6, and a union bound over the grid. Picking a value of λ from the grid closest to the v alue of λ that minimizes the right hand side of (9) yields almost as good result as we would get if we would minimize (9) for a single choice of ρ . In this approach the variance ¯ V n can be replaced by a sample-dependent upper bound. For example, in importance-weighted sampling such an upper bound is derived from the reciprocal of the sampling distribution at each round [4]. Theor em 8 (P AC-Bayes-Bernstein Inequality): Assume that k ¯ Z i k ∞ ≤ K for all i with probability 1. Fix a reference distribution π over H . Pick an arbitrary number c > 1 . Then, IEEE TRANSA CTIONS ON INFORMA TION THEOR Y , V OL. XX, NO. Y , MONTH 201X 4 for any δ ∈ (0 , 1) , with probability greater than 1 − δ ov er ¯ Z n 1 , simultaneously for all distrib utions ρ over H that satisfy s KL( ρ k π ) + ln 2 ν δ ( e − 2) h ¯ V n , ρ i ≤ 1 K (10) we ha ve |h ¯ M n , ρ i| ≤ (1 + c ) s ( e − 2) h ¯ V n , ρ i  KL( ρ k π ) + ln 2 ν δ  , (11) where ν =     ln  q ( e − 2) n ln 2 δ  ln c     + 1 , (12) and for all other ρ |h ¯ M n , ρ i| ≤ 2 K  KL( ρ k π ) + ln 2 ν δ  . (13) ( d x e is the smallest inte ger v alue that is lar ger than x .) I I I . C O M PA R I S O N O F T H E I N E Q UA L I T I E S In this section we remind the reader of Hoef fding-Azuma’ s and Bernstein’ s inequalities for individual martingales and compare them with our ne w kl -form inequality . Then, we compare inequalities for weighted av erages of martingales with inequalities for indi vidual martingales. A. Backgr ound W e ﬁrst recall Hoeffding-Azuma’ s inequality [1], [2]. For a sequence of random variables Z 1 , . . . , Z n we use Z i 1 := Z 1 , . . . , Z i to denote the ﬁrst i elements of the sequence. Lemma 9 (Hoeffding-Azuma’ s Inequality): Let Z 1 , . . . , Z n be a martingale difference sequence, such that Z i ∈ [ α i , β i ] with probability 1 and E [ Z i | Z i − 1 1 ] = 0 . Let M i = P i j =1 Z j be the corresponding martingale. Then for any λ ∈ R : E [ e λM n ] ≤ e ( λ 2 / 8) P n i =1 ( β i − α i ) 2 . By combining Hoef fding-Azuma’ s inequality with Markov’ s inequality and taking λ = r 8 ln 2 δ P n i =1 ( β i − α i ) 2 it is easy to obtain the follo wing corollary . Cor ollary 10: For M n deﬁned in Lemma 9 and δ ∈ (0 , 1) , with probability greater than 1 − δ : | M n | ≤ v u u t 1 2 ln  2 δ  n X i =1 ( β i − α i ) 2 . The next lemma is a Bernstein-type inequality [3], [20]. W e provide the proof of this inequality in Appendix C, the proof is a part of the proof of [21, Theorem 1]. Lemma 11 (Bernstein’s Inequality): Let Z 1 , . . . , Z n be a martingale dif ference sequence, such that | Z i | ≤ K with probability 1 and E [ Z i | Z i − 1 1 ] = 0 . Let M i := P i j =1 Z j and let V i := P i j =1 E [( Z j ) 2 | Z j − 1 1 ] . Then for any λ ∈ [0 , 1 K ] : E h e λM n − ( e − 2) λ 2 V n i ≤ 1 . By combining Lemma 11 with Markov’ s inequality we obtain that for any λ ∈ [0 , 1 K ] and δ ∈ (0 , 1) , with probability greater than 1 − δ : | M n | ≤ 1 λ ln 2 δ + λ ( e − 2) V n . (14) V n is a random variable and can be replaced by an upper bound. Inequality (14) is minimized by λ ∗ = q ln 2 δ ( e − 2) V n . Note that λ ∗ depends on V n and is not accessible until we observe the entire sample. W e can bypass this problem by constructing the same grid of λ -s, as the one used in the proof of Theorem 8, and taking a union bound over it. Picking a value of λ closest to λ ∗ from the grid leads to the following corollary . In this bounding technique the upper bound on V n can be sample- dependent, since the bound holds simultaneously for all λ -s in the grid. Despite being a relati v ely simple consequence of Lemma 11, we have not seen this result in the literature. The corollary is tighter than an analogous result by Beygelzimer et. al. [21, Theorem 1]. Cor ollary 12: For M n and V n as deﬁned in Lemma 11, c > 1 and δ ∈ (0 , 1) , with probability greater than 1 − δ , if s ln 2 ν δ ( e − 2) V n ≤ 1 K (15) then | M n | ≤ (1 + c ) r ( e − 2) V n ln 2 ν δ , where ν is deﬁned in (12), and otherwise | M n | ≤ 2 K ln 2 ν δ . The technical condition (15) follo ws from the requirement of Lemma 11 that λ ∈ [0 , 1 K ] . B. Comparison W e ﬁrst compare inequalities for individual martingales in Corollaries 3, 10, and 12. Comparison of Inequalities for Individual Martingales: The comparison between Corollaries 10 and 12 is relatively straightforward. W e note that the assumption E [ Z i | Z i − 1 1 ] = 0 implies that α i ≤ 0 and that V n ≤ P n i =1 max { α 2 i , β 2 i } ≤ P n i =1 ( β i − α i ) 2 . Hence, Corollary 12 (derived from Bern- stein’ s inequality) matches Corollary 10 (derived from Hoeffding-Azuma’ s inequality) up to minor constants and logarithmic f actors in the general case, and can be much tighter when the variance is small. The comparison with the kl inequality in Corollary 3 is a bit more inv olved. As we mentioned after Corollary 3, its combination with Pinsker’ s inequality implies that | S n − bn | ≤ q n 2 ln n +1 δ , where S n − bn is a martingale corresponding to the martingale difference sequence Z i = X i − b . Thus, Corollary 3 is at least as tight as Hoeffding-Azuma’ s inequality in Corollary 10, up to a factor of q ln n +1 2 . This is also true if X i ∈ [ α i , β i ] (rather than [0 , 1] ), as long as we can simultaneously project all X i -s to the [0 , 1] interval without losing too much. IEEE TRANSA CTIONS ON INFORMA TION THEOR Y , V OL. XX, NO. Y , MONTH 201X 5 T ighter upper bounds on the kl diver gence show that in certain situations Corollary 3 is actually much tighter than Hoeffding-Azuma’ s inequality . One possible application of Corollary 3 is estimation of the value of the drift b of a random walk from empirical observation S n . If S n is close to zero, it is possible to use a tighter bound on kl , which states that for p > q we have p ≤ q + p 2 q kl( q || p ) + 2kl( q || p ) [15]. From this inequality , we obtain that with probability greater than 1 − δ : b ≤ 1 n S n + s 2 n S n ln n +1 δ n + 2 ln n +1 δ n . The abov e inequality is tighter than Hoef fding-Azuma inequal- ity whenev er 1 n S n < 1 / 8 . Since kl is con ve x in each of its parameters, it is actually easy to in vert it numerically , and thus avoid the need to resort to approximations in practice. In a similar manner, tighter bounds can be obtained when S n is close to n . The comparison of kl inequality in Corollary 3 with Bern- stein’ s inequality in Corollary 12 is not as equiv ocal as the comparison with Hoef fding-Azuma’ s inequality . If there is a bound on V n that is signiﬁcantly tighter than n , Bernstein’ s inequality can be signiﬁcantly tighter than the kl inequality , but otherwise it can also be the opposite case. In the example of estimating a drift of a random walk without prior kno wledge on its variance, if the empirical drift is close to zero or to n the kl inequality is tighter . In this case the kl inequality is comparable with empirical Bernstein’ s bounds [22], [23], [24]. Comparison of Inequalities for Individual Martingales with P AC-Bayesian Inequalities for W eighted A verages of Martin- gales: The “price” that is paid for considering weighted av- erages of multiple martingales is the KL-diver gence KL( ρ k π ) between the desired mixture weights ρ and the reference mixture weights π . (In the case of P A C-Bayes-Hoef fding- Azuma inequality , Theorem 6, there is also an additional minor term originating from the union bound ov er the grid of λ -s.) Note that for ρ = π the KL term vanishes. I V . D I S C U S S I O N W e presented a comparison inequality that bounds expec- tation of a con v ex function of martingale difference type variables by expectation of the same function of independent Bernoulli v ariables. This inequality enables to reduce a prob- lem of studying continuous dependent random variables on a bounded interval to a much simpler problem of studying independent Bernoulli random v ariables. As an example of an application of our lemma we deri ved an analog of Hoeffding-Azuma’ s inequality for martingales. Our result is al ways comparable to Hoeffding-Azuma’ s inequality up to a logarithmic factor and in cases, where the empirical drift of a corresponding random walk is close to the region boundaries it is tighter than Hoeffding-Azuma’ s inequality by an order of magnitude. It can also be tighter than Bernstein’ s inequality for martingales, unless there is a tight bound on the martingale v ariance. Finally , b ut most importantly , we presented a set of inequal- ities on concentration of weighted av erages of multiple si- multaneously ev olving and interdependent martingales. These inequalities are especially useful for controlling uncountably many martingales, where standard union bounds cannot be applied. Martingales are one of the most basic and important tools for studying time-evolving processes and we belie ve that our results will be useful for multiple domains. One such application in analysis of importance weighted sampling in reinforcement learning was already presented in [4]. A P P E N D I X A P R O O F S O F T H E R E S U LT S F O R I N D I V I D UA L M A RT I N G A L E S Pr oof of Lemma 1: The proof follows the lines of the proof of Maurer [19, Lemma 3]. Any point ¯ x = ( x 1 , . . . , x n ) ∈ [0 , 1] n can be written as a con v ex combination of the extreme points ¯ η = ( η 1 , . . . , η n ) ∈ { 0 , 1 } n in the following way: ¯ x = X ¯ η ∈{ 0 , 1 } n n Y i =1 [(1 − x i )(1 − η i ) + x i η i ] ! ¯ η. Con v exity of f therefore implies f ( ¯ x ) ≤ X ¯ η ∈{ 0 , 1 } n n Y i =1 [(1 − x i )(1 − η i ) + x i η i ] ! f ( ¯ η ) (16) with equality if ¯ x ∈ { 0 , 1 } n . Let X i 1 := X 1 , . . . , X i be the ﬁrst i elements of the sequence X 1 , . . . , X n . Let W i ( η i ) = (1 − X i )(1 − η i ) + X i η i and let w i ( η i ) = (1 − b i )(1 − η i ) + b i η i . Note that by the assumption of the lemma: E [ W i ( η i ) | X i − 1 1 ] = E [(1 − X i )(1 − η i ) + X i η i | X i − 1 1 ] = (1 − b i )(1 − η i ) + b i η i = w i ( η i ) . By taking expectation of both sides of (16) we obtain: E X n 1 [ f ( X n 1 )] ≤ E X n 1   X ¯ η ∈{ 0 , 1 } n n Y i =1 W i ( η i ) ! f ( ¯ η )   = X ¯ η ∈{ 0 , 1 } n E X n 1 " n Y i =1 W i ( η i ) # f ( ¯ η ) = X ¯ η ∈{ 0 , 1 } n E X n − 1 1 " E X n " n Y i =1 W i ( η i )      X n − 1 1 ## f ( ¯ η ) = X ¯ η ∈{ 0 , 1 } n E X n − 1 1 " n − 1 Y i =1 W i ( η i ) E X n  W n ( η n ) | X n − 1 1  # f ( ¯ η ) = X ¯ η ∈{ 0 , 1 } n E X n − 1 1 " n − 1 Y i =1 W i ( η i ) # w n ( η n ) f ( ¯ η ) = . . . (17) = X ¯ η ∈{ 0 , 1 } n n Y i =1 w i ( η i ) ! f ( ¯ η ) = X ¯ η ∈{ 0 , 1 } n n Y i =1 [(1 − b i )(1 − η i ) + b i η i ] ! f ( ¯ η ) = E Y n 1 [ f ( Y n 1 )] . IEEE TRANSA CTIONS ON INFORMA TION THEOR Y , V OL. XX, NO. Y , MONTH 201X 6 In (17) we apply induction in order to replace X i by b i , one- by-one from the last to the ﬁrst, same way we did it for X n . Lemma 2 follows from the following concentration result for independent Bernoulli v ariables that is based on the method of types in information theory [8]. Its proof can be found in [25], [17]. Lemma 13: Let Y 1 , . . . , Y n be i.i.d. Bernoulli random vari- ables, such that E [ Y i ] = b . Then: E h e n kl ( 1 n P n i =1 Y i k b ) i ≤ n + 1 . (18) For n ≥ 8 it is possible to prove ev en stronger result √ n ≤ E [ e n kl( 1 n P n i =1 Y i k b ) ] ≤ 2 √ n using Stirling’ s approx- imation of the factorial [19]. For the sake of simplicity we restrict ourselves to the slightly weaker bound (18), although all results that are based on Lemma 2 can be slightly improved by using the tighter bound. Pr oof of Lemma 2: Since KL-div ergence is a conv ex function [8] and the exponent function is con ve x and non- decreasing, e n kl( p k q ) is also a con ve x function. Therefore, Lemma 2 follows from Lemma 13 by Lemma 1. Corollary 3 follows from Lemma 2 by Markov’ s inequality . Lemma 14 (Markov’ s inequality): For δ ∈ (0 , 1) and a random v ariable X ≥ 0 , with probability greater than 1 − δ : X ≤ 1 δ E [ X ] . (19) Pr oof of Cor ollary 3: By Markov’ s inequality and Lemma 2, with probability greater than 1 − δ : e n kl ( 1 n S n k b ) ≤ 1 δ E h e n kl ( 1 n S n k b ) i ≤ n + 1 δ . T aking logarithm of both sides of the inequality and normal- izing by n completes the proof. A P P E N D I X B P R O O F S O F P AC - B A Y E S I A N T H E O R E M S F O R M A RT I N G A L E S In this appendix we provide the proofs of Theorems 4, 7, and 8. The proof of Theorem 5 is very similar to the proof of Theorem 7 and, therefore, omitted. The proof of Theorem 6 is very similar to the proof of Theorem 8, so we only provide the w ay of how to choose the grid of λ -s in this theorem. The proofs of all P AC-Bayesian theorems are based on the follo wing lemma, which is obtained by changing sides in Donsker -V aradhan’ s variational deﬁnition of relative entropy . The lemma takes roots back in information theory and statis- tical physics [5], [6], [7]. The lemma provides a deterministic relation between averages of φ with respect to all possible distributions ρ and the cumulant generating function ln h e φ , π i with respect to a single reference distribution π . A single application of Markov’ s inequality combined with the bounds on moment generating functions in Lemmas 2, 9, and 11 is then used in order to bound the last term in (20) in the proofs of Theorems 4, 5, and 7, respecti vely . Lemma 15 (Change of Measure Inequality): For any prob- ability space ( H , B ) , a measurable function φ : H → R , and any distributions π and ρ over H , we have : h φ, ρ i ≤ KL( ρ k π ) + ln h e φ , π i . (20) Since the KL-div ergence is inﬁnite when the support of ρ exceeds the support of π , inequality (20) is interesting when π  ρ . For a similar reason, it is interesting only when h e φ , π i is ﬁnite. W e note that the inequality is tight in the same sense as Jensen’ s inequality is tight: for φ ( h ) = ln ρ ( h ) π ( h ) it becomes an equality . Pr oof of Theorem 4: T ake φ ( h ) := n kl  1 n ¯ S n ( h )   ¯ b ( h )  . More compactly , denote φ = kl  1 n ¯ S n   ¯ b  : H → R . Then with probability greater than 1 − δ for all ρ : n kl  1 n ¯ S n , ρ      h ¯ b, ρ i  ≤ n  kl  1 n ¯ S n     ¯ b  , ρ  (21) ≤ KL( ρ k π ) + ln D e n kl( 1 n ¯ S n k ¯ b ) , π E (22) ≤ KL( ρ k π ) + ln  1 δ E ¯ X n 1 hD e n kl( 1 n ¯ S n k ¯ b ) , π Ei  (23) = KL( ρ k π ) + ln  1 δ D E ¯ X n 1 h e n kl( 1 n ¯ S n k ¯ b ) i , π E  (24) ≤ KL( ρ k π ) + ln n + 1 δ , (25) where (21) is by con ve xity of the kl diver gence [8]; (22) is by change of measure inequality (Lemma 15); (23) holds with probability greater than 1 − δ by Mark ov’ s inequality; in (24) we can take the expectation inside the dot product due to linearity of both operations and since π is deterministic; and (25) is by Lemma 2. 2 Normalization by n completes the proof of the theorem. Pr oof of Theor em 7: F or the proof of Theorem 7 we take φ ( h ) := λ ¯ M n ( h ) − ( e − 2) λ 2 ¯ V n ( h ) . Or, more compactly , φ = λ ¯ M n − ( e − 2) λ 2 ¯ V n . Then with probability greater than 1 − δ 2 for all ρ : λ h ¯ M n ,ρ i − ( e − 2) λ 2 h ¯ V n , ρ i = h λ ¯ M n − ( e − 2) λ 2 ¯ V n , ρ i ≤ KL( ρ k π ) + ln D e λ ¯ M n − ( e − 2) λ 2 ¯ V n , π E ≤ KL( ρ k π ) + ln  2 δ E ¯ Z n 1 hD e λ ¯ M n − ( e − 2) λ 2 ¯ V n , π Ei  (26) = KL( ρ k π ) + ln  2 δ D E ¯ Z n 1 h e λ ¯ M n − ( e − 2) λ 2 ¯ V n i , π E  ≤ KL( ρ k π ) + ln 2 δ , (27) where (27) is by Lemma 11 and other steps are justiﬁed in the same way as in the pre vious proof. By applying the same argument to − ¯ M n , taking a union bound over the two results, taking ( e − 2) λ 2 h ¯ V n , ρ i to the 2 By Lemma 2, for each h ∈ H we ha ve E ¯ X n 1 h e n kl( 1 n ¯ S n ( h ) k ¯ b ( h )) i ≤ n + 1 and, therefore, D E ¯ X n 1 h e n kl( 1 n ¯ S n k ¯ b ) i , π E ≤ n + 1 . IEEE TRANSA CTIONS ON INFORMA TION THEOR Y , V OL. XX, NO. Y , MONTH 201X 7 other side of the inequality , and normalizing by λ , we obtain the statement of the theorem. Pr oof of Theor em 8: The value of λ that minimizes (9) depends on ρ , whereas we w ould lik e to hav e a result that holds for all possible distributions ρ simultaneously . This requires considering multiple values of λ simultaneously and we hav e to take a union bound over λ -s in step (26) of the proof of Theorem 7. W e cannot take all possible values of λ , since there are uncountably man y possibilities. Instead we determine the relev ant range of λ and tak e a union bound over a grid of λ - s that forms a geometric sequence over this range. Since the range is ﬁnite, the grid is also ﬁnite. The upper bound on the relev ant range of λ is determined by the constraint that λ ≤ 1 K . For the lower bound we note that since KL( ρ k π ) ≥ 0 , the v alue of λ that minimizes (9) is lo wer bounded by q ln 2 δ ( e − 2) h ¯ V n ,ρ i . W e also note that h ¯ V n , ρ i ≤ K 2 n , since | Z i ( h ) | ≤ K for all h and i . Hence, λ ≥ 1 K q ln 2 δ ( e − 2) n and the range of λ we are interested in is λ ∈   1 K s ln 2 δ ( e − 2) n , 1 K   . W e cover the above range with a grid of λ i -s, such that λ i := c i 1 K q ln 2 δ ( e − 2) n for i = 0 , . . . , m − 1 . It is easy to see that in order to cover the interval of rele v ant λ we need m = & 1 ln c ln s ( e − 2) n ln 2 δ !' . ( λ m − 1 is the last v alue that is strictly less than 1 /K and we take λ m := 1 /K for the case when the technical condition (10) is not satisﬁed). This deﬁnes the v alue of ν in (12). Finally , we note that (9) has the form g ( λ ) = U λ + λV . F or the relev ant range of λ , there is λ i ∗ that satisﬁes p U /V ≤ λ i ∗ < c p U /V . For this value of λ we ha ve g ( λ i ∗ ) ≤ (1 + c ) √ U V . Therefore, whenev er (10) is satisﬁed we pick the highest value of λ i that does not exceed the left hand side of (10), substitute it into (9), and obtain (11), where the ln ν factor comes from the union bound over λ i -s. If (10) is not satisﬁed, we know that h ¯ V n , ρ i < K 2  K L ( ρ k π ) + ln 2 ν δ  / ( e − 2) and by taking λ = 1 /K and substituting into (9) we obtain (13). Pr oof of Theorem 6: Theorem 6 follo ws from Theorem 5 in the same way as Theorem 8 follo ws from Theorem 7. The only difference is that the relev ant range of λ is unlimited from abo ve. If KL( ρ k π ) = 0 the bound is minimized by λ = s 8 ln 2 δ P n i =1 ( β i − α i ) 2 , hence, we are interested in λ that is larger or equal to this value. W e take a grid of λ i -s of the form λ i := c i s 8 ln 2 δ P n i =1 ( β i − α i ) 2 for i ≥ 0 . Then for a giv en v alue of KL( ρ k π ) we hav e to pick λ i , such that i =     ln  KL( ρ k π ) ln 2 δ + 1  2 ln c     , where b x c is the largest integer value that is smaller than x . T aking a weighted union bound ov er λ i -s with weights 2 − ( i +1) completes the proof. (In the weighted union bound we take δ i = δ 2 − ( i +1) . Then by substitution of δ with δ i , (7) holds with probability greater than 1 − δ i for each λ i individually , and with probability greater than 1 − P ∞ i =0 δ i = 1 − δ for all λ i simultaneously .) A P P E N D I X C B A C K G RO U N D In this section we provide a proof of Lemma 11. The proof reproduces an intermediate step in the proof of [21, Theorem 1]. Pr oof of Lemma 11: First, we ha v e: E Z i  e λZ i   Z i − 1 1  ≤ E Z i  1 + λZ i + ( e − 2) λ 2 ( Z i ) 2   Z i − 1 1  (28) = 1 + ( e − 2) λ 2 E Z i  ( Z i ) 2   Z i − 1 1  (29) ≤ e ( e − 2) λ 2 E Z i [ ( Z i ) 2 | Z i − 1 1 ] , (30) where (28) uses the fact that e x ≤ 1 + x + ( e − 2) x 2 for x ≤ 1 (this restricts the choice of λ to λ ≤ 1 K , which leads to technical conditions (10) and (15) in Theorem 8 and Corollary 12, respectiv ely); (29) uses the martingale property E Z i [ Z i | Z i − 1 1 ] = 0 ; and (30) uses the fact that 1 + x ≤ e x for all x . W e apply inequality (30) in the follo wing w ay: E Z n 1 h e λM n − ( e − 2) λ 2 V n i = E Z n 1 h e λM n − 1 − ( e − 2) λ 2 V n − 1 + λZ n − ( e − 2) λ 2 E [ ( Z n ) 2 | Z n − 1 1 ] i = E Z n − 1 1 " e λM n − 1 − ( e − 2) λ 2 V n − 1 × E Z n  e λZ n   Z n − 1 1  × e − ( e − 2) λ 2 E [ ( Z n ) 2 | Z n − 1 1 ] # ≤ E Z n − 1 1 h e λM n − 1 − ( e − 2) λ 2 V n − 1 i (31) ≤ . . . (32) ≤ 1 . Inequality (31) applies inequality (30) and inequality (32) recursiv ely proceeds with Z n − 1 , . . . , Z 1 (in rev erse order). Note that conditioning on additional variables in the proof of the lemma does not change the result. This fact is exploited in the proof of Theorem 7, when we allow interdependence between multiple martingales. A C K N O W L E D G M E N T S The authors would like to thank Andreas Maurer for his comments on Lemma 1. W e are also very grateful to the anonymous re viewers for their valuable comments that helped to improve the presentation of our work. This work was supported in part by the IST Programme of the European IEEE TRANSA CTIONS ON INFORMA TION THEOR Y , V OL. XX, NO. Y , MONTH 201X 8 Community , under the P ASCAL2 Network of Excellence, IST - 2007-216886, and by the European Community’ s Sev enth Framew ork Programme (FP7/2007-2013), under grant agree- ment N o 270327. This publication only reﬂects the authors’ views. R E F E R E N C E S [1] W . Hoeffding, “Probability inequalities for sums of bounded random variables, ” J ournal of the American Statistical Association , v ol. 58, no. 301, pp. 13–30, 1963. [2] K. Azuma, “W eighted sums of certain dependent random variables, ” T ˆ ohoku Mathematical Journal , vol. 19, no. 3, 1967. [3] S. N. Bernstein, Pr obability Theory , 4th ed., Moscow-Leningrad, 1946, in Russian. [4] Y . Seldin, P . Auer, F . Laviolette, J. Shawe-T aylor, and R. Ortner, “P A C-Bayesian analysis of contextual bandits, ” in Advances in Neural Information Pr ocessing Systems (NIPS) , 2011. [5] M. D. Donsker and S. S. V aradhan, “ Asymptotic ev aluation of certain Markov process expectations for lar ge time. ” Communications on Pur e and Applied Mathematics , vol. 28, 1975. [6] P . Dupuis and R. S. Ellis, A W eak Conver g ence Appr oach to the Theory of Lar ge Deviations . Wile y-Interscience, 1997. [7] R. M. Gray , Entr opy and Information Theory , 2nd ed. Springer, 2011. [8] T . M. Cover and J. A. Thomas, Elements of Information Theory . John W iley & Sons, 1991. [9] J. Shawe-T aylor and R. C. Williamson, “ A P AC analysis of a Bayesian estimator , ” in Pr oceedings of the International Conference on Compu- tational Learning Theory (COLT) , 1997. [10] J. Shawe-T aylor, P . L. Bartlett, R. C. W illiamson, and M. Anthony , “Structural risk minimization over data-dependent hierarchies, ” IEEE T r ansactions on Information Theory , vol. 44, no. 5, 1998. [11] D. McAllester, “Some P A C-Bayesian theorems, ” in Pr oceedings of the International Conference on Computational Learning Theory (COLT) , 1998. [12] M. Seeger , “P AC-Bayesian generalization error bounds for Gaussian process classiﬁcation, ” Journal of Machine Learning Researc h , 2002. [13] L. G. V aliant, “ A theory of the learnable, ” Communications of the Association for Computing Machinery , v ol. 27, no. 11, 1984. [14] J. Langford and J. Shawe-T aylor , “P A C-Bayes & margins, ” in Advances in Neural Information Pr ocessing Systems (NIPS) , 2002. [15] D. McAllester , “P AC-Bayesian stochastic model selection, ” Machine Learning , v ol. 51, no. 1, 2003. [16] P . Germain, A. Lacasse, F . Laviolette, and M. Marchand, “P AC-Bayesian learning of linear classiﬁers, ” in Pr oceedings of the International Con- fer ence on Machine Learning (ICML) , 2009. [17] Y . Seldin and N. Tishby , “P AC-Bayesian analysis of co-clustering and beyond, ” Journal of Machine Learning Researc h , v ol. 11, 2010. [18] G. Lev er , F . Laviolette, and J. Shawe-T aylor, “Distrib ution-dependent P A C-Bayes priors, ” in Proceedings of the International Conference on Algorithmic Learning Theory (ALT) , 2010. [19] A. Maurer , “ A note on the P A C-Bayesian theorem, ” www .arxi v .org, 2004. [20] D. A. Freedman, “On tail probabilities for martingales, ” The Annals of Pr obability , v ol. 3, no. 1, 1975. [21] A. Beygelzimer, J. Langford, L. Li, L. Reyzin, and R. Schapire, “Contextual bandit algorithms with supervised learning guarantees, ” in Pr oceedings on the International Conference on Artiﬁcial Intelligence and Statistics (AISTA TS) , 2011. [22] V . Mnih, C. Szepesv ´ ari, and J.-Y . Audibert, “Empirical Bernstein stopping, ” in Pr oceedings of the International Conference on Machine Learning (ICML) , 2008. [23] J. Y . Audibert, R. Munos, and C. Szepesv ´ ari, “Exploration-e xploitation trade-off using variance estimates in multi-armed bandits, ” Theor etical Computer Science , 2009. [24] A. Maurer and M. Pontil, “Empirical Bernstein bounds and sample variance penalization, ” in Proceedings of the International Conference on Computational Learning Theory (COLT) , 2009. [25] M. Seeger, “Bayesian Gaussian process models: P AC-Bayesian gener- alization error bounds and sparse approximations, ” Ph.D. dissertation, Univ ersity of Edinburgh, 2003. Y evgeny Seldin received his Ph.D. in computer science from the Hebrew University of Jerusalem in 2010. Since 2009 he is a Research Scientist at the Max Planck Institute for Intelligent Systems in T ¨ ubingen and since 2011 he is also an Honorary Research Associate at the Department of Computer Science in Uni versity College London. His research interests include statistical learning theory , P A C- Bayesian analysis, and reinforcement learning. He has contributions in P A C-Bayesian analysis, rein- forcement learning, clustering-based models in su- pervised and unsupervised learning, collaborative ﬁltering, image processing, and bioinformatics. Franc ¸ ois Laviolette receiv ed his Ph.D. in mathemat- ics from Univ ersit ´ e de Montr ´ eal in 1997. His thesis solved a long-standing conjecture (60 years old) on graph theory and was among the seven ﬁnalists of the 1998 Council of Graduate Schools / Univ ersity Microﬁlms International Distinguished Dissertation A ward of W ashington, in the category Mathematics- Physic-Engineering. He then moved to Universit ´ e Lav al, where he works on Probabilistic V eriﬁcation of Systems, Bio-Informatics, and Machine Learning, with a particular interest in P AC-Bayesian analysis, for which he has already more than a dozen of scientiﬁc publications. Nicol ` o Cesa-Bianchi is a faculty member of the Computer Science Department at the Uni versit ` a degli Studi di Milano, Italy . His main research interests include statistical learning theory , game- theoretic learning, and pattern analysis. He is co- author with G ` abor Lugosi of the monography “Pre- diction, Learning, and Games” (Cambridge Univ er- sity Press, 2006). John Shawe-T aylor obtained a Ph.D. in Mathe- matics at Royal Holloway , University of London in 1986. He subsequently completed an M.Sc. in the Foundations of Advanced Information T echnology at Imperial College. He was promoted to Professor of Computing Science in 1996. He has published over 200 research papers. In 2006 he w as appointed Director of the Center for Computational Statistics and Machine Learning at University Colle ge Lon- don. He has pioneered the dev elopment of the well- founded approaches to Machine Learning inspired by statistical learning theory (including Support V ector Machine, Boosting and Kernel Principal Components Analysis) and has sho wn the viability of applying these techniques to document analysis and computer vision. He is co- author of an Introduction to Support V ector Machines, the ﬁrst comprehensive account of this ne w generation of machine learning algorithms. A second book on K ernel Methods for Pattern Analysis was published in 2004. IEEE TRANSA CTIONS ON INFORMA TION THEOR Y , V OL. XX, NO. Y , MONTH 201X 9 Peter A uer received his Ph.D. in mathematics from the V ienna University of T echnology in 1992, work- ing on probability theory with Pal Revesz and on Symbolic Computation with Alexander Leitsch. He then moved to Graz Uni versity of T echnology , work- ing on Machine Learning with W olfgang Maass, and was appointed associate professor in 1997. He has also been a research scholar at the Uni versity of California, Santa Cruz. In 2003 he accepted the position of a full professor for Information T ech- nology at the Montanuniversit ¨ at Leoben. He has authored scientiﬁc publications in the areas of probability theory , symbolic computation, and machine learning, he is a member of the editorial board of Machine Learning, and he has been principal in vestigator in sev eral research projects funded by the European Union. His current research interests include Machine Learning focused on autonomous learning and exploration algorithms.

PAC-Bayesian Inequalities for Martingales

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment