Pac-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning
This monograph deals with adaptive supervised classification, using tools borrowed from statistical mechanics and information theory, stemming from the PACBayesian approach pioneered by David McAllester and applied to a conception of statistical lear…
Authors: ** David McAllester (PAC‑Bayesian 개념 창시자) – *원 논문에 직접적인 저자 언급은 없으나, 이론적 기반을 제공* Vladimir Vapnik (통계학습 이론, SRM) – *전이적 학습 및 VC 차원 관련 아이디어 제공* (본 단행본의 실제 저자는 명시되지 않았으나
Institute of Mathematical Statistics LECTURE NOTES–MONOGRAPH SERIES V olume 56 P ac-B a y esian Sup ervised Classification: The Thermo dynamics of Statistical Lea rning Olivier Catoni Institute of Mathemat ical Statistics Beachw o od, Ohio, USA Institute of Mathematical Statistics L e ctur e Notes–Mono gr aph Series Series Editor: An thon y C. Davison The pro duction of the Institute of Mathematic al Statistics L e ctur e Notes–Mono gr aph Series is managed b y the IMS Office: Rong Chen, T reasurer and Elyse Gustafson, Executiv e Director. Library of Congress Con trol Num b er: 200793 9 120 In ternational Standard Bo ok Num b er (10) 0-940600- 72-2 In ternational Standard Bo ok Num b er (13) 978-0-94 0 600-72-0 In ternational Standard Serial Num b er 0749-2170 DOI: 10.1214 /0749217070000 00391 Cop yrigh t c 2007 Institute of Mathematical Statistics All righ ts reserv ed Prin ted in the United States of America Con ten ts Pref ace . . . . . . . . . . . . . . . . . . . . . . . . v Introduction . . . . . . . . . . . . . . . . . . . . . . vii 1. Inductive P AC-Ba yesian learning . . . . . . . . . . . . 1 1.1. Basic inequality . . . . . . . . . . . . . . . . . . 2 1.2. Non local bounds . . . . . . . . . . . . . . . . . 5 1.2.1 . Unbiase d empiric al b ounds . . . . . . . . . . . . 5 1.2.2 . Optimizing explicitly the exp onential p ar ameter λ . . . 8 1.2.3 . Non r andom b ounds . . . . . . . . . . . . . . 10 1.2.4 . Deviation b ounds . . . . . . . . . . . . . . . 11 1.3. Local bounds . . . . . . . . . . . . . . . . . . . 14 1.3.1 . Choic e of the prior . . . . . . . . . . . . . . 14 1.3.2 . Unbiase d lo c al empiric al b ounds . . . . . . . . . . 15 1.3.3 . Non r andom lo c al b ounds . . . . . . . . . . . . 18 1.3.4 . L o c al deviatio n b ounds . . . . . . . . . . . . . 19 1.3.5 . Partial ly lo c al b ounds . . . . . . . . . . . . . 23 1.3.6 . Two step lo c alization . . . . . . . . . . . . . . 28 1.4. Rela tiv e bounds . . . . . . . . . . . . . . . . . . 3 4 1.4.1 . Basic ine qualities . . . . . . . . . . . . . . . 34 1.4.2 . Non r andom b ounds . . . . . . . . . . . . . . 37 1.4.3 . Unbiase d empiric al b ounds . . . . . . . . . . . . 41 1.4.4 . R elative empiric al deviation b ounds . . . . . . . . 44 2. Comp aring posterior distributions to Gibbs priors . . . . 51 2.1. Bounds rela tive to a Gibbs distribution . . . . . . . . . 51 2.1.1 . Comp aring a p osterior distribution with a Gibbs prior . . 52 2.1.2 . The effe ctive temp er atu r e of a p osterior distribution . . 55 2.1.3 . Analy sis of an empiri c al b ound for the effe ctive temp er atur e 57 2.1.4 . A daptation t o p ar ametric and mar gin assumptions . . . 62 2.1.5 . Estimating t he diver genc e o f a p osterior with r esp e ct to a Gibbs prior . . . . . . . . . . . . . . . 67 2.2. Pla ying with tw o posterior and tw o local prior distributions . . 68 2.2.1 . Comp aring t wo p osterior distributions . . . . . . . 68 2.2.2 . Elab or ate uses of r elative b ounds b etwe en p osteriors . . 70 2.2.3 . Analy sis of r elative b ounds . . . . . . . . . . . 75 iii iv Contents 2.3. Tw o step localiza tion . . . . . . . . . . . . . . . . 89 2.3.1 . Two step lo c alization of b ounds r elative to a Gibbs prior . 89 2.3.2 . Analy sis of two step b ounds r elative to a Gibbs prior . . 97 2.3.3 . Two step lo c alization b etwe en p osterior distributions . . 102 3. Transductive P AC-Ba yesian l earning . . . . . . . . . . 111 3.1. Basic inequalities . . . . . . . . . . . . . . . . . 111 3.1.1 . The tr ansductive setting . . . . . . . . . . . . 111 3.1.2 . Abso lute b ound . . . . . . . . . . . . . . . . 113 3.1.3 . R elative b ounds . . . . . . . . . . . . . . . . 114 3.2. V apnik bounds for transductive classifica tion . . . . . . 11 5 3.2.1 . With a shadow sample of arbitr ary size . . . . . . . 116 3.2.2 . When the shadow sample has the same size as the tr aining sample . . . . . . . . . . . . . . . . 118 3.2.3 . When mor e over the distribution of the augmente d sam- ple is exchange able . . . . . . . . . . . . . . 119 3.3. V apnik bounds for inductive classifica tion . . . . . . . . 12 1 3.3.1 . Arbi tr ary shadow sample size . . . . . . . . . . . 121 3.3.2 . A b etter minimization with r esp e ct to the exp onential p ar ameter . . . . . . . . . . . . . . . . . . 123 3.3.3 . Equal shadow and tr aining sample sizes . . . . . . . 125 3.3.4 . Impr ovement on the e qual sample size b ound in the i.i.d. c ase 126 3.4. Gaussian appr oxima tion in V apnik bounds . . . . . . . . 127 3.4.1 . Gaussian upp er b ounds of varianc e terms . . . . . . 127 3.4.2 . Arbi tr ary shadow sample size . . . . . . . . . . . 128 3.4.3 . Equal sample sizes in the i.i.d. c ase . . . . . . . . 128 4. Suppor t Vector Machine s . . . . . . . . . . . . . . . 1 31 4.1. Ho w to build them . . . . . . . . . . . . . . . . . 1 3 1 4.1.1 . The c anonic al hyp erplane . . . . . . . . . . . . 1 31 4.1.2 . Computation of the c anonic al hyp erplane . . . . . . 13 2 4.1.3 . Supp ort ve ctors . . . . . . . . . . . . . . . . 1 34 4.1.4 . The non-sep ar able c ase . . . . . . . . . . . . . 134 4.1.5 . Supp ort V e ct or Machines . . . . . . . . . . . . 13 8 4.1.6 . Building kernels . . . . . . . . . . . . . . . 1 40 4.2. Bounds f or Suppor t Vector Mac hin es . . . . . . . . . 142 4.2.1 . Compr ession scheme b ounds . . . . . . . . . . . 1 42 4.2.2 . The V apnik–Ce rvonenkis dimension of a family of su bsets 143 4.2.3 . V apnik–Cervo nenkis dimension of line ar rules with mar gin 146 4.2.4 . Applic ation t o Supp ort V e ct or Machi nes . . . . . . . 148 4.2.5 . Inductive m ar gin b ounds . . . . . . . . . . . . 149 Appendix: Classifica tion by thresholding . . . . . . . . . 155 5.1. Description of the model . . . . . . . . . . . . . . 155 5.2. Comput a tion of inductive bounds . . . . . . . . . . . 156 5.3. Transductive bounds . . . . . . . . . . . . . . . . 158 Bibliography . . . . . . . . . . . . . . . . . . . . . . 161 Preface This monograph deals with adaptiv e sup ervised classification, using tools b or- row ed from statistical mechanics and information theory , stemming from the P AC- Bay esian appr oac h pionee r ed by David McAllester and applied to a conception of statistical learning theory for g ed by Vladimir V apnik. Using conv ex a nalysis on the set of posterio r pro babilit y measures, we sho w how to get lo c a l measur es o f the complexity o f the classification mo del in volving the relative entropy of p osterior distributions with respect to Gibbs posterior measures. W e then discuss rela tiv e bo unds, c o mparing the generalization error of tw o classification rules, showing how the margin assumption of Mammen and Tsyba k ov can b e replaced with some em- pirical measure of the cov ariance structure of the classification mo del. W e show ho w to as socia te to any p osterior distribution an effe ctive temp er atur e rela ting it to the Gibbs pr ior distribution with the same level of ex pected error rate, a nd how to esti- mate this effectiv e temper a ture from data, resulting in an estimator whose exp e c ted error ra te c o n verges according to the b est p o ssible power of the sa mple size adap- tiv ely under any margin a nd parametric co mplexit y a ssumptions. W e describ e a nd study an alternative selection scheme bas ed on re lativ e b ounds b etw een estimator s, and pre s en t a tw o step lo calization technique which can handle the selec tio n o f a parametric mo del fr om a family of those. W e show how to extend systematically all the r esults obtained in the inductive setting to transductive learning, a nd use this to improve V apnik’s gener alization b ounds, extending them to the cas e when the s ample is made of indep enden t non-identically distributed pa irs of patterns and lab els. Finally we rev ie w briefly the construction of Supp ort V ector Ma c hines and show how to derive generalization b ounds for them, measur ing the co mplexit y ei- ther thro ug h the num ber of supp ort vectors or through the v a lue of the transductive or inductiv e ma rgin. Olivier Ca toni CNRS – Lab o ratoire de Pr obabilit ´ es et Mo d ` eles Al ´ eatoires, Univ ersit´ e Paris 6 (site Chev aler et), 4 plac e Juss ie u – Case 18 8, 75 252 Paris Cedex 05 . v to my son Nic olas In tro duction Among the p ossible a ppr oach es to pa ttern recognition, statistical lea rning theory has received a lot of atten tion in the last few y ears . Although a rea lis tic pattern recognition sc heme inv olves data pre-pro cessing and p ost-pro cessing th at need a theory of their own, a ce ntral role is often pla yed b y some kind of supervised learning algorithm. This central building blo ck is the sub ject we are go ing to analyse in these notes. Accordingly , w e a ssume that w e hav e prepared in some way or ano ther a sample of N lab elled patterns ( X i , Y i ) N i =1 , where X i ranges in some pattern space X and Y i ranges in some finite lab el set Y . W e also a ssume that we hav e devise d o ur exp eri- men t in such a way that the couples of random v aria bles ( X i , Y i ) are independent (but not necessarily equidistributed). Here, randomness should be under stoo d to come from the way the statistician has planned his exp eriment. He ma y for in- stance hav e drawn the X i s at rando m from some la rger p opulation of patterns the algorithm is mean t to b e applied to in a second s ta ge. The lab els Y i may have been set with the help o f some external exp ertise (whic h ma y itself be fault y or contain so me amount of ra ndomness, so we do not assume that Y i is a function of X i , and a llo w the couple of ra ndom v ariables ( X i , Y i ) to follow any kind of joint distribution). In practice, patterns will be extracted from some high dimensiona l and highly structured data, such as digital images , sp eech signals, DNA sequences, etc. W e will not discuss this pre-pro cessing stage her e , although it poses crucial problems dealing wit h segmentation and the c hoice of a representation. The aim of sup ervised class ification is to choos e some clas s ification r ule f : X → Y whic h predicts Y from X making as few mistakes as po ssible on a verage. The choice of f will b e driven b y a suitable use o f the information provided b y the sample ( X i , Y i ) N i =1 on the joint distribution of X and Y . Mor eo ver, consider ing all the p ossible measurable functions f from X to Y w ould not b e feasible in practice and maybe more impor tan tly not w ell founded fro m a statistical p oint of view, at least as so on as the pa ttern space X is la rge and little is kno wn in adv ance ab out the joint distribution of patterns X and lab els Y . Therefore, w e will co nsider parametrized subsets of classification rules { f θ : X → Y ; θ ∈ Θ m } , m ∈ M , which may b e group ed to form a big par ameter set Θ = S m ∈ M Θ m . The sub ject of this monogra ph is to intro duce to statistical lea r ning theor y , and more precisely to the theory of sup ervised class ification, a num b er o f technical to ols akin to statistical mechanics and infor mation theory , dealing with the concepts of ent ro p y a nd temp erature. A cen tral task will in particular be to control the mu tual information be t ween an estimated parameter a nd the obser v ed sample. The fo cus will not b e directly on the descr iption of the data to b e classified, but on the de- scription of the classification rules. As we want to deal with high dimensional data, we will b e b ound to consider high dimensiona l sets of candidate classification rules, and will a nalyse them with too ls v ery similar to those used in statistical mechanics vii viii Intr o duction to descr ibe particle systems with many deg r ees of freedo m. More sp ecifically , the sets of c la ssification rules will b e descr ib ed by Gibbs measur es defined o n parameter sets and dep ending on the obser v ed sample v alue. A Gibbs measure is the sp ecial kind of probability measure used in statistical mechanics to describ e the state of a particle s y stem driven by a g iv en ener gy function a t some given temp erature. Her e, Gibbs measur e s will emerge as minimizers of the av era g e loss v alue under e n tropy (or mutual infor mation) constraints. Entrop y itself, more precisely the Kullback divergence function b et ween probability measures, will emerge in conjunction with the us e of exp onen tial deviation inequalities: indeed, the log -Laplace transfor m ma y be seen as the Legendre transfor m of the Kullback divergence function, a s will be stated in Lemma 1.1.3 (page 4). T o fix no tation, let ( X i , Y i ) N i =1 be the canonical pro cess on Ω = ( X × Y ) N (whic h means the co ordinate pro cess). Let the pattern space b e provided with a sigma- algebra B turning it into a mea surable space ( X , B ). On the finite lab el space Y , we will consider the trivia l algebra B ′ made o f all its subsets. Let M 1 + ( K × Y ) N , ( B ⊗ B ′ ) ⊗ N be our notation for the set of pro babilit y mea sures (i.e. of p ositive mea sures of total mass equal to 1) o n the measurable space ( X × Y ) N , ( B × B ′ ) ⊗ N . Once some probability distribution P ∈ M 1 + ( X × Y ) N , ( B ⊗ B ′ ) ⊗ N is chosen, it turns ( X i , Y i ) N i =1 in to the ca nonical rea liza tion of a sto chastic pro cess mo delling the obser v ed sample (also called the training set). W e will ass ume that P = N N i =1 P i , where for eac h i = 1 , . . . , N , P i ∈ M 1 + ( X × Y , B ⊗ B ′ ), to r eflect the assumption that we observe independent pairs of patterns and lab e ls . W e will als o as sume that we are provided with some indexed set of po ssible classification rules R Θ = f θ : X → Y ; θ ∈ Θ , where (Θ , T ) is so me measur able index set. Assuming some indexation of the classi- fication r ules is just a matter of presentation. Although it leads to heavier notation, it allows us to integrate ov er the space of classification rules as well as ov er Ω, us - ing the usual formalism of multiple integrals. F or this ma tt er , we will assume that ( θ, x ) 7→ f θ ( x ) : (Θ × X , B ⊗ T ) → ( Y , B ′ ) is a measurable function. In many cases , as alr eady mentioned, Θ = S m ∈ M Θ m will b e a finite (or more generally countable) union of subspaces, dividing the classification mo del R Θ = S m ∈ M R Θ m in to a union of sub-mo dels. The impor tance of intro ducing suc h a structure has been put forward by V. V apnik, as a w ay to av oid making str o ng h yp otheses on the distribution P o f the s ample. If neither the distribution of the sample nor the set of classification rules were constrained, it is well known that no kind of statistica l inference would be pos s ible. Considering a family of sub-mo dels is a way to provide for adaptive classification wher e the choice o f the mo del dep ends on the observed sa mple. Restricting the set of classification rules is more rea listic than restricting the distribution of pa tter ns , since the cla ssification rules are a pr ocess ing to ol left to the choice of the s tatistician, whereas the distribution o f the patterns is not fully under his co n trol, except for some planning of the lear ning exp eriment which ma y enforce some weak proper ties lik e indepe ndence, but not the precise shap es of the mar g inal distributions P i which are as a rule unknown distr ibutions on some high dimensional space. In these notes, w e will concentrate on general issues concerned with a natu- ral measure of risk, namely the ex p e cte d err or r ate of eac h classification rule f θ , expressed as (0.1) R ( θ ) = 1 N N X i =1 P f θ ( X i ) 6 = Y i . Intr o duction ix As this quant ity is unobserv ed, w e will b e led to work with the co rresp onding empiric al err or r ate (0.2) r ( θ , ω ) = 1 N N X i =1 1 f θ ( X i ) 6 = Y i . This do es not mean that practica l lear ning algor ithms will alwa ys try to minimize this criterion. They often on the contrary try to minimize some other criterion which is linked with the structure of the pr oblem and has some nice additional proper ties (lik e smo othness a nd conv exity , for e x ample). Nevertheless, and indep enden tly of the pr ecise form of the estimator b θ : Ω → Θ under study , the analysis of R ( b θ ) is a natural question, and often corre sponds to what is required in pra ctice. Answ er ing this question is not straightforw ar d b e c a use, although R ( θ ) is the exp e ctation of r ( θ ), a sum of independent Bernoulli random v ariables, R ( b θ ) is not the e x pectation of r ( b θ ), b ecause of the dep endence of b θ on the sample, and neither is r ( b θ ) a sum of independent random v a riables. T o cir c umv ent this unfortunate situation, some uniform control over the devia tions o f r from R is needed. W e will follow the P A C-Bayesian appr oac h to th is problem, originated in the machine learning communit y a nd pioneered by McAllester (1 998, 1999). It can be seen a s so me v aria nt of the mor e classical approach of M -estimators relying on empirical pro cess theory — a s descr ibed for instance in V an de Geer (2000). It is built on some general principles: • One idea is to em bed the set o f estimators of the t yp e b θ : Ω → Θ into the larger set of regular conditional probability meas ures ρ : Ω , ( B ⊗ B ′ ) ⊗ N → M 1 + (Θ , T ). W e will call these conditional pro ba bilit y meas ures p osterior dis- tributions , to follow standard terminology . • A second idea is to measure the fluctuations of ρ with resp ect to the sample, using some prior distribution π ∈ M 1 + (Θ , T ), and the Kullback div ergence function K ( ρ, π ). The expectation P K ( ρ, π ) measures the randomness of ρ . The optimal choice of π w ould be P ( ρ ), r esulting in a measure of the randomness o f ρ equal to the m utual information b et ween the sample and the estimated parameter drawn from ρ . Anyho w, since P ( ρ ) is usually not better known than P , we will ha ve to be conten t with some less co ncen trated prio r distribution π , resulting in so me loo s er measure of randomness, as shown by the iden tity P K ( ρ, π ) = P K ρ, P ( ρ ) + K P ( ρ ) , π . • A third idea is to analyse the fluctuations of the random pro cess θ 7→ r ( θ ) from its mean pro cess θ 7→ R ( θ ) through the log-Lapla ce tra ns fo rm − 1 λ log Z Z exp − λr ( θ , ω ) π ( dθ ) P ( dω ) , as would be do ne in statistical mechanics, wher e this is called the free energ y . This tra nsform is well suited to r elate min θ ∈ Θ r ( θ ) to inf θ ∈ Θ R ( θ ), since for large eno ugh v a lues of the para meter λ , cor responding to low enough v alues of the temp erature, the system has small fluctuatio ns ar ound its ground s ta te. • A fo ur th idea deals with localiza tion. It consists o f considering a prior dis - tribution π depe nding on the unkno wn exp ected err or ra te function R . Th us some central result of the theory will co nsist in an empirical upp er b ound for K ρ, π exp( − β R ) , where π exp( − β R ) , defined b y its density d dπ π exp( − β R ) = exp( − β R ) π exp( − β R ) , x Intr o duction is a Gibbs distribution built from a known prior distribution π ∈ M 1 + (Θ , T ), some inv erse temp erature parameter β ∈ R + and the expected err or rate R . This b ound will in par ticular be used when ρ is a p osterior Gibbs distribution, of the form π exp( − β r ) . The general idea will b e to show that in the case when ρ is not to o ra ndo m, in the sense that it is p ossible to find a prior (that is non-r andom) distribution π such that K ( ρ, π ) is sma ll, then ρ ( r ) can b e reliably taken for a go o d appr o ximation of ρ ( R ). This mono g raph is divided into four chapters. The first deals with the inductive setting pres e nted in these lines. The seco nd is devoted to r elativ e b ounds. It shows that it is po ssible to obtain a tighter estimate of the m utual informa tion b e tw een the sample a nd the estimated para meter by co mparing prior and p osterior Gibbs distributions. It sho ws ho w to use this idea to obtain adaptiv e model selection schemes under very weak h yp otheses. The third c hapter in tro duces the tr ansductive setting of V. V apnik (V apnik, 1998), which consists in compa r ing the p erformance of classifica tion rules on the learning sample with their p erformance on a test sample instead of their average per formance. The fourth one is a f as t introduction to Su pp ort V ector Mac hines. It is the o ccasio n to s ho w the implications of the ge neral results discussed in the three first chapters when some particular choice is made ab out the structure o f the classification rules. In the first chapter, tw o t yp es of b ounds ar e shown. Empiric al b ounds are useful to build, compar e and select es timators. Non r andom b ounds a re useful to assess the sp e e d of conv ergence of e s timators, rela ting this sp eed to the b ehaviour of the Gibbs prior exp ected error rate β 7→ π exp( − β R ) ( R ) a nd to cov ariance factors r elated to the margin assumption of Ma mmen a nd Tsybako v when a finer analysis is p erformed. W e will pro ceed from the most straightforw ar d b ounds to wards more elabor a te ones, built to achiev e a b etter asymptotic behaviour. In this cour s e tow ards mor e sophisticated inequalities, we will introduce lo c al b ounds and r elative b oun ds . The study of relative b ounds is expanded in the third c hapter, where tigh ter comparisons betw een prior and poster ior Gibbs distributions are prov ed. Theorems 2.1.3 (pag e 54) and 2.2.4 (page 73 ) present t wo wa ys of selecting some near ly opti- mal classification rule. They are b oth prov ed to b e adaptive in a ll the parameters under Mammen and Tsybako v marg in a ssumptions and parametric complexity as- sumptions. This is done in Corollary 2 .1 .17 (page 67) of Theorem 2.1.15 (page 66) and in Theorem 2.2 .11 (pa g e 89). In the fir st approach, the p erformance of a randomized estimator mo delled by a posterio r distr ibution is compared with the per formance of a prior Gibbs distribution. In the second approach poster io r distri- butions are dir ectly compared b etw een themselves (and leads to slightly stronger results, to the price of using a more complex algorithm). When there ar e more than one para metric mo del, it is appropria te to us e also some doubly lo c alize d scheme : t wo step loca lization is present ed for b oth approa c hes, in Theorems 2.3 .2 (pag e 94) and 2.3.9 (page 108) and provides b ounds with a decreased influence of the num b er of empirically inefficient mo dels included in the selection sc heme. W e would not like to induce the r e a der into thinking that the most sophisticated results presented in these first tw o chapters a r e necessar ily the most useful ones, they ar e as a rule o nly more efficient asymptotic al ly , whereas, b eing more in volv ed, they use lo oser constants leading to less pr ecision for small sa mple s izes. In pra ctice whether a sample is to b e considere d small is a ques tion of the r atio b etw een the n umber of examples and the complexity (roughly sp eaking the num be r o f para me- ters) of the model used fo r classification. Since our aim here is to describ e methods Intr o duction xi appropriate for complex da ta (images, sp eech, DNA, . . . ), we suspect that pr acti- tioners wan ting to make use o f our prop osals will often b e confronted with small sample sizes; thus w e would advise them to try the simplest bo unds fir st and only afterwards see whether the asymptotically b etter o nes ca n bring so me improv ement. W e would also like to p oint out that the results of the first t wo chapters are not of a purely theoretical nature: p osterior parameter distributions can indeed be com- puted effectively , using Monte Carlo techniques, and there is well-established know- how ab out these computations in Bay esian statistics. Mo r eo ver, non-rando mized estimators of the c la ssical form b θ : Ω → Θ can b e efficiently a ppro ximated by pos- terior distributions ρ : Ω → M 1 + (Θ) suppo rted by a fairly narrow neigh b ourho o d of b θ , more precisely a neighbourho o d of the size of the typical fluctuations of b θ , so that this randomized approximation o f b θ will most of the time pro vide the sa me classification as b θ itself, except for a small amount of dubious examples for which the class ification provided b y b θ would a n ywa y b e unreliable. This is explained on page 7. As already men tioned, the third ch apter is ab out the tr ansductive setting , that is ab out c o mparing the p erformance of estimators on a tr a ining set and on a test set. W e show firs t that this compariso n can b e based on a s et o f e x ponential devi- ation inequa lities which parallels the one use d in the inductiv e case. This gives the opp ortunit y to tr ansp ort a ll the results obtained in the inductive ca se in a sy stem- atic wa y . In the transductive setting, the use of pr ior distributions can be extended to the use of p artial ly exchange able p osterior distributions depending on the union of training and test patterns , bringing increased pos sibilit ies to adapt to the data and giving rise to such cr ucial notions of complexity as the V apnik–Cervonenkis dimension. Having done so, w e more sp e c ifically fo cus on the smal l sample c ase , where lo cal and r elativ e b ounds are not exp ected to b e of grea t help. Introducing a fictitious (that is unobserved) shadow sample, w e study V apnik-type generalization b ounds, showing how to tighten and extend them w ith s o me original ideas, like making no Gaussian appr oximation to the log-Laplace transform of Ber noulli random v a ri- ables, using a shadow sa mple of a r bitrary size. shrinking from the use of a n y sym- metrization trick, a nd using a suitable subse t of the group of p erm utations to cov er the case of indep enden t non- identically distributed data. The culminating result of the third chapter is Theorem 3.3.3 (page 125), subsequent b ounds showing the separate influence o f the ab ov e ideas and providing an easier compa r ison with V ap- nik’s or iginal results. V apnik-type generalization bo unds have a bro ad a pplica bilit y , not only through the co ncept of V apnik–Cervonenkis dimension, but also throug h the use of compres s ion schemes (Little et al., 1986), which a re briefly describ ed on page 117. The b eginning of the fourth chapter intro duces Suppor t V ector Ma c hines, b oth in the separable and in the non-separable case (using the b ox constraint) . W e then describ e different t yp es of bo unds. W e star t with compres s ion scheme bo unds, to pro ceed with mar gin b ounds. W e be g in with transductive ma rgin b ounds, recalling on this o ccasio n in Theorem 4.2 .2 (page 144) the growth b o und for a family of c la s- sification rules with given V apnik–Cervonenkis dimension. In Theorem 4.2.4 (page 146) we give the usual estimate of the V a pnik–Cervonenkis dimension of a family of separating h yp erplanes with a g iv en tr a nsductiv e ma rgin (we mea n by this that the margin is computed on the union of the tra ining and test sets). W e pr esen t a n original probabilistic pro of ins pir ed by a similar o ne from Cristianini et al. (2000), whereas o ther pro ofs av a ila ble usually rely on the infor mal claim that the simplex xii Intr o duction is the w or st c a se. W e end this s ho rt review of Support V ector Mac hines with a dis- cussion of inductive margin bo unds. Here the marg in is co mput ed on the tra ining set only , and a more inv olved combinatorial lemma, due to Alon et al. (1997) and recalled in Lemma 4.2 .6 (page 149) is use d. W e use this lemma and the results of the third chapter to establish a b ound dep ending on the ma rgin of the training set alone. In app endix, we finally discuss the textbo ok ex ample of classifica tio n b y thresh- olding: in this setting, each classification rule is built by thresholding a series of measurements and ta k ing a decision based on these thresholded v alues. This re l- atively s imple ex ample (which can be co ns ider ed as an int ro duction to the more tech nical case of class ifica tion trees) can be used to give mor e flesh to the re s ults of the first three chapters. It is a pleasure to end this in tro duction with my grea test thanks to Anthon y Davison, for his car eful reading of the manuscript a nd his numerous sugg estions. Chapter 1 Inductiv e P A C-Ba y esian learning The setting of inductiv e infere nce (as opp osed to transductive inference to be dis- cussed later) is the one describ ed in the in tro duction. When we will have to take the exp ectation of a r andom v ariable Z : Ω → R as well as of a function of the para meter h : Θ → R with resp ect to some probability measure, we will a s a rule use s ho rt functional no ta tion instead of r esorting to the in tegr al sign: thu s we will write P ( Z ) for R Ω Z ( ω ) P ( dω ) a nd π ( h ) for R Θ h ( θ ) π ( dθ ). A mo re tr aditional statistical approach w ould focus on estimator s b θ : Ω → Θ of the parameter θ and b e interested on the relationship b e tw een the empiric al err or r ate r ( b θ ), defined b y equation (0.1, page viii), whic h is the n umber o f errors made o n the sample, a nd the exp ected err or rate R ( b θ ), defined by equa tion (0.2, page ix), which is the exp ected pro babilit y of er ror on new instances of patterns. The P A C-Bayesian approach instead chooses a broa der p erspective and a llo ws the estimator b θ to be drawn at ra ndom using s o me auxilia ry source of r andomness to smo oth the dep endence of b θ on the sample. O ne wa y of represen ting the supple- men tar y randomness allow ed in the choice of b θ , is to consider what it is usua l to call p osterior distributions on the parameter space, that is probabilit y measures ρ : Ω → M 1 + (Θ , T ), depending on the sample, or from a technical per spective, regular conditional (or transition) pro babilit y mea sures. Let us r e c all that we use the model describ ed in the in tro duction: th e training sample is mo delled by the canonical pro cess ( X i , Y i ) N i =1 on Ω = X × Y N , and a pro duct probability measure P = N N i =1 P i on Ω is considered to r eflect the a ssumption that the training s am- ple is made of indep enden t pair s of patterns and labels. The transition probabilit y measure ρ , along with P ∈ M 1 + (Ω), defines a proba bilit y distribution o n Ω × Θ a nd describ es the co nditional distribution o f the es timated par ameter b θ kno wing the sample ( X i , Y i ) N i =1 . The main sub ject o f this broadened theor y becomes to inv estigate the relation- ship betw een ρ ( r ), the average err o r r ate of b θ on the training sample, a nd ρ ( R ), the exp e cted e r ror rate of b θ on new sa mples. The firs t step tow ards using some kind of thermo dynamics to tac kle this question, is to consider the Laplace transform of ρ ( R ) − ρ ( r ), a w ell kno wn pr o vider of non-asymptotic deviation bounds. This transform takes the form P n exp h λ ρ ( R ) − ρ ( r ) io , 1 2 Chapter 1. Inductive P AC-Bayesian le arning where some inv er s e temper ature parameter λ ∈ R + , as a ph ysicist would call it, is in tro duced. This Lapla ce transfo rm would be easy to b ound if ρ did no t dep end on ω ∈ Ω (namely on the sample), beca us e ρ ( R ) would then b e non-random, and ρ ( r ) = 1 N N X i =1 ρ Y i 6 = f θ ( X i ) , would b e a sum of independent ra ndom v a riables. It turns out, and this will b e the sub ject of the nex t s ection, that this annoying dep endence of ρ on ω ca n b e quantified, using the inequality ρ ( R ) − ρ ( r ) ≤ λ − 1 log n π h exp λ ( R − r ) io + λ − 1 K ( ρ, π ) , which holds for any probability measure π ∈ M 1 + (Θ) on the parameter space; for our purp ose it will be appropr ia te to consider a prior distribution π that is non-random, a s oppo sed to ρ , which depends on the s ample. Here, K ( ρ, π ) is the Kullback div ergence of ρ f ro m π , whose definition will be r ecalled when we will come to techni ca lities; it can b e see n a s an upper bo und for the mut ual information betw een the ( X i , Y i ) N i =1 and the estimated pa rameter b θ . This inequality will allow us to r elate the p enalize d difference ρ ( R ) − ρ ( r ) − λ − 1 K ( ρ, π ) with the Laplace transform of sums of independent random v ar iables. 1.1. Basic i nequalit y Let us now come to the details of the inv estiga tion sk etched ab ov e. The first thing we will do is to study the Laplace transform of R ( θ ) − r ( θ ), as a s tarting p oin t for the more general study of ρ ( R ) − ρ ( r ): it corres ponds to the simple case wher e b θ is not rando m at all, and therefore where ρ is a Dirac mass at s o me deterministic parameter v alue θ . In the setting descr ibed in the intro duction, let us consider the Bernoulli r andom v ar iables σ i ( θ ) = 1 Y i 6 = f θ ( X i ) , which indicates whether the classificatio n rule f θ made an error on the i th co mponent of the training sample. Using indep endence and the co nca vity of the lo garithm function, it is readily seen that for a n y real constant λ log n P exp − λr ( θ ) o = N X i =1 log n P h exp − λ N σ i io ≤ N log 1 N N X i =1 P h exp − λ N σ i i . The right-hand side of this inequa lit y is the log -Laplace transform of a Bernoulli distribution with parameter 1 N P N i =1 P ( σ i ) = R ( θ ). As an y Bernoulli distribution is fully defined by its parameter , this log-La pla ce tra ns form is necessar ily a function of R ( θ ). It can be expr essed with the help o f the family of functions (1.1) Φ a ( p ) = − a − 1 log 1 − 1 − e x p( − a ) p , a ∈ R , p ∈ (0 , 1) . It is immediately seen that Φ a is an increasing one-to-o ne mapping of the unit in terv al onto itself, and that it is conv ex when a > 0, concave when a < 0 and ca n 1.1. Basic ine qu ali ty 3 be defined b y con tinuit y to be the identit y when a = 0 . Moreover the inv erse of Φ a is given by the for m ula Φ − 1 a ( q ) = 1 − e x p( − aq ) 1 − e x p( − a ) , a ∈ R , q ∈ (0 , 1) . This formula may b e used to extend Φ − 1 a to q ∈ R , and w e will use this extension without further notice when required. Using this notation, the previous inequality b ecomes log n P exp − λr ( θ ) o ≤ − λ Φ λ N R ( θ ) , proving Lemma 1. 1.1 . F or any r e al c onst ant λ and any p ar ameter θ ∈ Θ , P exp n λ h Φ λ N R ( θ ) − r ( θ ) io ≤ 1 . In previo us versions of this study , we had used s ome Ber nstein b ound, instea d of this lemma. An yhow, a s it will turn out, keeping the log -Laplace trans fo r m of a Bernoulli instead of approximating it provides simpler and tighter r esults. Lemma 1.1.1 implies that for an y constants λ ∈ R + and ǫ ∈ )0 , 1), P Φ λ N R ( θ ) + log( ǫ ) λ ≤ r ( θ ) ≥ 1 − ǫ. Cho osing λ ∈ arg max R + Φ λ N R ( θ ) + log( ǫ ) λ , we deduce Lemma 1. 1.2 . F or any ǫ ∈ )0 , 1) , any θ ∈ Θ , P ( R ( θ ) ≤ inf λ ∈ R + Φ − 1 λ N r ( θ ) − log( ǫ ) λ ) ≥ 1 − ǫ. W e will illustrate throughout these notes the bounds we prove with a small n umerica l example: in the case where N = 1000, ǫ = 0 . 01 and r ( θ ) = 0 . 2 , w e get with a confidence level of 0 . 9 9 that R ( θ ) ≤ . 2402, this being obta ined for λ = 23 4. Now, to pro ceed tow ards the analysis of pos ter ior dis tr ibuti ons , let us put U λ ( θ, ω ) = λ h Φ λ N R ( θ ) − r ( θ, ω ) i for short, and let us co nsider some pr ior proba bilit y distribution π ∈ M 1 + (Θ , T ). A pr oper c hoice o f π will be an imp ortant question, underlying muc h of the material pr e s en ted in this monogra ph, so for the time b e- ing, let us only s a y that we will let this choice b e as op en as p ossible by writing inequalities whic h hold fo r any choice of π . Let us insist on the fact that when w e say that π is a pr ior distribution, we mean that it do es not depend on the training sample ( X i , Y i ) N i =1 . The qua n tit y of in terest to o btain the bound w e a re lo oking for is log n P h π exp( U λ ) io . Using F ubini’s theore m for non-nega tiv e functions, we see that log n P h π exp( U λ ) io = log n π h P exp( U λ ) io ≤ 0 . T o r e la te this quantit y to the e x pectation ρ ( U λ ) with resp ect to an y p osterior distribution ρ : Ω → M 1 + (Θ), we will use the prope r ties of the Kullback divergence 4 Chapter 1. Inductive P AC-Bayesian le arning K ( ρ, π ) of ρ with resp ect to π , whic h is defined as K ( ρ, π ) = R log( dρ dπ ) dρ, when ρ is abso lutely contin uous with resp ect to π , + ∞ , otherwise . The follo wing lemma shows in whic h sense the Kullba ck divergence function ca n be thought of as the dual of the log-L a place tr a nsform. Lemma 1. 1.3 . F or any b ounde d me asur able function h : Θ → R , and any pr ob a- bility distribution ρ ∈ M 1 + (Θ) such that K ( ρ, π ) < ∞ , log π exp( h ) = ρ ( h ) − K ( ρ, π ) + K ( ρ, π exp( h ) ) , wher e by definition dπ exp( h ) dπ = exp[ h ( θ )] π [exp( h )] . Conse quently log π exp( h )] = sup ρ ∈ M 1 + (Θ) ρ ( h ) − K ( ρ, π ) . The pro of is just a matter of writing down the definition of the quantities in volved and using the fact that the Kullback div ergence function is non-neg ativ e, and can be found in Catoni (2004, pag e 16 0). In the duality b etw een measura ble functions and probability measures, w e thu s see that the lo g-Laplace transform with resp ect to π is the Legendre transfor m of the Kullba ck divergence function with resp ect to π . Using this, we get P n exp sup ρ ∈ M 1 + (Θ) ρ [ U λ ( θ )] − K ( ρ, π ) o ≤ 1 , which , combined with the conv exity of λ Φ λ N , proves the basic inequality we w ere lo oking for. Theorem 1 . 1.4 . F or any r e al c onstant λ , P exp sup ρ ∈ M 1 + (Θ) λ h Φ λ N ρ ( R ) − ρ ( r ) i − K ( ρ, π ) ≤ P exp sup ρ ∈ M 1 + (Θ) λ h ρ Φ λ N ◦ R − ρ ( r ) i − K ( ρ, π ) ≤ 1 . W e insist on the fact that in this theorem, w e take a suprem um in ρ ∈ M 1 + (Θ) inside the expectation with r espect to P , the sa mple distribution. This means that the pr o ved inequalit y holds for any ρ depending on the tra ining sample, that is for any p osterior distribution: indeed, measurability questions set aside, P exp sup ρ ∈ M 1 + (Θ) λ h ρ U λ ( θ ) − K ( ρ, π ) i = sup ρ :Ω → M 1 + (Θ) P exp λ ρ U λ ( θ ) − K ( ρ, π ) i , 1.2. Non lo c al b ounds 5 and more formally , sup ρ :Ω → M 1 + (Θ) P exp λ h ρ U λ ( θ ) − K ( ρ, π ) i ≤ P exp sup ρ ∈ M 1 + (Θ) λ h ρ U λ ( θ ) − K ( ρ, π ) i , where the s uprem um in ρ taken in the left-hand side is restricted to r e gular condi- tional probability distributions. The following sections will show how to use this theor em. 1.2. Non lo cal b ounds A t leas t three so rts of b ounds can be deduced from Theorem 1.1.4. The most interesting o nes with which to build estimators and tune parameters , as well as the first that have b een considered in the development of the P AC- Bay esian approach, are devia tion b ounds. They provide an empirical upp er b ound for ρ ( R ) — that is a bound which can b e computed from observed data — with some proba bility 1 − ǫ , where ǫ is a presumably sma ll and tunable para meter setting the desired confidence level. An yhow, most of the results about the conv erg ence sp eed o f estimators to b e found in the statistical literature are concer ned with the exp ectation P ρ ( R ) , there- fore it is a lso enlightening to bo und this quantit y . In order to know at which rate it may b e a pproachin g inf Θ R , a non-rando m upper b ound is req uir ed, which will relate the av erage of the expe c ted r isk P ρ ( R ) with the prop erties of the cont ra st function θ 7→ R ( θ ). Since the v alues of cons tan ts do matter a lo t when a bo und is to b e used to se- lect be tw een v arious estimato r s using classification mo dels o f v arious co mplexities, a third kind of b ound, related to the first, ma y b e consider ed for the s a k e of its hop efully b etter c o nstan ts: we will call them u n bia se d empiri c al b ounds , to stress the fa c t that they provide some empirica l quantit y whose exp ectation under P can be prov ed to b e an upper b ound for P ρ ( R ) , the a verage exp ected risk. The price to pa y for these better constants is of co urse the lack of formal gua ran tee g iven by the b ound: tw o r andom v aria bles whose exp ectations are or dered in a certain way may very well be order e d in the reverse wa y with a large probabilit y , s o that basing the estimation o f parameters or the selection of an estimator o n some un biased empirical b ound is a hazardo us business. Anyho w, since it is co mmon practice to use the inequalities provided by mathematica l statistical theory while replacing the prov en c o nstan ts with smaller v alues showing a b etter practical efficiency , consid- ering unb ias ed empirical bo unds as well as deviation b ounds provides an indication ab out how muc h the co ns ta n ts may b e decreased while not violating the theory to o m uch. 1.2.1. Unbi ase d empiric al b ounds Let ρ : Ω → M 1 + (Θ) b e s ome fixed (and ar bitrary) po sterior distr ibution, describing some rando mized es timator b θ : Ω → Θ. As we already mentioned, in these notes a po sterior distribution will always be a reg ular conditional probability measure. By this we mean that 6 Chapter 1. Inductive P AC-Bayesian le arning • for a n y A ∈ T , the map ω 7→ ρ ( ω , A ) : Ω , ( B ⊗ B ′ ) ⊗ N → R + is assumed to be measurable; • for any ω ∈ Ω, the map A 7→ ρ ( ω , A ) : T → R + is assumed to b e a pro babilit y measure. W e will also as sume without further no tice that the σ -alg ebras we deal with are alwa ys countably gener ated. The technical im plications of these assumptions ar e standard and discussed fo r instance in Ca toni (2004, pag es 50 -54), where, among other things, a detailed pro of of the decomp osition of the Kullback Liebler div er- gence is given. Let us restrict to the case when the co nstan t λ is po sitiv e. W e get from Theor em 1.1.4 that (1.2) exp λ n Φ λ N h P ρ ( R ) i − P ρ ( r ) o − P K ( ρ, π ) ≤ 1 , where we have used the convexit y of the ex p f unction a nd of Φ λ N . Since we hav e restricted our attent ion to p ositiv e v alues of the c onstan t λ , equa tion (1.2) can also be written P ρ ( R ) ≤ Φ − 1 λ N n P ρ ( r ) + λ − 1 K ( ρ, π ) o , leading to Theorem 1 . 2.1 . F or any p osterior distribution ρ : Ω → M 1 + (Θ) , for any p ositive p ar ameter λ , P ρ ( R ) ≤ 1 − e x p h − N − 1 P λρ ( r ) + K ( ρ, π ) i 1 − e x p( − λ N ) ≤ P ( λ N 1 − e x p( − λ N ) ρ ( r ) + K ( ρ, π ) λ ) . The last inequality provides the unbiase d empiric al upp er b ound for ρ ( R ) we were lo oking for, meaning that the exp ectation of λ N 1 − exp( − λ N ) h ρ ( r ) + K ( ρ,π ) λ i is larger than the expectation of ρ ( R ). Let us no- tice that 1 ≤ λ N 1 − exp( − λ N ) ≤ 1 − λ 2 N − 1 and therefore that this co efficient is close to 1 when λ is significantly smaller than N . If we are re a dy to b elieve in this bo und (a lthough this b elief is no t mathematically well founded, as w e already men tioned), w e can use it to o ptimi ze λ and to c ho ose ρ . While the optimal choice of ρ when λ is fixed is, a c c o rding to Lemma 1.1.3 (pag e 4), to take it eq ua l to π exp( − λr ) , a Gibbs p osterior dis tr ibution, as it is sometimes called, we may for computational r easons be more interested in choos ing ρ in s o me other class of p osterior distributions. F or insta nce , our real interest ma y b e to select so me non-r andomized es timator from a family b θ m : Ω → Θ m , m ∈ M , of p ossible ones, where Θ m are measurable subsets o f Θ and where M is an arbitrary (non necessarily co un table) index set. W e ma y for instance think of the cas e when b θ m ∈ arg min Θ m r . W e ma y slightly randomize the estimators to s ta rt with, cons ider ing for any θ ∈ Θ m and any m ∈ M , ∆ m ( θ ) = n θ ′ ∈ Θ m : f θ ′ ( X i ) N i =1 = f θ ( X i ) N i =1 o , 1.2. Non lo c al b ounds 7 and defining ρ m b y the formula dρ m dπ ( θ ) = 1 θ ∈ ∆ m ( b θ m ) π ∆ m ( b θ m ) . Our po sterior minimizes K ( ρ, π ) among those distributions whose suppo rt is r e- stricted to the v alues of θ in Θ m for whic h the classification rule f θ is iden tical to the estimated one f b θ m on the o bserv ed sample. Pr esumably , in ma ny practi- cal s ituatio ns , f θ ( x ) will b e ρ m almost s urely iden tical to f b θ m ( x ) when θ is drawn from ρ m , for the v a st ma jority of the v alues of x ∈ X and all the sub-mo dels Θ m not plagued with too muc h ov erfitting (since this is by construction the case when x ∈ { X i : i = 1 , . . . , N } ). Ther efore replacing b θ m with ρ m can b e exp ected to b e a minor change in many s ituations. This c hang e by the way can b e estimated in the (admittedly not s o common) case when the distribution of the patterns ( X i ) N i =1 is known. Indeed, introducing the pseudo distance (1.3) D ( θ, θ ′ ) = 1 N N X i =1 P f θ ( X i ) 6 = f θ ′ ( X i ) , θ, θ ′ ∈ Θ , one immediately sees that R ( θ ′ ) ≤ R ( θ ) + D ( θ, θ ′ ), fo r any θ , θ ′ ∈ Θ, and therefore that R ( b θ m ) ≤ ρ m ( R ) + ρ m D ( · , b θ m ) . Let us notice also that in the case where Θ m ⊂ R d m , and R ha ppens to b e co n vex on ∆ m ( b θ m ), then ρ m ( R ) ≥ R R θρ m ( dθ ) , and we ca n replace b θ m with e θ m = R θρ m ( dθ ), and obtain b ounds for R ( e θ m ). This is not a very heavy assumption ab out R , in the case where we consider b θ m ∈ arg min Θ m r . Indeed, b θ m , and therefore ∆ m ( b θ m ), will presumably b e close to arg min Θ m R , and requiring a function to b e co n vex in the neighbourho o d of its minima is not a very s trong ass umption. Since r ( b θ m ) = ρ m ( r ), and K ( ρ m , π ) = − log π ∆ m ( b θ m ) , our unbiased empiri- cal upper bound in this c o n text reads as λ N 1 − e x p( − λ N ) ( r ( b θ m ) − log π ∆ m ( b θ m ) λ ) . Let us notice that we obtain a complexity fa c to r − log π ∆ m ( b θ m ) which may b e compared with the V apnik–Cer vonenkis dimension. Indeed, in the c a se of binary classification, when using a classification mo del with V apnik–Cervonenkis dimen- sion not greater than h m , that is when any subset of X whic h can b e split in a ny arbitrary wa y by so me classification r ule f θ of the mo del Θ m has a t most h m po in ts, then ∆ m ( θ ) : θ ∈ Θ m is a par tition of Θ m with at most eN h m h m comp onen ts: these facts, if no t a lready familiar to the r eader, will be prov ed in Theorems 4.2.2 a nd 4.2.3 (page 144). Therefore inf θ ∈ Θ m − log π ∆ m ( θ ) ≤ h m log eN h m − log π (Θ m ) . Thu s, if the model and prior distribution a r e well suited to the classifica tion task, in the s ense that there is more “ roo m” (where ro om is measure d with π ) b et ween the 8 Chapter 1. Inductive P AC-Bayesian le arning t wo clusters defined by b θ m than betw een other partitions of the sa mple of patterns ( X i ) N i =1 , then we will hav e − log π ∆ m ( b θ ) ≤ h m log eN h m − log π (Θ m ) . An optimal v alue b m may b e selected so that b m ∈ arg min m ∈ M ( inf λ ∈ R + λ N 1 − e x p( − λ N ) r ( b θ m ) − log π ∆ m ( b θ m ) λ !) . Since ρ b m is still another po sterior distribution, we can be sure that P n R ( b θ b m ) − ρ b m D ( · , b θ b m ) o ≤ P ρ b m ( R ) ≤ inf λ ∈ R + P ( λ N 1 − e x p ( − λ N ) r ( b θ b m ) − log π ∆ b m ( b θ b m ) λ !) . T a king the infimum in λ inside the expectation with resp ect to P would b e po ssible at the price of some supplementary technicalities and a s ligh t incre ase o f the bo und that we prefer to po stpone to the dis c us s ion of devia tion b ounds, s ince they are the only ones to provide a rig orous ma thematical foundation to the adaptive selection of estimators. 1.2.2. O pti mizing explicitly the exp onential p ar ameter λ In thi s section w e address some tec hnical issues w e think helpful to the under- standing o f Theore m 1.2.1 (page 6 ): namely to in vestigate how the upp er b ound it provides could b e optimized, or at least approximately optimized, in λ . It turns out that this can b e done q uite ex plicitly . So we will consider in this disc us s ion the pos terior distribution ρ : Ω → M 1 + (Θ) to be fixed, and our aim will b e to eliminate the constant λ from the bound b y choosing its v alue in some nearly optimal w ay a s a function of P ρ ( r ) , the average of the empirical risk, and of P K ( ρ, π ) , whic h controls overfitting. Let the b o und b e written as ϕ ( λ ) = 1 − e x p ( − λ N ) − 1 n 1 − e x p h − λ N P ρ ( r ) − N − 1 P K ( ρ, π ) io . W e s e e that N ∂ ∂ λ log ϕ ( λ ) = P ρ ( r ) exp h λ N P ρ ( r ) + N − 1 P K ( ρ, π ) i − 1 − 1 exp( λ N ) − 1 . Thu s, the o ptimal v alue for λ is such that exp( λ N ) − 1 P ρ ( r ) = exp h λ N P ρ ( r ) + N − 1 P K ( ρ, π ) i − 1 . Assuming that 1 ≫ λ N P ρ ( r ) ≫ P [ K ( ρ,π )] N , and keeping o nly higher or der terms, we are led to choose λ = s 2 N P K ( ρ, π ) P ρ ( r ) 1 − P ρ ( r ) , obtaining 1.2. Non lo c al b ounds 9 Theorem 1 . 2.2 . F or any p osterior distribution ρ : Ω → M 1 + (Θ) , P ρ ( R ) ≤ 1 − e x p n − q 2 P [ K ( ρ,π )] P [ ρ ( r )] N { 1 − P [ ρ ( r )] } − P [ K ( ρ,π )] N o 1 − e x p n − q 2 P [ K ( ρ,π )] N P [ ρ ( r )] { 1 − P [ ρ ( r )] } o . This result of course is not very useful in itself, s ince neither of the t wo quantities P ρ ( r ) and P K ( ρ, π ) are easy to ev aluate. An yhow it gives a hint that replacing them b oldly with ρ ( r ) and K ( ρ, π ) could pr oduce something clo se to a legitimate empirical upper bo und for ρ ( R ). W e will see in the s ubsection ab out deviation bo unds that this is indeed es sen tially true. Let us remark that in the third chapter of this monog raph, we will s ee another wa y of bounding inf λ ∈ R + Φ − 1 λ N q + d λ , leading to Theorem 1 . 2.3 . F or any prior distribution π ∈ M 1 + (Θ) , for any p osterior distri- bution ρ : Ω → M 1 + (Θ) , P ρ ( R ) ≤ 1 + 2 P K ( ρ, π ) N ! − 1 ( P ρ ( r ) + P K ( ρ, π ) N + s 2 P K ( ρ, π ) P ρ ( r ) 1 − P ρ ( r ) N + P K ( ρ, π ) 2 N 2 ) , as so on as P ρ ( r ) + s P K ( ρ, π ) 2 N ≤ 1 2 , and P ρ ( R ) ≤ P ρ ( r ) + s P K ( ρ, π ) 2 N otherwise. This theorem enlightens the influence of three terms on the av erag e exp ected risk: • the average empirical ris k, P ρ ( r ) , whic h as a r ule will decrease as the size of the classification mo del increases, a cts as a bias term, gra s ping the a bility of the mo del to account for the observed sample itself; • a varianc e term 1 N P ρ ( r ) 1 − P ρ ( r ) is due to the random fluctuations of ρ ( r ); • a c omplexity term P K ( ρ, π ) , which as a rule will increase with the s ize of the classification model, ev entually a cts as a multiplier o f the v ariance term. W e observed n umerically that the b ound provided by Theor e m 1.2.2 is better than th e more classical V apnik-like b ound of Theo r em 1.2.3. F or instance, when N = 10 00, P ρ ( r ) = 0 . 2 and P K ( ρ, π ) = 10, Theorem 1.2.2 gives a bo und low er than 0 . 260 4 , whereas the more classica l V a pnik-lik e approximation of Theorem 1.2 .3 gives a b o und lar g er than 0 . 2622. Numerical simulations tend to suggest the tw o bo unds are always ordered in the same wa y , although this could b e a little tedious to prov e mathematically . 10 Chapter 1. Inductive P AC-Bayesian le arning 1.2.3. Non r andom b ounds It is time now to come to less tentativ e results and see how far is the av era ge exp e cted er ror r ate P ρ ( R ) from its bes t p ossible v a lue inf Θ R . Let us notice first that λρ ( r ) + K ( ρ, π ) = K ( ρ, π exp( − λr ) ) − log n π exp( − λr ) o . Let us remark moreov er that r 7→ log h π exp( − λr ) i is a conv ex functional, a prop- erty which fr o m a technical p oint of view can b e dealt with in the following wa y: (1.4) P n log h π exp( − λr ) io = P n sup ρ ∈ M 1 + (Θ) − λρ ( r ) − K ( ρ, π ) o ≥ sup ρ ∈ M 1 + (Θ) P n − λρ ( r ) − K ( ρ, π ) o = sup ρ ∈ M 1 + (Θ) − λρ ( R ) − K ( ρ, π ) = log n π exp( − λR ) o = − R λ 0 π exp( − β R ) ( R ) dβ . These remark s applied to Theorem 1.2.1 lead to Theorem 1 . 2.4 . F or any p osterior distribution ρ : Ω → M 1 + (Θ) , for any p ositive p ar ameter λ , P ρ ( R ) ≤ 1 − e x p n − 1 N R λ 0 π exp( − β R ) ( R ) dβ − 1 N P K ( ρ, π exp( − λr ) ) o 1 − e x p ( − λ N ) ≤ 1 N 1 − e x p( − λ N ) n R λ 0 π exp( − β R ) ( R ) dβ + P K ( ρ, π exp( − λr ) ) o . This theorem is particular ly well suited to the c a se o f the Gibbs p osterior distri- bution ρ = π exp( − λr ) , where the en tropy fa c tor cancels and where P π exp( − λr ) ( R ) is shown to get clos e to inf Θ R when N g oes to + ∞ , as s oon a s λ/ N goes to 0 while λ go es to + ∞ . W e ca n elabo r ate on Theor em 1.2 .4 a nd define a notion of dimension of (Θ , R ), with margin η ≥ 0 putting (1.5) d η (Θ , R ) = sup β ∈ R + β π exp( − β R ) ( R ) − ess inf π R − η ≤ − log n π R ≤ ess inf π R + η o . This last inequality can b e established by the chain of inequalities: β π exp( − β R ) ( R ) ≤ R β 0 π exp( − γ R ) ( R ) dγ = − lo g n π exp( − β R ) o ≤ β ess inf π R + η − log h π R ≤ ess inf π R + η i , where we have used success ively the fac t that λ 7→ π exp( − λR ) ( R ) is decreasing (beca use it is the deriv ative of the conca ve function λ 7→ − log π exp( − λR ) ) and the fact that the expo nen tial function takes p ositive v alues. 1.2. Non lo c al b ounds 11 In typical “parametric” situations d 0 (Θ , R ) will b e finite, a nd in all circumstances d η (Θ , R ) will b e finite for any η > 0 (this is a direct consequence of the definition of the essential infimum). Using this notion o f dimension, we see that Z λ 0 π exp( − β R ) ( R ) dβ ≤ λ ess inf π R + η + Z λ 0 d η β ∧ (1 − ess inf π R − η ) dβ = λ ess inf π R + η + d η (Θ , R ) log eλ d η (Θ , R ) 1 − e s s inf π R − η . This leads to Corollary 1. 2.5 With the ab ove notation, for any mar gin η ∈ R + , for any p oste- rior distribution ρ : Ω → M 1 + (Θ) , P ρ ( R ) ≤ inf λ ∈ R + Φ − 1 λ N " ess inf π R + η + d η λ log eλ d η + P K ρ, π exp( − λr ) λ # . If o ne wan ts a pos terior distribution with a small supp ort, the theorem c a n also be applied to th e case when ρ is obtained by truncating π exp( − λr ) to some level set to r educe its supp ort: let Θ p = { θ ∈ Θ : r ( θ ) ≤ p } , and let us define fo r any q ∈ )0 , 1) the level p q = inf { p : π exp( − λr ) (Θ p ) ≥ q } , let us then define ρ q b y its densit y dρ q dπ exp( − λr ) ( θ ) = 1 ( θ ∈ Θ p q ) π exp( − λr ) (Θ p q ) , then ρ 0 = π exp( − λr ) and for any q ∈ (0 , 1(, P ρ q ( R ) ≤ 1 − e x p n − 1 N R λ 0 π exp( − β R ) ( R ) dβ − log( q ) N o 1 − e x p( − λ N ) ≤ 1 N 1 − e x p( − λ N ) n R λ 0 π exp( − β R ) ( R ) dβ − log( q ) o . 1.2.4. D eviation b ou nds They provide results ho lding under the distribution P of the sample with pr obabilit y at least 1 − ǫ , for any given confidence level, se t b y the choice of ǫ ∈ )0 , 1 (. Using them is the only way to b e quite (i.e. with probability 1 − ǫ ) s ure to do the right thing, although this right thing ma y b e over-pessimistic, since devia tion upp er b ounds ar e larger than corr e s ponding non- biased b ounds. Starting a gain from Theorem 1.1.4 (pag e 4), and using Mar k ov’s inequalit y P exp( h ) ≥ 1 ≤ P exp( h ) , w e obtain Theorem 1 . 2.6 . F or any p ositive p ar ameter λ , with P pr ob ability at le ast 1 − ǫ , for any p osterior distribution ρ : Ω → M 1 + (Θ) , ρ ( R ) ≤ Φ − 1 λ N ρ ( r ) + K ( ρ, π ) − log ( ǫ ) λ 12 Chapter 1. Inductive P AC-Bayesian le arning = 1 − e x p − λρ ( r ) N − K ( ρ, π ) − log( ǫ ) N 1 − e x p − λ N ≤ λ N 1 − e x p − λ N ρ ( r ) + K ( ρ, π ) − log ( ǫ ) λ . W e see that for a fixed v alue of the pa rameter λ , the upper b ound is optimized when the po sterior is c hose n to b e the Gibbs distribution ρ = π exp( − λr ) . In this theor e m, we hav e b ounded ρ ( R ), the av era ge ex pected risk of an estimator b θ drawn from the poster io r ρ . This is wha t we will do most of the time in this study . This is the error rate w e will get if we cla ssify a lar ge num b er of test patterns, drawing a new b θ for each one. Howev er, w e can also be interested in the err or rate we get if we draw o nly o ne b θ from ρ and use this sing le dra w of b θ to class ify a large num be r of test patterns. This erro r r ate is R ( b θ ). T o state a result a b out its deviations, we can star t bac k fro m Lemma 1.1.1 (page 3) and integrate it with resp ect to the pr ior distribution π to g et for any r eal co nstan t λ P π exp n λ h Φ λ N R − r io ≤ 1 . F or a n y po sterior distribution ρ : Ω → M 1 + (Θ), this can be rewritten as P ρ exp n λ h Φ λ N R − r i − log dρ dπ + log( ǫ ) io ≤ ǫ, proving Theorem 1 . 2.7 F or any p ositive r e al p ar ameter λ , for any p osterior distribution ρ : Ω → M 1 + (Θ) , with P ρ pr ob ability at le ast 1 − ǫ , R ( b θ ) ≤ Φ − 1 λ N r ( b θ ) + λ − 1 log ǫ − 1 dρ dπ ≤ λ N 1 − e x p( − λ N ) r ( b θ ) + λ − 1 log ǫ − 1 dρ dπ . Let us r emark that the b ound provided here is the exact co un terpart of the bound of Theorem 1.2.6, since log dρ dπ app ears as a disinte gr ate d version of the divergence K ( ρ, π ). The para llel b et ween the tw o theorems is pa rticularly striking in the sp ecial case when ρ = π exp( − λr ) . Indeed Theorem 1.2.6 prov es that with P probability at least 1 − ǫ , π exp( − λr ) ( R ) ≤ Φ − 1 λ N − log π exp − λr + log ( ǫ ) λ , whereas Theorem 1.2.7 prov es that with P π exp( − λr ) probability at least 1 − ǫ R ( b θ ) ≤ Φ − 1 λ N − log π exp − λr + log( ǫ ) λ , showing that w e g e t the sa me deviation bound for π exp( − λr ) ( R ) under P a nd for b θ under P π exp( − λr ) . 1.2. Non lo c al b ounds 13 W e would lik e to show now how to optimize with resp ect to λ the b ound g iven b y Theor em 1.2.6 (the same discussion w ould apply to Theorem 1.2.7). Let us notice first that v a lues of λ less than 1 ar e not interesting (because they provide a bo und la rger than one, a t least as so on as ǫ ≤ exp( − 1)). Let us consider some real parameter α > 1, a nd the set Λ = { α k ; k ∈ N } , on which we put the pro ba bilit y measure ν ( α k ) = [( k + 1)( k + 2)] − 1 . Applying Theorem 1 .2.6 to λ = α k at co nfidence level 1 − ǫ ( k +1)( k +2) , and using a union b ound, we see that with proba bilit y at least 1 − ǫ , for a n y po sterior distr ibution ρ , ρ ( R ) ≤ inf λ ′ ∈ Λ Φ − 1 λ ′ N ρ ( r ) + K ( ρ, π ) − log ( ǫ ) + 2 log h log( α 2 λ ′ ) log( α ) i λ ′ . Now we can remark that for any λ ∈ (1 , + ∞ (, there is λ ′ ∈ Λ such that α − 1 λ ≤ λ ′ ≤ λ . Mo r eo ver, for any q ∈ (0 , 1), β 7→ Φ − 1 β ( q ) is increasing on R + . Thus with probability at least 1 − ǫ , for any p osterior distribution ρ , ρ ( R ) ≤ inf λ ∈ (1 , ∞ ( Φ − 1 λ N n ρ ( r ) + α λ h K ( ρ, π ) − log ( ǫ ) + 2 log log( α 2 λ ) log( α ) io = inf λ ∈ (1 , ∞ ( 1 − e x p n − λ N ρ ( r ) − α N h K ( ρ, π ) − log( ǫ ) + 2 log log( α 2 λ ) log( α ) io 1 − e x p( − λ N ) . T a king the a ppro ximately optimal v alue λ = s 2 N α [ K ( ρ, π ) − log ( ǫ )] ρ ( r )[1 − ρ ( r )] , we obtain Theorem 1 . 2.8 . With pr ob ability 1 − ǫ , f or any p osterior di stribut io n ρ : Ω → M 1 + (Θ) , put t ing d ( ρ, ǫ ) = K ( ρ, π ) − log( ǫ ) , ρ ( R ) ≤ inf k ∈ N 1 − e x p − α k N ρ ( r ) − 1 N h d ( ρ, ǫ ) + log ( k + 1)( k + 2) i 1 − e x p − α k N ≤ 1 − e x p − s 2 αρ ( r ) d ( ρ, ǫ ) N [1 − ρ ( r )] − α N " d ( ρ, ǫ ) + 2 lo g log α 2 q 2 N αd ( ρ,ǫ ) ρ ( r )[1 − ρ ( r )] log( α ) # 1 − e x p " − s 2 αd ( ρ, ǫ ) N ρ ( r )[1 − ρ ( r )] # . Mor e over with pr ob ability at le ast 1 − ǫ , for any p osterior distribution ρ such that ρ ( r ) = 0 , ρ ( R ) ≤ 1 − exp − K ( ρ, π ) − log( ǫ ) N . W e can also elab orate o n the results in an other direction by intro ducing the empiric al dimension (1.6) d e = sup β ∈ R + β π exp( − β r ) ( r ) − ess inf π r ≤ − log π r = ess inf π r . 14 Chapter 1. Inductive P AC-Bayesian le arning There is no need to introduce a margin in this definition, since r takes at most N v alues , and ther efore π r = ess inf π r is strictly po sitiv e. This leads to Corollary 1. 2.9 . F or any p ositive r e al c onstant λ , with P pr ob ability at le ast 1 − ǫ , for any p osterior distribution ρ : Ω → M 1 + (Θ) , ρ ( R ) ≤ Φ − 1 λ N " ess inf π r + d e λ log eλ d e + K ρ, π exp( − λr ) − log( ǫ ) λ # . W e c o uld then make the b ound uniform in λ and optimize this parameter in a wa y similar to what was done to obtain Theorem 1.2.8. 1.3. Lo cal b ounds In this section, b etter bo unds will b e ac hieved through a better choice o f the prior distribution. This b etter prio r distribution turns out to dep end on the unkno wn sample distribution P , a nd some work is required to circum ven t this and obtain empirical bo unds. 1.3.1. C hoic e of the pri or As mentioned in the introduction, if one is willing to minimi ze the b ound in ex- pecta tion provided by Theore m 1.2 .1 (page 6), one is led to co ns ider the optimal choice π = P ( ρ ). How ever, this is only an ideal choice, s ince P is in all conce iv able situations unknown. Nevertheless it shows that it is p ossible through Theorem 1.2.1 to meas ure the c omplexity of the classifica tion mo del with P K ρ, P ( ρ ) , which is nothing but the mutual informatio n b etw een the ra ndom sample ( X i , Y i ) N i =1 and the estimated parameter ˆ θ , under the joint distribution P ρ . In pra c tice, since w e cannot cho ose π = P ( ρ ), w e hav e to b e co n tent with a flat prior π , re s ulting in a b ound measuring complexity acco rding to P K ( ρ, π ) = P K ρ, P ( ρ ) + K P ( ρ ) , π larger by the ent ro p y factor K P ( ρ ) , π than the optimal one (we are still commenting o n Theor em 1.2.1). If we wan t to base the c hoice o f π o n Theor e m 1.2.4 (page 10), a nd if we choo se ρ = π exp( − λr ) to optimize this b ound, we will b e inclined to cho ose some π such that 1 λ R λ 0 π exp( − β R ) ( R ) dβ = − 1 λ log n π exp( − λR ) o is as far as p ossible close to inf θ ∈ Θ R ( θ ) in all cir cumstances. T o give a mo r e sp ecific example, in the case when the distribution of the design ( X i ) N i =1 is known, one c an in tro duce on the parameter space Θ the metric D already defined by equation (1.3, page 7) (or some av ailable upp er bo und for this distance). In view of the fact that R ( θ ) − R ( θ ′ ) ≤ D ( θ, θ ′ ), for any θ , θ ′ ∈ Θ , it can b e meaningful, at lea st theoretically , to choose π as π = ∞ X k =1 1 k ( k + 1) π k , where π k is the uniform measur e on some minimal (or close to minimal) 2 − k -net N (Θ , D , 2 − k ) of the metric space (Θ , D ). With this choice 1.3. L o c al b ounds 15 − 1 λ log n π exp( − λR ) o ≤ inf θ ∈ Θ R ( θ ) + inf k 2 − k + log( | N (Θ , D , 2 − k ) | ) + lo g[ k ( k + 1)] λ . Another p ossibilit y , when we hav e to dea l with real v a lued parameters, meaning that Θ ⊂ R d , is to co de eac h real compo nen t θ i ∈ R of θ = ( θ i ) d i =1 to some precision and to use a prior µ which is a tomic on dyadic n umbers. More precisely let us parametrize the set of dyadic real num b ers as D = ( r s, m , p, ( b j ) p j =1 = s 2 m 1 + p X j =1 b j 2 − j : s ∈ {− 1 , +1 } , m ∈ Z , p ∈ N , b j ∈ { 0 , 1 } ) , where, as can b e seen, s co des the sign, m the order of magnitude, p the pre c is ion and ( b j ) p j =1 the binary representation o f the dyadic n umber r s, m , p, ( b j ) p j =1 . W e can for instance consider on D the probability distribution (1.7) µ r s, m , p, ( b j ) p j =1 = h 3( | m | + 1 )( | m | + 2 )( p + 1)( p + 2)2 p i − 1 , and define π ∈ M 1 + ( R d ) as π = µ ⊗ d . This kind of “co ding” prior distribution can be used also to define a prior o n the integers (by renormalizing the restr ic tio n of µ to integers to get a probability distr ibution). Using µ is somehow equiv alent to pickin g up a repr esen tative of each dyadic int er v al, a nd ma k es it p ossible to restrict to the case when t he p osterior ρ is a Dirac mass without losing to o muc h (when Θ = (0 , 1), this approa c h is somewhat equiv alent to consider ing a s prio r distr ibution the L e b esgue measure and using as p osterior distributions the uniform probabil- it y mea sures on dy adic interv als, with the adv antage o f obtaining non-randomized estimators). When one uses in this way an atomic prior and Dirac masse s as po s- terior distributions, the bounds proven so far c a n b e obtained throug h a simpler union bound argument. This is so true that so me of the detra ctors of the P AC- Bay esian appr oach ( which, a s a new comer, ha s sometimes receiv ed a suspicious greeting a mo ng statisticians) hav e ar gued that it cannot bring anything that ele- men tar y union b ound a rgumen ts could not ess e ntially provid e. W e do not share of course this derogator y opinion, and while we think that allowing for non ato mic priors and p osteriors is worthwhile, w e also w ould like to stress that the upco ming lo cal and relative b ounds could hardly b e obtained with the only help of union bo unds. Although the choice of a flat prior seems a t first gla nce to b e the only a lter nativ e when nothing is known ab out the sample distribution P , the previous discussion shows that this type of choice is lacking prop er lo calisation, and namely that we lo ose a factor K P π exp( − λr ) , π , the divergence b et ween the b o und-optimal prio r P π exp( − λr ) , which is concentrated near the minima o f R in favourable situations, and the flat prior π . F ortunately , there are technical wa ys to get a round this diffi- cult y a nd to obtain more lo cal empirical bo unds. 1.3.2. Unbi ase d lo c al empiri c al b ounds The idea is to star t with s ome flat pr ior π ∈ M 1 + (Θ), and the p osterior distribution ρ = π exp( − λr ) minimizing the b ound o f Theorem 1.2.1 (pag e 6), when π is used as a 16 Chapter 1. Inductive P AC-Bayesian le arning prior. T o improve the b ound, we would like to use P π exp( − λr ) instead of π , and we are going to make the guess that we could a ppro ximate it with π exp( − β R ) (w e hav e replaced the parameter λ with some distinct parameter β to give so me mor e freedo m to our inv estigation, and also b ecause, in tuitively , P π exp( − λr ) may b e expected to be less concen tra ted than each of the π exp( − λr ) it is mixing, which sugge s ts that the bes t approximation of P π exp( − λr ) b y some π exp( − β R ) may b e o btained for some parameter β < λ ) . W e a re then led to lo ok for some empirica l up p er b ound of K ρ, π exp( − β R ) . This is happily provided by the following computation P K ρ, π exp( − β R ) = P K ( ρ, π ) + β P ρ ( R ) + log n π exp( − β R ) o = P K ρ, π exp( − β r ) + β P ρ ( R − r ) + log n π exp( − β R ) o − P n log π exp( − β r ) o . Using the co n vexit y o f r 7→ log π exp( − β r ) as in equation (1.4) o n page 10, we conclude that 0 ≤ P K ρ, π exp( − β R ) ≤ β P ρ ( R − r ) + P K ρ, π exp( − β r ) . This inequality has an interest of its own, since it provides a low er bound for P ρ ( R ) . Moreov er we can plug it in to Theorem 1.2 .1 (page 6 ) applied to the prior distribution π exp( − β R ) and obta in for an y pos terior distribution ρ and an y pos itiv e parameter λ that Φ λ N P ρ ( R ) ≤ P ρ ( r ) + β λ ρ ( R − r ) + 1 λ P n K ρ, π exp( − β r ) o . In view of this, it it conv enient to in tro duce the function e Φ a,b ( p ) = (1 − b ) − 1 Φ a ( p ) − bp = − (1 − b ) − 1 n a − 1 log 1 − p 1 − e x p ( − a ) + bp o , p ∈ (0 , 1) , a ∈ )0 , ∞ ( , b ∈ (0 , 1( . This is a conv ex function of p , moreov er e Φ ′ a,b (0) = n a − 1 1 − e x p( − a ) − b o (1 − b ) − 1 , showing that it is a n increa sing one to one convex map o f the un it in terv al unto itself as so on as b ≤ a − 1 1 − exp( − a ) . Its co n vexit y , c om bined with the v alue o f its deriv a tiv e at the origin, shows that e Φ a,b ( p ) ≥ a − 1 1 − e x p( − a ) − b 1 − b p. Using this notation and remarks, we can state Theorem 1 . 3.1 . F or any p ositive r e al c onstants β and λ such that 0 ≤ β < N [1 − exp( − λ N )] , for any p osterior distribution ρ : Ω → M 1 + (Θ) , P ρ ( r ) − K ρ, π exp( − β r ) β ≤ P ρ ( R ) 1.3. L o c al b ounds 17 ≤ e Φ − 1 λ N , β λ P ρ ( r ) + K ρ, π exp( − β r ) λ − β ≤ λ − β N [1 − exp( − λ N )] − β P ρ ( r ) + K ρ, π exp( − β r ) λ − β . Thus (t aking λ = 2 β ), for any β such that 0 ≤ β < N 2 , P ρ ( R ) ≤ 1 1 − 2 β N P ρ ( r ) + K ρ, π exp( − β r ) β . Note that the last inequality is obtained using the fact that 1 − exp( − x ) ≥ x − x 2 2 , x ∈ R + . Corollary 1. 3.2 . F or any β ∈ (0 , N ( , P π exp( − β r ) ( r ) ≤ P π exp( − β r ) ( R ) ≤ inf λ ∈ ( − N log(1 − β N ) , ∞ ( λ − β N [1 − exp( − λ N )] − β P π exp( − β r ) ( r ) ≤ 1 1 − 2 β N P π exp( − β r ) ( r ) , the last ine quality holding only when β < N 2 . It is interesting to compare the upper bound provided b y this coro llary wit h Theorem 1.2.1 (pa g e 6) when the pos terior is a Gibbs measure ρ = π exp( − β r ) . W e see that we hav e got rid of the entrop y ter m K π exp( − β r ) , π , but at the price of an increase of the multipli ca tive factor, which for sma ll v alues of β N grows from (1 − β 2 N ) − 1 (when we take λ = β in Theor e m 1.2.1), to (1 − 2 β N ) − 1 . Therefor e non-lo calized b ounds ha ve an in terest of their own, a nd ar e sup e r seded by lo calized bo unds only in fa vourable circumstances (presumably when the sa mple is large enough when compared with the complexity of the classifica tion mo del). Corollar y 1 .3 .2 shows that when 2 β N is sma ll, π exp( − β r ) ( r ) is a tight approximation of π exp( − β r ) ( R ) in the mean (since we have an upp er bo und and a low er b ound whic h are close together). Another co rollary is obtained b y optimizing the b o und given by Theorem 1 .3.1 in ρ , which is done by taking ρ = π exp( − λr ) . Corollary 1. 3.3 . F or any p ositive r e al c onstants β and λ such that 0 ≤ β < N [1 − exp( − λ N )] , P π exp( − λr ) ( R ) ≤ e Φ − 1 λ N , β λ P 1 λ − β Z λ β π exp( − γ r ) ( r ) dγ ≤ 1 N [1 − exp( − λ N )] − β P h R λ β π exp( − γ r ) ( r ) dγ i . Although this inequality gives by construction a b etter upper b ound for inf λ ∈ R + P π exp( − λr ) ( R ) than Corolla ry 1.3.2, it is no t easy to tell which one of the tw o inequalities is the b est to b ound P π exp( − λr ) ( R ) for a fixed (and p ossibly 18 Chapter 1. Inductive P AC-Bayesian le arning subo ptimal) v alue of λ , b e c a use in this case, one fa c to r is improv ed while the other is worsened. Using the empiric al dimension d e defined by equation (1.6 ) on page 13, w e see that 1 λ − β Z λ β π exp( − γ r ) ( r ) dγ ≤ ess inf π r + d e log λ β . Therefore, in the ca se when we k eep the ratio λ β bo unded, we get a b etter dep en- dence on the empirical dimension d e than in Corolla ry 1.2.9 (page 14). 1.3.3. Non r andom lo c al b ounds Let us come now to the lo calization of the non-ra ndom upper bound given by The- orem 1.2 .4 (pag e 10). According to Theor em 1 .2.1 (page 6) applied to the lo calized prior π exp( − β R ) , λ Φ λ N P ρ ( R ) ≤ P n λρ ( r ) + K ( ρ, π ) + β ρ ( R ) o + log π exp( − β R ) = P n K ρ, π exp( − λr ) − log π exp( − λr ) + β ρ ( R ) o + log π exp( − β R ) ≤ P n K ρ, π exp( − λr ) + β ρ ( R ) o − log π exp( − λR ) + log π exp( − β R ) , where we have used as previously inequalit y (1.4 ) (page 1 0). This prov es Theorem 1 . 3.4 . F or any p osterior distribution ρ : Ω → M 1 + (Θ) , for any r e al p ar ameters β and λ such that 0 ≤ β < N 1 − e x p( − λ N ) , P ρ ( R ) ≤ e Φ − 1 λ N , β λ 1 λ − β Z λ β π exp( − γ R ) ( R ) dγ + P K ρ, π exp( − λr ) λ − β ≤ 1 N 1 − e x p( − λ N ) − β Z λ β π exp( − γ R ) ( R ) dγ + P n K ρ, π exp( − λr ) o . Let us no tice in particular that this theore m contains Theorem 1.2.4 (pag e 10) which corresp onds to the case β = 0. As a corolla r y , we see also, ta k ing ρ = π exp( − λr ) and λ = 2 β , and noticing that γ 7→ π exp( − γ R ) ( R ) is decreasing, that P π exp( − λr ) ( R ) ≤ inf β ,β 1, ϕ ( x ) ≤ ( κ − 1 ) κ − κ κ − 1 ( cx ) − 1 κ − 1 = (1 − κ − 1 )( κcx ) − 1 κ − 1 , th us P π exp( − λr ) R ′ ( · , e θ ) ≤ R λ β π exp( − γ R ) R ′ ( · , e θ ) dγ + (1 − κ − 1 )( κcx ) − 1 κ − 1 λ 2 2 N λ − xλ 2 2 N − β . T a king for instance β = λ 2 , x = N 2 λ , and putting b = (1 − κ − 1 )( cκ ) − 1 κ − 1 , we obtain P π exp( − λr ) ( R ) − inf R ≤ 4 λ Z λ λ/ 2 π exp( − γ R ) R ′ ( · , e θ ) dγ + b 2 λ N κ κ − 1 . In the p ar ametric ca se when π exp( − γ R ) R ′ ( · , e θ ) ≤ d γ , we get P π exp( − λr ) ( R ) − inf R ≤ 4 log(2) d λ + b 2 λ N κ κ − 1 . T a king λ = 2 − 1 8 log(2) d κ − 1 2 κ − 1 ( κc ) 1 2 κ − 1 N κ 2 κ − 1 , we obtain P π exp( − λr ) ( R ) − inf R ≤ (2 − κ − 1 )( κc ) − 1 2 κ − 1 8 log(2) d N κ 2 κ − 1 . W e see that this formula coincides with the res ult for κ = 1. W e ca n th us reduce the t wo cases to a s ingle one and state Corollary 1. 4.7 . L et u s assume that for some e θ ∈ Θ , some p ositive r e al c onstant c , some r e al exp onent κ ≥ 1 and for any θ ∈ Θ , R ( θ ) ≥ R ( e θ ) + cD ( θ, e θ ) κ . Le t us also assume that for some p ositive r e al c onstant d and any p ositive r e al p ar ameter γ , π exp( − γ R ) ( R ) − inf R ≤ d γ . Then P h π exp − 2 − 1 [8 log(2) d ] κ − 1 2 κ − 1 ( κc ) 1 2 κ − 1 N κ 2 κ − 1 r ( R ) i ≤ inf R + (2 − κ − 1 )( κc ) − 1 2 κ − 1 8 log(2) d N κ 2 κ − 1 . 1.4. R elative b ounds 41 Let us remark that the exp onent of N in this corollary is known to be the mini- max exp onent under these a ssumptions: it is unimprov able, whatever estimato r is used in place of the Gibbs p osterior shown her e (at least in the w orst case co m- patible with the hypotheses). The interest o f the coro llary is to show not only the minimax exp onent in N , but also an explicit non-asymptotic b ound with reas o n- able a nd simple constan ts. It is also clear that w e could hav e got sligh tly better constants if we had k ept the full stre ng th o f Theorem 1.4 .4 (page 38) instead of using the weak er Corollary 1 .4.6 (page 39). W e will prove in the following empirica l b ounds showing how the c onstan t λ can be estimated from the data instead of being chosen according to some margin a nd complexity assumptions. 1.4.3. Unbi ase d empiric al b ounds W e a re g oing to define an empirica l counterpart fo r the exp e cte d mar gin funct io n ϕ . It will app ear in empir ical bounds having otherwise the same str ucture as the non-random b ound we just pro ved. An yhow, we will not launch in to trying to compare the b eha viour of our pr oposed empiric al mar gin function with the exp e cte d mar gin fun ction , s ince the mar gin function in volv es taking a supremum which is not straightforward to handle. When we will touch the issue of building pr ovably adaptive e s timators, we will instead f or m ulate another t ype of b ounds based on in tegr ated quan tities, ra ther than try to a nalyse the pro perties of the empirical margin function. Let us start as in the previous subsection with the inequalit y β P n ρ R ′ ( · , e θ ) o ≤ P n λρ r ′ ( · , e θ ) + K ( ρ, π ) o + log n π h exp − λ Ψ λ N R ′ ( · , e θ ) , M ′ ( · , e θ ) + β R ′ ( · , e θ ) io . W e hav e alr eady defined by equation (1.22, pag e 37) the empirical pseudo-dista nce m ′ ( θ, e θ ) = 1 N N X i =1 ψ i ( θ, e θ ) 2 . Recalling that P m ′ ( θ, e θ ) = M ′ ( θ, e θ ), and using the conv exit y o f h 7→ log n π exp( h ) o , leads to the following inequalities: log n π h exp − λ Ψ λ N R ′ ( · , e θ ) , M ′ ( · , e θ ) + β R ′ ( · , e θ ) io ≤ log n π h exp − N sinh( λ N ) R ′ ( · , e θ ) + N sinh( λ N ) tanh( λ 2 N ) M ′ ( · , e θ ) + β R ′ ( · , e θ ) io ≤ P log n π h exp − N sinh( λ N ) − β r ′ ( · , e θ ) + N sinh( λ N ) tanh ( λ 2 N ) m ′ ( · , e θ ) io . W e may moreov er r emark that 42 Chapter 1. Inductive P AC-Bayesian le arning λρ r ′ ( · , e θ ) + K ( ρ, π ) = β − N sinh( λ N ) + λ ρ r ′ ( · , e θ ) + K ρ, π exp {− [ N sinh( λ N ) − β ] r } − log n π h exp − N sinh( λ N ) − β r ′ ( · , e θ ) io . This establishes Theorem 1 . 4.8 . F or any p ositive r e al p ar ameters β and λ , for any p osterior dis- tribution ρ : Ω → M 1 + (Θ) , P ρ R ′ ( · , e θ ) ≤ P 1 − N sinh( λ N ) − λ β ρ r ′ ( · , e θ ) + K ρ, π exp {− [ N sinh( λ N ) − β ] r } β + β − 1 log n π exp {− [ N sinh( λ N ) − β ] r } h exp N sinh( λ N ) tanh( λ 2 N ) m ′ ( · , e θ ) io . T a king β = N 2 sinh( λ N ), using the fac t tha t s inh( a ) ≥ a , a ≥ 0 and ex pr essing tanh( a 2 ) = a − 1 p 1 + s inh( a ) 2 − 1 and a = log p 1 + s inh( a ) 2 + sinh( a ) , we deduce Corollary 1. 4.9 . F or any p ositive r e al c onstant β and any p osterior distribution ρ : Ω → M 1 + (Θ) , P ρ R ′ ( · , e θ ) ≤ P ( N β log q 1 + 4 β 2 N 2 + 2 β N − 1 | {z } ≤ 1 ρ r ′ ( · , e θ ) + 1 β K ρ, π exp( − β r ) + log π exp( − β r ) n exp h N q 1 + 4 β 2 N 2 − 1 m ′ ( · , e θ ) io ) . This theorem and its corollary a re really analogous to Theorem 1.4.4 (page 38), and it could eas ily b e prov ed that under Mammen and Tsybakov margin assump- tions w e o btain an upper bound of the same order as Corollar y 1.4.7 (page 40). An yhow, in order to obta in an empirical b ound, w e are now going to take a supre- m um ov er all p ossible v alues of e θ , that is ov er Θ 1 . Although w e believe that taking this suprem um will not spoil the bound in cases when o ver-fitting remains un- der control, we will not try to inv estiga te pr ecisely if and when this is actually true, and provide our empirical bound a s such. Let us say only that on qualitative grounds, the v alues o f the ma rgin function quantif y the s teepness of the co n trast function R or its empir ic a l counterpart r , and that the definition of the empir ic a l margin function is o btained b y substituting P , the true sample dis tr ibution, with P = 1 N P N i =1 δ ( X i ,Y i ) ⊗ N , th e empirical sample distribution, in the definition of the e x pected margin function. Therefore, o n qualitative grounds, it seems hopeless to pres ume that R is steep when r is not, or in o ther words that a c la ssification mo del that w ould be inefficient at estimating a bo otstra pped sa mple according to our non-r andom b ound would be by some miracle efficient a t estimating the true 1.4. R elative b ounds 43 sample distribution according to the s ame b ound. T o this extent, w e feel that our empirical bo unds bring a sa tisfactory counterpart o f our non-random bounds. Any- how, we will also produce estimators which can be proved to be a daptiv e using P A C-Bayesian to ols in the next section, a t the price of a more sophisticated co n- struction inv olving comparisons be tw een a p osterior distribution and a Gibbs prior distribution or b et ween tw o po sterior distr ibutions. Let us now r e s trict discussio n to the imp ortant case when e θ ∈ arg min Θ 1 R . T o obtain an obser v able bo und, let b θ ∈ ar g min θ ∈ Θ r ( θ ) a nd let us intro duce the empiric al mar gin functions ϕ ( x ) = sup θ ∈ Θ m ′ ( θ, b θ ) − x r ( θ ) − r ( b θ ) , x ∈ R + , e ϕ ( x ) = s up θ ∈ Θ 1 m ′ ( θ, b θ ) − x r ( θ ) − r ( b θ ) , x ∈ R + . Using the fact that m ′ ( θ, e θ ) ≤ m ′ ( θ, b θ ) + m ′ ( b θ , e θ ), we get Corollary 1. 4.10 . F or any p ositive r e al p ar ameters β and λ , fo r any p osterior distribution ρ : Ω → M 1 + (Θ) , P ρ ( R ) − inf Θ 1 R ≤ P h 1 − N sinh( λ N ) − λ β i ρ ( r ) − r ( b θ ) + K ρ, π exp {− [ N sinh( λ N ) − β ] r } β + β − 1 log n π exp {− [ N sinh( λ N ) − β ] r } h exp N sinh λ N tanh λ 2 N m ′ ( · , b θ ) io + β − 1 N sinh( λ N ) tanh( λ 2 N ) e ϕ β N sinh( λ N ) tanh ( λ 2 N ) 1 − N sinh( λ N ) − λ β ! . T aking β = N 2 sinh( λ N ) , we also obtain P ρ ( R ) − inf Θ 1 R ≤ P ( N β log q 1 + 4 β 2 N 2 + 2 β N − 1 | {z } ≤ 1 ρ ( r ) − r ( b θ ) + 1 β K ρ, π exp( − β r ) + log π exp( − β r ) n exp h N q 1 + 4 β 2 N 2 − 1 m ′ ( · , b θ ) io + N β q 1 + 4 β 2 N 2 − 1 e ϕ " log q 1 + 4 β 2 N 2 + 2 β N − β N q 1 + 4 β 2 N 2 − 1 #) . Note that we could a lso us e the upp er b ound m ′ ( θ, b θ ) ≤ x r ( θ ) − r ( b θ ) + ϕ ( x ) and put α = N sinh( λ N ) 1 − x tanh( λ 2 N ) − β , to obtain Corollary 1. 4.11 . F or any non-ne gative r e al p ar ameters x , α and λ , such that α < N sinh( λ N ) 1 − x tanh( λ 2 N ) , for any p osterior distribution ρ : Ω → M 1 + (Θ) , 44 Chapter 1. Inductive P AC-Bayesian le arning P ρ ( R ) − inf Θ 1 R ≤ P ( 1 − N sinh( λ N ) 1 − x tanh( λ 2 N ) − λ N sinh( λ N ) 1 − x tanh( λ 2 N ) − α ρ ( r ) − r ( b θ ) + K ρ, π exp( − αr ) N sinh( λ N ) 1 − x tanh( λ 2 N ) − α + N sinh( λ N ) tanh( λ 2 N ) N sinh( λ N ) 1 − x tanh( λ 2 N ) − α × ϕ ( x ) + e ϕ λ − α N sinh( λ N ) tanh( λ 2 N ) ) . Let us notice that in the ca se when Θ 1 = Θ, the upper b o und provided b y this corollar y ha s the sa me gener al form as the upper b ound provided by Co rollary 1.4.5 (page 39), with the s ample distribution P replaced with the empirical distribution of the sample P = 1 N P N i =1 δ ( X i ,Y i ) ⊗ N . Therefor e, our empirical b ound can b e of a larg er order of ma gnitude than our non-random bound only in the case when our non-random bo und applied to the b oo tstrapped sample distribution P w ould be of a la r ger or der of magnitude than w hen applied to the true sa mple distribution P . In other words, we can say that o ur empirica l b ound is close to our non-ra ndom b ound in every situation where the b ootstra pped sample distr ibut ion P is not harder to bo und than the true sample distribution P . Although this do es no t pr o ve that our empirical bo und is alwa ys of the same order as our non-random b ound, this is a go o d qualitative hin t that this will b e the ca se in most practical s ituatio ns of interest, since in situations o f “under-fitting”, if they exist, it is lik ely that the c hoice o f the classification mo del is inappropriate to the da ta and should b e mo dified. Another reass uring r emark is that the empirica l margin functions ϕ and e ϕ b ehav e well in the case when inf Θ r = 0. Indeed in this case m ′ ( θ, b θ ) = r ′ ( θ, b θ ) = r ( θ ), θ ∈ Θ, a nd th us ϕ (1) = e ϕ (1) = 0, and e ϕ ( x ) ≤ − ( x − 1 ) inf Θ 1 r , x ≥ 1. This shows that in this ca se we recover the same accura cy as with non-relative lo cal empirical bo unds. Thus the b ound of Corollary 1.4.11 do es not collapse in pr e sence of massive ov er-fitting in the larger mo del, causing r ( b θ ) = 0, which is another hint that this may b e an accurate b o und in many s ituatio ns. 1.4.4. R elative empiric al deviation b ounds It is natural to mak e use of Theorem 1 .4.3 (page 37) to obtain empirical deviation bo unds, since this theor em provides an empirical v aria nce term. Theorem 1.4 .3 is written in a way which exploits the fact that ψ i takes o nly the three v alues − 1, 0 a nd +1. How ever, it will b e mor e conv enien t for the following computations to use it in its mo re general form, which only makes use o f the fact that ψ i ∈ ( − 1 , 1). With no ta tion to b e explained hereafter, it can indeed also be written as (1.25) P ( exp " sup ρ ∈ M 1 + (Θ) − N ρ n log h 1 − λP ( ψ ) io + N ρ n P h log(1 − λψ ) io − K ( ρ, π ) #) ≤ 1 . 1.4. R elative b ounds 45 W e have used the following notatio n in this inequality . W e ha ve put P = 1 N N X i =1 δ ( X i ,Y i ) , so that P is our no tation for the empirical distr ibution of the pro cess ( X i , Y i ) N i =1 . Moreov er we hav e als o used P = P ( P ) = 1 N N X i =1 P i , where it sho uld b e remembered that the join t distribution of the pro cess ( X i , Y i ) N i =1 is P = N N i =1 P i . W e hav e co nsidered ψ ( θ, e θ ) as a function defined on X × Y as ψ ( θ, e θ )( x, y ) = 1 y 6 = f θ ( x ) − 1 y 6 = f e θ ( x ) , ( x, y ) ∈ X × Y so that it sho uld b e understo od that P ( ψ ) = 1 N N X i =1 P ψ i ( θ, e θ ) = 1 N N X i =1 P n 1 Y i 6 = f θ ( X i ) − 1 Y i 6 = f e θ ( X i ) o = R ′ ( θ, e θ ) . In the same wa y P h log(1 − λψ ) i = 1 N N X i =1 log 1 − λψ i ( θ, e θ ) . Moreov er in tegratio n with resp ect to ρ bear s o n the index θ , so that ρ n log h 1 − λP ( ψ ) io = Z θ ∈ Θ log 1 − λ N N X i =1 P ψ i ( θ, e θ ) ρ ( dθ ) , ρ n P h log(1 − λψ ) io = Z θ ∈ Θ 1 N N X i =1 log 1 − λψ i ( θ, e θ ) ρ ( dθ ) . W e hav e c hose n concise notation, as we did througho ut these notes, in order to make the computations easier to follow. T o get an alternate version of empirica l rela tiv e deviation b ounds, we need to find some conv enient wa y to lo calize the choice o f the prior distribution π in equation (1.25, page 44). Here we pr opose replacing π with µ = π exp {− N log[1+ β P ( ψ )] } , which can also b e written π exp {− N log[1+ β R ′ ( · , e θ )] } . Indeed we see that K ( ρ, µ ) = N ρ n log 1 + β P ( ψ ) o + K ( ρ, π ) + log n π h exp − N log 1 + β P ( ψ ) io . Moreov er, we deduce from our deviation inequa lit y a pplied to − ψ , that (as long as β > − 1), P exp N µ n P log(1 + β ψ ) o − N µ n log 1 + β P ( ψ ) o ≤ 1 . 46 Chapter 1. Inductive P AC-Bayesian le arning Thu s P exp log n π h exp − N lo g 1 + β P ( ψ ) io − log n π h exp − N P log(1 + β ψ ) io ≤ P exp − N µ n log 1 + β P ( ψ ) o − K ( µ, π ) + N µ n P log(1 + β ψ ) o + K ( µ, π ) ≤ 1 . This can be used to handle K ( ρ, µ ), making use of the Cauch y–Sch warz inequa lit y as follows P ( exp " 1 2 − N lo g n 1 − λρ P ( ψ ) 1 + β ρ P ( ψ ) o + N ρ n P h log(1 − λψ ) io − K ( ρ, π ) − log n π h exp − N P log(1 + β ψ ) io #) ≤ P ( exp " − N lo g n 1 − λρ P ( ψ ) o + N ρ n P h log(1 − λψ ) io − K ( ρ, µ ) #) 1 / 2 × P ( exp " log n π h exp − N log 1 + β P ( ψ ) io − log n π h exp − N P log(1 + β ψ ) io #) 1 / 2 ≤ 1 . This implies that with P probability at least 1 − ǫ , − N log n 1 − λρ P ( ψ ) 1 + β ρ P ( ψ ) o ≤ − N ρ n P h log(1 − λψ ) io + K ( ρ, π ) + log n π h exp − N P log(1 + β ψ ) io − 2 log( ǫ ) . It is now conven ient to remember that P h log(1 − λψ ) i = 1 2 log 1 − λ 1 + λ r ′ ( θ, e θ ) + 1 2 log(1 − λ 2 ) m ′ ( θ, e θ ) . W e thus can write the previous inequalit y as − N log n 1 − λρ R ′ ( · , e θ ) 1 + β ρ R ′ ( · , e θ ) o ≤ N 2 log 1 + λ 1 − λ ρ r ′ ( · , e θ ) − N 2 log(1 − λ 2 ) ρ m ′ ( · , e θ ) + K ( ρ, π ) 1.4. R elative b ounds 47 + log π exp n − N 2 log 1 + β 1 − β r ′ ( · , e θ ) − N 2 log(1 − β 2 ) m ′ ( · , e θ ) o − 2 log( ǫ ) . Let us assume now that e θ ∈ arg min Θ 1 R . Let us intro duce b θ ∈ arg min Θ r . Decom- po sing r ′ ( θ, e θ ) = r ′ ( θ, b θ ) + r ′ ( b θ, e θ ) a nd co nsidering that m ′ ( θ, e θ ) ≤ m ′ ( θ, b θ ) + m ′ ( b θ, e θ ), we see that with P pro babilit y at least 1 − ǫ , for any p osterior distribution ρ : Ω → M 1 + (Θ), − N log n 1 − λρ R ′ ( · , e θ ) 1 + β ρ R ′ ( · , e θ ) o ≤ N 2 log 1 + λ 1 − λ ρ r ′ ( · , b θ ) − N 2 log(1 − λ 2 ) ρ m ′ ( · , b θ ) + K ( ρ, π ) + log π exp n − N 2 log 1+ β 1 − β r ′ ( · , b θ ) − N 2 log(1 − β 2 ) m ′ ( · , b θ ) o + N 2 log h (1+ λ )(1 − β ) (1 − λ )(1+ β ) i r ( b θ ) − r ( e θ ) − N 2 log (1 − λ 2 )(1 − β 2 ) m ′ ( b θ , e θ ) − 2 log( ǫ ) . Let us now define for simplicity the p osterior ν : Ω → M 1 + (Θ) by the identit y dν dπ ( θ ) = exp n − N 2 log 1+ λ 1 − λ r ′ ( θ, b θ ) + N 2 log(1 − λ 2 ) m ′ ( θ, b θ ) o π exp n − N 2 log 1+ λ 1 − λ r ′ ( · , b θ ) + N 2 log(1 − λ 2 ) m ′ ( · , b θ ) o . Let us also int ro duce the random b ound B = 1 N log ν exp h N 2 log h (1+ λ )(1 − β ) (1 − λ )(1+ β ) i r ′ ( · , b θ ) − N 2 log (1 − λ 2 )(1 − β 2 ) m ′ ( · , b θ ) i + sup θ ∈ Θ 1 1 2 log h (1 − λ )(1+ β ) (1+ λ )(1 − β ) i r ′ ( θ, b θ ) − 1 2 log (1 − λ 2 )(1 − β 2 ) m ′ ( θ, b θ ) − 2 N log( ǫ ) . Theorem 1 . 4.12 . Using the ab ove notation, for any r e al c onstants 0 ≤ β < λ < 1 , for any prior distribution π ∈ M 1 + (Θ) , for any subset Θ 1 ⊂ Θ , with P pr ob ability at le ast 1 − ǫ , for any p osterior distribution ρ : Ω → M 1 + (Θ) , − log n 1 − λ ρ ( R ) − inf Θ 1 R 1 + β ρ ( R ) − inf Θ 1 R o ≤ K ( ρ, ν ) N + B . Ther efor e, 48 Chapter 1. Inductive P AC-Bayesian le arning ρ ( R ) − inf Θ 1 R ≤ λ − β 2 λβ s 1 + 4 λβ ( λ − β ) 2 1 − e x p − B − K ( ρ, ν ) N − 1 ! ≤ 1 λ − β B + K ( ρ, ν ) N . Let us define the po sterior b ν by the iden tity d b ν dπ ( θ ) = exp h − N 2 log 1+ β 1 − β r ′ ( θ, b θ ) − N 2 log(1 − β 2 ) m ′ ( θ, b θ ) i π n exp h − N 2 log 1+ β 1 − β r ′ ( · , b θ ) − N 2 log(1 − β 2 ) m ′ ( · , b θ ) io . It is useful to remark that 1 N log ν exp h N 2 log (1 + λ )(1 − β ) (1 − λ )(1 + β ) r ′ ( · , b θ ) − N 2 log (1 − λ 2 )(1 − β 2 ) m ′ ( · , b θ ) i ≤ b ν 1 2 log (1 + λ )(1 − β ) (1 − λ )(1 + β ) r ′ ( · , b θ ) − 1 2 log (1 − λ 2 )(1 − β 2 ) m ′ ( · , b θ ) . This inequality is a sp ecial case of log n π exp( g ) o − log n π exp( h ) o = Z 1 α =0 π exp[ h + α ( g − h ) ] ( g − h ) dα ≤ π exp( g ) ( g − h ) , which is a cons equence of the conv exity of α 7→ log n π h exp h + α ( g − h ) io . Let us in tro duce a s pre v iously ϕ ( x ) = sup θ ∈ Θ m ′ ( θ, b θ ) − x r ′ ( θ, b θ ), x ∈ R + . Let us moreov er consider e ϕ ( x ) = sup θ ∈ Θ 1 m ′ ( θ, b θ ) − x r ′ ( θ, b θ ), x ∈ R + . These functions can be us ed to pro duce a result which is slightly w eaker, but maybe easier to read and understand. Indeed, we s ee that, for any x ∈ R + , with P probability a t least 1 − ǫ , for a n y po sterior distr ibution ρ , − N log n 1 − λρ R ′ ( · , e θ ) 1 + β ρ R ′ ( · , e θ ) o ≤ N 2 log (1 + λ ) (1 − λ )(1 − λ 2 ) x ρ r ′ ( · , b θ ) − N 2 log (1 − λ 2 )(1 − β 2 ) ϕ ( x ) + K ( ρ, π ) + log π exp n − N 2 log h (1+ β ) (1 − β )(1 − β 2 ) x i r ′ ( · , b θ ) o − N 2 log (1 − λ 2 )(1 − β 2 ) e ϕ log h (1+ λ )(1 − β ) (1 − λ )(1+ β ) i − log [(1 − λ 2 )(1 − β 2 )] − 2 log( ǫ ) 1.4. R elative b ounds 49 = Z N 2 log (1+ λ ) (1 − λ )(1 − λ 2 ) x N 2 log (1+ β ) (1 − β )(1 − β 2 ) x π exp( − αr ) r ′ ( · , b θ ) dα + K ( ρ, π exp {− N 2 log[ (1+ λ ) (1 − λ )(1 − λ 2 ) x ] r } ) − 2 log( ǫ ) − N 2 log (1 − λ 2 )(1 − β 2 ) ϕ ( x ) + e ϕ log h (1+ λ )(1 − β ) (1 − λ )(1+ β ) i − log[(1 − λ 2 )(1 − β 2 )] . Theorem 1 . 4.13 . With the pr evious notation, for any r e al c onstants 0 ≤ β < λ < 1 , for any p ositive r e al c onstant x , for any prior pr ob ability distribution π ∈ M 1 + (Θ) , for any subset Θ 1 ⊂ Θ , with P pr ob ability at le ast 1 − ǫ , for any p osterior distribution ρ : Ω → M 1 + (Θ) , putting B ( ρ ) = 1 N ( λ − β ) Z N 2 log (1+ λ ) (1 − λ )(1 − λ 2 ) x N 2 log (1+ β ) (1 − β )(1 − β 2 ) x π exp( − αr ) r ′ ( · , b θ ) dα + K ( ρ, π exp {− N 2 log[ (1+ λ ) (1 − λ )(1 − λ 2 ) x ] r } ) − 2 log ( ǫ ) N ( λ − β ) − 1 2( λ − β ) log (1 − λ 2 )(1 − β 2 ) ϕ ( x ) + e ϕ log h (1+ λ )(1 − β ) (1 − λ )(1+ β ) i − log[(1 − λ 2 )(1 − β 2 )] ≤ 1 N ( λ − β ) d e log log h (1+ λ ) (1 − λ )(1 − λ 2 ) x i log (1+ β ) (1 − β )(1 − β 2 ) x + K ( ρ, π exp {− N 2 log[ (1+ λ ) (1 − λ )(1 − λ 2 ) x ] r } ) − 2 log ( ǫ ) N ( λ − β ) − 1 2( λ − β ) log (1 − λ 2 )(1 − β 2 ) ϕ ( x ) + e ϕ log h (1+ λ )(1 − β ) (1 − λ )(1+ β ) i − log[(1 − λ 2 )(1 − β 2 )] , the fol lowing b ounds hold true: ρ ( R ) − inf Θ 1 R ≤ λ − β 2 λβ s 1 + 4 λβ ( λ − β ) 2 n 1 − e x p − ( λ − β ) B ( ρ ) o − 1 ! ≤ B ( ρ ) . Let us remark that this alternative wa y of handling relative deviation b ounds made it po s sible to carr y on with non-linear bounds up to the final result. F or instance, if λ = 0 . 5 , β = 0 . 2 a nd B ( ρ ) = 0 . 1, the non-linear bo und gives ρ ( R ) − inf Θ 1 R ≤ 0 . 096. 50 Chapter 1. Inductive P AC-Bayesian le arning Chapter 2 Comparing p osterior distributions to Gibbs priors 2.1. Bounds relativ e to a Gibbs di stribution W e now come to an approa c h to r elativ e b ounds who se p erformance can b e analysed with P A C-Bayesian tools . The empirical bounds a t the end of the previous chapter inv olve taking suprema in θ ∈ Θ, and replacing the exp e cte d mar gin function ϕ with some empirical coun- terparts ϕ or e ϕ , whic h ma y pro ve unsafe when using v ery complex cla ssification mo dels. W e a re now going to fo cus on the cont ro l of the divergence K ρ, π exp( − β R ) . It is already ob vious, we hop e, that co n trolling this div erge nce is the cr ux of the matter, a nd that it is a wa y to upp er bound the m utual inf or mation betw een the tr aining sample and the parameter, which can be expressed a s K ρ, P ( ρ ) = K ρ, π exp( − β R ) − K P ( ρ ) , π exp( − β R ) , as explained on page 14. Through the iden tity (2.1) K ρ, π exp( − β R ) = β ρ ( R ) − π exp( − β R ) ( R ) + K ( ρ, π ) − K π exp( − β R ) , π , we see that the control of this div erge nce is related to the control o f the difference ρ ( R ) − π exp( − β R ) ( R ). This is the route we will follow fir st. Thu s compar ing any p osterior distribution with a Gibbs prior distribution will provide a first way to build an es timator whic h ca n be prov ed to rea c h adaptively the be s t p ossible asymptotic erro r rate under Mammen a nd Ts ybak ov margin as- sumptions and pa rametric complexity a ssumptions (at least as long as orders of magnitude are c o ncerned, we will not discuss the question o f asymptotically opti- mal constants). Then w e will provide a n empirical b ound for the Kullback divergence K ρ, π exp( − β R ) itself. This will ser v e to address the question of mo del selection, which will be achiev ed by comparing the pe r formance of tw o po sterior distributions p ossi- bly supp orted by tw o different mo dels. This will also provide a seco nd wa y to build estimators whic h can b e pr o ved to b e a daptiv e under Mammen a nd Tsybakov ma r- gin assumptions and par ametric complexity assumptions (somewhat weak er than with the first metho d). 51 52 Chapter 2. Comp aring p osterior distributions to Gibbs priors Finally , we will present tw o-s tep lo calization strategies , in which the p erformance of the p osterior distribution to b e analy s ed is compar ed with a t wo-st ep Gibbs pr ior. 2.1.1. C omp aring a p osterior distribution with a Gibbs pr i or Similarly to Theor em 1.4.3 (pa g e 37) w e can prove that for any prior distribution e π ∈ M 1 + (Θ), (2.2) P ( e π ⊗ e π exp − N log(1 − N tanh γ N R ′ ) − γ r ′ − N log cosh( γ N ) m ′ ) ≤ 1 . Replacing e π with π exp( − β R ) and considering the p osterior distribution ρ ⊗ π exp( − β R ) , provides a starting p oin t in the co mparison of ρ with π exp( − β R ) ; we can indeed state with P probability at least 1 − ǫ that (2.3) − N lo g n 1 − ta nh γ N h ρ ( R ) − π exp( − β R ) ( R ) io ≤ γ ρ ( r ) − π exp( − β R ) ( r ) + N log cosh( γ N ) ρ ⊗ π exp( − β R ) ( m ′ ) + K ρ, π exp( − β R ) − lo g( ǫ ) . Using equation (2.1, page 51) to handle the en tropy term, we get (2.4) − N lo g n 1 − ta nh( γ N ) h ρ ( R ) − π exp( − β R ) ( R ) io − β ρ ( R ) − π exp( − β R ) ( R ) ≤ γ ρ ( r ) − π exp( − β R ) ( r ) + N log cosh γ N ρ ⊗ π exp( − β R ) ( m ′ ) + K ( ρ, π ) − K π exp( − β R ) , π − lo g( ǫ ) . W e ca n then decomp ose in the right -ha nd side γ ρ ( r ) − π exp( − β R ) ( r ) in to ( γ − λ ) ρ ( r ) − π exp( − β R ) ( r ) + λ ρ ( r ) − π exp( − β R ) ( r ) for some pa rameter λ to be set later on and use the fact that λ ρ ( r ) − π exp( − β R ) ( r ) + N log cosh( γ N ) ρ ⊗ π exp( − β R ) ( m ′ ) + K ( ρ, π ) − K π exp( − β R ) , π ≤ λρ ( r ) + K ( ρ, π ) + log n π h exp − λr + N log cosh( γ N ) ρ ( m ′ ) io = K ρ, π exp( − λr ) + log n π exp( − λr ) h exp N log cosh( γ N ) ρ ( m ′ ) io , to get rid o f the a ppeara nce of the unobserved Gibbs prior π exp( − β R ) in most places of the right-hand side of our inequality , leading to Theorem 2 . 1.1 . F or any r e al c onstants β and γ , with P pr ob ability at le ast 1 − ǫ , for any p osterior distribution ρ : Ω → M 1 + (Θ) , for any r e al c onstant λ , N tanh( γ N ) − β ρ ( R ) − π exp( − β R ) ( R ) ≤ − N log n 1 − ta nh( γ N ) h ρ ( R ) − π exp( − β R ) ( R ) io 2.1. Bounds r elative t o a Gibbs distribution 53 − β ρ ( R ) − π exp( − β R ) ( R ) ≤ ( γ − λ ) ρ ( r ) − π exp( − β R ) ( r ) + K ρ, π exp( − λr ) + log n π exp( − λr ) h exp N log cosh( γ N ) ρ ( m ′ ) io − log( ǫ ) = K ρ, π exp( − γ r ) + log n π exp( − γ r ) h exp ( γ − λ ) r + N log cosh( γ N ) ρ ( m ′ ) io − ( γ − λ ) π exp( − β R ) ( r ) − log( ǫ ) . W e would like to have a fully empirical upp er b ound even in the case when λ 6 = γ . This can be done b y using the theorem twice. W e will need a lemma. Lemma 2. 1.2 F or any pr ob ability distribution π ∈ M 1 + (Θ) , for any b ounde d me a- sur able functions g , h : Θ → R , π exp( − g ) ( g ) − π exp( − h ) ( g ) ≤ π exp( − g ) ( h ) − π exp( − h ) ( h ) . Proof. Let us no tice that 0 ≤ K ( π exp( − g ) , π exp( − h ) ) = π exp( − g ) ( h ) + log π exp( − h ) + K ( π exp( − g ) , π ) = π exp( − g ) ( h ) − π exp( − h ) ( h ) − K ( π exp( − h ) , π ) + K ( π exp( − g ) , π ) = π exp( − g ) ( h ) − π exp( − h ) ( h ) − K ( π exp( − h ) , π ) − π exp( − g ) ( g ) − lo g π exp( − g ) . Moreov er − log π exp( − g ) ≤ π exp( − h ) ( g ) + K ( π exp( − h ) , π ) , which ends the pro of. F or any p ositiv e real constants β and λ , w e can then apply Theorem 2.1.1 to ρ = π exp( − λr ) , and use the inequality (2.5) λ β π exp( − λr ) ( r ) − π exp( − β R ) ( r ) ≤ π exp( − λr ) ( R ) − π exp( − β R ) ( R ) provided by the prev ious lemma. W e thus obtain with P pro ba bilit y at least 1 − ǫ − N log n 1 − ta nh( γ N ) λ β h π exp( − λr ) ( r ) − π exp( − β R ) ( r ) io − γ π exp( − λr ) ( r ) − π exp( − β R ) ( r ) ≤ log n π exp( − λr ) h exp N log cosh( γ N ) π exp( − λr ) ( m ′ ) io − lo g( ǫ ) . Let us introduce the conv ex function F γ ,α ( x ) = − N log 1 − ta nh( γ N ) x − αx ≥ N tanh( γ N ) − α x. With P probability at least 1 − ǫ , − π exp( − β R ) ( r ) ≤ inf λ ∈ R ∗ + − π exp( − λr ) ( r ) + β λ F − 1 γ , β γ λ log n π exp( − λr ) h exp N log cosh( γ N ) π exp( − λr ) ( m ′ ) io 54 Chapter 2. Comp aring p osterior distributions to Gibbs priors − log ( ǫ ) . Since Theor em 2.1.1 holds uniformly for any po s terior distribution ρ , we can apply it aga in to s ome a rbitrary po sterior distr ibution ρ . W e can moreov er make the res ult uniform in β and γ by consider ing some a to mic measur e ν ∈ M 1 + ( R ) on the real line and using a union b ound. This lea ds to Theorem 2 . 1.3 . F or any atomic pr ob ability distribution on t he p ositive r e al line ν ∈ M 1 + ( R + ) , with P pr ob ability at le ast 1 − ǫ , for any p osterior distribution ρ : Ω → M 1 + (Θ) , for any p ositive r e al c onstant s β and γ , N tanh( γ N ) − β ρ ( R ) − π exp( − β R ) ( R ) ≤ F γ ,β ρ ( R ) − π exp( − β R ) ( R ) ≤ B ( ρ, β , γ ) , wher e B ( ρ, β , γ ) = inf λ 1 ∈ R + ,λ 1 ≤ γ λ 2 ∈ R ,λ 2 > β γ N tanh( γ N ) − 1 ( K ρ, π exp( − λ 1 r ) + ( γ − λ 1 ) ρ ( r ) − π exp( − λ 2 r ) ( r ) + log n π exp( − λ 1 r ) h exp N log cosh( γ N ) ρ ( m ′ ) io − log ǫν ( β ) ν ( γ ) + ( γ − λ 1 ) β λ 2 F − 1 γ , β γ λ 2 log n π exp( − λ 2 r ) h exp N log cosh( γ N ) π exp( − λ 2 r ) ( m ′ ) io − log ǫν ( β ) ν ( γ ) ) ≤ inf λ 1 ∈ R + ,λ 1 ≤ γ λ 2 ∈ R ,λ 2 > β γ N tanh( γ N ) − 1 ( K ρ, π exp( − λ 1 r ) + ( γ − λ 1 ) ρ ( r ) − π exp( − λ 2 r ) ( r ) + log n π exp( − λ 1 r ) h exp N log cosh( γ N ) ρ ( m ′ ) io + β λ 2 (1 − λ 1 γ ) N γ tanh( γ N ) − β λ 2 log n π exp( − λ 2 r ) h exp N log cosh( γ N ) π exp( − λ 2 r ) ( m ′ ) io − n 1 + β λ 2 (1 − λ 1 γ ) [ N γ tanh( γ N ) − β λ 2 ] o log ǫν ( β ) ν ( γ ) ) , wher e we have written for short ν ( β ) and ν ( γ ) inst e ad of ν ( { β } ) and ν ( { γ } ) . Let us notice that B ( ρ, β , γ ) = + ∞ when ν ( β ) = 0 or ν ( γ ) = 0, the uniformit y in β and γ of the theorem there fo r e necessar ily be a rs on a co un table num b er of v alues of these parameters. W e can typically choose distributions for ν such as the one used in T heo rem 1.2.8 (page 13): namely w e can put for some p ositiv e rea l ratio α > 1 ν ( α k ) = 1 ( k + 1 )( k + 2) , k ∈ N , 2.1. Bounds r elative t o a Gibbs distribution 55 or alternatively , since we are in teres ted in v alues of the parameters les s than N , we can prefer ν ( α k ) = log( α ) log( αN ) , 0 ≤ k < log( N ) log( α ) . W e c a n also use such a co ding distribution on dyadic num b ers as the one defined b y e q uation (1.7, pag e 15). F ollowing the s ame ro ute as for Theore m 1.3 .1 5 (pag e 31), we ca n a lso prove the following result ab out the deviations under any p osterior distribution ρ : Theorem 2 . 1.4 F or any ǫ ∈ )0 , 1( , with P pr ob ability at le ast 1 − ǫ , for any p osterior distribution ρ : Ω → M 1 + (Θ) , with ρ pr ob ability at le ast 1 − ξ , F γ ,β R ( b θ ) − π exp( − β R ) ( R ) ≤ inf λ 1 ∈ R + ,λ 1 ≤ γ , λ 2 ∈ R ,λ 2 > β γ N tanh( γ N ) − 1 ( log " dρ dπ exp( − λ 1 r ) ( b θ ) # + ( γ − λ 1 ) r ( b θ ) − π exp( − λ 2 r ) ( r ) + log n π exp( − λ 1 r ) h exp N log cosh( γ N ) m ′ ( · , b θ ) io − log ǫξ ν ( β ) ν ( γ ) + ( γ − λ 1 ) β λ 2 F − 1 γ , β γ λ 2 log n π exp( − λ 2 r ) h exp N log cosh( γ N ) π exp( − λ 2 r ) ( m ′ ) io − log ǫν ( β ) ν ( γ ) ) . The only tricky p oin t is to justify that we ca n still tak e an infimum in λ 1 without using a union b ound. T o justify this, we hav e to notice that the following v ariant of Theorem 2.1 .1 (pag e 52 ) ho lds: with P probability a t lea st 1 − ǫ , for any p osterior distribution ρ : Ω → M 1 + (Θ), for any real constant λ , ρ n F γ ,β R − π exp( − β R ) ( R ) o ≤ K ρ, π exp( − γ r ) + ρ inf λ ∈ R log n π exp( − γ r ) h exp ( γ − λ ) r + N log cosh( γ N m ′ ( · , b θ ) io − ( γ − λ ) π exp( − β R ) ( r ) − lo g( ǫ ) . W e leave the details as an exercise. 2.1.2. The effe ctive temp er atu r e of a p osterior di stribution Using the parametric approximation π exp( − αr ) ( r ) − inf Θ r ≃ d e α , we get as an or der of magnitude B ( π exp( − λ 1 r ) , β , γ ) . − ( γ − λ 1 ) d e λ − 1 2 − λ − 1 1 56 Chapter 2. Comp aring p osterior distributions to Gibbs priors + 2 d e log λ 1 λ 1 − N log cosh( γ N ) x + 2 β λ 2 (1 − λ 1 γ ) N γ tanh( γ N ) − β λ 2 d e log λ 2 λ 2 − N log cosh( γ N ) x ! + 2 N lo g cosh( γ N ) 1 + β λ 2 (1 − λ 1 γ ) N γ tanh( γ N ) − β λ 2 e ϕ ( x ) − n 1 + β λ 2 (1 − λ 1 γ ) [ N γ tanh( γ N ) − β λ 2 ] o log ν ( β ) ν ( γ ) ǫ . Therefore, if the empir ica l dimension d e stays b ounded when N increases, we a re going to obtain a nega tiv e upp er b ound for any v alues of the consta nts λ 1 > λ 2 > β , as so on as γ and N γ are chosen to b e la rge enough. This ability to obtain negative v alues for the b ound B ( π exp( − λ 1 r ) , γ , β ), and mo re g enerally B ( ρ, γ , β ), leads the wa y to in tro ducing the new concept of the effe ctive temp er atu r e of a n estimator . Definition 2. 1.1 F or any p osterior distribution ρ : Ω → M 1 + (Θ) we define the effectiv e temp erature T ( ρ ) ∈ R ∪ {− ∞ , + ∞} of ρ by the e quation ρ ( R ) = π exp( − R T ( ρ ) ) ( R ) . Note that β 7→ π exp( − β R ) ( R ) : R ∪ {− ∞ , + ∞} → (0 , 1 ) is co n tinuous and strictly decreasing from es s sup π R to ess inf π R (as s oon as t hese t wo b ounds do not co - incide). This sho ws t hat the effectiv e temperature T ( ρ ) is a well-defined ra ndom v ar iable. Theorem 2.1.3 provides a bo und for T ( ρ ), indeed: Prop osition 2. 1 .5 . L et b β ( ρ ) = sup β ∈ R ; inf γ ,N tanh( γ N ) >β B ( ρ, β , γ ) ≤ 0 , wher e B ( ρ, β , γ ) is as in The or em 2.1.3 (p age 54). Then with P pr ob ability at le ast 1 − ǫ , for any p osterior distribution ρ : Ω → M 1 + (Θ) , T ( ρ ) ≤ b β ( ρ ) − 1 , or e quivalently ρ ( R ) ≤ π exp[ − b β ( ρ ) R ] ( R ) . This notion of effe ctive temp er atur e of a (rando mized) estimator ρ is interesting for t wo reasons: • the differe nce ρ ( R ) − π exp( − β R ) ( R ) ca n be estimated with b etter a ccuracy than ρ ( R ) itself, due to the use of r elativ e devia tion inequalities, leading to conv ergence r ates up to 1 / N in favourable situations, even when inf Θ R is not close to zero; • and of course π exp( − β R ) ( R ) is a decreasing function of β , th us b eing able to estimate ρ ( R ) − π exp( − β R ) ( R ) with so me given accuracy , means b eing able to discriminate betw een v alues of ρ ( R ) with the sa me a ccuracy , a lthough doing so throug h the parametrizatio n β 7→ π exp( − β R ) ( R ), which ca n neither be observed nor estimated with the same precision! 2.1. Bounds r elative t o a Gibbs distribution 57 2.1.3. Analysis of an empiric al b ound fo r the effe ctive tem p er atur e W e are no w going to launch into a ma thematically rigor ous a nalysis of the b ound B ( π exp( − λ 1 r ) ,β ,γ ) provided by Theo rem 2 .1 .3 (page 54), to show that inf ρ ∈ M 1 + (Θ) π exp[ − b β ( ρ ) R ] ( R ) co n verges indeed to inf Θ R at some optimal rate in fav oura ble sit- uations. It is more co n venien t for this purp ose to use deviation inequalities in volving M ′ rather than m ′ . It is straightforw ard to extend Theo rem 1 .4.2 (page 36) to Theorem 2 . 1.6 . F or any r e al c onstants β and γ , for any prior distributions π , µ ∈ M 1 + (Θ) , with P pr ob ability at le ast 1 − η , for any p osterior distribution ρ : Ω → M 1 + (Θ) , γ ρ ⊗ π exp( − β R ) Ψ γ N ( R ′ , M ′ ) ≤ γ ρ ⊗ π exp( − β R ) ( r ′ ) + K ( ρ, µ ) − log( η ) . In or der to transform the left-hand side into a linear expr ession and in the same time lo calize this theo rem, let us cho ose µ defined by its densit y dµ dπ ( θ 1 ) = C − 1 exp − β R ( θ 1 ) − γ Z Θ n Ψ γ N R ′ ( θ 1 , θ 2 ) , M ′ ( θ 1 , θ 2 ) − N γ sinh( γ N ) R ′ ( θ 1 , θ 2 ) o π exp( − β R ) ( dθ 2 ) , where C is such that µ (Θ) = 1 . W e get K ( ρ, µ ) = β ρ ( R ) + γ ρ ⊗ π exp( − β R ) Ψ γ N ( R ′ , M ′ ) − N γ sinh( γ N ) R ′ + K ( ρ, π ) + log Z Θ exp − β R ( θ 1 ) − γ Z Θ n Ψ γ N R ′ ( θ 1 , θ 2 ) , M ′ ( θ 1 , θ 2 ) − N γ sinh( γ N ) R ′ ( θ 1 , θ 2 ) o π exp( − β R ) ( dθ 2 ) π ( dθ 1 ) = β ρ ( R ) − π exp( − β R ) ( R ) + γ ρ ⊗ π exp( − β R ) Ψ γ N ( R ′ , M ′ ) − N γ sinh( γ N ) R ′ + K ( ρ, π ) − K ( π exp( − β R ) , π ) + log Z Θ exp − γ Z Θ n Ψ γ N R ′ ( θ 1 , θ 2 ) , M ′ ( θ 1 , θ 2 ) − N γ sinh( γ N ) R ′ ( θ 1 , θ 2 ) o π exp( − β R ) ( dθ 2 ) π exp( − β R ) ( dθ 1 ) . Thu s with P pro babilit y at least 1 − η , (2.6) N sinh( γ N ) − β ρ ( R ) − π exp( − β R ) ( R ) ≤ γ ρ ( r ) − π exp( − β R ) ( r ) + K ( ρ, π ) − K ( π exp( − β R ) , π ) − log ( η ) + C ( β , γ ) where C ( β , γ ) = log Z Θ exp − γ Z Θ n Ψ γ N R ′ ( θ 1 , θ 2 ) , M ′ ( θ 1 , θ 2 ) 58 Chapter 2. Comp aring p osterior distributions to Gibbs priors − N γ sinh( γ N ) R ′ ( θ 1 , θ 2 ) o π exp( − β R ) ( dθ 2 ) π exp( − β R ) ( dθ 1 ) . Remarking that K ρ, π exp( − β R ) = β ρ ( R ) − π exp( − β R ) ( R ) + K ( ρ, π ) − K ( π exp( − β R ) , π ) , we deduce from the prev io us inequality Theorem 2 . 1.7 . F or any r e al c onstants β and γ , with P pr ob ability at le ast 1 − η , for any p osterior distribution ρ : Ω → M 1 + (Θ) , N sinh( γ N ) ρ ( R ) − π exp( − β R ) ( R ) ≤ γ ρ ( r ) − π exp( − β R ) ( r ) + K ρ, π exp( − β R ) − log( η ) + C ( β , γ ) . W e can also go in to a slightly different direction, star ting bac k aga in fro m equa- tion (2.6, page 57) and remarking that for any rea l constant λ , λ ρ ( r ) − π exp( − β R ) ( r ) + K ( ρ, π ) − K ( π exp( − β R ) , π ) ≤ λρ ( r ) + K ( ρ, π ) + lo g π exp( − λr ) = K ρ, π exp( − λr ) . This leads to Theorem 2 . 1.8 . F or any r e al c onstants β and γ , with P pr ob ability at le ast 1 − η , for any r e al c onstant λ , N sinh( γ N ) − β ρ ( R ) − π exp( − β R ) ( R ) ≤ ( γ − λ ) ρ ( r ) − π exp( − β R ) ( r ) + K ρ, π exp( − λr ) − log( η ) + C ( β , γ ) , wher e t he definition of C ( β , γ ) is given by e quation (2.6, p age 57). W e c a n now us e this inequa lit y in the c ase when ρ = π exp( − λr ) and combine it with Inequality (2.5, pag e 53) to obtain Theorem 2 . 1.9 F or any r e al c onstants β and γ , with P pr ob ability at le ast 1 − η , for any r e al c onstant λ , N λ β sinh( γ N ) − γ π exp( − λr ) ( r ) − π exp( − β R ) ( r ) ≤ C ( β , γ ) − log( η ) . W e deduce from this theo rem Prop osition 2. 1 .10 F or any r e al p ositive c onstants β 1 , β 2 and γ , with P pr ob abil- ity at le ast 1 − η , for any r e al c onstant s λ 1 and λ 2 , su ch that λ 2 < β 2 γ N sinh( γ N ) − 1 and λ 1 > β 1 γ N sinh( γ N ) − 1 , π exp( − λ 1 r ) ( r ) − π exp( − λ 2 r ) ( r ) ≤ π exp( − β 1 R ) ( r ) − π exp( − β 2 R ) ( r ) + C ( β 1 , γ ) + log (2 /η ) N λ 1 β 1 sinh( γ N ) − γ + C ( β 2 , γ ) + log(2 /η ) γ − N λ 2 β 2 sinh( γ N ) . 2.1. Bounds r elative t o a Gibbs distribution 59 Moreov er, π exp( − β 1 R ) and π exp( − β 2 R ) being prior distributions, with P pr obabilit y at least 1 − η , γ π exp( − β 1 R ) ( r ) − π exp( − β 2 R ) ( r ) ≤ γ π exp( − β 1 R ) ⊗ π exp( − β 2 R ) Ψ − γ N ( R ′ , M ′ ) − log( η ) . Hence Prop osition 2. 1 .11 F or any p ositive r e al c onstants β 1 , β 2 and γ , wi th P pr ob- ability at le ast 1 − η , for any p ositive r e al c onstants λ 1 and λ 2 such tha t λ 2 < β 2 γ N sinh( γ N ) − 1 and λ 1 > β 1 γ N sinh( γ N ) − 1 , π exp( − λ 1 r ) ( r ) − π exp( − λ 2 r ) ( r ) ≤ π exp( − β 1 R ) ⊗ π exp( − β 2 R ) Ψ − γ N ( R ′ , M ′ ) + log( 3 η ) γ + C ( β 1 , γ ) + log ( 3 η ) N λ 1 β 1 sinh( γ N ) − γ + C ( β 2 , γ ) + log( 3 η ) γ − N λ 2 β 2 sinh( γ N ) . In order to achieve the ana lysis of the bound B ( π exp( − λ 1 r ) , β , γ ) given by Theo- rem 2.1.3 (page 54), it now remains to b ound quant ities of the genera l form log n π exp( − λr ) h exp N log cosh( γ N ) π exp( − λr ) ( m ′ ) io = sup ρ ∈ M 1 + (Θ) N log cosh( γ N ) ρ ⊗ π exp( − λ ) ( m ′ ) − K ρ, π exp( − λr ) . Let us consider the prior distribution µ ∈ M 1 + (Θ × Θ) on couples of parameters defined b y the dens ity dµ d ( π ⊗ π ) ( θ 1 , θ 2 ) = C − 1 exp n − β R ( θ 1 ) − β R ( θ 2 ) + α Φ − α N M ′ ( θ 1 , θ 2 ) o , where the norma lizing constant C is such that µ (Θ × Θ) = 1. Since for fixed v alues of the pa rameters θ and θ ′ ∈ Θ, m ′ ( θ, θ ′ ), like r ( θ ), is a sum of indep enden t Bernoulli random v ariables, w e ca n easily adapt the proo f of Theor em 1.1.4 on page 4, to establish that with P probability at lea st 1 − η , for any p osterior dis tr ibution ρ and any real consta nt λ , αρ ⊗ π exp( − λr ) ( m ′ ) ≤ αρ ⊗ π exp( − λr ) Φ − α N ( M ′ ) + K ( ρ ⊗ π exp( − λr ) , µ ) − lo g( η ) = K ρ, π exp( − β R ) + K π exp( − λr ) , π exp( − β R ) + log n π exp( − β R ) ⊗ π exp( − β R ) h exp α Φ − α N ◦ M ′ io − log( η ) . Thu s for any rea l constant β and an y positive real constan ts α and γ , with P probability at least 1 − η , for any real constant λ , (2.7) log n π exp( − λr ) h exp N log cosh( γ N ) π exp( − λr ) ( m ′ ) io ≤ sup ρ ∈ M 1 + (Θ) N α log cosh( γ N ) n K ρ, π exp( − β R ) + K π exp( − λr ) , π exp( − β R ) 60 Chapter 2. Comp aring p osterior distributions to Gibbs priors + log π exp( − β R ) ⊗ π exp( − β R ) exp( α Φ − α N ◦ M ′ ) − log ( η ) o − K ρ, π exp( − λr ) . T o finish, we need some appropr iate upp er b ound for the entrop y K ρ, π exp( − β R ) . This question ca n be handled in the follo wing wa y: using The- orem 2.1 .7 (page 5 8), we see that for a n y p ositiv e real c o nstan ts γ and β , with P probability at least 1 − η , for any p osterior distribution ρ , K ρ, π exp( − β R ) = β ρ ( R ) − π exp( − β R ) ( R ) + K ( ρ, π ) − K ( π exp( − β R ) , π ) ≤ β N sinh( γ N ) γ ρ ( r ) − π exp( − β R ) ( r ) + K ρ, π exp( − β R ) − log( η ) + C ( β , γ ) + K ( ρ, π ) − K ( π exp( − β R ) , π ) ≤ K ρ, π exp( − β γ N sinh( γ N ) r ) + β N sinh( γ N ) n K ρ, π exp( − β R ) + C ( β , γ ) − log ( η ) o . In other words, Theorem 2 . 1.12 . F or any p ositive r e al c onstants β and γ su ch that β < N × sinh( γ N ) , with P pr ob ability at le ast 1 − η , fo r any p osterior di stribution ρ : Ω → M 1 + (Θ) , K ρ, π exp( − β R ) ≤ K ρ, π exp[ − β γ N sinh( γ N ) − 1 r ] 1 − β N sinh( γ N ) + C ( β , γ ) − log ( η ) N sinh( γ N ) β − 1 , wher e the quantity C ( β , γ ) is define d by e quation (2.6, p age 57). Equivalently, it wil l b e in some c ases mor e c onvenient to u se this r esult in the form: for any p ositive r e al c onstants λ and γ , with P pr ob ability at le ast 1 − η , for any p osterior distribution ρ : Ω → M 1 + (Θ) , K ρ, π exp[ − λ N γ sinh( γ N ) R ] ≤ K ρ, π exp( − λr ) 1 − λ γ + C ( λ N γ sinh( γ N ) , γ ) − log( η ) λ β − 1 . Cho osing in equation (2.7, pag e 59) α = N log cosh( γ N ) 1 − β N sinh( γ N ) and β = λ N γ sinh( γ N ), so that α = N log cosh( γ N ) 1 − λ γ , we obtain with P probability at leas t 1 − η , log n π exp( − λr ) h exp N log cosh( γ N ) π exp( − λr ) ( m ′ ) io ≤ 2 λ γ C ( β , γ ) + log ( 2 η ) + 1 − λ γ log n π exp( − β R ) ⊗ π exp( − β R ) exp( α Φ − α N ◦ M ′ ) o 2.1. Bounds r elative t o a Gibbs distribution 61 + log( 2 η ) . This prov es Prop osition 2. 1 .13 . F or any p ositive r e al c onstants λ < γ , with P pr ob ability at le ast 1 − η , log n π exp( − λr ) h exp N log cosh( γ N ) π exp( − λr ) ( m ′ ) io ≤ 2 λ γ C ( N λ γ sinh( γ N ) , γ ) + log( 2 η ) + 1 − λ γ log π ⊗ 2 exp[ − N λ γ sinh( γ N ) R ] exp N log [cosh( γ N )] 1 − λ γ Φ − log[cosh( γ N )] 1 − λ γ ◦ M ′ + 1 − λ γ log( 2 η ) . W e a re no w rea dy to analyse the bound B ( π exp( − λ 1 r ) , β , γ ) of Theo rem 2.1.3 (page 54). Theorem 2 . 1.14 . F or any p ositive r e al c onstants λ 1 , λ 2 , β 1 , β 2 , β and γ , su ch that λ 1 < γ , β 1 < N λ 1 γ sinh( γ N ) , λ 2 < γ , β 2 > N λ 2 γ sinh( γ N ) , β < N λ 2 γ tanh( γ N ) , with P pr ob ability 1 − η , t he b ound B ( π exp( − λ 1 r ) , β , γ ) of The or em 2.1.3 (p age 54) satisfies B ( π exp( − λ 1 r ) , β , γ ) ≤ ( γ − λ 1 ) ( π exp( − β 1 R ) ⊗ π exp( − β 2 R ) Ψ − γ N ( R ′ , M ′ ) + log( 7 η ) γ + C ( β 1 , γ ) + log( 7 η ) N λ 1 β 1 sinh( γ N ) − γ + C ( β 2 , γ ) + log( 7 η ) γ − N λ 2 β 2 sinh( γ N ) ) + 2 λ 1 γ h C N λ 1 γ sinh( γ N ) , γ + log ( 7 η ) i + 1 − λ 1 γ log π ⊗ 2 exp[ − N λ 1 γ sinh( γ N ) R ] exp N log[cosh( γ N )] 1 − λ 1 γ Φ − log[cosh( γ N )] 1 − λ 1 γ ◦ M ′ + 1 − λ 1 γ log( 7 η ) − log ν ( { β } ) ν ( { γ } ) ǫ + ( γ − λ 1 ) β λ 2 F − 1 γ , β γ λ 2 ( 2 λ 2 γ h C N λ 2 γ sinh( γ N ) , γ + log 7 η i + 1 − λ 2 γ log π ⊗ 2 exp[ − N λ 2 γ sinh( γ N ) R ] 62 Chapter 2. Comp aring p osterior distributions to Gibbs priors exp N log [cosh( γ N )] 1 − λ 2 γ Φ − log[cosh( γ N )] 1 − λ 2 γ ◦ M ′ + 1 − λ 2 γ log 7 η − log ν ( { β } ) ν ( { γ } ) ǫ ) , wher e t he function C ( β , γ ) is define d by e quation (2.6, p age 57). 2.1.4. A daptation to p ar ametric and mar gin assumptions T o help under s tand the previo us theorem, it ma y b e useful to giv e linear upp er- bo unds to the factors a ppearing in the r igh t-hand side of the previo us inequalit y . In tro ducing e θ such that R ( e θ ) = inf Θ R (assuming that such a parameter exists) and remembering that Ψ − a ( p, m ) ≤ a − 1 sinh( a ) p + 2 a − 1 sinh( a 2 ) 2 m, a ∈ R + , Φ − a ( p ) ≤ a − 1 exp( a ) − 1 p, a ∈ R + , Ψ a ( p, m ) ≥ a − 1 sinh( a ) p − 2 a − 1 sinh( a 2 ) 2 m, a ∈ R + , M ′ ( θ 1 , θ 2 ) ≤ M ′ ( θ 1 , e θ ) + M ′ ( θ 2 , e θ ) , θ 1 , θ 2 ∈ Θ , M ′ ( θ 1 , e θ ) ≤ xR ′ ( θ 1 , e θ ) + ϕ ( x ) , x ∈ R + , θ 1 ∈ Θ , the la st inequality b eing rather a consequence of the definition of ϕ than a prop ert y of M ′ , w e eas ily see that π exp( − β 1 R ) ⊗ π exp( − β 2 R ) Ψ − γ N ( R ′ , M ′ ) ≤ N γ sinh( γ N ) π exp( − β 1 R ) ( R ) − π exp( − β 2 R ) ( R ) + 2 N γ sinh( γ 2 N ) 2 π exp( − β 1 R ) ⊗ π exp( − β 2 R ) ( M ′ ) ≤ N γ sinh( γ N ) π exp( − β 1 R ) ( R ) − π exp( − β 2 R ) ( R ) + 2 xN γ sinh( γ 2 N ) 2 n π exp( − β 1 R ) R ′ ( · , e θ ) + π exp( − β 2 R ) R ′ ( · , e θ ) o + 4 N γ sinh( γ 2 N ) 2 ϕ ( x ) , that C ( β , γ ) ≤ log π exp( − β R ) n exp h 2 N sinh γ 2 N 2 π exp( − β R ) ( M ′ ) io ≤ log π exp( − β R ) n exp h 2 N sinh γ 2 N 2 M ′ ( · , e θ ) io + 2 N s inh( γ 2 N ) 2 π exp( − β R ) M ′ ( · , e θ ) ≤ log π exp( − β R ) n exp h 2 xN sinh( γ 2 N ) 2 R ′ ( · , e θ ) io + 2 xN sinh( γ 2 N ) 2 π exp( − β R ) R ′ ( · , e θ ) + 4 N s inh( γ 2 N ) 2 ϕ ( x ) = Z β β − 2 xN sinh( γ 2 N ) 2 π exp( − αR ) R ′ ( · , e θ ) dα 2.1. Bounds r elative t o a Gibbs distribution 63 + 2 xN sinh( γ 2 N ) 2 π exp( − β R ) R ′ ( · , e θ ) + 4 N s inh( γ 2 N ) 2 ϕ ( x ) ≤ 4 xN sinh( γ 2 N ) 2 π exp[ − ( β − 2 xN sinh( γ 2 N ) 2 ) R ] R ′ ( · , e θ ) + 4 N s inh( γ 2 N ) 2 ϕ ( x ) , and that log n π ⊗ 2 exp( − β R ) h exp N α Φ − α ◦ M ′ io ≤ 2 log n π exp( − β R ) h exp N exp( α ) − 1 M ′ ( · , e θ ) io ≤ 2 xN exp( α ) − 1 π exp[ − ( β − xN [e xp ( α ) − 1]) R ] R ′ ( · , e θ ) + 2 xN exp( α ) − 1 ϕ ( x ) . Let us push further the inv estigation under the pa rametric assumption that for some po sitiv e rea l constant d (2.8) lim β → + ∞ β π exp( − β R ) R ′ ( · , e θ ) = d, This as sumption will for instance hold true with d = n 2 when R : Θ → (0 , 1) is a smo oth function defined on a compact s ubset Θ o f R n that reaches its minimum v alue on a finite num b er of non-degener ate (i.e. with a p ositiv e definite H ess ia n) in terio r po ints of Θ , and π is absolutely co n tin uous with res pect to the Leb esgue measure on Θ and has a smo oth density . In case of a ssumption (2.8), if we restrict our selv es to sufficiently large v alues of the constants β , β 1 , β 2 , λ 1 , λ 2 and γ (the smaller of which is as a rule β , a s we will see), we can use the fact that for some (small) p ositive constant δ , and some (large) p o sitiv e constant A , (2.9) d α (1 − δ ) ≤ π exp( − αR ) R ′ ( · , e θ ) ≤ d α (1 + δ ) , α ≥ A. Under this assumption, π exp( − β 1 R ) ⊗ π exp( − β 2 R ) Ψ − γ N ( R ′ , M ′ ) ≤ N γ sinh( γ N ) d β 1 (1 + δ ) − d β 2 (1 − δ ) + 2 xN γ sinh( γ 2 N ) 2 (1 + δ ) d β 1 + d β 2 + 4 N γ sinh( γ 2 N ) 2 ϕ ( x ) . C ( β , γ ) ≤ d (1 + δ ) log β β − 2 xN sinh( γ 2 N ) 2 + 2 xN sinh( γ 2 N ) 2 (1+ δ ) d β + 4 N s inh( γ 2 N ) 2 ϕ ( x ) . log n π ⊗ 2 exp( − β R ) h exp N α Φ − α ◦ M ′ io ≤ 2 xN exp( α ) − 1 d (1 + δ ) β − xN [exp( α ) − 1] + 2 N exp( α ) − 1 ϕ ( x ) . Thu s with P pro babilit y at least 1 − η , B ( π exp( − λ 1 r ) , β , γ ) ≤ − ( γ − λ 1 ) N γ sinh( γ N ) d β 2 (1 − δ ) + ( γ − λ 1 ) N γ sinh( γ N ) (1+ δ ) d β 1 + 2 xN γ sinh( γ 2 N ) 2 (1 + δ ) d β 1 + d β 2 + 4 N γ sinh( γ 2 N ) 2 ϕ ( x ) + log( 7 η ) γ 64 Chapter 2. Comp aring p osterior distributions to Gibbs priors + 4 xN sinh( γ 2 N ) 2 (1+ δ ) d β 1 − 2 xN sinh( γ 2 N ) 2 + 4 N sinh( γ 2 N ) 2 ϕ ( x ) + log( 7 η ) N λ 1 β 1 sinh( γ N ) − γ + 4 xN sinh( γ 2 N ) 2 (1+ δ ) d β 2 − 2 xN sinh( γ 2 N ) 2 + 4 N s inh( γ 2 N ) 2 ϕ ( x ) + log( 7 η ) γ − N λ 2 β 2 sinh( γ N ) + 2 λ 1 γ 4 xN sinh( γ 2 N ) 2 (1+ δ ) d N λ 1 γ sinh( γ N ) − 2 xN sinh( γ 2 N ) 2 + 4 N s inh( γ 2 N ) 2 ϕ ( x ) + log( 7 η ) + 1 − λ 1 γ ( 2 d (1 + δ ) λ 1 sinh γ N xγ h exp log[cosh( γ N )] 1 − λ 1 γ − 1 i − 1 ! − 1 + 2 N h exp log[cosh( γ N )] 1 − λ 1 γ − 1 i ϕ ( x ) ) + 1 − λ 1 γ log( 7 η ) − log ν ( { β } ) ν ( { γ } ) ǫ + 1 − λ 1 γ N λ 2 β γ tanh( γ N ) − 1 ( 2 λ 2 γ 4 xN sinh( γ 2 N ) 2 (1+ δ ) d N λ 2 γ sinh( γ N ) − 2 xN sinh( γ 2 N ) 2 + 4 N s inh( γ 2 N ) 2 ϕ ( x ) + log( 7 η ) + 1 − λ 2 γ " 2 d (1 + δ ) λ 2 sinh γ N xγ h exp log[cosh( γ N )] 1 − λ 2 γ − 1 i − 1 ! − 1 + 2 N h exp log[cosh( γ N )] 1 − λ 2 γ − 1 i ϕ ( x ) # + 1 − λ 2 γ log( 7 η ) − log ν ( β ) ν ( γ ) ǫ ) . Now let us choo se for simplicity β 2 = 2 λ 2 = 4 β , β 1 = λ 1 / 2 = γ / 4, and let us in tro duce the notation C 1 = N γ sinh( γ N ) , C 2 = N γ tanh( γ N ) , C 3 = N 2 γ 2 exp( γ 2 N 2 ) − 1 and C 4 = 2 N 2 (1 − 2 β γ ) γ 2 h exp γ 2 2 N 2 (1 − 2 β γ ) − 1 i , to obtain B ( π exp( − λ 1 r ) , β , γ ) ≤ − C 1 γ 8 β (1 − δ ) d 2.1. Bounds r elative t o a Gibbs distribution 65 + C 1 γ 2 4(1+ δ ) d γ + x γ 2 N (1 + δ ) 4 d γ + d 4 β + γ N ϕ ( x ) + 1 2 log 7 η + 1 2 C 1 − 1 h (1 + δ ) d N 2 xC 1 γ − 1 − 1 + C 1 γ 2 2 N ϕ ( x ) + 1 2 log( 7 η ) i + 1 2 − C 1 2(1 + δ ) d 8 N β xC 1 γ 2 − 1 − 1 + C 1 γ 2 N ϕ ( x ) + log( 7 η ) + 2 xγ (1 + δ ) d N − xγ + C 1 γ 2 N ϕ ( x ) + log( 7 η ) + d (1 + δ ) xγ N C 1 2 C 3 − xγ N − 1 + γ 2 N C 3 ϕ ( x ) + log( 7 η ) 2 − log ν ( β ) ν ( γ ) ǫ + 4 C 2 − 2 − 1 ( 4 β γ x γ 2 N C 1 (1 + δ ) d 2 β C 1 − xC 1 γ 2 2 N − 1 + γ 2 N ϕ ( x ) + log( 7 η ) + 1 − 2 β γ 2 d (1 + δ ) xγ N 4 β C 1 γ C 4 1 − 2 β γ − xγ N − 1 + γ 2 N (1 − 2 β γ ) C 4 ϕ ( x ) + 1 − 2 β γ log( 7 η ) − log ν ( β ) ν ( γ ) ǫ ) . This simplifies to B ( π exp( − λ 1 r ) , β , γ ) ≤ − C 1 8 (1 − δ ) d γ β + 2 C 1 (1 + δ ) d + log( 7 η ) 2 + 3 C 1 (4 C 1 − 2)(2 − C 1 ) + 1 + 2 β γ 4 C 2 − 2 − 1 + 1 4 C 2 − 2 log ν ( β ) ν ( γ ) ǫ + (1 + δ ) dxγ N C 1 + 1 2 C 1 − 1 1 2 C 1 − γ x N − 1 + 2 1 − γ x N − 1 + C 1 2 C 3 − γ x N − 1 + 4 C 1 β γ (4 C 2 − 2) + (1 + δ ) dxγ 2 N β C 1 16 + 2 2 − C 1 8 C 1 − xγ 2 N β − 1 + 1 − 2 β γ 1 2 C 2 − 1 h 4 C 1 C 4 1 − 2 β γ − γ 2 x β N i − 1 + γ 2 N ϕ ( x ) 3 C 1 2 + C 1 4 C 1 − 2 + C 1 2 − C 1 + C 3 + 4 β γ (4 C 2 − 2) + C 4 4 C 2 − 2 . This shows that there exist univ ers al p ositiv e real constants A 1 , A 2 , B 1 , B 2 , B 3 , and B 4 such that as so on as γ max { x, 1 } N ≤ A 1 β γ ≤ A 2 , B ( π exp( − λ 1 r ) , β , γ ) ≤ − B 1 (1 − δ ) d γ β + B 2 (1 + δ ) d 66 Chapter 2. Comp aring p osterior distributions to Gibbs priors − B 3 log ν ( β ) ν ( γ ) ǫ η + B 4 γ 2 N ϕ ( x ) . Thu s π exp( − λ 1 r ) ( R ) ≤ π exp( − β R ) ( R ) ≤ inf Θ R + (1+ δ ) d β as so on a s β γ ≤ B 1 B 2 (1+ δ ) (1 − δ ) + B 4 γ 2 N ϕ ( x ) − B 3 log[ ν ( β ) ν ( γ ) ǫη ] (1 − δ ) d . Cho osing some rea l ra tio α > 1, we can now make the a bov e re s ult uniform for any (2.10) β , γ ∈ Λ α def = n α k ; k ∈ N , 0 ≤ k < log( N ) log( α ) o , b y substituting ν ( β ) and ν ( γ ) with log( α ) log( αN ) and − log( η ) with − log( η ) + 2 × log h log( αN ) log( α ) i . T a king η = ǫ for simplicity , we can summarize our result in Theorem 2 . 1.15 . Ther e exist p ositive r e al universal c onstants A , B 1 , B 2 , B 3 and B 4 such tha t for any p ositive r e al c onstants α > 1 , d and δ , for any pri or distribution π ∈ M 1 + (Θ) , with P pr ob ability at le ast 1 − ǫ , for any β , γ ∈ Λ α (wher e Λ α is define d by e quation (2.1 0) ab ove) such that sup β ′ ∈ R ,β ′ ≥ β β ′ d π exp( − β ′ R ) ( R ) − inf Θ R − 1 ≤ δ and such that also for some p ositive r e al p ar ameter x γ max { x, 1 } N ≤ Aβ γ and β γ ≤ B 1 B 2 (1+ δ ) (1 − δ ) + B 4 γ 2 N ϕ ( x ) − 2 B 3 log( ǫ )+4 B 3 log log( N ) log( α ) (1 − δ ) d , the b ound B ( π exp( − γ 2 r ) , β , γ ) given by The or em 2.1.3 on p age 54 in the c ase wher e we have chosen ν to b e the un ifo rm pr ob ability me asur e on Λ α , satisfies B ( π exp( − γ 2 r ) , β , γ ) ≤ 0 , pr oving that b β ( π exp( − γ 2 r ) ) ≥ β and ther efor e that π exp( − γ r 2 ) ( R ) ≤ π exp( − β R ) ( R ) ≤ inf Θ R + (1 + δ ) d β . What is important in this r e s ult is that we do not only bo und π exp( − γ 2 r ) ( R ), but also B ( π exp( − γ 2 r ) , β , γ ), and that w e do it unifor mly on a g r id of v alues o f β and γ , showing that w e ca n indeed set the co nstan ts β and γ adaptiv ely using the empirical bo und B ( π exp( − γ 2 r ) , β , γ ). Let us se e what w e get under the margin assumption (1 .24, page 39). When κ = 1, we hav e ϕ ( c − 1 ) ≤ 0, leading to Corollary 2. 1.16 . Assuming that the mar gin assumption (1.2 4, p age 39 ) is sat- isfie d for κ = 1 , that R : Θ → (0 , 1) is indep endent of N (which is the c ase for instanc e when P = P ⊗ N ), and is such that lim β ′ → + ∞ β ′ π exp( − β ′ R ) ( R ) − inf Θ R = d, 2.1. Bounds r elative t o a Gibbs distribution 67 ther e ar e universal p ositive r e al c onstant s B 5 and B 6 and N 1 ∈ N such that for any N ≥ N 1 , with P pr ob ability at le ast 1 − ǫ π exp( − b γ r 2 ) ( R ) ≤ inf Θ R + B 5 d cN 1 + B 6 d log log( N ) ǫ 2 , wher e b γ ∈ arg max γ ∈ Λ 2 max β ∈ Λ 2 ; B ( π exp( − γ r 2 ) , β , γ ) ≤ 0 , wher e Λ 2 is define d by e qu ation (2.10, p age 66), and B is the b ound of The or em 2.1.3 (p age 54). When κ > 1, ϕ ( x ) ≤ (1 − κ − 1 ) κcx − 1 κ − 1 , and we can c ho ose γ and x such that γ 2 N ϕ ( x ) ≃ d to prov e Corollary 2. 1.17 . Ass u ming that the mar gin assum ption (1.24, p age 39) is sat- isfie d for some exp onent κ > 1 , that R : Θ → (0 , 1) is indep en dent of N (which is for instanc e the c ase when P = P ⊗ N ), and is such t hat lim β ′ → + ∞ β ′ π exp( − β ′ R ) ( R ) − inf Θ R = d, ther e ar e un iversal p ositive c onstants B 7 and B 8 and N 1 ∈ N such that for any N ≥ N 1 , with P pr ob ability at le ast 1 − ǫ , π exp( − b γ r 2 ) ( R ) ≤ inf Θ R + B 7 c − 1 2 κ − 1 1 + B 8 d log log( N ) ǫ 2 κ 2 κ − 1 d N κ 2 κ − 1 , wher e b γ ∈ arg max γ ∈ Λ 2 max β ∈ Λ 2 ; B ( π exp( − γ r 2 ) , β , γ ) ≤ 0 , Λ 2 b eing define d by e quation (2.10, p age 66) and B by The or em 2.1.3 (p age 54). W e find the s ame rate o f conv ergence as in Cor ollary 1.4.7 (page 40), but this time, we were able to pr o vide an empirical pos terior distribution π exp( − b γ r 2 ) which achiev es this rate ada ptively in all the parameters (meaning in particular that we do not need to know d , c or κ ). Mor eo ver, as alr eady mentioned, the power of N in this rate of conv erg ence is known to b e optimal in the worst ca se (see Mammen et al. (1999); Tsybako v (2004); Tsybako v et a l. (200 5), and mo re sp ecifically in Audiber t (2004b) — downloadable from its author’s web pag e — Theor em 3.3, page 132). 2.1.5. E stimating the diver genc e of a p osterior with r esp e ct to a Gibbs prior Another in teresting question is to es timate K ρ, π exp( − β R ) using relative deviation inequalities. W e follow here an idea to b e found fir st in (Audiber t, 2004b, pa ge 9 3). Indeed, c o m bining eq ua tion (2.3, pa ge 5 2) with equation (2.1, page 5 1), we s ee that for any p ositiv e real parameters β and λ , with P probability at least 1 − ǫ , for any po sterior distribution ρ : Ω → M 1 + (Θ), K ρ, π exp( − β R ) ≤ β N tanh( γ N ) γ ρ ( r ) − π exp( − β R ) ( r ) + N log cosh( γ N ) ρ ⊗ π exp( − β R ) ( m ′ ) + K ρ, π exp( − β R ) − log ( ǫ ) + K ( ρ, π ) − K π exp( − β R ) , π 68 Chapter 2. Comp aring p osterior distributions to Gibbs priors ≤ K ρ, π exp[ − β γ N tanh( γ N ) r ] + β N tanh( γ N ) n K ρ, π exp( − β R ) − log( ǫ ) o + log π exp[ − β γ N tanh( γ N ) r ] n exp h β tanh( γ N ) log cosh( γ N ) ρ ( m ′ ) io . W e thus obtain Theorem 2 . 1.18 . F or any p ositive r e al c onst ants β and γ such that β < N × tanh( γ N ) , with P pr ob ability at le ast 1 − ǫ , for any p osterior distribution ρ : Ω → M 1 + (Θ) , K ρ, π exp( − β R ) ≤ 1 − β N tanh γ N − 1 − 1 × ( K ρ, π exp[ − β γ N tanh( γ N ) − 1 r ] − β N tanh( γ N ) log( ǫ ) + log n π exp[ − β γ N tanh( γ N ) − 1 r ] h exp β tanh( γ N ) − 1 log[cosh( γ N )] ρ ( m ′ ) io ) . This theorem provides another wa y of meas uring ov er-fitting, since it gives an upper bound for K π exp[ − β γ N tanh( γ N ) − 1 r ] , π exp( − β R ) . It may b e used in co mbination with Theorem 1.2.6 (page 11) a s an alternative to Theore m 1.3.7 (page 2 1). It will also b e used in the next section. An alternative par a metrization of the same res ult pr o viding a simpler right-hand side is also useful: Corollary 2. 1.19 . F or any p ositive r e al c onst ant s β and γ su ch that β < γ , with P pr ob ability at le ast 1 − ǫ , for any p osterior distribution ρ : Ω → M 1 + (Θ) , K ρ, π exp[ − N β γ tanh( γ N ) R ] ≤ 1 − β γ − 1 ( K ρ, π exp( − β r ) − β γ log( ǫ ) + log n π exp( − β r ) h exp N β γ log cosh( γ N ) ρ ( m ′ ) io ) . 2.2. Play ing with t w o p osterior and t wo lo cal prior distributio ns 2.2.1. C omp aring two p osterior distributions Estimating the effectiv e temp erature of an es timato r provides an e fficien t wa y to tune parameters in a mo del with para metric b eha viour. On the other hand, it will not be fitted to choo se betw een differen t mo dels, esp ecially when they a re nested, beca use as we alrea dy saw in the case when Θ is a union of nested mo dels, the prior distribution π exp( − β R ) do es not provide an efficient lo calization of the parameter in this case, in the sense that π exp( − β R ) ( R ) do es not g o down to inf Θ R at the desired rate when β g oes to + ∞ , requiring a resor t to partial lo calization. Once some estimator (in the form of a p osterior distribution) has b een chosen in eac h sub-mo del, these estimators c a n be compared betw een themselv es with the help of the relative b ounds that we will esta blish in this section. It is also po ssible 2.2. Playing with two p osterior and two lo c al prior distributions 69 to c ho ose several estimators in each sub-mo del, to tune parameters in the same time (like the in verse temper ature para meter if we decide to use Gibbs posterior distributions in each sub-mo del). F ro m equation (2.2 page 52) (sligh tly mo dified b y repla cing π ⊗ π with π 1 ⊗ π 2 ), we eas ily obtain Theorem 2 . 2.1 . F or any p ositive r e al c onstant λ , for any prior distributions π 1 , π 2 ∈ M 1 + (Θ) , with P pr ob ability at le ast 1 − ǫ , for any p osterior distributions ρ 1 and ρ 2 : Ω → M 1 + (Θ) , − N log n 1 − ta nh λ N h ρ 2 ( R ) − ρ 1 ( R ) io ≤ λ ρ 2 ( r ) − ρ 1 ( r ) + N log cosh λ N ρ 1 ⊗ ρ 2 ( m ′ ) + K ρ 1 , π 1 + K ρ 2 , π 2 − lo g( ǫ ) . This is w her e the entropy bo und of the previous section enters into the ga me, providing a lo calized version of Theorem 2.2.1 (page 69). W e will use the notation (2.11) Ξ a ( q ) = tanh( a ) − 1 1 − e x p( − aq ) ≤ a tanh( a ) q , a, q ∈ R . Theorem 2 . 2.2 . F or any ǫ ∈ )0 , 1( , any se quenc e of prior distributions ( π i ) i ∈ N ∈ M 1 + (Θ) N , any pr ob ability distribution µ on N , any atomic pr ob abili ty distribution ν on R + , with P pr ob ability at le ast 1 − ǫ , for any p osterior distributions ρ 1 , ρ 2 : Ω → M 1 + (Θ) , ρ 2 ( R ) − ρ 1 ( R ) ≤ B ( ρ 1 , ρ 2 ) , wher e B ( ρ 1 , ρ 2 ) = inf λ,β 1 <γ 1 ,β 2 <γ 2 ∈ R + ,i,j ∈ N Ξ λ N ( ρ 2 ( r ) − ρ 1 ( r ) + N λ log cosh( λ N ) ρ 1 ⊗ ρ 2 ( m ′ ) + 1 λ 1 − β 1 γ 1 K ρ 1 , π i exp( − β 1 r ) + log n π i exp( − β 1 r ) h exp β 1 N γ 1 log cosh( γ 1 N ) ρ 1 ( m ′ ) io − β 1 γ 1 log ν ( γ 1 ) + 1 λ 1 − β 2 γ 2 K ρ 2 , π j exp( − β 2 r ) + log n π j exp( − β 2 r ) h exp β 2 N γ 2 log cosh( γ 2 N ) ρ 2 ( m ′ ) io − β 2 γ 2 log ν ( γ 2 ) − h γ 1 β 1 − 1 − 1 + γ 2 β 2 − 1 − 1 + 1 i log 3 − 1 ν ( β 1 ) ν ( β 2 ) ν ( λ ) µ ( i ) µ ( j ) ǫ λ ) . 70 Chapter 2. Comp aring p osterior distributions to Gibbs priors The sequence of prior distributions ( π i ) i ∈ N should b e understo od to b e typically suppo rted by subs e ts of Θ co r respo nding to parametric sub- mo dels, that is sub- mo dels for whic h it is reaso nable to exp ect that lim β → + ∞ β π i exp( − β R ) ( R ) − es s inf π i R exists and is po s itiv e a nd finite. As there is no rea son wh y the b ound B ( ρ 1 , ρ 2 ) pr o- vided by the previous theorem should b e sub-additive (in the sense that B ( ρ 1 , ρ 3 ) ≤ B ( ρ 1 , ρ 2 ) + B ( ρ 2 , ρ 3 )), it is adequa te to co nsider s ome work able subset P of p osterior distributions (for instance the distributions of the form π i exp( − β r ) , i ∈ N , β ∈ R + ), and to define the sub-additive chained b ound (2.12) e B ( ρ, ρ ′ ) = inf ( n − 1 X k =0 B ( ρ k , ρ k +1 ); n ∈ N ∗ , ( ρ k ) n k =0 ∈ P n +1 , ρ 0 = ρ, ρ n = ρ ′ ) , ρ, ρ ′ ∈ P . Prop osition 2. 2 .3 . With P pr ob ability at le ast 1 − ǫ , for any p osterior distribu- tions ρ 1 , ρ 2 ∈ P , ρ 2 ( R ) − ρ 1 ( R ) ≤ e B ( ρ 1 , ρ 2 ) . Mor e over for any p osterior distribution ρ 1 ∈ P , any p osterior distribution ρ 2 ∈ P such that e B ( ρ 1 , ρ 2 ) = inf ρ 3 ∈ P e B ( ρ 1 , ρ 3 ) is unimpr ovable with the help of e B in P in the sense that inf ρ 3 ∈ P e B ( ρ 2 , ρ 3 ) ≥ 0 . Proof. The first asse rtion is a direct consequence o f the pre v ious theorem, so only the second asser tion re quires a pro of: for any ρ 3 ∈ P , we deduce fro m the optimality of ρ 2 and the sub-additivit y of e B that e B ( ρ 1 , ρ 2 ) ≤ e B ( ρ 1 , ρ 3 ) ≤ e B ( ρ 1 , ρ 2 ) + e B ( ρ 2 , ρ 3 ) . This prop osition provides a wa y to improv e a p osterior distribution ρ 1 ∈ P by choosing ρ 2 ∈ arg min ρ ∈ P e B ( ρ 1 , ρ ) whenever e B ( ρ 1 , ρ 2 ) < 0. This improv ement is prov ed by Prop osition 2.2.3 to be one-step: the obtained improv ed p osterior ρ 2 cannot be impro ved a gain using the same technique. Let us give so me examples of p ossible starting distributions ρ 1 for this improv e- men t scheme: ρ 1 may be chosen as the be st p osterior Gibbs distribution according to Prop osition 2.1.5 (page 56). More precise ly , we ma y build from the pr ior dis tr i- butions π i , i ∈ N , a g lobal pr ior π = P i ∈ N µ ( i ) π i . W e can then define the estimator of the in verse effective temp erature as in P ropo sition 2.1.5 (page 5 6) a nd choos e ρ 1 ∈ arg min ρ ∈ P b β ( ρ ), where P is as sugg e s ted ab ov e the se t of p osterior distribu- tions P = n π i exp( − β r ) ; i ∈ N , β ∈ R + o . This starting po int ρ 1 should already be pretty goo d, at least in an asymptotic per spective, the only gain in the rate of con vergence to b e exp ected bear ing o n spurious log( N ) facto r s. 2.2.2. E lab or ate uses of r elative b ounds b etwe en p osteriors More elab orate uses of relative b ounds are described in the third section of the second chapter of Audib ert (200 4b) , where an alg orithm is prop osed and analysed, 2.2. Playing with two p osterior and two lo c al prior distributions 71 which allows one to use relative b ounds b et ween tw o p osterior distributions as a stand-alone estimation to ol. Let us give here so me alternative way to address this issue. W e will assume for simplicit y a nd without great loss of generality that the working set of poster ior distributions P is finite (so that among other things a n y or dering of it has a first element ). It is natural to define the estimated complexity of any g iv en p osterior dis tr ibution ρ ∈ P in our working set a s the b ound for inf i ∈ N K ( ρ, π i ) used in Theo r em 2.2.1 (page 69). This leads to set (given some confidence level 1 − ǫ ) C ( ρ ) = inf β <γ ∈ R + ,i ∈ N 1 − β γ − 1 K ρ, π i exp( − β r ) + log n π i exp( − β r ) h exp β N γ log cosh( γ N ) ρ ( m ′ ) io − β γ log 3 − 1 ν ( γ ) ν ( β ) µ ( i ) ǫ . Let us moreov er call γ ( ρ ), β ( ρ ) and i ( ρ ) the v alues achieving this infim um, or nearly achieving it, whic h requires a slight change of the definition o f C ( ρ ) to take this mo dification into account. F or the sake of simplicity , we can assume without substantial loss of generality that the suppo rts o f ν and µ are large but finite, and th us that the minimum is reached. T o unders tand how this notion of complexity comes in to play , it may b e inter- esting to keep in mind that for any p osterior distr ibutions ρ and ρ ′ we can write the bo und in T heo rem 2 .2.2 (page 69) as (2.13) B ( ρ, ρ ′ ) = inf λ ∈ R + Ξ λ N ρ ′ ( r ) − ρ ( r ) + S λ ( ρ, ρ ′ ) , where S λ ( ρ, ρ ′ ) = S λ ( ρ ′ , ρ ) ≤ N λ log cosh( λ N ) ρ ⊗ ρ ′ ( m ′ ) + C ( ρ ) + C ( ρ ′ ) λ − log(3 − 1 ǫ ) λ − log ν β ( ρ ) µ i ( ρ ) λ 1 − β ( ρ ′ ) γ ( ρ ′ ) − log ν β ( ρ ′ ) µ i ( ρ ′ ) λ 1 − β ( ρ ) γ ( ρ ) − h γ ( ρ ) β ( ρ ) − 1 − 1 + γ ( ρ ′ ) β ( ρ ′ ) − 1 − 1 + 1 i log ν ( λ ) λ . (Let us recall that the function Ξ is defined by equa tion (2.11, page 69).) T hus for any ρ, ρ ′ such that B ( ρ ′ , ρ ) > 0, we ca n deduce fro m the monotonicity of Ξ λ N that ρ ′ ( r ) − ρ ( r ) ≤ inf λ ∈ R + S λ ( ρ, ρ ′ ) , proving that the left-hand side is small, and consequently that B ( ρ, ρ ′ ) a nd its chained counterpart defined b y equation (2.12, page 70) are small: e B ( ρ, ρ ′ ) ≤ B ( ρ, ρ ′ ) ≤ inf λ ∈ R + Ξ λ N 2 S λ ( ρ, ρ ′ ) . It is a lso worth noticing that B ( ρ, ρ ′ ) a nd e B ( ρ, ρ ′ ) a re upp er bounded in terms o f v ar iance a nd complexity o nly . 72 Chapter 2. Comp aring p osterior distributions to Gibbs priors The presence of the r atios γ ( ρ ) β ( ρ ) should not b e obnoxious, since their v a lues should be automatically tamed by the fact that β ( ρ ) a nd γ ( ρ ) should make the estimate of the complexity of ρ optimal. As an a lternativ e, it is p ossible to restrict to set of pa rameter v alues β and γ such that, for so me fix e d constant ζ > 1, the r atio γ β is b ounded aw ay from 1 by the inequality γ β ≥ ζ . This leads to a n alterna tiv e definition of C ( ρ ): C ( ρ ) = inf γ ≥ ζ β ∈ R + ,i ∈ N 1 − β γ − 1 K ρ, π i exp( − β r ) + log n π i exp( − β r ) h exp β N γ log cosh( γ N ) ρ ( m ′ ) io − β γ log 3 − 1 ν ( γ ) ν ( β ) µ ( i ) ǫ − log ν ( β ) µ ( i ) (1 − ζ − 1 ) − log(3 − 1 ǫ ) 2 . W e can even push simplification a s tep further, p ostponing the optimization of the ratio γ β , and setting it to the fixed v alue ζ . This leads us to adopt the definition (2.14) C ( ρ ) = inf β ∈ R + ,i ∈ N 1 − ζ − 1 − 1 K ρ, π i exp( − β r ) + log n π i exp( − β r ) h exp N ζ log cosh( ζ β N ) ρ ( m ′ ) io − ζ + 1 ζ − 1 log ν ( β ) µ ( i ) + 2 − 1 log(3 − 1 ǫ ) . With either of these mo dified definitions of the complexity C ( ρ ), we get the upp er bo und (2.15) S λ ( ρ, ρ ′ ) ≤ e S λ ( ρ, ρ ′ ) def = N λ log cosh( λ N ) ρ ⊗ ρ ′ ( m ′ ) + 1 λ C ( ρ ) + C ( ρ ′ ) − ζ + 1 ζ − 1 log ν ( λ ) . With these definitions, we hav e for any p osterior distributions ρ and ρ ′ B ( ρ, ρ ′ ) ≤ inf λ ∈ R + Ξ λ N n ρ ′ ( r ) − ρ ( r ) + e S λ ( ρ, ρ ′ ) o . Consequently in the case when B ( ρ ′ , ρ ) > 0 , we ge t e B ( ρ, ρ ′ ) ≤ B ( ρ, ρ ′ ) ≤ inf λ ∈ R + Ξ λ N 2 e S λ ( ρ, ρ ′ ) . T o selec t some nearly optimal p osterior distribution in P , it is appropr iate to or- der the po sterior distributions of P accor ding to increasing v alues of their complex- it y C ( ρ ) a nd consider so me indexation P = { ρ 1 , . . . , ρ M } , where C ( ρ k ) ≤ C ( ρ k +1 ), 1 ≤ k < M . Let us now consider for each ρ k ∈ P the first p osterior distribution in P which cannot be prov ed to b e worse than ρ k according to the bound e B : (2.16) t ( k ) = min n j ∈ { 1 , . . . M } : e B ( ρ j , ρ k ) > 0 o . 2.2. Playing with two p osterior and two lo c al prior distributions 73 In th is definition, which uses the c hained bound defined by equation (2.12, page 70), it is appropr iate to ass ume b y conv enti on that e B ( ρ, ρ ) = 0, for any pos terior distribution ρ . Let us now define our estimated b est ρ ∈ P as ρ b k , where (2.17) b k = min(arg max t ) . Thu s we take the po sterior with sma llest complexity which can b e prov ed to be b e t- ter than the la rgest starting interv al of P in terms of estimated r elativ e cla s sification error. The following theorem is a simple consequence of the chosen optimisation scheme. It is v a lid for any a rbitrary choice of the complexit y function ρ 7→ C ( ρ ). Theorem 2 . 2.4 . L et us put b t = t ( b k ) , wher e t is define d by e quation (2.16) and b k is define d by e quation (2.17) . With P pr ob ability at le ast 1 − ǫ , ρ b k ( R ) ≤ ρ j ( R ) + 0 , 1 ≤ j < b t, e B ( ρ j , ρ t ( j ) ) , b t ≤ j < b k , e B ( ρ j , ρ b t ) + e B ( ρ b t , ρ b k ) , j ∈ (arg max t ) , e B ( ρ j , ρ b k ) , j ∈ b k + 1 , . . . , M \ (arg max t ) , wher e the chaine d b oun d e B is define d fr om t he b ound of The or em 2. 2.2 (p age 69) by e quation (2.12, p age 70). In the me an time, for any j such t hat b t ≤ j < b k , t ( j ) < b t = ma x t , b e c ause j 6∈ (arg ma x t ) . Thus ρ b k ( R ) ≤ ρ t ( j ) ( R ) ≤ ρ j ( R ) + inf λ ∈ R + Ξ λ N 2 S λ ( ρ j , ρ t ( j ) ) while ρ t ( j ) ( r ) ≤ ρ j ( r ) + inf λ ∈ R + S λ ( ρ j , ρ t ( j ) ) , wher e the function Ξ is define d by e quation (2.11, p age 69) and S λ is define d by e quation (2.13, p age 71). F or any j ∈ (arg max t ) , (including notably b k ), B ( ρ b t , ρ j ) ≥ e B ( ρ b t , ρ j ) > 0 , B ( ρ j , ρ b t ) ≥ e B ( ρ j , ρ b t ) > 0 , so in this c ase ρ b k ( R ) ≤ ρ j ( R ) + inf λ ∈ R + Ξ λ N h S λ ( ρ j , ρ b t ) + S λ ( ρ b t , ρ b k ) + S λ ( ρ j , ρ b k ) i , while ρ b t ( r ) ≤ ρ j ( r ) + inf λ ∈ R + S λ ( ρ j , ρ b t ) , ρ b k ( r ) ≤ ρ b t ( r ) + inf λ ∈ R + S λ ( ρ b t , ρ b k ) , and ρ b t ( R ) ≤ ρ j ( R ) + inf λ ∈ R + Ξ λ N 2 S λ ( ρ j , ρ b t ) . Final ly in the c ase when j ∈ b k + 1 , . . . , M \ (arg max t ) , due to the fact that in p articular j 6∈ (arg max t ) , B ( ρ b k , ρ j ) ≥ e B ( ρ b k , ρ j ) > 0 . Thus in this last c ase ρ b k ( R ) ≤ ρ j ( R ) + inf λ ∈ R + Ξ λ N 2 S λ ( ρ j , ρ b k ) , while ρ b k ( r ) ≤ ρ j ( r ) + inf λ ∈ R + S λ ( ρ j , ρ b k ) . 74 Chapter 2. Comp aring p osterior distributions to Gibbs priors Thus for any j = 1 , . . . , M , ρ b k ( R ) − ρ j ( R ) is b ounde d fr om ab ove by an empiri c al quantity involving only varianc e and entr opy terms of p osterior distributions ρ ℓ such that ℓ ≤ j , and ther efor e such that C ( ρ ℓ ) ≤ C ( ρ j ) . Mor e over, t hese distributions ρ ℓ ar e su ch t ha t ρ ℓ ( r ) − ρ j ( r ) and ρ ℓ ( R ) − ρ j ( R ) have an empiri c al upp er b oun d of the same or der as the b ound state d for ρ b k ( R ) − ρ j ( R ) — namely the b ound for ρ ℓ ( r ) − ρ j ( r ) is in al l ci r cumstanc es not gr e ater than Ξ − 1 λ N applie d to the b ound state d for ρ b k ( R ) − ρ j ( R ) , wher e as the b ound for ρ ℓ ( R ) − ρ j ( R ) is always smal ler than two times the b ound state d for ρ b k ( R ) − ρ j ( R ) . This shows that varianc e t erms ar e b etwe en p osterior distributions whose empiric al a s wel l as exp e cte d err or r ates c annot b e much lar ger than those of ρ j . Let us remark that the estimation scheme described in this theorem is v ery general, the same metho d can be used as so on as some c onfidenc e interval for the relative exp ected risks − B ( ρ 2 , ρ 1 ) ≤ ρ 2 ( R ) − ρ 1 ( R ) ≤ B ( ρ 1 , ρ 2 ) with P probability at least 1 − ǫ, is av ailable. The definition o f the complexity is arbitra ry , and co uld in an abstra ct context b e chosen as C ( ρ 1 ) = inf ρ 2 6 = ρ 1 B ( ρ 1 , ρ 2 ) + B ( ρ 2 , ρ 1 ) . Proof. The case when 1 ≤ j < b t is straightf or w ard from the definitions: when j < b t , e B ( ρ j , ρ b k ) ≤ 0 and ther efore ρ b k ( R ) ≤ ρ j ( R ). In the second case , that is when b t ≤ j < b k , j ca nnot b e in ar g max t , b ecause of the sp e cial choice of b k in arg max t . Thus t ( j ) < b t and we deduce from the first case that ρ b k ( R ) ≤ ρ t ( j ) ( R ) ≤ ρ j ( R ) + e B ( ρ j , ρ t ( j ) ) . Moreov er, w e see from the defintion o f t that e B ( ρ t ( j ) , ρ j ) > 0, implying ρ t ( j ) ( r ) ≤ ρ j ( r ) + inf λ ∈ R + S λ ( ρ j , ρ t ( j ) ) , and therefore that ρ b k ( R ) ≤ ρ j ( R ) + inf λ Ξ λ N 2 S λ ( ρ j , ρ t ( j ) ) . In the third case j b elongs to a rg ma x t . In this case, we are no t sure that e B ( ρ b k , ρ j ) > 0 , and it is appro priate to inv olve b t , whic h is the index of the first po sterior distr ibution which canno t b e improved b y ρ b k , implying notably that e B ( ρ b t , ρ k ) > 0 for any k ∈ arg ma x t . On the other hand, ρ b t cannot either improv e any po sterior distribution ρ k with k ∈ (ar g max t ), b ecause this would imply for a n y ℓ < b t that e B ( ρ ℓ , ρ b t ) ≤ e B ( ρ ℓ , ρ k ) + e B ( ρ k , ρ b t ) ≤ 0, and therefore that t ( b t ) ≥ b t + 1, in contradiction of the fact that b t = max t . Thus e B ( ρ k , ρ b t ) > 0, and these t wo remarks imply that ρ b t ( r ) ≤ ρ j ( r ) + inf λ ∈ R + S λ ( ρ j , ρ b t ) , ρ b k ( r ) ≤ ρ b t ( r ) + inf λ ∈ R + S λ ( ρ b t , ρ b k ) ≤ ρ j ( r ) + inf λ ∈ R + S λ ( ρ j , ρ b t ) + inf λ ∈ R + S λ ( ρ b t , ρ b k ) , 2.2. Playing with two p osterior and two lo c al prior distributions 75 and consequently also that ρ b k ( R ) ≤ ρ j ( R ) + e B ( ρ j , ρ b k ) ≤ ρ j ( R ) + inf λ ∈ R + Ξ λ N h S λ ( ρ j , ρ b t ) + S λ ( ρ b t , ρ b k ) + S λ ( ρ j , ρ b k ) i and that ρ b t ( R ) ≤ ρ j ( R ) + inf λ ∈ R + Ξ λ N 2 S λ ( ρ j , ρ b t ) ≤ ρ j ( R ) + 2 inf λ ∈ R + 2Ξ λ N S λ ( ρ j , ρ b t ) , the last inequa lit y being due to the fact that Ξ λ N is a concave function. Let us notice that it may b e the case that b k < b t , but that only the case when j ≥ b t is to be considered, since otherwise w e alrea dy know that ρ b k ( R ) ≤ ρ j ( R ). In the fo urth case, j is g reater than b k , a nd the complexity o f ρ j is larger than the complexity of ρ b k . Mor e over, j is not in arg max t , and thus e B ( ρ b k , ρ j ) > 0, beca use otherwise, the sub-additivity o f e B would imply that e B ( ρ ℓ , ρ j ) ≤ 0 for any ℓ ≤ b t and therefore that t ( j ) ≥ b t = max t . Ther efore ρ b k ( r ) ≤ ρ j ( r ) + inf λ ∈ R + S λ ( ρ j , ρ b k ) , and ρ b k ( R ) ≤ ρ j ( R ) + e B ( ρ j , ρ b k ) ≤ ρ j ( R ) + inf λ ∈ R + Ξ λ N 2 S λ ( ρ j , ρ b k ) . 2.2.3. Analysis of r elative b ou n ds Let us start our inv estigation of the theoretical prop erties of the a lgorithm describ ed in Theor em 2.2.4 (page 73) by co mputing some non-random upper bo unds for B ( ρ, ρ ′ ), the bound of Theorem 2.2.2 (page 69), a nd C ( ρ ), the complexity factor defined b y e q uation (2.1 4 , page 72), for any ρ, ρ ′ ∈ P . This analysis will b e done in the c a se when P = n π i exp( − β r ) : ν ( β ) > 0 , µ ( i ) > 0 o , in which it will b e p ossible to get s ome cont ro l on the randomness of any ρ ∈ P , in addition to con trolling the other random ex pr essions appear ing in the definition of B ( ρ, ρ ′ ), ρ, ρ ′ ∈ P . W e will also use a simpler choice of complexity function, removing from equation (2.14 pag e 72) the optimization in i and β and using instead the definition (2.18) C ( π i exp( − β r ) ) def = 1 − ζ − 1 − 1 log π i exp( − β r ) exp n N ζ log cosh ζ β N π i exp( − β r ) ( m ′ ) o + ζ + 1 ζ − 1 log ν ( β ) µ ( i ) . With this definition, 76 Chapter 2. Comp aring p osterior distributions to Gibbs priors S λ ( π i exp( − β r ) , π j exp( − β ′ r ) ) ≤ N λ log cosh( λ N ) π i exp( − β r ) ⊗ π j exp( − β ′ r ) ( m ′ ) + C π i exp( − β r ) + C π j exp( − β ′ r ) λ + ( ζ + 1) ( ζ − 1) λ log 3 − 1 ν ( λ ) ǫ , where S λ is defined b y equation (2.13, page 71), so that B π i exp( − β r ) , π j exp( − β ′ r ) = inf λ ∈ R + Ξ λ N n π j exp( − β ′ r ) ( r ) − π i exp( − β r ) ( r ) + S λ π i exp( − β r ) , π j exp( − β r ) o . Let us succe s siv ely b ound the v arious r andom factors en tering into the defini- tion of B π i exp( − β r ) , π j exp( − β ′ r ) . The quantit y π j exp( − β ′ r ) ( r ) − π i exp( − β r ) ( r ) can be bo unded using a slight adaptation of Prop osition 2 .1.11 (page 59). Prop osition 2. 2 .5 . F or any p ositive r e al c onstants λ, λ ′ and γ , with P pr ob abili ty at le ast 1 − η , for any p ositive r e al c onstants β , β ′ such tha t β < λ γ N sinh( γ N ) − 1 and β ′ > λ ′ γ N sinh( γ N ) − 1 , π j exp( − β ′ r ) ( r ) − π i exp( − β r ) ( r ) ≤ π j exp( − λ ′ R ) ⊗ π i exp( − λR ) Ψ − γ N ( R ′ , M ′ ) + log 3 η γ + C j ( λ ′ , γ ) + log ( 3 η ) N β ′ λ ′ sinh( γ N ) − γ + C i ( λ, γ ) + log( 3 η ) γ − N β λ sinh( γ N ) , wher e C i ( λ, γ ) def = lo g Z Θ exp − γ Z Θ n Ψ γ N R ′ ( θ 1 , θ 2 ) , M ′ ( θ 1 , θ 2 ) − N γ sinh( γ N ) R ′ ( θ 1 , θ 2 ) o π i exp( − λR ) ( dθ 2 ) π i exp( − λR ) ( dθ 1 ) ≤ log π i exp( − λR ) exp n 2 N sinh γ 2 N 2 π i exp( − λR ) M ′ o . As for π i exp( − β r ) ⊗ π j exp( − β ′ r ) ( m ′ ), we can write with P probability at lea st 1 − η , for any p osterior distributions ρ and ρ ′ : Ω → M 1 + (Θ), γ ρ ⊗ ρ ′ ( m ′ ) ≤ log h π i exp( − λR ) ⊗ π j exp( − λ ′ R ) exp γ Φ − γ N ( M ′ ) i + K ρ, π i exp( − λR ) + K ρ ′ , π j exp( − λ ′ R ) − log( η ) . W e ca n then replace λ with β N λ sinh( λ N ) a nd use Theorem 2.1.12 (page 60) to get Prop osition 2. 2 .6 . F or any p ositive r e al c onstants γ , λ , λ ′ , β and β ′ , with P pr ob ability 1 − η , γ ρ ⊗ ρ ′ ( m ′ ) 2.2. Playing with two p osterior and two lo c al prior distributions 77 ≤ log h π i exp[ − β N λ sinh( λ N ) R ] ⊗ π j exp[ − β ′ N λ ′ sinh( λ ′ N ) R ] exp γ Φ − γ N ( M ′ ) i + K ρ, π i exp( − β r ) 1 − β λ + C i β N λ sinh( λ N ) , λ − log( η 3 ) λ β − 1 + K ρ ′ , π j exp( − β ′ r ) 1 − β ′ λ ′ + C j β ′ N λ ′ sinh( λ ′ N ) , λ ′ − log( η 3 ) λ β ′ − 1 − log( η 3 ) . The last random factor in B ( ρ, ρ ′ ) that w e need to upp er bo und is log n π i exp( − β r ) h exp β N γ log cosh( γ N ) π i exp( − β r ) ( m ′ ) io . A s ligh t a daptation of Prop osition 2.1.13 (page 61) shows that with P pro ba bilit y at least 1 − η , log n π i exp( − β r ) h exp β N γ log cosh( γ N ) π i exp( − β r ) ( m ′ ) io ≤ 2 β γ C i N β γ sinh( γ N ) , γ + 1 − β γ log π i exp[ − N β γ sinh( γ N ) R ] ⊗ 2 exp N log cosh( γ N ) γ β − 1 Φ − log[cosh( γ N )] γ β − 1 ◦ M ′ + 1 + β γ log( 2 η ) , where as us ua l Φ is the function defined by equatio n (1.1, page 2). This lea ds us to define for any i, j ∈ N , a n y β , β ′ ∈ R + , (2.19) C ( i, β ) def = 2 ζ − 1 C i h N ζ sinh( ζ β N ) , ζ β i + log π i exp[ − N ζ sinh( ζ β N ) R ] ⊗ 2 exp N log cosh( ζ β N ) ζ − 1 Φ − log[cosh( ζ β N )] ζ − 1 ◦ M ′ − ζ + 1 ζ − 1 2 log ν ( β ) µ ( i ) + log η 2 . Recall that the definition of C i ( λ, γ ) is to be found in Prop osition 2.2.5, pag e 76. Let us remark that, since exp N a Φ − a ( p ) = exp n N log h 1 + exp( a ) − 1 p io ≤ exp n N exp( a ) − 1 p o , p ∈ (0 , 1) , a ∈ R , we have C ( i, β ) ≤ 2 ζ − 1 log π i exp[ − N ζ sinh( ζ β N ) R ] exp n 2 N s inh ζ β 2 N 2 π i exp[ − N ζ sinh( ζ β N ) R ] M ′ o 78 Chapter 2. Comp aring p osterior distributions to Gibbs priors + log π i exp[ − N ζ sinh( ζ β N ) R ] ⊗ 2 exp n N h exp ( ζ − 1) − 1 log cosh ζ β N − 1 i M ′ o − ζ + 1 ζ − 1 2 log ν ( β ) µ ( i ) + log η 2 . Let us put S λ ( i, β ) , ( j, β ′ ) def = N λ log cosh( λ N ) inf γ ∈ R + γ − 1 log π i exp[ − N ζ sinh( ζ β N ) R ] ⊗ π j exp[ − N ζ sinh( ζ β ′ N ) R ] n exp γ Φ − γ N ( M ′ ) o + C i N ζ sinh( ζ β N ) , ζ β − log ( η 3 ) ζ − 1 + C j N ζ sinh( ζ β ′ N ) , ζ β ′ − log ( η 3 ) ζ − 1 − log( η 3 ) + 1 λ C ( i, β ) + C ( j, β ′ ) − ζ + 1 ζ − 1 log 3 − 1 ν ( λ ) ǫ , where η = ν ( γ ) ν ( β ) ν ( β ′ ) µ ( i ) µ ( j ) η . Let us remark that S λ ( i, β ) , ( j, β ′ ) ≤ inf γ ∈ R + λ 2 N γ log π i exp[ − N ζ sinh( ζ β N ) R ] ⊗ π j exp[ − N ζ sinh( ζ β ′ N ) R ] n exp h N exp γ N − 1 M ′ io + λ 2 N γ ( ζ − 1) + 2 λ ( ζ − 1) log π i exp[ − N ζ sinh( ζ β N ) R ] exp n 2 N s inh ζ β 2 N 2 π i exp[ − N ζ sinh( ζ β N ) R ] M ′ o + λ − 1 log π i exp[ − N ζ sinh( ζ β N ) R ] ⊗ 2 exp n N h exp ( ζ − 1) − 1 log cosh ζ β N − 1 i M ′ o + λ 2 N γ ( ζ − 1) + 2 λ ( ζ − 1) log π j exp[ − N ζ sinh( ζ β ′ N ) R ] exp n 2 N sinh ζ β ′ 2 N 2 π j exp[ − N ζ sinh( ζ β ′ N ) R ] M ′ o + λ − 1 log π j exp[ − N ζ sinh( ζ β ′ N ) R ] ⊗ 2 exp n N h exp ( ζ − 1 ) − 1 log cosh ζ β ′ N − 1 i M ′ o 2.2. Playing with two p osterior and two lo c al prior distributions 79 − ( ζ + 1) λ 2 N ( ζ − 1) γ log 3 − 1 ν ( γ ) ν ( β ) ν ( β ′ ) µ ( i ) µ ( j ) η − ( ζ + 1) ( ζ − 1) λ 2 log 2 − 1 ν ( β ) ν ( β ′ ) µ ( i ) µ ( j ) η + log 3 − 1 ν ( λ ) ǫ . Let us define accordingly B ( i, β ) , ( j, β ′ ) def = inf λ Ξ λ N ( inf α,γ ,α ′ ,γ ′ π j exp( − α ′ R ) ⊗ π i exp( − αR ) Ψ − λ N ( R ′ , M ′ ) − log e η 3 λ + C j ( α ′ , γ ′ ) − log e η 3 N β ′ α ′ sinh( γ ′ N ) − γ ′ + C i ( α, γ ) − log e η 3 γ − N β α sinh( γ N ) + S λ ( i, β ) , ( j, β ′ ) ) , where e η = ν ( λ ) ν ( α ) ν ( γ ) ν ( β ) ν ( α ′ ) ν ( γ ′ ) ν ( β ′ ) µ ( i ) µ ( j ) η . Prop osition 2. 2 .7 . • With P pr ob ability at le ast 1 − η , for any β ∈ R + and i ∈ N , C ( π i exp( − β r ) ) ≤ C ( i, β ) ; • With P pr ob ability at le ast 1 − 3 η , for any λ, β , β ′ ∈ R + , any i, j ∈ N , S λ ( i, β ) , ( j, β ′ ) ≤ S λ ( i, β ) , ( j, β ′ ) ; • With P pr ob ability at le ast 1 − 4 η , for any i, j ∈ N , any β , β ′ ∈ R + , B ( π i exp( − β r ) , π j exp( − β ′ r ) ) ≤ B ( i, β ) , ( j, β ′ ) . It is also int eres ting to find a non-ra ndo m low er b ound for C ( π i exp( − β r ) ). Let us start from the fact that with P probability at least 1 − η , π i exp( − αR ) ⊗ π i exp( − αR ) Φ γ ′ N ( M ′ ) ≤ π i exp( − αR ) ⊗ π i exp( − αR ) ( m ′ ) − log( η ) γ ′ . On the other hand, we alr eady proved that with P pro babilit y at least 1 − η , 0 ≤ 1 − α N tanh( λ N ) K ρ, π i exp( − αR ) ≤ α N tanh( λ N ) λ ρ ( r ) − π i exp( αR ) ( r ) + N log cosh( λ N ) ρ ⊗ π i exp( − αR ) ( m ′ ) − log( η ) + K ρ, π i − K π i exp( − αR ) , π i . Thu s fo r a n y ξ > 0, putting β = αλ N tanh( λ N ) , with P probability at least 1 − η , ξ π i exp( − αR ) ⊗ π i exp( − αR ) Φ γ ′ N ( M ′ ) 80 Chapter 2. Comp aring p osterior distributions to Gibbs priors ≤ π i exp( − αR ) log π i exp( − β r ) n exp h β N λ log cosh( λ N ) π i exp( − β r ) ( m ′ ) + ξ m ′ io − β λ + ξ γ ′ log η 2 ≤ log π i exp( − β r ) exp n β N λ log cosh( λ N ) π i exp( − β r ) ( m ′ ) o × π i exp( − β r ) n exp h β N λ log cosh( λ N ) π i exp( − β r ) ( m ′ ) + ξ m ′ io − 2 β λ + ξ γ ′ log η 2 ≤ 2 log π i exp( − β r ) exp nh ξ + β N λ log cosh( λ N ) i π i exp( − β r ) ( m ′ ) o − 2 β λ + ξ γ ′ log η 2 ≤ 2 log π i exp( − β r ) exp nh ξ + β λ 2 N i π i exp( − β r ) ( m ′ ) o − 2 β λ + ξ γ ′ log η 2 . T a king ξ = β λ 2 N , w e get with P pro babilit y a t lea st 1 − η β λ 4 N π i exp[ − β N λ tanh( λ N ) R ] ⊗ 2 h Φ γ ′ N M ′ i ≤ log π i exp( − β r ) exp n β λ N π i exp( − β r ) ( m ′ ) o − 2 β λ + β λ 2 N γ ′ log η 2 . Putting λ = N 2 γ log cosh( γ N ) and Υ( γ ) def = γ tanh N γ log cosh( γ N ) N log cosh( γ N ) ∼ γ → 0 1 , this can be rewritten as β N 4 γ log cosh( γ N ) π i exp( − β Υ( γ ) R ) ⊗ 2 h Φ γ ′ N M ′ i ≤ log π i exp( − β r ) exp n β N γ log cosh( γ N ) π i exp( − β r ) ( m ′ ) o − 2 β γ N 2 log cosh( γ N ) + β N log cosh( γ N ) 2 γ γ ′ log η 2 . It is now tempting to simplify the picture a little bit by setting γ ′ = γ , leading to 2.2. Playing with two p osterior and two lo c al prior distributions 81 Prop osition 2. 2 .8 . With P pr ob ability at le ast 1 − η , for any i ∈ N , any β ∈ R + , C π i exp( − β r ) ≥ C ( i, β ) def = 1 ζ − 1 ( N 4 log cosh( ζ β N ) π i exp( − β Υ( ζ β ) R ) ⊗ 2 h Φ ζ β N M ′ i + 2 ζ 2 β 2 N 2 log cosh( ζ β N ) + N log cosh( ζ β N ) 2 ζ β ! log 2 − 1 ν ( β ) µ ( i ) η − ( ζ + 1 ) n log ν ( β ) µ ( i ) + 2 − 1 log 3 − 1 ǫ o ) , wher e C π i exp( − β r ) is define d by e quation (2.18, p age 75). W e are now going to analys e Theor em 2.2.4 (page 73). F o r this, we will also need an upp er b ound for S λ ( ρ, ρ ′ ), defined by equation (2.13, pag e 71), using M ′ and empirical complexities, b ecause of the sp e c ia l relations b etw een empirical complex- ities induced by the selection algorithm. T o this purpose, a us eful alternative to Prop osition 2 .2.6 (page 76) is to write, with P probabilit y at least 1 − η , γ ρ ⊗ ρ ′ ( m ′ ) ≤ γ ρ ⊗ ρ ′ Φ − γ N M ′ + K ρ, π i exp( − λR ) + K ρ ′ , π j exp( − λ ′ R ) − log( η ) , and thus at least with P pro ba bilit y 1 − 3 η , γ ρ ⊗ ρ ′ ( m ′ ) ≤ γ ρ ⊗ ρ ′ Φ − γ N M ′ + (1 − ζ − 1 ) − 1 K ρ, π i exp( − β r ) + log n π i exp( − β r ) h exp N ζ log cosh ζ β N ρ ( m ′ ) io − ζ − 1 log( η ) + (1 − ζ − 1 ) − 1 K ρ, π j exp( − β ′ r ) + log n π j exp( − β ′ r ) h exp N ζ log cosh ζ β ′ N ρ ( m ′ ) io − ζ − 1 log( η ) − log( η ) . When ρ = π i exp( − β r ) and ρ ′ = π j exp( − β ′ r ) , w e get with P pr obabilit y at least 1 − η , for any β , β ′ , γ ∈ R + , any i , j ∈ N , γ ρ ⊗ ρ ′ ( m ′ ) ≤ γ ρ ⊗ ρ ′ Φ − γ N M ′ + C ( ρ ) + C ( ρ ′ ) − ζ + 1 ζ − 1 log 3 − 1 ν ( γ ) η . Prop osition 2. 2 .9 . With P pr ob ability at le ast 1 − η , for any ρ = π i exp( − β r ) , any ρ ′ = π j exp( − β ′ r ) ∈ P , 82 Chapter 2. Comp aring p osterior distributions to Gibbs priors S λ ( ρ, ρ ′ ) ≤ N λ log cosh( λ N ) ρ ⊗ ρ ′ Φ − γ N M ′ + 1 + N γ log cosh( λ N ) λ C ( ρ ) + C ( ρ ′ ) − ( ζ + 1) ( ζ − 1) λ log 3 − 1 ν ( λ ) ǫ + N γ log cosh λ N log 3 − 1 ν ( γ ) η . In or der to ana ly s e Theor em 2.2.4 (pag e 73), we need to index P = ρ 1 , . . . , ρ M in order of increasing empirical complexit y C ( ρ ). T o deal in a conv enient wa y with this indexation, we will wr ite C ( i, β ) as C π i exp( − β r ) , C ( i, β ) as C π i exp( − β r ) , and S ( i, β ) , ( j, β ′ ) as S π i exp( − β r ) , π j exp( − β ′ r ) . With P probability at least 1 − ǫ , when b t ≤ j < b k , a s we already s a w, ρ b k ( R ) ≤ ρ i ( R ) ≤ ρ j ( R ) + inf λ ∈ R + Ξ λ N 2 S λ ( ρ j , ρ i ) , where i = t ( j ) < b t . Therefore, with P probability at least 1 − ǫ − η , ρ i ( R ) ≤ ρ j ( R ) + inf λ ∈ R + Ξ λ N ( 2 N λ log cosh λ N ρ j ⊗ ρ i Φ − γ N M ′ + 4 1 + N γ log cosh λ N λ C ( ρ j ) − ( ζ + 1) ( ζ − 1) λ log 3 − 1 ν ( λ ) ǫ + N γ log cosh λ N log 3 − 1 ν ( γ ) η ) . W e ca n no w remark that Ξ a ( p + q ) ≤ Ξ a ( p ) + q Ξ ′ a ( p ) q ≤ Ξ a ( p ) + Ξ ′ a (0) q = Ξ a ( p ) + a tanh( a ) q and that Φ − a ( p + q ) ≤ Φ − a ( p ) + Φ ′ − a (0) q = Φ − a ( p ) + exp( a ) − 1 a q . Moreov er, a ssuming as usual without substantial loss of generality that there ex ists e θ ∈ arg min Θ R , we can split M ′ ( θ, θ ′ ) ≤ M ′ ( θ, e θ ) + M ′ ( e θ, θ ′ ). L e t us then consider the exp e cte d mar gin funct io n defined b y ϕ ( y ) = s up θ ∈ Θ M ′ ( θ, e θ ) − y R ′ ( θ, e θ ) , y ∈ R + , and let us write for any y ∈ R + , ρ j ⊗ ρ i Φ − λ N M ′ ≤ ρ j ⊗ ρ i Φ − γ N M ′ ( ., e θ ) + y R ′ ( ., e θ ) + ϕ ( y ) ≤ ρ j Φ − λ N M ′ ( ., e θ ) + ϕ ( y ) + N y exp( γ N ) − 1 γ ρ i ( R ) − R ( e θ ) and 1 − 2 y N exp( γ N ) − 1 log cosh λ N γ tanh λ N ! ρ i ( R ) − R ( e θ ) 2.2. Playing with two p osterior and two lo c al prior distributions 83 ≤ ρ j ( R ) − R ( e θ ) + Ξ λ N ( 2 N λ log cosh λ N ρ j Φ − γ N M ′ ( ., e θ ) + ϕ ( y ) + 4 1 + N γ log cosh λ N λ C ( ρ j ) − 2( ζ + 1) ( ζ − 1) λ log 3 − 1 ν ( λ ) ǫ + N γ log cosh λ N log 3 − 1 ν ( γ ) η ) . With P pro babilit y at leas t 1 − ǫ − η , for any λ , γ , x , y ∈ R + , any j ∈ b t, . . . , b k − 1 , ρ b k ( R ) − R ( e θ ) ≤ ρ i ( R ) − R ( e θ ) ≤ 1 − 2 y N exp( γ N ) − 1 log cosh λ N γ tanh λ N ! − 1 ( 1 + 2 xN exp( γ N ) − 1 log cosh λ N γ tanh λ N ! ρ j ( R ) − R ( e θ ) + Ξ λ N 2 N λ log cosh λ N Φ − γ N ϕ ( x ) + ϕ ( y ) + 4 1 + N γ log cosh λ N λ C ( ρ j ) − 2( ζ + 1) ( ζ − 1) λ n log 3 − 1 ν ( λ ) ǫ + N γ log cosh λ N log 3 − 1 ν ( γ ) η o ) . Now we have to get an upp er bo und fo r ρ j ( R ). W e can write ρ j = π ℓ exp( − β ′ r ) , as we assumed that all the po sterior distributions in P are of this sp ecial for m. Moreover, we already kno w from Theorem 2 .1.8 (page 58) that with P probability a t le a st 1 − η , N sinh β ′ N − β ′ ζ − 1 π ℓ exp( − β ′ r ) ( R ) − π ℓ exp( − β ′ ζ − 1 R ) ( R ) ≤ C ℓ ( β ′ ζ − 1 , β ′ ) − log ν ( β ′ ) µ ( ℓ ) η . This prov es that with P probability at least 1 − ǫ − 2 η , ρ b k ( R ) ≤ R ( e θ ) + 1 − 2 y N exp γ N − 1 log cosh λ N γ tanh λ N − 1 ( 1 + 2 xN exp γ N − 1 log cosh λ N γ tanh λ N × π ℓ exp( − ζ − 1 β ′ R ) ( R ) − R ( e θ ) + C ℓ ( ζ − 1 β ′ , β ′ ) − log ν ( β ′ ) µ ( ℓ ) η N sinh( β ′ N ) − ζ − 1 β ′ ! + Ξ λ N 2 N λ log cosh λ N Φ − γ N ϕ ( x ) + ϕ ( y ) + 4 1 + N γ log cosh λ N λ C ( ℓ, β ′ ) 84 Chapter 2. Comp aring p osterior distributions to Gibbs priors − 2( ζ + 1) ( ζ − 1) λ n log 3 − 1 ν ( λ ) ǫ + N γ log cosh λ N log 3 − 1 ν ( γ ) η o ) . The case when j ∈ b k + 1 , . . . , M \ (arg max t ) is dealt with exactly in the same wa y , with i = t ( j ) replaced directly with b k itself, lea ding to the s ame inequality . The cas e when j ∈ (arg max t ) is dealt with bo unding first ρ b k ( R ) − R ( e θ ) in terms of ρ b t ( R ) − R ( e θ ), a nd this latter in terms o f ρ j ( R ) − R ( e θ ). L e t us put A ( λ, γ ) = 1 − 2 xN exp γ N − 1 log cosh λ N γ tanh λ N ! , B ( λ, γ ) = 1 + 2 y N exp γ N − 1 log cosh λ N γ tanh λ N , D ( λ, γ , ρ j ) = Ξ λ N 2 N λ log cosh λ N Φ − γ N ϕ ( x ) + ϕ ( y ) +4 1 + N γ log cosh λ N λ C ( ρ j ) − 2( ζ + 1) ( ζ − 1) λ n log 3 − 1 ν ( λ ) ǫ + N γ log cosh λ N log 3 − 1 ν ( γ ) η o , (2.20) where C ( ρ j ) = C ( ℓ, β ′ ) is defined, when ρ j = π ℓ exp( − β ′ r ) , b y equatio n (2.19, page 77). W e o btain, s till with P proba bilit y 1 − ǫ − 2 η , ρ b k ( R ) − R ( e θ ) ≤ B ( λ, γ ) A ( λ, γ ) ρ b t ( R ) − R ( e θ ) + D ( λ, γ , ρ j ) A ( λ, γ ) , ρ b t ( R ) − R ( e θ ) ≤ B ( λ, γ ) A ( λ, γ ) ρ j ( R ) − R ( e θ ) + D ( λ, γ , ρ j ) A ( λ, γ ) . The use of the factor D ( λ, γ , ρ j ) in the first of these tw o inequa lities, instead of D ( λ, γ , ρ b t ), is justified b y the fact that C ( ρ b t ) ≤ C ( ρ j ). Combinin g the tw o we get ρ b k ( R ) ≤ R ( e θ ) + B ( λ, γ ) 2 A ( λ, γ ) 2 ρ j ( R ) − R ( e θ ) + B ( λ, γ ) A ( λ, γ ) + 1 D ( λ, γ , ρ j ) A ( λ, γ ) . Since it is the worst b ound of all cases, it holds for any v alue of j , proving Theorem 2 . 2.10 . With P pr ob ability at le ast 1 − ǫ − 2 η , ρ b k ( R ) ≤ R ( e θ ) + inf i,β ,λ,γ ,x,y ( B ( λ, γ ) 2 A ( λ, γ ) 2 h π i exp( − β r ) ( R ) − R ( e θ ) i + B ( λ, γ ) A ( λ, γ ) + 1 D ( λ, γ , π i exp( − β r ) ) A ( λ, γ ) ) 2.2. Playing with two p osterior and two lo c al prior distributions 85 ≤ R ( e θ ) + inf i,β ,λ,γ ,x,y ( B ( λ, γ ) 2 A ( λ, γ ) 2 π i exp( − ζ − 1 β R ) ( R ) − R ( e θ ) + C i ( ζ − 1 β , β ) − log ν ( β ) µ ( i ) η N sinh β N − ζ − 1 β ! + B ( λ, γ ) A ( λ, γ ) + 1 D ( λ, γ , π i exp( − β r ) ) A ( λ, γ ) ) , wher e the notation A ( λ, γ ) , B ( λ, γ ) and D ( λ, γ , ρ ) is define d by e quation (2.20 p age 84) and wher e the notation C i ( β , γ ) is define d in Pr op osition 2.2.5 (p age 76). The b ound is a little inv olved, but as we will prov e next, it gives the sa me rate as T heo rem 2.1.1 5 (page 66) and its corolla ries, when we w ork with a single model (meaning that the s upp ort of µ is reduced to o ne p o in t) and the g oal is to choose adaptively the temperature of the Gibbs pos terior, except for the app earance o f the union b ound factor − log ν ( β ) which can b e made of or der log log( N ) without sp o iling the o r der o f magnitude of the b ound. W e will encompa s s the case when one m ust choose betw een po ssibly sev era l parametric mo dels. Let us ass ume that each π i is supp orted by some mea surable parameter subset Θ i ( meaning that π i (Θ i ) = 1), let us also a ssume that the behaviour o f π i is par ametric in the sense that there exists a dimension d i ∈ R + such that (2.21) sup β ∈ R + β π i exp( − β R ) ( R ) − inf Θ i R ≤ d i . Then C i ( λ, γ ) ≤ log π i exp( − λR ) exp n 2 N sinh γ 2 N 2 M ′ ( ., e θ ) o + 2 N s inh γ 2 N 2 π i exp( − λR ) M ′ ( ., e θ ) ≤ log π i exp( − λR ) exp 2 xN sinh γ 2 N 2 R − R ( e θ ) o + 2 xN sinh γ 2 N 2 π i exp( − λR ) R − R ( e θ ) + 4 N s inh γ 2 N 2 ϕ ( x ) ≤ 2 xN sinh γ 2 N 2 π i exp {− [ λ − 2 xN sinh( γ 2 N ) 2 ] R } R − R ( e θ ) + 2 xN sinh γ 2 N 2 π i exp( − λR ) R − R ( e θ ) + 4 N s inh γ 2 N 2 ϕ ( x ) . Thu s C i ( λ, γ ) ≤ 4 N sinh γ 2 N 2 x inf Θ i R − R ( e θ ) + ϕ ( x ) + xd i 2 λ + xd i 2 λ − 4 xN sinh γ 2 N 2 ! . In the same wa y , 86 Chapter 2. Comp aring p osterior distributions to Gibbs priors C ( i, β ) ≤ 8 N ζ − 1 sinh ζ β 2 N 2 " x inf Θ i R − R ( e θ ) + ϕ ( x ) + ζ xd i 2 N s inh ζ β N 1 + 1 1 − xζ tanh ζ β 2 N # + 2 N h exp ζ 2 β 2 2 N 2 ( ζ − 1) − 1 i ϕ ( x ) + x inf Θ i R − R ( e θ ) + xζ d i N sinh ζ β N − xζ N exp ζ 2 β 2 2 N 2 ( ζ − 1) − 1 ! − ( ζ + 1) ( ζ − 1) 2 log ν ( β ) µ ( i ) + log η 2 . In or der to keep the right o rder of ma gnitude while simplifying the b ound, let us consider (2.22) C 1 = max ζ − 1 , 2 N ζ β max 2 sinh ζ β max 2 N 2 , 2 N 2 ( ζ − 1) ζ 2 β 2 max h exp ζ 2 β 2 max 2 N 2 ( ζ − 1) − 1 i . Then, for any β ∈ (0 , β max ), C ( i, β ) ≤ inf y ∈ R + 3 C 1 ζ 2 β 2 ( ζ − 1) N " y inf Θ i R − R ( e θ ) + ϕ ( y ) + y d i β 1 − y C 1 ζ 2 β 2( ζ − 1) N # − ( ζ + 1) ( ζ − 1) 2 log ν ( β ) µ ( i ) + log η 2 . Thu s D λ, γ , π i exp( − β r ) ≤ λ N tanh λ N ( λ exp γ N − 1 γ ϕ ( x ) + ϕ ( y ) + 4 1 + λ 2 2 N γ λ " 3 C 1 ζ 2 β 2 ( ζ − 1) N z inf Θ i R − R ( e θ ) + ϕ ( z ) + z d i β 1 − z C 1 ζ 2 β 2( ζ − 1) N ! − ( ζ + 1 ) ( ζ − 1 ) 2 log ν ( β ) µ ( i ) + log η 2 # − 2( ζ + 1) ( ζ − 1) λ log 3 − 1 ν ( λ ) ǫ + λ 2 2 N γ log 3 − 1 ν ( γ ) η ) If w e are not seeking tight co nstan ts, w e can take for the sake o f simplicity λ = γ = β , x = y and ζ = 2. Let us put (2.23) C 2 = max C 1 , N exp β max N − 1 β max , 2 N log cosh β max N β max tanh β max N , β max N tanh β max N , 2.2. Playing with two p osterior and two lo c al prior distributions 87 so that A ( β , β ) − 1 ≤ 1 − C 2 xβ N − 1 , B ( β , β ) ≤ 1 + C 2 xβ N , D β , β , π i exp( − β r ) ≤ C 2 2 2 β N ϕ ( x ) + 4 + 2 β N C 2 β " 12 C 1 β 2 N z inf Θ i R − R ( e θ ) + ϕ ( z ) + z d i β 1 − 2 z C 1 β N ! − 6 log ν ( β ) µ ( i ) − 3 log η 2 # − 6 C 2 β log 3 − 1 ν ( β ) ǫ + β 2 N log 3 − 1 ν ( β ) η and C i ( ζ − 1 β , β ) ≤ C 1 β 2 N x inf Θ i R − R ( e θ ) + ϕ ( x ) + 2 xd i β 1 − xβ N . This leads to ρ b k ( R ) ≤ R ( e θ ) + inf i,β 1 + C 2 xβ N 1 − C 2 xβ N ! 2 ( 2 d i β + inf Θ i R − R ( e θ ) + 2 β " C 1 β 2 N x inf Θ i R − R ( e θ + ϕ ( x ) + 2 xd i β 1 − xβ N − log ν ( β ) µ ( i ) η #) + 2 1 − C 2 xβ N 2 ( C 2 2 2 β N ϕ ( x ) + 4 + 2 β N C 2 β 12 C 1 β 2 N x inf Θ i R − R ( e θ ) + ϕ ( x ) + xd i β 1 − 2 xC 1 β N − 6 log ν ( β ) µ ( i ) − 3 log η 2 − 6 C 2 β log 3 − 1 ν ( β ) ǫ + β 2 N log 3 − 1 ν ( β ) η ) . W e s e e in this e x pression that, in order to balance the v arious facto r s dep ending on x it is advisable to ch o ose x such that inf Θ i R − R ( e θ ) = ϕ ( x ) x , as long as x ≤ N 4 C 2 β . 88 Chapter 2. Comp aring p osterior distributions to Gibbs priors F ollowing Ma mmen a nd Tsybakov, let us assume that the usual marg in assump- tion holds: for some real constants c > 0 and κ ≥ 1, R ( θ ) − R ( e θ ) ≥ c D ( θ, e θ ) κ . As D ( θ, e θ ) ≥ M ′ ( θ, e θ ), this also implies the w eaker a ssumption R ( θ ) − R ( e θ ) ≥ c M ′ ( θ, e θ ) κ , θ ∈ Θ , which we will r eally need and use. Let us take β max = N and ν = 1 ⌈ log 2 ( N ) ⌉ ⌈ log 2 ( N ) ⌉ X k =1 δ 2 k . Then, as we hav e alrea dy seen, ϕ ( x ) ≤ (1 − κ − 1 ) κcx − 1 κ − 1 . Thus ϕ ( x ) /x ≤ bx − κ κ − 1 , where b = (1 − κ − 1 ) κc − 1 κ − 1 . Let us ch o ose accordingly x = min x 1 def = inf Θ i R − R ( e θ ) b − κ − 1 κ , x 2 def = N 4 C 2 β . Using the fact that when r ∈ (0 , 1 2 ), 1+ r 1 − r 2 ≤ 1 + 16 r ≤ 9 , we get with P pr obabilit y at least 1 − ǫ , for any β ∈ supp ν , in the case when x = x 1 ≤ x 2 , ρ b k ( R ) ≤ inf Θ i R + 538 C 2 2 β N b κ − 1 κ inf Θ i R − R ( e θ ) 1 κ + C 2 β 138 d i + 166 lo g 1 + lo g 2 ( N ) − 134 lo g µ ( i ) − 102 lo g( ǫ ) + 724 , and in the case when x = x 2 ≤ x 1 , ρ b k ( R ) ≤ inf Θ i R + 68 C 1 inf Θ i R − R ( e θ ) + 269 C 2 2 β N ϕ ( x ) + C 2 β 138 d i + 166 lo g 1 + lo g 2 ( N ) − 134 lo g µ ( i ) − 102 lo g ( ǫ ) + 72 4 ≤ inf Θ i R + 54 1 C 2 2 β N ϕ ( x ) + C 2 β 138 d i + 166 lo g 1 + lo g 2 ( N ) − 134 lo g µ ( i ) − 102 lo g( ǫ ) + 724 . Thu s with P pro babilit y at least 1 − ǫ , ρ b k ( R ) ≤ inf Θ i R + inf β ∈ (1 ,N ) 1082 C 2 2 β N max b κ − 1 κ inf Θ i R − R ( e θ ) 1 κ , b 4 C 2 β N 1 κ − 1 + C 2 β 138 d i + 166 lo g 1 + lo g 2 ( N ) − 134 lo g µ ( i ) − 102 lo g( ǫ ) + 724 . 2.3. Two step lo c alization 89 Theorem 2 . 2.11 . With pr ob ability at le ast 1 − ǫ , for any i ∈ N , ρ b k ( R ) ≤ inf Θ i R + max 847 C 3 2 2 v u u t b κ − 1 κ inf Θ i R − R ( e θ ) 1 κ n d i + log 1+log 2 ( N ) ǫµ ( i ) + 5 o N , 2 C 2 1082 b κ − 1 2 κ − 1 4 1 2 κ − 1 166 C 2 h d i + log 1+log 2 ( N ) ǫµ ( i ) + 5 i N κ 2 κ − 1 , wher e C 2 , given by e quation (2.23 p age 86), wil l in most c ases b e close to 1 , and in any c ase less than 3 . 2 . This result gives a bo und o f the same form as that given in Theor em 2.1.15 (page 66) in the specia l case when there is only one mo del — that is when µ is a Dirac mass, for instance µ (1) = 1 , implying that R ( e θ 1 ) − R ( e θ ) = 0 . Morov er the para metr ic complexity assumption we ma de for this theorem, given by equation (2.21 pag e 85), is weaker than the one used in Theo rem 2.1 .15 and describ ed by equation (2.8, page 63). When there is more than one mo del, the b ound shows that the estimator makes a trade-o ff b etw een mo del accuracy , represented by inf Θ i R − R ( e θ ), a nd dimension, represented b y d i , and that for o ptimal parametric sub-mo dels, meaning those for which inf Θ i R = inf Θ R , the estimator do es at least as w ell a s the minimax optimal conv ergence s p eed in the b est of these. Another point is that we obtain more explicit constants than in Theor e m 2.1.15. It is a lso clear that a more careful choice of parameter s could hav e broug ht s o me improv ement in the v a lue of these constants. These results show that the sele c tio n s c heme describ ed in this section is a go o d candidate to perfor m temp erature selection o f a Gibbs p osterior distribution built within a single parametric mo del in a rate optimal w ay , as w ell as a prop osal with prov en p erformance b ound for mo del selection. 2.3. Two step lo calization 2.3.1. Tw o step lo c ali zation of b ounds r elative to a Gibbs prior Let us r econsider the case wher e we wan t to choo se adaptively among a family of parametric mo dels. Let us thu s assume that the parameter set is a disjoint unio n of measurable sub-mo dels, s o that w e can wr ite Θ = ⊔ m ∈ M Θ m , where M is some measurable index set. Let us ch o ose some prior probability distribution on the index set µ ∈ M 1 + ( M ), and some reg ular conditional prior distribution π : M → M 1 + (Θ), such that π ( i, Θ i ) = 1, i ∈ M . Let us then study s o me ar bitrary p osterior distributions ν : Ω → M 1 + ( M ) a nd ρ : Ω × M : → M 1 + (Θ), such that ρ ( ω , i, Θ i ) = 1, ω ∈ Ω, i ∈ M . W e would like to co mpare ν ρ ( R ) with some doubly lo calized prior distribution µ exp[ − β 1+ ζ 2 π exp( − βR ) ( R )] π exp( − β R ) ( R ) (where ζ 2 is a positive para meter to b e set as needed later on). T o ease notation w e will define t wo prior distributions 90 Chapter 2. Comp aring p osterior distributions to Gibbs priors (one b eing more precisely a conditional distribution) dep ending on the p ositiv e real parameters β a nd ζ 2 , putting (2.24) π = π exp( − β R ) and µ = µ exp[ − β 1+ ζ 2 π ( R )] . Similarly to Theo rem 1.4.3 on page 37 we can write for any p o sitiv e real co nstan ts β and γ P ( µ π ) ⊗ ( µ π ) exp h − N log 1 − ta nh( γ N ) R ′ − γ r ′ − N log cosh( γ N ) m ′ i ≤ 1 , and deduce, using Lemma 1.1.3 on page 4, that (2.25) P exp sup ν ∈ M 1 + ( M ) sup ρ : M → M 1 + (Θ) n − N log 1 − ta nh( γ N )( ν ρ − µ π )( R ) − γ ( ν ρ − µ π )( r ) − N log cosh( γ N ) ( ν ρ ) ⊗ ( µ π )( m ′ ) − K ( ν, µ ) − ν K ( ρ, π ) o ≤ 1 . This will b e our starting p oint in comparing ν ρ ( R ) with µ π ( R ). Howev er, obtaining an empirical b ound will require some supplementary efforts. F or each index o f the mo del index set M , w e can write in the same wa y P π ⊗ π exp h − N log 1 − ta nh( γ N ) R ′ − γ r ′ − N log cosh( γ N ) m ′ i ≤ 1 . In tegr ating this inequality with respe c t to µ and using F ubini’s lemma for p ositive functions, we get P µ ( π ⊗ π ) exp h − N log 1 − ta nh( γ N ) R ′ − γ r ′ − N log cosh( γ N ) m ′ i ≤ 1 . Note that µ ( π ⊗ π ) is a proba bilit y meas ur e on M × Θ × Θ, whereas ( µ π ) ⊗ ( µ π ) considered previo usly is a probability measure on ( M × Θ) × ( M × Θ). W e get a s previously (2.26) P exp sup ν ∈ M 1 + ( M ) sup ρ : M → M 1 + (Θ) n − N log 1 − ta nh( γ N ) ν ( ρ − π )( R ) − γ ν ( ρ − π )( r ) − N log cosh( γ N ) ν ( ρ ⊗ π )( m ′ ) − K ( ν, µ ) − ν K ( ρ, π ) o ≤ 1 . Let us finally recall that K ( ν, µ ) = β 1+ ζ 2 ( ν − µ ) π ( R ) + K ( ν , µ ) − K ( µ, µ ) , (2.27) K ( ρ, π ) = β ( ρ − π )( R ) + K ( ρ, π ) − K ( π , π ) . (2.28) F ro m equations (2.2 5), (2.2 6) and (2.28) we deduce 2.3. Two step lo c alization 91 Prop osition 2. 3 .1 . F or any p ositive r e al c onstants β , γ and ζ 2 , with P pr ob ability at le ast 1 − ǫ , for any p osterior distribution ν : Ω → M 1 + ( M ) and any c onditional p osterior distribution ρ : Ω × M → M 1 + (Θ) , − N log 1 − ta nh( γ N )( ν ρ − µ π )( R ) − β ν ( ρ − π )( R ) ≤ γ ( ν ρ − µ π )( r ) + N log cosh( γ N ) ( ν ρ ) ⊗ ( µ π )( m ′ ) + K ( ν, µ ) + ν K ( ρ, π ) − ν K ( π , π ) + log 2 ǫ . and − N log 1 − ta nh( γ N ) ν ( ρ − π )( R ) ≤ γ ν ( ρ − π )( r ) + N log cosh( γ N ) ν ( ρ ⊗ π )( m ′ ) + K ( ν, µ ) + ν K ( ρ, π ) + log 2 ǫ , wher e the prior distribution µ π is define d by e quation (2 .24 ) on p age 90 and dep ends on β and ζ 2 . Let us put for short T = tanh( γ N ) and C = N log cosh( γ N ) . W e will use an en tropy comp ensation strategy for whic h w e need a couple of ent ro p y bounds. W e hav e according to Prop osition 2.3.1, wit h P probability at least 1 − ǫ , ν K ( ρ, π ) = β ν ( ρ − π )( R ) + ν K ( ρ, π ) − K ( π , π ) ≤ β N T γ ν ( ρ − π )( r ) + C ν ( ρ ⊗ π )( m ′ ) + K ( ν, µ ) + ν K ( ρ, π ) + log( 2 ǫ ) + ν K ( ρ, π ) − K ( π , π ) . Similarly K ( ν, µ ) = β 1 + ζ 2 ( ν − µ ) π ( R ) + K ( ν , µ ) − K ( µ, µ ) ≤ β (1 + ζ 2 ) N T γ ( ν − µ ) π ( r ) + C ( ν π ) ⊗ ( µ π )( m ′ ) + K ( ν, µ ) + log( 2 ǫ ) + K ( ν, µ ) − K ( µ, µ ) . Thu s, for any po sitiv e real co nstan ts β , γ and ζ i , i = 1 , . . . , 5, with P pro babilit y at least 1 − ǫ , for an y p osterior distributions ν, ν 3 : Ω → M 1 + (Θ), an y pos terior conditional distributions ρ, ρ 1 , ρ 2 , ρ 4 , ρ 5 : Ω × M → M 1 + (Θ), − N log 1 − T ( ν ρ − µ π )( R ) − β ν ( ρ − π )( R ) ≤ γ ( ν ρ − µ π )( r ) + C ( ν ρ ) ⊗ ( µ π )( m ′ ) + K ( ν, µ ) + ν K ( ρ, π ) − K ( π , π ) + log( 2 ǫ ) , 92 Chapter 2. Comp aring p osterior distributions to Gibbs priors ζ 1 N T β µ K ( ρ 1 , π ) ≤ ζ 1 γ µ ( ρ 1 − π )( r ) + ζ 1 C µ ( ρ 1 ⊗ π )( m ′ ) + ζ 1 µ K ( ρ 1 , π ) + ζ 1 log( 2 ǫ ) + ζ 1 N T β µ K ( ρ 1 , π ) − K ( π , π ) , ζ 2 N T β ν K ( ρ 2 , π ) ≤ ζ 2 γ ν ( ρ 2 − π )( r ) + ζ 2 C ν ( ρ 2 ⊗ π )( m ′ ) + ζ 2 K ( ν, µ ) + ζ 2 ν K ( ρ 2 , π ) + ζ 2 log( 2 ǫ ) + ζ 2 N T β ν K ( ρ 2 , π ) − K ( π , π ) , ζ 3 (1 + ζ 2 ) N T β K ( ν 3 , µ ) ≤ ζ 3 γ ( ν 3 − µ ) π ( r ) + ζ 3 C ( ν 3 π ) ⊗ ( ν 3 ρ 1 ) + ( ν 3 ρ 1 ) ⊗ ( µ π ) ( m ′ ) + ζ 3 K ( ν 3 , µ ) + ζ 3 log( 2 ǫ ) + ζ 3 (1 + ζ 2 ) N T β K ( ν 3 , µ ) − K ( µ, µ ) , ζ 4 N T β ν 3 K ( ρ 4 , π ) ≤ ζ 4 γ ν 3 ( ρ 4 − π )( r ) + ζ 4 C ν 3 ( ρ 4 ⊗ π )( m ′ ) + ζ 4 K ( ν 3 , µ ) + ζ 4 ν 3 K ( ρ 4 , π ) + ζ 4 log( 2 ǫ ) + ζ 4 N T β ν 3 K ( ρ 4 , π ) − K ( π , π ) , ζ 5 N T β µ K ( ρ 5 , π ) ≤ ζ 5 γ µ ( ρ 5 − π )( r ) + ζ 5 C µ ( ρ 5 ⊗ π )( m ′ ) + ζ 5 µ K ( ρ 5 , π ) + ζ 5 log( 2 ǫ ) + ζ 5 N T β µ K ( ρ 5 , π ) − K ( π , π ) . Adding these six inequalities and assuming that (2.29) ζ 4 ≤ ζ 3 (1 + ζ 2 ) N T β − 1 , we find − N log 1 − T ( ν ρ − µ π )( R ) − β ( ν ρ − µ π )( R ) ≤ − N log 1 − T ( ν ρ − µ π )( R ) − β ( ν ρ − µ π )( R ) + ζ 1 N T β − 1 µ K ( ρ 1 , π ) + ζ 2 N T β − 1 ν K ( ρ 2 , π ) + ζ 3 (1 + ζ 2 ) N T β − ζ 3 − ζ 4 K ( ν 3 , µ ) + ζ 4 N T β − 1 ν 3 K ( ρ 4 , π ) + ζ 5 N T β − 1 µ K ( ρ 5 , π ) ≤ γ ( ν ρ − µ π )( r ) + ζ 1 γ µ ( ρ 1 − π )( r ) + ζ 2 γ ν ( ρ 2 − π )( r ) + ζ 3 γ ( ν 3 − µ ) π ( r ) + ζ 4 γ ν 3 ( ρ 4 − π )( r ) + ζ 5 γ µ ( ρ 5 − π )( r ) + C ( ν ρ ) ⊗ ( µ π ) + ζ 1 µ ( ρ 1 ⊗ π ) + ζ 2 ν ( ρ 2 ⊗ π ) + ζ 3 ( ν 3 π ) ⊗ ( ν 3 ρ 1 ) + ζ 3 ( ν 3 ρ 1 ) ⊗ ( µ π ) + ζ 4 ν 3 ( ρ 4 ⊗ π ) + ζ 5 µ ( ρ 5 ⊗ π ) ( m ′ ) + (1 + ζ 2 ) K ( ν, µ ) − K ( µ, µ ) + ν K ( ρ, π ) − K ( π , π ) + ζ 1 N T β µ K ( ρ 1 , π ) − K ( π , π ) + ζ 2 N T β ν K ( ρ 2 , π ) − K ( π , π ) + ζ 3 (1 + ζ 2 ) N T β K ( ν 3 , µ ) − K ( µ, µ ) + ζ 4 N T β ν 3 K ( ρ 4 , π ) − K ( π , π ) + ζ 5 N T β µ K ( ρ 5 , π ) − K ( π , π ) + (1 + ζ 1 + ζ 2 + ζ 3 + ζ 4 + ζ 5 ) lo g ( 2 ǫ ) , 2.3. Two step lo c alization 93 where we hav e also used the fact (concerning the 11th line of the preceding inequal- ities) that − β ( ν ρ − µ π )( R ) + K ( ν, µ ) + ν K ( ρ, π ) ≤ − β ( ν ρ − µ π )( R ) + (1 + ζ 2 ) K ( ν, µ ) + ν K ( ρ, π ) = (1 + ζ 2 ) K ( ν, µ ) − K ( µ, µ ) + ν K ( ρ, π ) − K ( π , π ) . Let us now apply to π (we shall later do the same with µ ) the following inequalities, holding for a n y rando m functions of the sample and the pa rameters h : Ω × Θ → R and g : Ω × Θ → R , π ( g − h ) − K ( π , π ) ≤ sup ρ :Ω × M → M 1 + (Θ) ρ ( g − h ) − K ( ρ, π ) = lo g π exp( g − h ) = lo g π exp( − h ) + log π exp( − h ) exp( g ) = − π exp( − h ) ( h ) − K ( π exp( − h ) , π ) + log π exp( − h ) exp( g ) . When h and g are observ able, and h is not to o far f ro m β r ≃ β R , this giv es a wa y to repla ce π with a satisfactor y empirical a ppro ximation. W e will apply this method, cho osing ρ 1 and ρ 5 such that µ π is r e pla ced either with µρ 1 , when it comes from the fir st t wo inequa lities or with µρ 5 otherwise, choo sing ρ 2 such that ν π is replaced with ν ρ 2 and ρ 4 such that ν 3 π is repla ced with ν 3 ρ 4 . W e will do so b ecause it leads to a lot o f helpful ca ncellations. F or those to happen, we need to cho ose ρ i = π exp( − λ i r ) , i = 1 , 2 , 4, where λ 1 , λ 2 and λ 4 are such that (1 + ζ 1 ) γ = ζ 1 N T β λ 1 , (2.30) ζ 2 γ = 1 + ζ 2 N T β λ 2 , (2.31) ( ζ 4 − ζ 3 ) γ = ζ 4 N T β λ 4 , (2.32) ζ 3 γ = ζ 5 N T β λ 5 , (2.33) and to assume that (2.34) ζ 4 > ζ 3 . W e o btain that with P pro babilit y at least 1 − ǫ , − N log 1 − T ( µρ − µ π )( R ) − β ( ν ρ − µ π )( R ) ≤ γ ( ν ρ − µ ρ 1 )( r ) + ζ 3 γ ( ν 3 ρ 4 − µρ 5 )( r ) + ζ 1 N T β µ ( log " ρ 1 exp C β N T ζ 1 ν ρ + ζ 1 ρ 1 ( m ′ ) #) + 1 + ζ 2 N T β ν ( log ( ρ 2 exp C 1+ ζ 2 N T β ζ 2 ρ 2 ( m ′ ) #) + ζ 4 N T β ν 3 ( log " ρ 4 exp C β N T ζ 4 ζ 3 ν 3 ρ 1 + ζ 4 ρ 4 ( m ′ ) #) + ζ 5 N T β µ ( log " ρ 5 exp C β N T ζ 5 ζ 3 ν 3 ρ 1 + ζ 5 ρ 5 ( m ′ ) #) 94 Chapter 2. Comp aring p osterior distributions to Gibbs priors + (1 + ζ 2 ) K ( ν, µ ) − K ( µ, µ ) + ν K ( ρ, π ) − K ( ρ 2 , π ) + ζ 3 (1 + ζ 2 ) N T β K ( ν 3 , µ ) − K ( µ, µ ) + 1 + 5 X i =1 ζ i log 2 ǫ . In or der to obtain more cancella tions while r eplacing µ by so me p osterior distri- bution, we will choose the constan ts such that λ 5 = λ 4 , which can b e done by choosing (2.35) ζ 5 = ζ 3 ζ 4 ζ 4 − ζ 3 . W e ca n no w replace µ with µ exp − ξ 1 ρ 1 ( r ) − ξ 4 ρ 4 ( r ) , where ξ 1 = γ (1 + ζ 2 ) 1 + N T β ζ 3 , (2.36) ξ 4 = γ ζ 3 (1 + ζ 2 ) 1 + N T β ζ 3 . (2.37) Cho osing moreov er ν 3 = µ exp − ξ 1 ρ 1 ( r ) − ξ 4 ρ 4 ( r ) , to induce s ome mor e ca ncellations, we get Theorem 2 . 3.2 . L et us use the notation intr o duc e d ab ove. F or any p ositive r e al c onstants satisf ying e quations (2.29, p age 92), (2.30, p age 93), (2.31, p age 93), (2.32, p age 93 ) , (2.33, p age 93 ), (2.34, p age 93), (2.35, p age 94), (2.36, p age 94), (2.37, p age 94), with P pr ob ability at le ast 1 − ǫ , fo r any p osterior distribution ν : Ω → M 1 + ( M ) and any c onditional p osterior distribution ρ : Ω × M → M 1 + (Θ) , − N log 1 − T ( ν ρ − µ π )( R ) − β ( ν ρ − µ π )( R ) ≤ B ( ν , ρ, β ) , wher e B ( ν, ρ, β ) def = γ ( ν ρ − ν 3 ρ 1 )( r ) + (1 + ζ 2 ) 1 + N T β ζ 3 × log ( ν 3 " ρ 1 exp C β N T ζ 1 ν ρ + ζ 1 ρ 1 ( m ′ ) ζ 1 N T β (1+ ζ 2 )(1+ N T β ζ 3 ) × ρ 4 exp C β N T ζ 5 ζ 3 ν 3 ρ 1 + ζ 5 ρ 4 ( m ′ ) ζ 5 N T β (1+ ζ 2 )(1+ N T β ζ 3 ) #) + 1 + ζ 2 N T β ν ( log ( ρ 2 exp C 1+ ζ 2 N T β ζ 2 ρ 2 ( m ′ ) #) + ζ 4 N T β ν 3 ( log " ρ 4 exp C β N T ζ 4 ζ 3 ν 3 ρ 1 + ζ 4 ρ 4 ( m ′ ) #) + (1 + ζ 2 ) K ( ν, µ ) − K ( ν 3 , µ ) + ν K ( ρ, π ) − K ( ρ 2 , π ) + 1 + 5 X i =1 ζ i log 2 ǫ . This theorem can b e us e d to find the la rgest v alue b β ( ν ρ ) of β s uch that B ( ν, ρ, β ) ≤ 0, thus pr o viding an estimator for β ( ν ρ ) defined as ν ρ ( R ) = µ β ( ν ρ ) π β ( ν ρ ) ( R ), 2.3. Two step lo c alization 95 where we hav e mentioned explicitly the dep endence of µ a nd π in β , the constant ζ 2 staying fixed. The posterio r distribution ν ρ may then b e chosen to maximize b β ( ν ρ ) within some managea ble subset of p osterior distr ibutions P , thu s gaining the assurance that ν ρ ( R ) ≤ µ b β ( ν ρ ) π b β ( ν ρ ) ( R ), with the largest parameter b β ( ν ρ ) that t his approach c an provide. Maximizing b β ( ν ρ ) is supp orted by th e fact th at lim β → + ∞ µ β π β ( R ) = ess inf µπ R . An yhow, there is no a ssurance (to o ur knowledge) that β 7→ µ β π β ( R ) will b e a decreasing function of β all the w ay , although this may be expected to be the ca se in many pra ctical situations. W e c a n make the b ound more explicit in several wa ys. One p oint of view is to put forward the optimal v alues of ρ and ν . W e can thus remar k that ν γ ρ ( r ) + K ( ρ, π ) − K ( ρ 2 , π ) + (1 + ζ 2 ) K ( ν, µ ) = ν K ρ, π exp( − γ r ) + λ 2 ρ 2 ( r ) + Z γ λ 2 π exp( − αr ) ( r ) dα + (1 + ζ 2 ) K ( ν, µ ) = ν K ρ, π exp( − γ r ) + (1 + ζ 2 ) K ν, µ exp − λ 2 ρ 2 ( r ) 1+ ζ 2 − 1 1+ ζ 2 R γ λ 2 π exp( − αr ) ( r ) dα − (1 + ζ 2 ) lo g ( µ " exp − λ 2 1 + ζ 2 ρ 2 ( r ) − 1 1 + ζ 2 Z γ λ 2 π exp( − αr ) ( r ) dα #) . Thu s B ( ν, ρ, β ) = (1 + ζ 2 ) h ξ 1 ν 3 ρ 1 ( r ) + ξ 4 ν 3 ρ 4 ( r ) + log µ exp − ξ 1 ρ 1 ( r ) − ξ 4 ρ 4 ( r ) i − (1 + ζ 2 ) lo g ( µ " exp − λ 2 1 + ζ 2 ρ 2 ( r ) − 1 1 + ζ 2 Z γ λ 2 π exp( − αr ) ( r ) dα #) − γ ν 3 ρ 1 ( r ) + (1 + ζ 2 ) 1 + N T β ζ 3 × log ( ν 3 " ρ 1 exp C β N T ζ 1 ν ρ + ζ 1 ρ 1 ( m ′ ) ζ 1 N T β (1+ ζ 2 )(1+ N T β ζ 3 ) × ρ 4 exp C β N T ζ 5 ζ 3 ν 3 ρ 1 + ζ 5 ρ 4 ( m ′ ) ζ 5 N T β (1+ ζ 2 )(1+ N T β ζ 3 ) #) + 1 + ζ 2 N T β ν ( log ( ρ 2 exp C 1+ ζ 2 N T β ζ 2 ρ 2 ( m ′ ) #) + ζ 4 N T β ν 3 ( log " ρ 4 exp C β N T ζ 4 ζ 3 ν 3 ρ 1 + ζ 4 ρ 4 ( m ′ ) #) + ν K ρ, π exp( − γ r ) + (1 + ζ 2 ) K ν, µ exp − λ 2 ρ 2 ( r ) 1+ ζ 2 − 1 1+ ζ 2 R γ λ 2 π exp( − αr ) ( r ) dα + 1 + 5 X i =1 ζ i log 2 ǫ . This form ula is b etter understo o d when thinking ab out the following upper b ound for the t wo first lines in the expression of B ( ν, ρ, β ): 96 Chapter 2. Comp aring p osterior distributions to Gibbs priors (1 + ζ 2 ) h ξ 1 ν 3 ρ 1 ( r ) + ξ 4 ν 3 ρ 4 ( r ) + log µ exp − ξ 1 ρ 1 ( r ) − ξ 4 ρ 4 ( r ) i − (1 + ζ 2 ) lo g ( µ " exp − λ 2 1 + ζ 2 ρ 2 ( r ) − 1 1 + ζ 2 Z γ λ 2 π exp( − αr ) ( r ) dα #) − γ ν 3 ρ 1 ( r ) ≤ ν 3 λ 2 ρ 2 ( r ) + Z γ λ 2 π exp( − αr ) ( r ) dα − γ ρ 1 ( r ) . Another approa c h to understanding Theo rem 2.3.2 is to put forward ρ 0 = π exp( − λ 0 r ) , for some po sitiv e real constant λ 0 < γ , noticing that ν K ( ρ 0 , π ) − K ( ρ 2 , π ) = λ 0 ν ( ρ 2 − ρ 0 )( r ) − ν K ( ρ 2 , ρ 0 ) . Thu s B ( ν, ρ 0 , β ) ≤ ν 3 ( γ − λ 0 )( ρ 0 − ρ 1 )( r ) + λ 0 ( ρ 2 − ρ 1 )( r ) + (1 + ζ 2 ) 1 + N T β ζ 3 × log ( ν 3 " ρ 1 exp C β N T ζ 1 ν ρ 0 + ζ 1 ρ 1 ( m ′ ) ζ 1 N T β (1+ ζ 2 )(1+ N T β ζ 3 ) × ρ 4 exp C β N T ζ 5 ζ 3 ν 3 ρ 1 + ζ 5 ρ 4 ( m ′ ) ζ 5 N T β (1+ ζ 2 )(1+ N T β ζ 3 ) #) + 1 + ζ 2 N T β ν ( log ( ρ 2 exp C 1+ ζ 2 N T β ζ 2 ρ 2 ( m ′ ) #) + ζ 4 N T β ν 3 ( log " ρ 4 exp C β N T ζ 4 ζ 3 ν 3 ρ 1 + ζ 4 ρ 4 ( m ′ ) #) + (1 + ζ 2 ) K h ν, µ exp − ( γ − λ 0 ) ρ 0 ( r )+ λ 0 ρ 2 ( r ) 1+ ζ 2 i − ν K ( ρ 2 , ρ 0 ) + 1 + 5 X i =1 ζ i log 2 ǫ . In the case when w e want to select a single mo del b m ( ω ), and therefore to set ν = δ b m , the previous inequalit y eng ages us to take b m ∈ arg min m ∈ M ( γ − λ 0 ) ρ 0 ( m, r ) + λ 0 ρ 2 ( m, r ). In parametric situations where π exp( − λr ) ( r ) ≃ r ⋆ ( m ) + d e ( m ) λ , we get ( γ − λ 0 ) ρ 0 ( m, r ) − λ 0 ρ 2 ( m, r ) ≃ γ r ⋆ ( m ) + d e ( m ) 1 λ 0 + λ 0 − λ 2 γ λ 2 , resulting in a linear pena lization o f the empiric a l dimension of the mo dels. 2.3. Two step lo c alization 97 2.3.2. Analysis of two step b ounds r elative to a Gibbs prior W e will not state a fo rmal result, but will nev ertheless give some hints ab out how to establish one. This is a rather technical se c tio n, which can b e skipped at a first reading , since it will not b e used b elow. W e should start from Theorem 1 .4.2 (page 36), which gives a deter ministic v ariance term. F ro m Theo rem 1.4.2, after a change of prior distribution, we obtain for any positive constant s α 1 and α 2 , any prior distributions e µ 1 and e µ 2 ∈ M 1 + ( M ), for a n y prior co nditional distributions e π 1 and e π 2 : M → M 1 + (Θ), with P probability at least 1 − η , for any p osterior distributions ν 1 ρ 1 and ν 2 ρ 2 , α 1 ( ν 1 ρ 1 − ν 2 ρ 2 )( R ) ≤ α 2 ( ν 1 ρ 1 − ν 2 ρ 2 )( r ) + K ( ν 1 ρ 1 ) ⊗ ( ν 2 ρ 2 ) , ( e µ 1 e π 1 ) ⊗ ( e µ 2 e π 2 ) + lo g n ( e µ 1 e π 1 ) ⊗ ( e µ 2 e π 2 ) h exp − α 2 Ψ α 2 N ( R ′ , M ′ ) + α 1 R ′ io − log( η ) . Applying this to α 1 = 0, we g et that ( ν ρ − ν 3 ρ 1 )( r ) ≤ 1 α 2 K ( ν ρ ) ⊗ ( ν 3 ρ 1 ) , ( e µ e π ) ⊗ ( e µ 3 e π 1 ) + log n ( e µ e ν ) ⊗ ( e µ 3 e π 1 ) h exp α 2 Ψ − α 2 N ( R ′ , M ′ ) io − log( η ) . In the same wa y , to b ound quantities of the form log ( ν 3 " ρ 1 exp C 1 ( ν ρ + ζ 1 ρ 1 )( m ′ ) p 1 × ρ 4 exp C 2 ζ 3 ν 3 ρ 1 + ζ 5 ρ 4 ( m ′ ) p 2 #) = sup ν 5 p 1 sup ρ 5 n C 1 ( ν ρ ) ⊗ ( ν 5 ρ 5 ) + ζ 1 ν 5 ( ρ 1 ⊗ ρ 5 ) ( m ′ ) − K ( ρ 5 , ρ 1 ) o + p 2 sup ρ 6 n C 2 ζ 3 ( ν 3 ρ 1 ) ⊗ ( ν 5 ρ 6 ) + ζ 5 ν 5 ( ρ 4 ⊗ ρ 6 ) ( m ′ ) − K ( ρ 6 , ρ 4 ) o − K ( ν 5 , ν 3 ) , where C 1 , C 2 , p 1 and p 2 are p ositive constants, a nd similar terms, we need to use inequalities of the t yp e: for an y prio r distr ibutions e µ i e π i , i = 1 , 2, with P probabilit y at least 1 − η , for any p osterior distributions ν i ρ i , i = 1 , 2 , α 3 ( ν 1 ρ 1 ) ⊗ ( ν 2 ρ 2 )( m ′ ) ≤ log n ( e µ 1 e π 1 ) ⊗ ( e µ 2 e π 2 ) exp h α 3 Φ − α 3 N ( M ′ ) io + K ( ν 1 ρ 1 ) ⊗ ( ν 2 ρ 2 ) , ( e µ 1 e π 1 ) ⊗ ( e µ 2 e π 2 ) − log( η ) . W e need also the v ariant: with P pr o babilit y at least 1 − η , for any p osterior dis- tribution ν 1 : Ω → M 1 + ( M ) a nd any conditional po sterior distributions ρ 1 , ρ 2 : Ω × M → M 1 + (Θ), α 3 ν 1 ( ρ 1 ⊗ ρ 2 )( m ′ ) ≤ log n e µ 1 e π 1 ⊗ e π 2 exp h α 3 Φ − α 3 N ( M ′ ) io 98 Chapter 2. Comp aring p osterior distributions to Gibbs priors + K ( ν 1 , e µ 1 ) + ν 1 K ρ 1 ⊗ ρ 2 , e π 1 ⊗ e π 2 − log( η ) . W e deduce that log ( ν 3 " ρ 1 exp C 1 ( ν ρ + ζ 1 ρ 1 )( m ′ ) p 1 × ρ 4 exp C 2 ζ 3 ν 3 ρ 1 + ζ 5 ρ 4 ( m ′ ) p 2 #) ≤ sup ν 5 ( p 1 sup ρ 5 " C 1 α 3 log n ( e µ e π ) ⊗ ( e µ 5 e π 5 ) exp h α 3 Φ − α 3 N ( M ′ ) io + K ( ν ρ ) ⊗ ( ν 5 ρ 5 ) , ( e µ e π ⊗ ( e µ 5 e π 5 ) + log( 2 η ) + ζ 1 log n e µ 5 e π 1 ⊗ e π 5 exp h α 3 Φ − α 3 N ( M ′ ) io + K ( ν 5 , e µ 5 ) + ν 5 K ρ 1 ⊗ ρ 5 , e π 1 ⊗ e π 5 + log 2 η − K ( ρ 5 , ρ 1 ) # + p 2 sup ρ 6 " C 1 α 3 log n ( e µ 3 e π 1 ) ⊗ ( e µ 5 e π 6 ) exp h α 3 Φ − α 3 N ( M ′ ) io + K ( ν 3 ρ 1 ) ⊗ ( ν 5 ρ 6 ) , ( e µ 3 e π 1 ⊗ ( e µ 5 e π 6 ) + log( 2 η ) + ζ 1 log n e µ 5 e π 4 ⊗ e π 6 exp h α 3 Φ − α 3 N ( M ′ ) io + K ( ν 5 , e µ 5 ) + ν 5 K ρ 4 ⊗ ρ 6 , e π 4 ⊗ e π 6 + log 2 η − K ( ρ 6 , ρ 4 ) # − K ( ν 5 , ν 3 ) ) . W e are then left with the need to b ound entrop y terms like K ( ν 3 ρ 1 , e µ 3 e π 1 ), wher e we hav e the choice of e µ 3 and e π 1 , to o btain a useful bound. As could be exp ected, we decomp ose it in to K ( ν 3 ρ 1 , e µ 3 e π 1 ) = K ( ν 3 , e µ 3 ) + ν 3 K ( ρ 1 , e π 1 ) . Let us lo ok a fter the s econd term first, cho o sing e π 1 = π exp( − β 1 R ) : ν 3 K ( ρ 1 , e π 1 ) = ν 3 β 1 ( ρ 1 − e π 1 )( R ) + K ( ρ 1 , π ) − K ( e π 1 , π ) ≤ β 1 α 1 α 2 ν 3 ( ρ 1 − e π 1 )( r ) + K ( ν 3 , e µ 3 ) + ν 3 K ( ρ 1 , e π 1 ) + log n e µ 3 e π ⊗ 2 1 h exp − α 2 Ψ α 2 N ( R ′ , M ′ ) + α 1 R ′ io − log( η ) + ν 3 K ( ρ 1 , π ) − K ( e π 1 , π ) ≤ β 1 α 1 K ( ν 3 , e µ 3 ) + ν 3 K ( ρ 1 , e π 1 ) + log n e µ 3 e π ⊗ 2 1 h exp − α 2 Ψ α 2 N ( R ′ , M ′ ) + α 1 R ′ io − log( η ) + ν 3 K ρ 1 , π exp( − β 1 α 2 α 1 r ) . 2.3. Two step lo c alization 99 Thu s, when the constra in t λ 1 = β 1 α 2 α 1 is satisfied, ν 3 K ( ρ 1 , e π 1 ) ≤ 1 − β 1 α 1 − 1 β 1 α 1 K ( ν 3 , e µ 3 ) + log n e µ 3 e π ⊗ 2 1 h exp − α 2 Ψ α 2 N ( R ′ , M ′ ) + α 1 R ′ io − log( η ) . W e ca n further sp ecialize the constants, choo sing α 1 = N sinh( α 2 N ), so that − α 2 Ψ α 2 N ( R ′ , M ′ ) + α 1 R ′ ≤ 2 N sinh α 2 2 N 2 M ′ . W e can for instance choo s e α 2 = γ , α 1 = N sinh( γ N ) and β 1 = λ 1 N γ sinh( γ N ), leading to Prop osition 2. 3 .3 . With the notation of The or em 2.3.2, t he c onstants b eing set as explai ne d ab ove, putting e π 1 = π exp( − λ 1 N γ sinh( γ N ) R ) , with P pr ob ability at le ast 1 − η , ν 3 K ( ρ 1 , e π 1 ) ≤ 1 − λ 1 γ − 1 λ 1 γ K ( ν 3 , e µ 3 ) + log n e µ 3 e π ⊗ 2 1 h exp 2 N sinh( γ 2 N ) 2 M ′ io − log( η ) . Mor e gener al ly ν 3 K ( ρ, e π 1 ) ≤ 1 − λ 1 γ − 1 λ 1 γ K ( ν 3 , e µ 3 ) + log n e µ 3 e π ⊗ 2 1 h exp 2 N sinh( γ 2 N ) 2 M ′ io − log( η ) + 1 − λ 1 γ − 1 ν 3 K ( ρ, ρ 1 ) . In a similar wa y , let us now c ho ose e µ 3 = µ exp[ − α 3 π ( R )] . W e can w r ite K ( ν, e µ 3 ) = α 3 ( ν − e µ 3 ) π ( R ) + K ( ν, µ ) − K ( e µ 3 , µ ) ≤ α 3 α 1 α 2 ( ν − e µ 3 ) π ( r ) + K ( ν, e µ 3 ) + log n ( e µ 3 π ) ⊗ ( e µ 3 π ) h exp − α 2 Ψ α 2 N ( R ′ , M ′ ) + α 1 R ′ io − log( η ) + K ( ν, µ ) − K ( e µ 3 , µ ) . Let us cho ose α 2 = γ , α 1 = N sinh( γ N ), and let us add some o ther en tropy inequal- ities to get rid o f π in a s uitable wa y , the appro ac h of entropy comp ensation b eing the same as that used to obtain the empirical bo und o f Theore m 2.3.2 (page 94). This results with P probability at least 1 − η in 1 − α 3 α 1 K ( ν, e µ 3 ) ≤ α 3 α 1 γ ( ν − e µ 3 ) π ( r ) 100 Chapter 2. Comp aring p osterior distributions to Gibbs priors + log n ( e µ 3 π ) ⊗ ( e µ 3 π ) h exp − γ Ψ γ N ( R ′ , M ′ ) + α 1 R ′ io + log( 2 η ) + K ( ν, µ ) − K ( e µ 3 , µ ) , ζ 6 1 − β α 1 e µ 3 K ( ρ 6 , π ) ≤ ζ 6 β α 1 γ e µ 3 ( ρ 6 − π )( r ) + log n e µ 3 π ⊗ 2 h exp − γ Ψ γ N ( R ′ , M ′ ) + α 1 R ′ io + log( 2 η ) + ζ 6 e µ 3 K ( ρ 6 , π ) − K ( π , π ) , ζ 7 1 − β α 1 e µ 3 K ( ρ 7 , π ) ≤ ζ 7 β α 1 γ e µ 3 ( ρ 7 − π )( r ) + log n e µ 3 π ⊗ 2 h exp − γ Ψ γ N ( R ′ , M ′ ) + α 1 R ′ io + log( 2 η ) + ζ 7 e µ 3 K ( ρ 7 , π ) − K ( π , π ) , ζ 8 1 − β α 1 ν K ( ρ 8 , π ) ≤ ζ 8 β α 1 γ ν ( ρ 8 − π )( r ) + K ( ν, e µ 3 ) + log n e µ 3 π ⊗ 2 h exp − γ Ψ γ N ( R ′ , M ′ ) + α 1 R ′ io + log( 2 η ) + ζ 8 ν K ( ρ 8 , π ) − K ( π , π ) , ζ 9 1 − β α 1 ν K ( ρ 9 , π ) ≤ ζ 9 β α 1 γ ν ( ρ 9 − π )( r ) + K ( ν, e µ 3 ) + log n e µ 3 π ⊗ 2 h exp − γ Ψ γ N ( R ′ , M ′ ) + α 1 R ′ io + log( 2 η ) + ζ 9 ν K ( ρ 9 , π ) − K ( π , π ) , where we hav e introduced a bunch o f constants, assumed to b e p ositive, that we will more precisely set to x 8 + x 9 = 1 , ( ζ 6 β + x 8 α 3 ) γ α 1 = λ 6 , ( ζ 7 β + x 9 α 3 ) γ α 1 = λ 7 , ( ζ 8 β − x 8 α 3 ) γ α 1 = λ 8 , ( ζ 9 β − x 9 α 3 ) γ α 1 = λ 9 . W e g et w ith P pro babilit y at least 1 − η , 1 − α 3 α 1 − ( ζ 8 + ζ 9 ) β α 1 K ( ν, e µ 3 ) ≤ α 3 α 1 γ ν ( x 8 ρ 8 + x 9 ρ 9 )( r ) − e µ 3 ( x 8 ρ 6 + x 9 ρ 7 )( r ) + α 3 α 1 log n ( e µ 3 π ) ⊗ ( e µ 3 π ) h exp − γ Ψ γ N ( R ′ , M ′ ) + α 1 R ′ io + ( ζ 6 + ζ 7 + ζ 8 + ζ 9 ) β α 1 log n e µ 3 π ⊗ 2 h exp − γ Ψ γ N ( R ′ , M ′ ) + α 1 R ′ io 2.3. Two step lo c alization 101 + K ( ν, µ ) − K ( e µ 3 , µ ) + α 3 α 1 + ( ζ 6 + ζ 7 + ζ 8 + ζ 9 ) β α 1 log 2 η . Let us choose the constants so that λ 1 = λ 7 = λ 9 , λ 4 = λ 6 = λ 8 , α 3 x 9 γ α 1 = ξ 1 and α 3 x 8 γ α 1 = ξ 4 . This is done b y s e tting x 8 = ξ 4 ξ 1 + ξ 4 , x 9 = ξ 1 ξ 1 + ξ 4 , α 3 = N γ sinh( γ N )( ξ 1 + ξ 4 ) , ζ 6 = N γ sinh( γ N ) ( λ 4 − ξ 4 ) β , ζ 7 = N γ sinh( γ N ) ( λ 1 − ξ 1 ) β , ζ 8 = N γ sinh( γ N ) ( λ 4 + ξ 4 ) β , ζ 9 = N γ sinh( γ N ) ( λ 1 + ξ 1 ) β . The inequality λ 1 > ξ 1 is a lways satisfied. The inequality λ 4 > ξ 4 is r e q uired for the ab o ve choice of constants, and will b e satisfied for a suitable choice of ζ 3 and ζ 4 . Under these assumptions, w e obtain with P proba bilit y at lea st 1 − η 1 − α 3 α 1 − ( ζ 8 + ζ 9 ) β α 1 K ( ν, e µ 3 ) ≤ ( ν − e µ 3 )( ξ 1 ρ 1 + ξ 4 ρ 4 )( r ) + α 3 α 1 log n ( e µ 3 π ) ⊗ ( e µ 3 π ) h exp − γ Ψ γ N ( R ′ , M ′ ) + α 1 R ′ io + ( ζ 6 + ζ 7 + ζ 8 + ζ 9 ) β α 1 log n e µ 3 π ⊗ 2 h exp − γ Ψ γ N ( R ′ , M ′ ) + α 1 R ′ io + K ( ν, µ ) − K ( e µ 3 , µ ) + α 3 α 1 + ( ζ 6 + ζ 7 + ζ 8 + ζ 9 ) β α 1 log 2 η . This prov es Prop osition 2. 3 .4 . The c onstants b eing set as explaine d ab ove, with P pr ob ability at le ast 1 − η , for any p osterior distribution ν : Ω → M 1 + ( M ) , K ( ν, e µ 3 ) ≤ 1 − α 3 α 1 − ( ζ 8 + ζ 9 ) β α 1 − 1 K ( ν, ν 3 ) + α 3 α 1 log n ( e µ 3 π ) ⊗ ( e µ 3 π ) h exp − γ Ψ γ N ( R ′ , M ′ ) + α 1 R ′ io + ( ζ 6 + ζ 7 + ζ 8 + ζ 9 ) β α 1 log n e µ 3 π ⊗ 2 h exp − γ Ψ γ N ( R ′ , M ′ ) + α 1 R ′ io + α 3 α 1 + ( ζ 6 + ζ 7 + ζ 8 + ζ 9 ) β α 1 log 2 η . Thu s 102 Chapter 2. Comp aring p osterior distributions to Gibbs priors K ( ν 3 ρ 1 , e µ 3 e π 1 ) ≤ 1 + 1 − λ 1 γ − 1 λ 1 γ 1 − α 3 α 1 − ( ζ 8 + ζ 9 ) β α 1 × α 3 α 1 log n ( e µ 3 π ⊗ ( e µ 3 π ) h exp − γ Ψ γ N ( R ′ , M ′ ) + α 1 R ′ io + ( ζ 6 + ζ 7 + ζ 8 + ζ 9 ) β α 1 log n e µ 3 π ⊗ 2 h exp − γ Ψ γ N ( R ′ , M ′ ) + α 1 R ′ io + α 3 α 1 + ( ζ 6 + ζ 7 + ζ 8 + ζ 9 ) β α 1 log 2 η + 1 − λ 1 γ − 1 λ 1 γ log n e µ 3 e π ⊗ 2 1 h exp 2 N s inh γ 2 N 2 M ′ io − log( 2 η ) . W e will not go further, lest it ma y be c o me tedious, but we hope w e hav e given sufficient hints to state infor mally that the bound B ( ν, ρ, β ) of Theorem 2.3.2 (page 94) is upper bo unded with P probabilit y clo s e to one by a b ound of the same flav our where the empiric a l quantities r and m ′ hav e b een replaced with their exp e ctations R and M ′ . 2.3.3. Tw o step lo c ali zation b etwe en p osterior distributions Here we work with a family of pr io r distr ibutions describ ed by a reg ular conditional prior distribution π = M → M 1 + (Θ), where M is some measurable index set. This family may typically descr ibe a countable family of par ametric mo dels. In this case M = N , and each of the pr ior distributions π ( i, . ), i ∈ N satisfies some para metric complexity assumption of the type lim sup β → + ∞ β π exp( − β R ) ( i, . )( R ) − ess inf π ( i,. ) R = d i < + ∞ , i ∈ M . Let us consider also a prior distribution µ ∈ M 1 + ( M ) defined on the index set M . Our aim here will be to compare the per formance of tw o given p osterior distri- butions ν 1 ρ 1 and ν 2 ρ 2 , where ν 1 , ν 2 : Ω → M 1 + ( M ), a nd wher e ρ 1 , ρ 2 : Ω × M → M 1 + (Θ). More precisely , we would like to establish a bound for ( ν 1 ρ 1 − ν 2 ρ 2 )( R ) which could b e a starting p oin t to implement a selection method similar to the one describ ed in Theo rem 2.2.4 (page 73). T o this purp ose, we can start with Theorem 2.2.1 (page 69), which says that with P pro babilit y a t least 1 − ǫ , − N log n 1 − ta nh( λ N ) ν 1 ρ 1 − ν 2 ρ 2 ( R ) o ≤ λ ( ν 1 ρ 1 − ν 2 ρ 2 )( r ) + N log cosh( λ N ) ( ν 1 ρ 1 ) ⊗ ( ν 2 ρ 2 )( m ′ ) + K ( ν 1 , e µ ) + K ( ν 2 , e µ ) + ν 1 K ( ρ 1 , e π ) + ν 2 K ( ρ 2 , e π ) − lo g( ǫ ) , where e µ ∈ M 1 + ( M ) a nd e π : M → M 1 + (Θ) ar e suitably loca lized prior distributions to b e c hosen later on. T o use these lo calized prior distributions, we need empirical bo unds for the en tropy terms K ( ν i , e µ ) and ν i K ( ρ i , e π ) , i = 1 , 2 . Bounding ν K ( ρ, e π ) can b e done using the following generaliza tio n of Cor ollary 2.1.19 page 68: Corollary 2. 3.5 . F or any p ositive r e al c onstant s γ and λ such that γ < λ , fo r any prior distribution µ ∈ M 1 + ( M ) and any c onditional prior distribution π : M → 2.3. Two step lo c alization 103 M 1 + (Θ) , with P pr ob ability at le ast 1 − ǫ , for any p osterior distribution ν : Ω → M 1 + ( M ) , and any c onditional p osterior distribution ρ : Ω × M → M 1 + (Θ) , ν n K ρ, π exp[ − N γ λ tanh( λ N ) R ] o ≤ K ′ ( ν, ρ, γ , λ, ǫ ) + 1 λ γ − 1 K ( ν, µ ) , wher e K ′ ( ν, ρ, γ , λ, ǫ ) def = 1 − γ λ − 1 ν K ( ρ, π exp( − γ r ) − γ λ log( ǫ ) + ν n log h π exp( − γ r ) exp N γ λ log cosh( λ N ) ρ ( m ′ ) io . T o a pply this cor ollary to our cas e, we have to set e π = π exp[ − N γ λ tanh( λ N ) R ] . Let us also consider fo r some p ositive real constant β t he conditional prior distri- bution π = π exp( − β R ) and the prior distribution µ = µ exp[ − απ ( R )] . Let us se e how we can bound, given any p o sterior distribution ν : Ω → M 1 + ( M ), the divergence K ( ν, µ ). W e can see that K ( ν, µ ) = α ( ν − µ ) π ( R ) + K ( ν, µ ) − K ( µ, µ ) . Now, let us introduce the conditional p osterior distribution b π = π exp( − γ r ) and let us decomp ose ( ν − µ ) π ( R ) = ν π ( R ) − b π ( R ) + ( ν − µ ) b π ( R ) + µ b π ( R ) − π ( R ) . Starting from the expo nen tial inequalit y P µ π ⊗ π exp n − N log 1 − ta nh( γ N ) R ′ − γ r ′ − N log cosh( γ N ) m ′ o ≤ 1 , and reasoning in the same w ay that led to Theorem 2 .1 .1 (page 52) in the simple case when we ta ke in this theorem λ = γ , we get with P probability a t least 1 − ǫ , that − N log 1 − ta nh( γ N ) ν ( π − b π )( R ) + β ν ( π − b π )( R ) ≤ ν log n b π h exp N log cosh( γ N ) b π ( m ′ ) io + K ( ν, µ ) − log( ǫ ) . − N log 1 − ta nh( γ N ) µ ( b π − π )( R ) − β µ ( b π − π )( R ) ≤ µ log n b π h exp N log cosh( γ N ) b π ( m ′ ) io − lo g( ǫ ) . 104 Chapter 2. Comp aring p osterior distributions to Gibbs priors In the meantime, using Theorem 2.2 .1 (page 6 9) and Corollar y 2.3 .5 above, we see that with P probabilit y at least 1 − 2 ǫ , for any c onditional p osterior distribution ρ : Ω × M → M 1 + (Θ), − N log n 1 − ta nh( λ N )( ν − µ ) ρ ( R ) o ≤ λ ( ν − µ ) ρ ( r ) + N log cosh( λ N ) ( ν ρ ) ⊗ ( µρ )( m ′ ) + ( ν + µ ) K ( ρ, e π ) + K ( ν, µ ) − log ( ǫ ) ≤ λ ( ν − µ ) ρ ( r ) + N log cosh( λ N ) ( ν ρ ) ⊗ ( µρ )( m ′ ) + K ( ν, µ ) − log ( ǫ ) + 1 − γ λ − 1 ( ν + µ ) K ρ, b π + log n b π h exp N γ λ log cosh( λ N ) ρ ( m ′ ) io + λ γ − 1 − 1 K ( ν, µ ) − 2 log( ǫ ) . Putting all this together , we see that with P proba bility a t le a st 1 − 3 ǫ , for an y po sterior distribution ν ∈ M 1 + ( M ), 1 − α N tanh( γ N ) + β − α N tanh( λ N ) 1 − γ λ K ( ν, µ ) ≤ α h N tanh( γ N ) + β i − 1 ν log n b π h exp N log cosh( γ N ) b π ( m ′ ) io − log( ǫ ) + α h N tanh( γ N ) − β i − 1 µ log n b π h exp N log cosh( γ N ) b π ( m ′ ) io − log( ǫ ) + α N tanh( λ N ) − 1 ( λ ( ν − µ ) b π ( r ) + N log cosh( λ N ) ( ν b π ) ⊗ ( µ b π )( m ′ ) + 1 − γ λ − 1 ( ν + µ ) log n b π h exp N γ λ log cosh( λ N ) b π ( m ′ ) io − 1 + γ λ 1 − γ λ log( ǫ ) ) + K ( ν, µ ) − K ( µ, µ ) . Replacing in the right-hand side of this inequalit y the unobserved prio r distribution µ with the worst p ossible po sterior distr ibution, we obtain Theorem 2 . 3.6 . F or any p ositive r e al c onstants α , β , γ and λ , using the notation, π = π exp( − β R ) , µ = µ exp[ − απ ( R )] , b π = π exp( − γ r ) , b µ = µ exp[ − α λ N tanh( λ N ) − 1 b π ( r )] , with P pr ob ability at le ast 1 − ǫ , for any p osterior distribution ν : Ω → M 1 + ( M ) , 1 − α N tanh( γ N ) + β − α N tanh( λ N ) 1 − γ λ K ( ν, µ ) ≤ K ( ν, b µ ) + α N tanh( γ N ) + β ν log n b π h exp N log cosh( γ N ) b π ( m ′ ) io 2.3. Two step lo c alization 105 + α N tanh( λ N )(1 − γ λ ) ν log n b π h exp N γ λ log cosh( λ N ) b π ( m ′ ) io + log ( b µ " b π n exp h N log cosh( γ N ) b π ( m ′ ) io α N tanh( γ N ) − β × b π n exp h N γ λ log cosh( λ N ) b π ( m ′ ) io α N tanh( λ N )(1 − γ λ ) × exp α lo g[cosh( λ N )] tanh( λ N ) ( ν b π ) ⊗ b π ( m ′ ) #) + 1 N tanh( γ N ) + β + 1 N tanh( γ N ) − β + 1 + γ λ N tanh( λ N ) 1 − γ λ log 3 ǫ . This result is satisfactor y , but in the same time hints a t some po s sible improve- men t in the choice of the lo calized prior µ , which is here somewhat lacking a v ariance term. W e will consider in the remainder of this section the use of (2.38) µ = µ exp[ − α π ( R ) − ξ e π ⊗ e π ( M ′ ) , where ξ is some positive real constant and e π = π exp( − e β R ) is some appropriate conditional prior distribution with p ositive real par ameter e β . With this new ch oice K ( ν, µ ) = α ( ν − µ ) π ( R ) + ξ ( ν − µ )( e π ⊗ e π )( M ′ ) + K ( ν, µ ) − K ( µ, µ ) . W e already know how to deal with the first factor α ( ν − µ ) π ( R ), since the com- putations we made to give it an empirica l upper b ound were v alid for any choice of the lo calized prior distribution µ . Let us now deal with ξ ( ν − µ )( e π ⊗ e π )( M ′ ). Since m ′ ( θ, θ ′ ) is a sum of indep enden t Ber noulli ra ndom v ariables, we can easily generalize the r esult of Theorem 1 .1.4 (page 4) to prove that with P probability a t least 1 − ǫ N 1 − e x p( − ζ N ) ν ( e π ⊗ e π )( M ′ ) ≤ ζ Φ ζ N ν ( e π ⊗ e π )( M ′ ) ≤ ζ ν ( e π ⊗ e π )( m ′ ) + K ( ν, µ ) − log( ǫ ) . In the same wa y , with P pro babilit y at least 1 − ǫ , − N exp( ζ N ) − 1 µ ( e π ⊗ e π )( M ′ ) ≤ − ζ Φ − ζ N µ ( e π ⊗ e π )( M ′ ) ≤ − ζ µ ( e π ⊗ e π )( m ′ ) − log( ǫ ) . W e would like now to replace ( e π ⊗ e π )( m ′ ) with an empirica l quantit y . In order to do this, we will use an entrop y b ound. Indeed for any conditional p osterior distribution ρ : Ω × M → M 1 + (Θ), ν K ( ρ, e π ) = e β ν ( ρ − e π )( R ) + ν K ( ρ, π ) − K ( e π , π ) ≤ e β N tanh( γ N ) γ ν ( ρ − e π )( r ) + N log cosh( γ N ) ν ( ρ ⊗ e π )( m ′ ) + K ( ν, µ ) + ν K ( ρ, e π ) − log ( ǫ ) + ν K ( ρ, π ) − K ( e π , π ) . 106 Chapter 2. Comp aring p osterior distributions to Gibbs priors Thu s cho osing e β = N tanh( γ N ), γ ν ( e π − ρ )( r ) + ν K ( e π , π ) − K ( ρ, π ) ≤ N log cosh( γ N ) ν ( ρ ⊗ e π )( m ′ ) + K ( ν, µ ) − log( ǫ ) . Cho osing ρ = b π , we get ν K ( e π , b π ) ≤ N log cosh( γ N ) ν ( b π ⊗ e π )( m ′ ) + K ( ν, µ ) − log( ǫ ) . This implies that ξ ν ( b π ⊗ e π )( m ′ ) = ν n e π ξ b π ( m ′ ) − K ( e π , b π ) o + ν K ( e π , b π ) ≤ ν n log h b π exp ξ b π ( m ′ ) io + N log cosh( γ N ) ν ( b π ⊗ e π )( m ′ ) + K ( ν, µ ) − log( ǫ ) . Thu s ξ − N log cosh( γ N ) ν ( b π ⊗ e π )( m ′ ) ≤ ν n log h b π exp ξ b π ( m ′ ) io + K ( ν, µ ) − log( ǫ ) and ν K ( e π , b π ) ≤ ξ N log [cosh( γ N )] − 1 − 1 ν n log h b π exp ξ b π ( m ′ ) io + K ( ν, µ ) − log( ǫ ) + K ( ν, µ ) − log( ǫ ) . T a king for simplicity ξ = 2 N log cosh( γ N ) and noticing that 2 N log cosh( γ N ) = − N log 1 − e β 2 N 2 , we get Theorem 2 . 3.7 . L et us put e π = π exp( − e β R ) and b π = π exp( − γ r ) , wher e γ is some arbitr ary p ositive r e al c onstant and e β = N tanh( γ N ) , so that γ = N 2 log 1+ e β N 1 − e β N . With P pr ob ability at le ast 1 − ǫ , ν K ( e π , b π ) ≤ ν log n b π h exp 2 N log cosh( γ N ) b π ( m ′ ) io + 2 K ( ν, µ ) − log( ǫ ) . As a consequence ζ ν ( e π ⊗ e π )( m ′ ) = ζ ν ( e π ⊗ e π )( m ′ ) − ν K ( e π ⊗ e π , b π ⊗ b π ) + 2 ν K ( e π , b π ) ≤ ν n log h b π ⊗ b π exp( ζ m ′ ) io + 2 ν log n b π h exp 2 N log cosh( γ N ) b π ( m ′ ) io + 4 K ( ν, µ ) − log( ǫ ) . Let us take for the sa ke o f simplicity ζ = 2 N log cosh( γ N ) , to get ζ ν ( e π ⊗ e π )( m ′ ) ≤ 3 ν n log h b π ⊗ b π exp( ζ m ′ ) io + 4 K ( ν, µ ) − log( ǫ ) . This prov es 2.3. Two step lo c alization 107 Prop osition 2. 3 .8 . L et us c onsider some arbitr ary prior distribution µ ∈ M 1 + ( M ) and some arbitr ary c onditional prior distribution π : M → M 1 + (Θ) . L et e β < N b e some p ositive r e al c onstant. L et u s put e π = π exp( − e β R ) and b π = π exp( − γ r ) , with e β = N tanh( γ N ) . Mor e over let u s put ζ = 2 N lo g cosh( γ N ) . With P p r ob ability at le ast 1 − 2 ǫ , for any p osterior distribution ν ∈ M 1 + ( M ) , ν ( e π ⊗ e π )( M ′ ) ≤ 3 ν n log h b π ⊗ b π exp( ζ m ′ ) io + 5 K ( ν, µ ) − log( ǫ ) N 1 − e x p( − ζ N ) = 1 N tanh( γ N ) 2 3 ν log n b π ⊗ b π h exp 2 N log cosh( γ N ) m ′ io + 5 K ( ν, µ ) − log( ǫ ) . In the same wa y , − ζ µ ( e π ⊗ e π )( m ′ ) ≤ µ n log h b π ⊗ b π exp( − ζ m ′ ) io + 2 µ log n b π h exp 2 N log cosh( γ N ) b π ( m ′ ) io − 4 log( ǫ ) and thus − µ ( e π ⊗ e π )( M ′ ) ≤ 1 N exp( ζ N ) − 1 µ n log h b π ⊗ b π exp( − ζ m ′ ) io + 2 µ log n b π h exp 2 N lo g cosh( γ N ) b π ( m ′ ) io − 5 log( ǫ ) . Here we hav e purpo sely kept ζ as an arbitrary p ositive real c onstan t, to b e tuned later (in order to b e able to strengthen mor e or less the co mpensation of v ariance terms). W e are now pr operly equipped to estimate the divergence with resp ect to µ , the choice o f prior distribution ma de in equation (2.3 8, pag e 105). Indeed we can now write 1 − α N tanh( γ N ) + β − α N tanh( λ N ) 1 − γ λ − 5 ξ N tanh( γ N ) 2 K ( ν, µ ) ≤ α N tanh( γ N ) + β ν log n b π h exp N log cosh( γ N ) b π ( m ′ ) io − log( ǫ ) + α N tanh( γ N ) − β µ log n b π h exp N log cosh( γ N ) b π ( m ′ ) io − log( ǫ ) + α N tanh( λ N ) ( λ ( ν − µ ) b π ( r ) + N log cosh( λ N ) ( ν b π ) ⊗ ( µ b π )( m ′ ) + 1 − γ λ − 1 ( ν + µ ) log n b π h exp N γ λ log cosh( λ N ) b π ( m ′ ) io − 1 + γ N 1 − γ N log( ǫ ) ) 108 Chapter 2. Comp aring p osterior distributions to Gibbs priors + ξ N tanh( γ N ) 2 3 ν log n b π ⊗ b π h exp 2 N log cosh( γ N ) m ′ io − 5 log( ǫ ) + ξ N exp( ζ N ) − 1 µ n log h b π ⊗ b π exp( − ζ m ′ ) io + 2 µ log n b π h exp 2 N log cosh( γ N ) b π ( m ′ ) io − 5 log( ǫ ) . + K ( ν, µ ) − K ( µ, µ ) . It remains now only to replace in the right-hand s ide of this inequality µ with the worst p ossible po sterior distribution to obtain Theorem 2 . 3.9 . L et λ > γ > β , ζ , α and ξ b e arbitr ary p ositive r e al c onstants. L et us use t he notation π = π exp( − β R ) , e π = π exp( − N tanh( γ N ) R ) , b π = π exp( − γ r ) , µ = µ exp[ − α π ( R ) − ξ e π ⊗ e π ( M ′ )] and let us define the p osterior distribution b µ : Ω → M 1 + ( M ) by d b µ dµ ∼ exp − αλ N tanh( λ N ) b π ( r ) + ξ N exp( ζ N ) − 1 log n b π ⊗ b π exp( − ζ m ′ ) o . L et us assume mor e over that α N tanh( γ N ) + β + α N tanh( λ N )(1 − γ λ ) + 5 ξ N tanh( γ N ) 2 < 1 . With P pr ob ability at le ast 1 − ǫ , for any p osterior distribution ν : Ω → M 1 + ( M ) , K ( ν, µ ) ≤ 1 − α N tanh( γ N ) + β − α N tanh( λ N ) 1 − γ λ − 5 ξ N tanh( γ N ) 2 − 1 ( K ( ν, b µ ) + α N tanh( γ N ) + β ν log n b π h exp N log cosh( γ N ) b π ( m ′ ) io + α N tanh( λ N ) 1 − γ λ ν log n b π h exp N γ λ log cosh( λ N ) b π ( m ′ ) io + ξ N tanh( γ N ) 2 3 ν log n b π ⊗ b π h exp 2 N log cosh( γ N ) m ′ io + ξ N exp( ζ N ) − 1 ν n log h b π ⊗ b π exp( − ζ m ′ ) io + log b µ n b π h exp N log cosh( γ N ) b π ( m ′ ) io α N tanh( γ N ) − β × n b π h exp N γ λ log cosh( λ N ) b π ( m ′ ) io α N tanh( λ N ) 1 − γ λ × n b π h exp 2 N lo g cosh( γ N ) b π ( m ′ ) io 2 ξ N exp( ζ N ) − 1 2.3. Two step lo c alization 109 × exp n N log cosh( λ N ) ( ν b π ) ⊗ b π ( m ′ ) o + " α N tanh( γ N ) + β + α N tanh( γ N ) − β + 2 α 1 + γ N N tanh( λ N ) 1 − γ λ + 5 ξ N tanh( γ N ) 2 + 5 ξ N exp( ζ N ) − 1 # log 5 ǫ ) . The interest of this theor em lies in the presence of a v ariance term in the lo calized po sterior distribution b µ , which with a s uitable c hoice of parameters seems to b e an in teres ting option in the case when there are nested mo dels: in this situation ther e may be a nee d to preven t integration with r espect to b µ in the r ig h t-hand side to put weight on wild oversized mo dels with la rge v ariance ter ms. Moreover, the right- hand side b eing empirical, pa rameters can be, as usual, optimized from data using a union bo und on a grid o f ca ndidate v alues. If one is only interested in the g e ner al shap e of the result, a s implified inequa lit y as the one below may suffice: Corollary 2. 3.10 . F or any p ositive r e al c onstants λ > γ > β , ζ , α and ξ , let us use the same notation as in The or em 2.3.9 (p age 108). L et us put mor e over A 1 = α N tanh( γ N ) + β + α N tanh( λ N ) 1 − γ λ + 5 ξ N tanh( γ N ) 2 , A 2 = α N tanh( γ N ) + β + α N tanh( λ N ) 1 − γ λ + 3 ξ N tanh( γ N ) 2 A 3 = ξ N exp ζ N − 1 A 4 = α N tanh( γ N ) − β + α N tanh( λ N )(1 − γ λ ) + 2 ξ N [exp( ζ N ) − 1] , A 5 = α N tanh( γ N ) + β + α N tanh( γ N ) − β + 2 α 1 + γ N N tanh( λ N ) 1 − γ λ + 5 ξ N tanh( γ N ) 2 + 5 ξ N exp( ζ N ) − 1 , C 1 = 2 N log cosh λ N , C 2 = N log cosh λ N . L et us assume that A 1 < 1 . With P pr ob ability at le ast 1 − ǫ , fo r any p osterior distribution ν : Ω → M 1 + ( M ) , K ( ν, µ ) ≤ K ( ν , α, β , γ , λ, ξ , ζ , ǫ ) def = 1 − A 1 − 1 ( K ( ν, b µ ) + A 2 ν h log b π ⊗ b π exp C 1 m ′ i + A 3 ν h log b π ⊗ b π exp − ζ m ′ i + log b µ h b π exp C 1 b π ( m ′ ) i A 4 exp C 2 ( ν b π ) ⊗ b π ( m ′ ) + A 5 log 5 ǫ ) . 110 Chapter 2. Comp aring p osterior distributions to Gibbs priors Putting this corollary together with Corollary 2.3.5 (page 102), we obtain Theorem 2 . 3.11 . L et us c onsider the n otatio n intr o duc e d in Cor ol lary 2.3.5 (p age 102) and in The or em 2.3.9 (p age 108) and its Cor ol lary 2.3.10 (p age 109). L et us c onsider r e al p ositive p ar ameters λ , γ ′ 1 < λ ′ 1 and γ ′ 2 < λ ′ 2 . L et us c onsider also two sets of p ar ameters α i , β i , γ i , λ i , ξ i , ζ i ,wher e i = 1 , 2 , b oth satisfying the c onditions state d in Cor ol lary 2.3 .10 (p age 109). With P pr ob ability at le ast 1 − ǫ , for any p osterior distributions ν 1 , ν 2 : Ω → M 1 + ( M ) , any c onditional p osterior distributions ρ 1 , ρ 2 : Ω × M → M 1 + Θ , − N log n 1 − ta nh λ N ν 1 ρ 1 − ν 2 ρ 2 ( R ) o ≤ λ ν 1 ρ 1 − ν 2 ρ 2 ( r ) + N log cosh λ N ν 1 ρ 1 ⊗ ν 2 ρ 2 m ′ + K ′ ν 1 , ρ 1 , γ ′ 1 , λ ′ 1 , ǫ 5 + K ′ ν 2 , ρ 2 , γ ′ 2 , λ ′ 2 , ǫ 5 + 1 1 − γ ′ 1 λ ′ 1 K ν 1 , α 1 , β 1 , γ 1 , λ 1 , ξ 1 , ζ 1 , ǫ 5 + 1 1 − γ ′ 2 λ ′ 2 K ν 2 , α 2 , β 2 , γ 2 , λ 2 , ξ 2 , ζ 2 , ǫ 5 − log ǫ 5 . This theor em pr o vides, using a union b ound a rgumen t to further optimize the parameters, an empirical b ound for ν 1 ρ 1 ( R ) − ν 2 ρ 2 ( R ), which can ser v e to build a se le c tio n a lgorithm exactly in the same wa y a s what was done in T heo rem 2 .2.4 (page 7 3). This represents the highes t degree of sophistication that w e will a chieve in this monog raph, as fa r as mo del selection is concerned: this theorem sho ws that it is indeed pos sible to der iv e a selection scheme in whic h loca lization is perfor med in tw o steps and in which the lo calization of the mo del selection itself, as opp osed to the loc a lization of the estimation in e a c h mo del, includes a v ar iance term as w ell as a bias term, so that it should b e po ssible to lo calize the choice of nested mod- els, so mething that would not have b een feas ible with the lo calization techn iques exp o sed in the previous se c tio ns of this study . W e should p oin t out how ever that mor e sophistic ate d does not necessarily mean mor e efficient : as the reader may hav e noticed, sophistication comes at a price, in terms of the complexity of the estimation schemes, with so me p ossible loss of accuracy in the constants that can mar the b enefits of using an asymptotically more efficient method for small sample sizes. W e will do the hurried reader a fav our: we will not launch in to a s tudy of the theoretical pro p erties of this selection a lgorithm, although it is clea r that all the to ols needed a re a t hand! W e w ould like as a co nclusion to this chapter, to put forward a simple idea: this appro ac h of mo del selection revolves aro und entrop y estimates concer ned with the divergence of po sterior distributions with res p ect to localized prior distribu- tions. Moreov er, this lo caliza tio n of the prio r distribution is more effectiv ely done in several steps in so me situations, and it is worth men tioning that these situations include the typical ca se of s election fro m a family of parametric mo dels. Finally , the whole sto ry r elies upon es timating the relative gener alization err or rate of one po sterior distribution with resp ect to some lo cal pr ior distribution as w ell as with resp ect to another p osterior distribution, bec a use these relative ra tes can b e esti- mated more ac c ur ately than absolute generalization error rates, a t least as s oon as no classification mo del o f r easonable size provides a g oo d ma tc h to the training sample, meaning that the classifica tio n problem is either difficult or noisy . Chapter 3 T ransductiv e P A C-Ba y es i an learning 3.1. Basic i nequalities 3.1.1. The tr ansductive setting In this chapter the observe d sample ( X i , Y i ) N i =1 will be supplemen ted with a test or shadow sa mple ( X i , Y i ) ( k +1) N i = N +1 . This p oint of view, called tr ansductive classific ation , has been in tro duced by V. V apnik. It may b e justified in differen t wa ys. On the practical side, one interest o f the transductive setting is that it is often a lot easier to collect examples than it is to lab el them, so that it is not unrealistic to assume that w e indeed ha ve tw o training samples, one labelled a nd one unlab elled. It also covers the case when a ba tch of patterns is to be classified a nd we a r e allow ed to observe the whole batch b efore issuing the classifica tion. On the mathematica l side, considering a shadow sa mple prov es tec hnically fruit- ful. Indeed, when in tro ducing the V a pnik–Cervonenkis ent ropy and V apnik– C e r v o- nenkis dimension concepts, as well as when dealing with co mpression schemes, albeit the inductive setting is our final co ncern, the tra nsductiv e s etting is a useful detour. In this second scenario, int er media te technical results inv olving the shadow sample are integrated with respec t to unobser v ed random v aria bles in a second s tage of the pro ofs. Let us descr ibe now the changes to b e ma de to previous notation to adapt them to the transductive setting. The distribution P will b e a pro babilit y meas ure on the canonical spa c e Ω = ( X × Y ) ( k +1) N , and ( X i , Y i ) ( k +1) N i =1 will be the cano nical pro cess on this space (t hat is the co ordinate process ). Unless explicitly men tioned, the parameter k indica ting the size of the shadow s a mple will remain fixed. Assuming the shadow sa mple size is a multi ple of the training sa mple size is conv enient without significantly restricting g eneralit y . F or a while, we will use a weak er as sumption than independence, assuming that P is p artial ly exchange able , since this is all we need in the pro ofs. Definition 3. 1.1 . F or i = 1 , . . . , N , let τ i : Ω → Ω b e define d for any 111 112 Chapter 3. T r ansductive P AC-Bayesi an le arning ω = ( ω j ) ( k +1) N j =1 ∈ Ω by τ i ( ω ) i + jN = ω i +( j − 1) N , j = 1 , . . . , k , τ i ( ω ) i = ω i + kN , and τ i ( ω ) m + j N = ω m + j N , m 6 = i, m = 1 , . . . , N , j = 0 , . . . k . Cle arly, if we arr ange the ( k + 1) N samples in a N × ( k + 1) arr ay, τ i p erforms a cir cular p ermutation of k + 1 entries on t he i th r ow, le aving the other r ows un- change d. Mor e over, al l the cir cular p ermutations of the i th r ow have the form τ j i , j r anging fr om 0 to k . The pr ob ability distribution P is said to b e p artial ly ex cha nge able if for any i = 1 , . . . , N , P ◦ τ − 1 i = P . This m e ans e quivalently that for any b ounde d me asur able function h : Ω → R , P ( h ◦ τ i ) = P ( h ) . In the same way a function h define d on Ω wil l b e said to b e p artial ly exchange able if h ◦ τ i = h f or any i = 1 , . . . , N . A c c or dingly a p osterior distribution ρ : Ω → M 1 + (Θ , T ) wil l b e said t o b e p artial ly exchange able when ρ ( ω , A ) = ρ τ i ( ω ) , A , for any ω ∈ Ω , any i = 1 , . . . , N and any A ∈ T . F or any b ounded mea s urable function h , let us define T i ( h ) = 1 k +1 P k j =0 h ◦ τ j i . Let T ( h ) = T N ◦ · · · ◦ T 1 ( h ). F or a n y pa r tially exchangeable pro babilit y distribution P , and for any b ounded measurable function h , P T ( h ) = P ( h ). Let us put σ i ( θ ) = 1 f θ ( X i ) 6 = Y i , indicating the succes s or failure of f θ to predict Y i from X i , r 1 ( θ ) = 1 N N X i =1 σ i ( θ ) , the empirical error rate of f θ on the observed sa mple, r 2 ( θ ) = 1 k N ( k +1) N X i = N +1 σ i ( θ ) , the error rate of f θ on the shadow sample, r ( θ ) = r 1 ( θ ) + k r 2 ( θ ) k + 1 = 1 ( k + 1) N ( k +1) N X i =1 σ i ( θ ) , the global erro r rate of f θ , R i ( θ ) = P f θ ( X i ) 6 = Y i , the ex pected erro r rate of f θ on the i th input, R ( θ ) = 1 N N X i =1 R i ( θ ) = P r 1 ( θ ) = P r 2 ( θ ) , the av era ge exp ected error rate of f θ on all inputs. W e will allow for p osterior distr ibutions ρ : Ω → M 1 + (Θ) depe nding on the shadow sample. The most interesting ones will anyho w b e indep enden t o f the shadow lab els Y N +1 , . . . , Y ( k +1) N . W e will be interested in the conditional e xpected error rate of the rando mized classifica tion rule describ ed b y ρ o n the shadow sa mple, g iv en the observed sample, that is, P ρ ( r 2 ) | ( X i , Y i ) N i =1 . This is a natural extension of the notion of gener alizatio n err or r ate : this is indeed the error ra te to b e ex p ected when the randomized cla ssification rule des cribed by the p osterior distribution ρ is applied to the shadow sample (which should in this case more purp osefully b e called the test sample). 3.1. Basic ine qu ali ties 113 T o see the connection with the previously defined g eneralization erro r r ate, let us comment on the case when P is in v ariant b y an y per m utation o f any row, meaning that P h ( ω ◦ s ) = P h ( ω ) for all s ∈ S ( { i + j N ; j = 0 , . . . , k } ) and all i = 1 , . . . , N , where S ( A ) is the set of p erm utations o f A , extended to { 1 , . . . , ( k + 1 ) N } so as to be the identit y outside of A . In other words, P is as- sumed to be in v ar ian t under an y per m utation whic h keeps the ro ws unc hanged. In this cas e , if ρ is inv ariant by any p ermu tation of any row o f the shadow sa m- ple, meaning that ρ ( ω ◦ s ) = ρ ( ω ) ∈ M 1 + (Θ), s ∈ S ( { i + j N ; j = 1 , . . . , k } ), i = 1 , . . . , N , then P ρ ( r 2 ) | ( X i , Y i ) N i =1 = 1 N P N i =1 P ρ ( σ i + N ) | ( X i , Y i ) N i =1 , meaning that the ex pectation can be taken on a restricted shadow s ample of the same size as the obser v ed sample. If mor eo ver the rows are equidistributed, meaning tha t their marginal distributions are equal, then P ρ ( r 2 ) | ( X i , Y i ) N i =1 = P ρ ( σ N +1 ) | ( X i , Y i ) N i =1 . This means that under these quite commonly fulfilled as sumptions, the e x pectation can b e taken on a sing le new o b ject to b e cla ssified, our study th us co vers the ca se when only o ne of the pa tterns from the s hado w sample is to b e lab elled and o ne is in teres ted in the exp ected error r ate of this s ingle lab elling. Of c ourse, in the case when P is i.i.d. and ρ dep ends only o n the training sample ( X i , Y i ) N i =1 , we fall back on the usual criterion of p e rformance P ρ ( r 2 ) | ( Z i ) N i =1 = ρ ( R ) = ρ ( R 1 ). 3.1.2. Absolute b ound Using an obvious facto rization, and considering for the moment a fixed v alue of θ and any partially ex c hangeable p o sitiv e real mea s urable function λ : Ω → R + , we can compute the log-Lapla ce trans fo rm of r 1 under T , whic h acts like a conditional probability distribution: log n T exp( − λr 1 ) o = N X i =1 log n T i exp( − λ N σ i ) o ≤ N log 1 N N X i =1 T i h exp − λ N σ i i = − λ Φ λ N ( r ) , where the function Φ λ N was defined b y equation (1.1, page 2 ). Remarking that T n exp h λ Φ λ N ( r ) − r 1 io = exp λ Φ λ N ( r ) T exp( − λr 1 ) we obtain Lemma 3. 1.1 . F or any θ ∈ Θ and any p artial ly exchange able p ositive r e al me a- sur able function λ : Ω → R + , T n exp h λ Φ λ N r ( θ ) − r 1 ( θ ) io ≤ 1 . W e deduce from this le mma a result analogous to the inductiv e case: Theorem 3 . 1.2 . F or any p artial ly exchange able p ositive r e al me asur able function λ : Ω × Θ → R + , for any p artial ly exchange able p osterior distribution π : Ω → M 1 + (Θ) , P exp sup ρ ∈ M 1 + (Θ) ρ h λ Φ λ N ( r ) − r 1 i − K ( ρ, π ) ≤ 1 . 114 Chapter 3. T r ansductive P AC-Bayesi an le arning The pro of is deduced from the previous lemma, using the fact that π is partially exchangeable: P exp sup ρ ∈ M 1 + (Θ) ρ h λ Φ λ N ( r ) − r 1 i − K ( ρ, π ) = P π n exp h λ Φ λ N ( r ) − r 1 io = P T π n exp h λ Φ λ N ( r ) − r 1 io = P π n T exp h λ Φ λ N ( r ) − r 1 io ≤ 1 . 3.1.3. R elative b ounds In tro ducing in the same wa y m ′ ( θ, θ ′ ) = 1 N N X i =1 1 f θ ( X i ) 6 = Y i − 1 f θ ′ ( X i ) 6 = Y i and m ( θ, θ ′ ) = 1 ( k + 1) N ( k +1) N X i =1 1 f θ ( X i ) 6 = Y i − 1 f θ ′ ( X i ) 6 = Y i , we could prove along the same line of reasoning Theorem 3 . 1.3 . F or any r e al p ar ameter λ , any e θ ∈ Θ , any p artial ly exchange able p osterior distribution π : Ω → M 1 + (Θ) , P exp sup ρ ∈ M 1 + (Θ) λ h ρ Ψ λ N r ( · ) − r ( e θ ) , m ( · , e θ ) − ρ ( r 1 ) − r 1 ( e θ ) i − K ( ρ, π ) ≤ 1 , wher e t he function Ψ λ N was define d by e quation (1.21, p age 35). Theorem 3 . 1.4 . F or any r e al c onstant γ , for any e θ ∈ Θ , for any p artial ly ex- change able p osterior distribution π : Ω → M 1 + (Θ) , P ( exp " sup ρ ∈ M 1 + (Θ) − N ρ n log h 1 − ta nh γ N r ( · ) − r ( e θ ) io − γ ρ ( r 1 ) − r 1 ( e θ ) − N log cosh γ N ρ m ′ ( · , e θ ) − K ( ρ, π ) #) ≤ 1 . This last theorem can be g eneralized to g iv e Theorem 3 . 1.5 . F or any r e al c onstant γ , for any p artial ly ex cha nge able p osterior distributions π 1 , π 2 : Ω → M 1 + (Θ) , P ( exp " sup ρ 1 ,ρ 2 ∈ M 1 + (Θ) − N lo g n 1 − ta nh γ N ρ 1 ( r ) − ρ 2 ( r ) o 3.2. V apnik b oun ds for tr ansductive classific ation 115 − γ ρ 1 ( r 1 ) − ρ 2 ( r 1 ) − N log cosh γ N ρ 1 ⊗ ρ 2 ( m ′ ) − K ( ρ 1 , π 1 ) − K ( ρ 2 , π 2 ) #) ≤ 1 . T o co nclude this s ection, w e see that the basic theorems o f transductive P AC- Bay esian classifica tio n hav e exactly the same form as the basic inequalities of in- ductiv e classifica tion, Theo rems 1.1.4 (page 4), 1.4 .2 (page 36) a nd 1.4.3 (page 3 7) with R ( θ ) r eplac e d with r ( θ ), r ( θ ) r eplaced with r 1 ( θ ) a nd M ′ ( θ, e θ ) replace d with m ( θ, e θ ). Thus al l the r esults of the first two chapters r emain true under the hy p otheses of tr ansductive classific ation, with R ( θ ) r eplac e d with r ( θ ) , r ( θ ) r eplac e d with r 1 ( θ ) and M ′ ( θ, e θ ) r eplac e d with m ( θ, e θ ) . Conse quent ly, in the c ase when the u nlab el le d shad ow sample i s observe d, it is p ossible t o impr ove on the V apnik b ounds to b e discusse d her e after by us ing an ex- plicit p artial ly excha nge able p osterior distrib ut ion π and r esorting t o lo c alize d or to r elative b ounds (in the c ase at le ast of u nlimite d c omputing r esour c es, which of c ourse may stil l b e u nr e alistic in many r e al world situations, and with t he c ave at, to b e r e c al le d in t he c onclusion of this study, that for smal l sample sizes and c om- p ar atively c omplex classific ation mo dels, t he impr ovement may not b e so de cisive). Let us notice a ls o that the transductive setting when exp erimen tally av ailable, has the adv a n tage that d ( θ , θ ′ ) = 1 ( k + 1 ) N ( k +1) N X i =1 1 f θ ′ ( X i ) 6 = f θ ( X i ) ≥ m ( θ, θ ′ ) ≥ r ( θ ) − r ( θ ′ ) , θ, θ ′ ∈ Θ , is observ able in this co n text, providing a n empir ica l upper b ound for the difference r ( b θ ) − ρ ( r ) for any non-randomized estimator b θ and a n y po sterior distribution ρ , namely r ( b θ ) ≤ ρ ( r ) + ρ d ( · , b θ ) . Thu s in the setting of transductive sta tistical exper imen ts, the P AC-Bay esian frame- work provides fully empirical b ounds for the error rate of non-r andomized es tima- tors b θ : Ω → Θ, even when using a non-atomic pr ior π (or more gener ally a non- atomic pa rtially exchangeable po sterior distribution π ), even when Θ is not a vector space and even when θ 7→ R ( θ ) canno t b e prov ed to b e conv ex on the suppo r t of some useful po sterior distribution ρ . 3.2. V apnik b ounds for transductiv e classification In this section, w e will stick to pla in unlo calized non-relative bounds. As w e hav e already mentioned, (a nd as it was put fo rw ard b y V apnik himself in his seminal works), these b ounds are not alwa ys superseded by the as ymptotically b etter ones when the sample is of small size: they deser v e all our attention for this reason. W e will star t with the gener a l cas e of a shadow sample of a rbitrary size. W e will then discuss the case of a sha dow sa mple o f equal siz e to the tr aining set and the ca s e of a fully exchangeable sample distribution, showing how they can b e taken adv a n tage of to sharp en inequa lities. 116 Chapter 3. T r ansductive P AC-Bayesi an le arning 3.2.1. Wi th a sh adow sample of arbitr ary size The gr eat thing with the transductiv e setting is that we are manipulating only r 1 and r which can take only a finite n umber of v alues and therefore are piecewise constant on Θ. This makes it po s sible to der iv e inequalities that will hold uniformly for any v alue of the para meter θ ∈ Θ. T o this purpo se, let us co nsider for a n y v alue θ ∈ Θ of the pa rameter the subset ∆( θ ) ⊂ Θ of parameters θ ′ such that the classification rule f θ ′ answers the same on the extended sample ( X i ) ( k +1) N i =1 as f θ . Namely , let us put for any θ ∈ Θ ∆( θ ) = θ ′ ∈ Θ ; f θ ′ ( X i ) = f θ ( X i ) , i = 1 , . . . , ( k + 1) N . W e se e immediately that ∆( θ ) is an exchangeable par a meter s ubse t on which r 1 and r 2 and therefore also r take constant v alues. Thus for any θ ∈ Θ w e may co nsider the po sterior ρ θ defined b y dρ θ dπ ( θ ′ ) = 1 θ ′ ∈ ∆( θ ) π ∆( θ ) − 1 , and use the fact that ρ θ ( r 1 ) = r 1 ( θ ) a nd ρ θ ( r ) = r ( θ ), to prov e that Lemma 3. 2.1 . F or any p artial ly exchange able p ositive r e al me asur able function λ : Ω × Θ → R such that (3.1) λ ( ω , θ ′ ) = λ ( ω , θ ) , θ ∈ Θ , θ ′ ∈ ∆( θ ) , ω ∈ Ω , and any p artial ly exchange able p osterior distribution π : Ω → M 1 + (Θ) , with P pr ob- ability at le ast 1 − ǫ , for any θ ∈ Θ , Φ λ N r ( θ ) + log ǫπ ∆( θ ) λ ( θ ) ≤ r 1 ( θ ) . W e can then re ma rk that fo r any v alue of λ independent of ω , the left-hand side of the previous inequality is a partia lly exchangeable function of ω ∈ Ω. Thu s this left-hand side is maximized by some partially exchangeable function λ , na mely arg max λ ( Φ λ N r ( θ ) + log ǫπ ∆( θ ) λ ) is par tially exchangeable a s dep ending only on partially exc hangeable qua n tities. Moreov er this c hoice of λ ( ω , θ ) satisfies also co ndition (3.1) stated in the previous lemma of being constant on ∆( θ ), proving Lemma 3. 2.2 . F or any p artial ly exchange able p osterior distribution π : Ω → M 1 + (Θ) , with P pr ob ability at le ast 1 − ǫ , for any θ ∈ Θ and any λ ∈ R + , Φ λ N r ( θ ) + log ǫπ ∆( θ ) λ ≤ r 1 ( θ ) . W riting r = r 1 + kr 2 k +1 and rearrang ing terms w e obtain 3.2. V apnik b oun ds for tr ansductive classific ation 117 Theorem 3 . 2.3 . F or any p artial ly exchange able p osterior distrib ut ion π : Ω → M 1 + (Θ) , with P pr ob ability at le ast 1 − ǫ , for any θ ∈ Θ , r 2 ( θ ) ≤ k + 1 k inf λ ∈ R + 1 − e x p − λ N r 1 ( θ ) + log ǫπ ∆( θ ) N ! 1 − e x p − λ N − r 1 ( θ ) k . If we hav e a set o f binary clas sification rules { f θ ; θ ∈ Θ } whose V apnik–Cervo- nenkis dimension is not greater than h , we can c ho ose π such that π ∆( θ ) is independent of θ and not less than h e ( k + 1) N h , as will b e proved further on in Theorem 4.2.2 (page 144). Another impor tan t setting where the complexity term − log π ∆( θ ) can eas- ily be con trolled is the case of c ompr ession schemes , in tro duced by Little et al. (1986). It goes as follows: we are given for each lab elled sub-sample ( X i , Y i ) i ∈ J , J ⊂ { 1 , . . . , N } , an estimator of the parameter b θ ( X i , Y i ) i ∈ J = b θ J , J ⊂ { 1 , . . . , N } , | J | ≤ h , where b θ : N G k =1 X × Y k → Θ is an exchangeable function providing estimator s for sub-sa mples of arbitra r y size. Let us assume that b θ is e x c hangeable, meaning that fo r any k = 1 , . . . , N a nd a n y per m utation σ of { 1 , . . . , k } b θ ( x i , y i ) k i =1 = b θ ( x σ ( i ) , y σ ( i ) ) k i =1 , ( x i , y i ) k i =1 ∈ X × Y k . In this situation, w e ca n introduce the exchangeable s ubset n b θ J ; J ⊂ { 1 , . . . , ( k + 1) N } , | J | ≤ h o ⊂ Θ , which is seen to contain a t mo st h X j =0 ( k + 1) N j ≤ e ( k + 1) N h h classification rules — as will b e proved la ter on in Theorem 4 .2.3 (page 144). Note that we had to extend the range o f J to all the subsets of the extended sample, although we will use for estimation only those of the training sample, on whic h the la bels are obser ved. Th us in this ca se also w e can find a partially exchangeable po sterior distribution π s uc h that π ∆( b θ J ) ≥ h e ( k + 1) N h . W e see that the size of the co mpression sch eme plays the same role in this complexity bo und a s the V apnik–Cervonenkis dimension for V apnik–Cer v onenkis class es. In these tw o ca ses of binary classifica tion with V a pnik–Cerv onenkis dimension not greater than h and co mpression schemes depending on a compression set with at most h p oin ts, we g et a b ound of 118 Chapter 3. T r ansductive P AC-Bayesi an le arning r 2 ( θ ) ≤ k + 1 k inf λ ∈ R + 1 − e x p − λ N r 1 ( θ ) − h log e ( k +1) N h − log( ǫ ) N 1 − e x p − λ N − r 1 ( θ ) k . Let us mak e some numerical application: when N = 1000 , h = 10 , ǫ = 0 . 01, and inf Θ r 1 = r 1 ( b θ ) = 0 . 2, we find that r 2 ( b θ ) ≤ 0 . 4093, for k betw een 15 and 17, and v alues o f λ equal r espectively to 965, 968 and 971. F or k = 1, we find only r 2 ( b θ ) ≤ 0 . 539, showing the in terest of allowing k to b e larger than 1. 3.2.2. When the shadow sample has the same size as the tr aini ng sample In the case when k = 1, we c a n impro ve Theorem 3.1.2 b y taking adv antage of the fa c t that T i ( σ i ) can take o nly 3 v alues, namely 0, 0 . 5 and 1. W e see th us that T i ( σ i ) − Φ λ N T i ( σ i ) can take only tw o v a lues, 0 and 1 2 − Φ λ N ( 1 2 ), b ecause Φ λ N (0) = 0 and Φ λ N (1) = 1. Th us T i ( σ i ) − Φ λ N T i ( σ i ) = 1 − | 1 − 2 T i ( σ i ) | 1 2 − Φ λ N ( 1 2 ) . This shows that in the case when k = 1, log n T exp( − λr 1 ) o = − λ r + λ N N X i =1 T i ( σ i ) − Φ λ N T i ( σ i ) = − λ r + λ N N X i =1 1 − | 1 − 2 T i ( σ i ) | 1 2 − Φ λ N ( 1 2 ) ≤ − λ r + λ 1 2 − Φ λ N ( 1 2 ) 1 − | 1 − 2 r | . Noticing that 1 2 − Φ λ N ( 1 2 ) = N λ log cosh( λ 2 N ) , we obtain Theorem 3 . 2.4 . F or any p artial ly exchange able funct ion λ : Ω × Θ → R + , for any p artial ly exchange able p osterior distribution π : Ω → M 1 + (Θ) , P exp sup ρ ∈ M 1 + (Θ) ρ h λ ( r − r 1 ) − N log cosh( λ 2 N ) 1 − | 1 − 2 r | i − K ( ρ, π ) ≤ 1 . As a consequence, reaso ning as previously , w e deduce Theorem 3 . 2.5 . In the c ase when k = 1 , for any p artial ly exchange able p osterior distribution π : Ω → M 1 + (Θ) , with P pr ob abili ty at le ast 1 − ǫ , for any θ ∈ Θ and any λ ∈ R + , r ( θ ) − N λ log cosh( λ 2 N ) 1 − | 1 − 2 r ( θ ) | + log ǫπ ∆( θ ) λ ≤ r 1 ( θ ); 3.2. V apnik b oun ds for tr ansductive classific ation 119 and c onse quent ly for any θ ∈ Θ , r 2 ( θ ) ≤ 2 inf λ ∈ R + r 1 ( θ ) − log ǫπ ∆( θ ) λ 1 − 2 N λ log cosh( λ 2 N ) − r 1 ( θ ) . In the case of binary cla ssification using a V apnik–Cervonenkis class of V apnik– Cerv onenkis dimension not greater than h , w e can c ho ose π such that − log π ∆( θ ) ≤ h log( 2 eN h ) a nd obtain the following n umerical illustr a tion of this theorem: for N = 1000, h = 10 , ǫ = 0 . 01 and inf Θ r 1 = r 1 ( b θ ) = 0 . 2, we find an upper b ound r 2 ( b θ ) ≤ 0 . 50 33, which improves on Theorem 3 .2 .3 but still is not un- der the significance level 1 2 (achiev ed by blind ra ndom clas sification). This indica tes that considering shadow samples of arbitra ry sizes some no isy situations yields a significant improvemen t on b ounds o btained with a s ha do w sample o f the s ame size as the training sample. 3.2.3. When mor e over the distribution of the augmente d sample is exchange able When k = 1 and P is exchangeable meaning that for any b ounded mea surable function h : Ω → R and any p erm utation s ∈ S { 1 , . . . , 2 N } P h ( ω ◦ s ) = P h ( ω ) , then we can still improve the bound a s follows. Let T ′ ( h ) = 1 N ! X s ∈ S { N +1 ,..., 2 N } h ( ω ◦ s ) . Then we can write 1 − | 1 − 2 T i ( σ i ) | = ( σ i − σ i + N ) 2 = σ i + σ i + N − 2 σ i σ i + N . Using this identit y , we get for a n y exchangeable function λ : Ω × Θ → R + , T exp λ ( r − r 1 ) − log cosh( λ 2 N ) N X i =1 σ i + σ i + N − 2 σ i σ i + N ≤ 1 . Let us put A ( λ ) = 2 N λ log cosh( λ 2 N ) , (3.2) v ( θ ) = 1 2 N N X i =1 ( σ i + σ i + N − 2 σ i σ i + N ) . (3.3) With this notation T n exp λ r − r 1 − A ( λ ) v o ≤ 1 . Let us notice now that T ′ v ( θ ) = r ( θ ) − r 1 ( θ ) r 2 ( θ ) . Let π : Ω → M 1 + (Θ) be any given exchangeable p osterior distribution. Using the exchangeabilit y of P a nd π and the exchangeability of the expo nen tial function, w e get P n π h exp λ r − r 1 − A ( r − r 1 r 2 ) io = P n π h exp λ r − r 1 − AT ′ ( v ) io 120 Chapter 3. T r ansductive P AC-Bayesi an le arning ≤ P n π h T ′ exp λ r − r 1 − Av io = P n T ′ π h exp λ r − r 1 − Av io = P n π h exp λ r − r 1 − Av io = P n T π h exp λ r − r 1 − Av io = P n π h T exp λ r − r 1 − Av io ≤ 1 . W e a re thus ready to state Theorem 3 . 2.6 . In the c ase when k = 1 , for any exchange able p r ob ability dis- tribution P , for any exchange able p osterior d istribution π : Ω → M 1 + (Θ) , for any exchange able fun ction λ : Ω × Θ → R + , P exp sup ρ ∈ M 1 + (Θ) ρ n λ r − r 1 − A ( λ )( r − r 1 r 2 ) o − K ( ρ, π ) ≤ 1 , wher e A ( λ ) is define d by e quation (3.2, p age 119). W e then deduce a s pre v iously Corollary 3. 2.7 . F or any exchange able p osterior distribution π : Ω → M 1 + (Θ) , for any exchange able pr ob ability me asur e P ∈ M 1 + (Ω) , for any me asur able exchange able function λ : Ω × Θ → R + , with P pr ob ability at le ast 1 − ǫ , for any θ ∈ Θ , r ( θ ) ≤ r 1 ( θ ) + A ( λ ) r ( θ ) − r 1 ( θ ) r 2 ( θ ) − log ǫπ ∆( θ ) λ , wher e A ( λ ) is define d by e quation (3.2, p age 119). In order to deduce an empirical bo und from this theorem, we hav e to make some choice fo r λ ( ω , θ ). F ortunately , it is eas y to show that the b ound holds uni- formly in λ , b ecause the inequality can b e rewritten as a function of only one non-exchangeable quant ity , namely r 1 ( θ ). Indeed, s ince r 2 = 2 r − r 1 , we see that the inequality can b e written as r ( θ ) ≤ r 1 ( θ ) + A ( λ ) r ( θ ) − 2 r ( θ ) r 1 ( θ ) + r 1 ( θ ) 2 − log ǫπ ∆( θ ) λ . It can be solved in r 1 ( θ ), to get r 1 ( θ ) ≥ f λ, r ( θ ) , − log ǫπ ∆( θ ) , where f ( λ, r , d ) = 2 A ( λ ) − 1 2 rA ( λ ) − 1 + r 1 − 2 r A ( λ ) 2 + 4 A ( λ ) n r 1 − A ( λ ) − d λ o . Thu s we can find some exc hangea ble function λ ( ω , θ ), such that f λ ( ω , θ ) , r ( θ ) , − log ǫπ ∆( θ ) = sup β ∈ R + f β , r ( θ ) , − log ǫπ ∆( θ ) . Applying Corolla r y 3.2.7 (page 120) to that c hoice of λ , we see that 3.3. V apnik b oun ds for inductive classific ation 121 Theorem 3 . 2.8 . F or any exchange able pr ob ability me asur e P ∈ M 1 + (Ω) , f or any exchange able p osterior pr ob ability distribution π : Ω → M 1 + (Θ) , with P pr ob abili ty at le ast 1 − ǫ , for any θ ∈ Θ , for any λ ∈ R + , r ( θ ) ≤ r 1 ( θ ) + A ( λ ) r ( θ ) − r 1 ( θ ) r 2 ( θ ) − log ǫπ ∆( θ ) λ , wher e A ( λ ) is define d by e quation (3.2, p age 119). Solving the previous inequalit y in r 2 ( θ ), we get Corollary 3. 2.9 . Un der the same assumptions as in the pr evious the or em, with P pr ob ability at le ast 1 − ǫ , for any θ ∈ Θ , r 2 ( θ ) ≤ inf λ ∈ R + r 1 ( θ ) n 1 + 2 N λ log cosh( λ 2 N ) o − 2 log ǫπ ∆( θ ) λ 1 − 2 N λ log cosh( λ 2 N ) 1 − 2 r 1 ( θ ) . Applying this to our usual numerical example of a binary classification mo del with V apnik– Cerv onenkis dimension not greater than h = 10, when N = 10 00, inf Θ r 1 = r 1 ( b θ ) = 10 and ǫ = 0 . 0 1 , we o btain that r 2 ( b θ ) ≤ 0 . 4450. 3.3. V apnik b ounds for inductiv e classification 3.3.1. Arbitr ary shadow sample size W e a ssume in this se c tio n that P = N O i =1 P i ⊗ ∞ ∈ M 1 + n X × Y N N o , where P i ∈ M 1 + X × Y : we consider an infinite i.i.d. sequence of independent non -identically distributed samples of size N , the first one only being o bserv ed. More precise ly , under P eac h sa mple ( X i + jN , Y i + jN ) N i =1 is distributed according to N N i =1 P i , and they a re all indep enden t from ea ch other . Only the first sample ( X i , Y i ) N i =1 is as sumed to b e obse r v ed. The sha do w samples will only appear in the pro ofs. The aim of this section is to prov e b etter V a pnik b ounds, generalizing them in the same time to the indep enden t non-i.i.d. se tting, whic h to our knowledge has not b een done b efore. Let us intro duce the notation P ′ h ( ω ) = P h ( ω ) | ( X i , Y i ) N i =1 , where h may be any suitable (e.g. b ounded) random v ar iable, let us also put Ω = ( X × Y ) N N . Definition 3. 3.1 . F or any subset A ⊂ N of inte gers, let C ( A ) b e t he set of cir cular p ermu tations of the total ly or der e d set A , extende d t o a p ermutation of N by taking it to b e the identity on the c omplement N \ A of A . We wil l say that a r andom function h : Ω → R is k -p artial ly exchange able if h ( ω ◦ s ) = h ( ω ) , s ∈ C { i + j N ; j = 0 , . . . , k } , i = 1 , . . . , N . In the same way, we will say that a p osterior distribution π : Ω → M 1 + (Θ) is k -p artial ly ex change able if π ( ω ◦ s ) = π ( ω ) ∈ M 1 + (Θ) , s ∈ C { i + j N ; j = 0 , . . . , k } , i = 1 , . . . , N . 122 Chapter 3. T r ansductive P AC-Bayesi an le arning Note that P itself is k -par tially exchangeable for an y k in the sense th at for any bo unded mea surable function h : Ω → R P h ( ω ◦ s ) = P h ( ω ) , s ∈ C { i + j N ; j = 0 , . . . , k } , i = 1 , . . . , N . Let ∆ k ( θ ) = n θ ′ ∈ Θ ; f θ ′ ( X i ) ( k +1) N i =1 = f θ ( X i ) ( k +1) N i =1 o , θ ∈ Θ , k ∈ N ∗ , and let also r k ( θ ) = 1 ( k + 1) N ( k +1) N X i =1 1 f θ ( X i ) 6 = Y i . Theorem 3.1 .2 s hows tha t for an y po sitiv e real parameter λ and a n y k -par tially exchangeable p osterior distribution π k : Ω → M 1 + (Θ), P exp sup θ ∈ Θ λ Φ λ N ( r k ) − r 1 + log ǫπ k ∆ k ( θ ) ≤ ǫ. Using the genera l fact that P exp( h ) = P n P ′ exp( h ) o ≥ P n exp P ′ ( h ) o , and the fact that the exp ectation of a supremum is lar ger than the supre mum of an exp ectation, w e see that with P probability at most 1 − ǫ , for any θ ∈ Θ, P ′ n Φ λ N r k ( θ ) o ≤ r 1 ( θ ) − P ′ n log ǫπ k ∆ k ( θ ) o λ . F or s ho rt let us put ¯ d k ( θ ) = − log ǫπ k ∆ k ( θ ) , d ′ k ( θ ) = − P ′ n log ǫπ k ∆ k ( θ ) o , d k ( θ ) = − P n log ǫπ k ∆ k ( θ ) o . W e can use the conv exity of Φ λ N and the fact that P ′ ( r k ) = r 1 + kR k +1 , to establish that P ′ n Φ λ N r k ( θ ) o ≥ Φ λ N r 1 ( θ ) + k R ( θ ) k + 1 . W e have prov ed Theorem 3 . 3.1 . Using the ab ove hyp otheses and notation, for any se quenc e π k : Ω → M 1 + (Θ) , wher e π k is a k -p artial ly exchange able p osterior distribution, for any p ositive r e al c onstant λ , any p ositive inte ger k , with P pr ob ability at le ast 1 − ǫ , for any θ ∈ Θ , Φ λ N r 1 ( θ ) + k R ( θ ) k + 1 ≤ r 1 ( θ ) + d ′ k ( θ ) λ . W e ca n make as we did with Theorem 1.2 .6 (page 1 1) the result of this theorem uniform in λ ∈ { α j ; j ∈ N ∗ } a nd k ∈ N ∗ (considering on k the prior 1 k ( k +1) and on j the prior 1 j ( j +1) ), and obtain 3.3. V apnik b oun ds for inductive classific ation 123 Theorem 3 . 3.2 . F or any r e al p ar ameter α > 1 , with P pr ob ability at le ast 1 − ǫ , for any θ ∈ Θ , R ( θ ) ≤ inf k ∈ N ∗ ,j ∈ N ∗ 1 − e x p − α j N r 1 ( θ ) − 1 N n d ′ k ( θ ) + lo g k ( k + 1) j ( j + 1) o k k +1 h 1 − e x p − α j N i − r 1 ( θ ) k . As a sp ecial case we can choo s e π k such that log π k ∆ k ( θ ) is independent of θ and equal to log( N k ), where N k = f θ ( X i ) ( k +1) N i =1 ; θ ∈ Θ is the size of the tra ce of the classification mo del on the extended s ample of s ize ( k + 1) N . With this choice, we obtain a b ound inv olving a new flavour of conditional V apnik en tropy , namely d ′ k ( θ ) = P log( N k ) | ( Z i ) N i =1 − log( ǫ ) . In the case of binary classification using a V apnik–Cer v onenkis class o f V apnik– Cervonenkis dimensio n not greater than h = 10, when N = 1000, inf Θ r 1 = r 1 ( b θ ) = 0 . 2 and ǫ = 0 . 01, c ho osing α = 1 . 1, we o btain R ( b θ ) ≤ 0 . 4271 (for an o ptimal v alue of λ = 1071 . 8 , and an optimal v alue of k = 16). 3.3.2. A b etter minimization with r esp e ct to the exp onential p ar ameter If we are no t pleased with optimizing λ on a discrete subset of the rea l line, w e can use a slightly differen t approach. F rom Theo rem 3.1.2 (page 113), w e see that for a n y p ositive int eger k , for any k -partially exchangeable p ositive r eal measura ble function λ : Ω × Θ → R + satisfying equation (3.1, pa ge 116) — with ∆( θ ) r eplaced with ∆ k ( θ ) — for any ǫ ∈ )0 , 1) and η ∈ )0 , 1), P P ′ exp h sup θ λ Φ λ N ( r k ) − r 1 + log ǫη π k ∆ k ( θ ) ≤ ǫη , therefore with P probability at leas t 1 − ǫ , P ′ exp h sup θ λ Φ λ N ( r k ) − r 1 + log ǫη π k ∆ k ( θ ) i ≤ η , and consequently , with P probabilit y a t least 1 − ǫ , with P ′ probability at least 1 − η , for any θ ∈ Θ, Φ λ N ( r k ) + log ǫη π k ∆ k ( θ ) λ ≤ r 1 . Now we a re entitled to c ho ose λ ( ω , θ ) ∈ arg max λ ′ ∈ R + Φ λ ′ N ( r k ) + log ǫη π k ∆ k ( θ ) λ ′ . 124 Chapter 3. T r ansductive P AC-Bayesi an le arning This shows that with P probability at leas t 1 − ǫ , w ith P ′ probability at lea s t 1 − η , for any θ ∈ Θ, sup λ ∈ R + Φ λ N ( r k ) − ¯ d k ( θ ) − lo g( η ) λ ≤ r 1 , which can also b e written Φ λ N ( r k ) − r 1 − ¯ d k ( θ ) λ ≤ − log( η ) λ , λ ∈ R + . Thu s with P pro babilit y at least 1 − ǫ , for any θ ∈ Θ, any λ ∈ R + , P ′ Φ λ N ( r k ) − r 1 − ¯ d k ( θ ) λ ≤ − log( η ) λ + 1 − r 1 + log( η ) λ η . On the other hand, Φ λ N being a conv ex function, P ′ Φ λ N ( r k ) − r 1 − ¯ d k ( θ ) λ ≥ Φ λ N P ′ ( r k ) − r 1 − d ′ k λ = Φ λ N k R + r 1 k + 1 − r 1 − d ′ k λ . Thu s with P pro babilit y at least 1 − ǫ , for any θ ∈ Θ, k R + r 1 k + 1 ≤ inf λ ∈ R + Φ − 1 λ N r 1 (1 − η ) + η + d ′ k − log( η )(1 − η ) λ . W e can g eneralize this approa c h by considering a finite decreas ing sequence η 0 = 1 > η 1 > η 2 > · · · > η J > η J +1 = 0 , and the co rresp onding sequence of levels L j = − log( η j ) λ , 0 ≤ j ≤ J, L J +1 = 1 − r 1 − log( J ) − log( ǫ ) λ . T a king a union b ound in j , we see that with P probability at le a st 1 − ǫ , for any θ ∈ Θ, fo r a ny λ ∈ R + , P ′ Φ λ N ( r k ) − r 1 − ¯ d k + log( J ) λ ≥ L j ≤ η j , j = 0 , . . . , J + 1 , and consequently P ′ Φ λ N ( r k ) − r 1 − ¯ d k + log( J ) λ ≤ Z L J +1 0 P ′ Φ λ N ( r k ) − r 1 − ¯ d k + log( J ) λ ≥ α dα ≤ J +1 X j =1 η j − 1 ( L j − L j − 1 ) = η J 1 − r 1 − log( J ) − log( ǫ ) − lo g( η J ) λ − log( η 1 ) λ + J − 1 X j =1 η j λ log η j η j +1 . Let us put 3.3. V apnik b oun ds for inductive classific ation 125 d ′′ k θ, ( η j ) J j =1 = d ′ k ( θ ) + lo g( J ) − log( η 1 ) + J − 1 X j =1 η j log η j η j +1 + log ǫη J J η J . W e have prov ed that for a n y dec r easing seq uence ( η j ) J j =1 , with P probability at least 1 − ǫ , fo r a n y θ ∈ Θ , k R + r 1 k + 1 ≤ inf λ ∈ R + Φ − 1 λ N r 1 (1 − η J ) + η J + d ′′ k θ, ( η j ) J j =1 λ . Remark 3. 3 .1 . We c an for instanc e cho ose J = 2 , η 2 = 1 10 N , η 1 = 1 log(10 N ) , r esulting in d ′′ k = d ′ k + log(2 ) + log log(10 N ) + 1 − log log(10 N ) log(10 N ) − log 20 N ǫ 10 N . In the c ase wher e N = 1000 and for any ǫ ∈ )0 , 1) , we get d ′′ k ≤ d ′ k + 3 . 7 , in the c ase wher e N = 1 0 6 , we get d ′′ k ≤ d ′ k + 4 . 4 , and in the c ase N = 10 9 , we get d ′′ k ≤ d ′ k + 4 . 7 . Ther efor e, for any pr actic al purp ose we c ould take d ′′ k = d ′ k + 4 . 7 and η J = 1 10 N in the ab ove ine qu ali ty. T a king mor eo ver a weigh ted union b ound in k , we get Theorem 3 . 3.3 . F or any ǫ ∈ )0 , 1) , any se quenc e 1 > η 1 > · · · > η J > 0 , a ny se quenc e π k : Ω → M 1 + (Θ) , wher e π k is a k -p artial ly ex cha nge able p osterior distri- bution, with P pr ob ability at le ast 1 − ǫ , for any θ ∈ Θ , R ( θ ) ≤ inf k ∈ N ∗ k + 1 k inf λ ∈ R + Φ − 1 λ N r 1 ( θ ) + η J 1 − r 1 ( θ ) + d ′′ k θ, ( η j ) J j =1 + log k ( k + 1) λ − r 1 ( θ ) k . Corollary 3. 3.4 . F or any ǫ ∈ )0 , 1 ) , for any N ≤ 10 9 , with P pr ob ability at le ast 1 − ǫ , for any θ ∈ Θ , R ( θ ) ≤ inf k ∈ N ∗ inf λ ∈ R + k + 1 k 1 − e x p ( − λ N ) − 1 1 − e x p − λ N r 1 ( θ ) + 1 10 N − P ′ log( N k ) | ( Z i ) N i =1 − log( ǫ ) + lo g k ( k + 1) + 4 . 7 N − r 1 ( θ ) k . Let us end this section with a numerical ex ample: in the case of binar y classi- fication with a V apnik–Cervonenkis clas s of dimension not grea ter than 10, when N = 1000 , inf Θ r 1 = r 1 ( b θ ) = 0 . 2 and ǫ = 0 . 01, we get a b ound R ( b θ ) ≤ 0 . 4 211 (for optimal v a lues o f k = 15 and of λ = 1010 ). 3.3.3. E qual shadow and tr aining sample sizes In the case when k = 1 , we can use Theorem 3.2 .5 (page 118) a nd replace Φ − 1 λ N ( q ) with 1 − 2 N λ × log cosh( λ 2 N ) − 1 q , resulting in 126 Chapter 3. T r ansductive P AC-Bayesi an le arning Theorem 3 . 3.5 . F or any ǫ ∈ )0 , 1) , any N ≤ 10 9 , any one-p artial ly exchange able p osterior distribution π 1 : Ω → M 1 + (Θ) , with P pr ob ability at le ast 1 − ǫ , for any θ ∈ Θ , R ( θ ) ≤ inf λ ∈ R + n 1 + 2 N λ log cosh( λ 2 N ) o r 1 ( θ ) + 1 5 N + 2 d ′ 1 ( θ ) + 4 . 7 λ 1 − 2 N λ log cosh( λ 2 N ) . 3.3.4. Impr ovement on the e qu al sample size b ound in the i. i.d. c ase Finally , in the ca s e when P is i.i.d., mea ning that all the P i are equal, we can impr o ve the pr evious b ound. F or any partially exchangeable function λ : Ω × Θ → R + , w e saw in the discussio n preceding Theorem 3.2.6 (page 120) that T h exp λ ( r k − r 1 ) − A ( λ ) v i ≤ 1 , with the no tation introduced therein. Th us for an y partially exch ang eable po s itiv e real measur able function λ : Ω × Θ → R + satisfying equation (3 .1, page 1 16), any one-partially exchangeable p osterior distribution π 1 : Ω → M 1 + (Θ), P n exp h sup θ ∈ Θ λ r k ( θ ) − r 1 ( θ ) − A ( λ ) v ( θ ) + log ǫπ 1 ∆( θ ) io ≤ 1 . Therefore with P probability at least 1 − ǫ , with P ′ probability 1 − η , r k ( θ ) ≤ r 1 ( θ ) + A ( λ ) v ( θ ) + 1 λ ¯ d 1 ( θ ) − log( η ) . W e ca n then choos e λ ( ω , θ ) ∈ arg min λ ′ ∈ R + A ( λ ′ ) v ( θ ) + ¯ d 1 ( θ ) − lo g( η ) λ ′ , which satis- fies the required conditions, to show that with P probability at least 1 − ǫ , for any θ ∈ Θ, w ith P ′ probability at least 1 − η , for any λ ∈ R + , r k ( θ ) ≤ r 1 ( θ ) + A ( λ ) v ( θ ) + ¯ d 1 ( θ ) − log( η ) λ . W e can then take a unio n b ound o n a decr easing sequence of J v alues η 1 ≥ · · · ≥ η J of η . W eakening the order of qua n tifiers a little, we then o btain the following statement : with P pr obabilit y at lea st 1 − ǫ , for any θ ∈ Θ, for any λ ∈ R + , for an y j = 1 , . . . , J P ′ r k ( θ ) − r 1 ( θ ) − A ( λ ) v ( θ ) − ¯ d 1 ( θ ) + lo g( J ) λ ≥ − log( η j ) λ ≤ η j . Consequently for any λ ∈ R + , P ′ r k ( θ ) − r 1 ( θ ) − A ( λ ) v ( θ ) − ¯ d 1 ( θ ) + log( J ) λ ≤ − log( η 1 ) λ + η J 1 − r 1 ( θ ) − log( J ) − log( ǫ ) − log( η J ) λ + J − 1 X j =1 η j λ log η j η j +1 . 3.4. Gaussian appr oximation in V apnik b ounds 127 Moreov er P ′ v ( θ ) = r 1 + R 2 − r 1 R , (t his is where w e need equidistribution) th us proving that R − r 1 2 ≤ A ( λ ) 2 h R + r 1 − 2 r 1 R i + d ′′ 1 θ, ( η j ) J j =1 λ + η J 1 − r 1 ( θ ) . Keeping track of quantifiers, we obtain Theorem 3 . 3.6 . F or a ny de cr e asing se quenc e ( η j ) J j =1 , any ǫ ∈ )0 , 1 ) , any one- p artial ly exchange able p osterior distribution π : Ω → M 1 + (Θ) , with P pr ob abili ty at le ast 1 − ǫ , for any θ ∈ Θ , R ( θ ) ≤ inf λ ∈ R + n 1 + 2 N λ log cosh( λ 2 N ) o r 1 ( θ ) + 2 d ′′ 1 θ, ( η j ) J j =1 λ + 2 η J 1 − r 1 ( θ ) 1 − 2 N λ log cosh( λ 2 N ) 1 − 2 r 1 ( θ ) . 3.4. Gaussian approximation in V apnik b ounds 3.4.1. Gau ssian upp er b ounds of vari anc e ter ms T o obtain formulas which co uld b e easily compa red with o r iginal V apnik b ounds, we may repla ce p − Φ a ( p ) with a Gaussian upper bound: Lemma 3. 4.1 . F or any p ∈ (0 , 1 2 ) , any a ∈ R + , p − Φ a ( p ) ≤ a 2 p (1 − p ) . F or any p ∈ ( 1 2 , 1 ) , p − Φ a ( p ) ≤ a 8 . Proof. Let us no tice that for any p ∈ (0 , 1), ∂ ∂ a − a Φ a ( p ) = − p exp( − a ) 1 − p + p exp( − a ) , ∂ 2 ∂ 2 a − a Φ a ( p ) = p exp( − a ) 1 − p + p exp( − a ) 1 − p exp( − a ) 1 − p + p exp( − a ) ≤ ( p (1 − p ) p ∈ (0 , 1 2 ) , 1 4 p ∈ ( 1 2 , 1 ) . Thu s ta king a T aylor expa nsion of or der one with integral r emainder: − a Φ( a ) ≤ − ap + Z a 0 p (1 − p )( a − b ) db = − ap + a 2 2 p (1 − p ) , p ∈ (0 , 1 2 ) , − ap + Z a 0 1 4 ( a − b ) db = − ap + a 2 8 , p ∈ ( 1 2 , 1 ) . This ends the pro of o f o ur lemma. 128 Chapter 3. T r ansductive P AC-Bayesi an le arning Lemma 3. 4.2 . L et u s c onsider t he b ound B ( q , d ) = 1 + 2 d N − 1 q + d N + r 2 dq (1 − q ) N + d 2 N 2 , q ∈ R + , d ∈ R + . L et us also put ¯ B ( q , d ) = ( B ( q , d ) B ( q , d ) ≤ 1 2 , q + q d 2 N otherwise . F or any p ositive r e al p ar ameters q and d inf λ ∈ R + Φ − 1 λ N q + d λ ≤ ¯ B ( q , d ) . Proof. Let p = inf λ Φ − 1 λ N q + d λ . F o r a n y λ ∈ R + , p − λ 2 N ( p ∧ 1 2 ) 1 − ( p ∧ 1 2 ) ≤ Φ λ N ( p ) ≤ q + d λ . Thu s p ≤ q + inf λ ∈ R + λ 2 N ( p ∧ 1 2 ) 1 − ( p ∧ 1 2 ) + d λ = q + s 2 d ( p ∧ 1 2 ) 1 − ( p ∧ 1 2 ) N ≤ q + r d 2 N . Then let us remark that B ( q , d ) = sup ( p ′ ∈ R + ; p ′ ≤ q + r 2 dp ′ (1 − p ′ ) N ) . If moreov er 1 2 ≥ B ( q , d ), then accor ding to this remar k 1 2 ≥ q + q d 2 N ≥ p . Therefor e p ≤ 1 2 , and consequently p ≤ q + q 2 dp (1 − p ) N , implying that p ≤ B ( q , d ). 3.4.2. Arbitr ary shadow sample size The previous lemma combined with Coro lla ry 3.3.4 (page 125) implies Corollary 3. 4.3 . L et u s use the notation intr o duc e d in L emma 3.4.2 (p age 128). F or any ǫ ∈ )0 , 1) , any inte ger N ≤ 1 0 9 , with P pr ob ability at le ast 1 − ǫ , for any θ ∈ Θ , R ( θ ) ≤ inf k ∈ N ∗ k + 1 k n ¯ B h r 1 ( θ ) + 1 10 N , d ′ k ( θ ) + lo g k ( k + 1) + 4 . 7 io − r 1 ( θ ) k . 3.4.3. E qual sample s iz es in the i .i.d. c ase T o make a link with V apnik’s re sult, it is useful to state the Gaussian approximation to Theorem 3 .3.6 (page 127 ). Indeed, using the upp er b ound A ( λ ) ≤ λ 4 N , where A ( λ ) is defined by equation (3.2) on page 119, we get with P probability a t least 1 − ǫ R − r 1 − 2 η J ≤ inf λ ∈ R + λ 4 N R + r 1 − 2 r 1 R + 2 d ′′ 1 λ = r 2 d ′′ 1 ( R + r 1 − 2 r 1 R ) N , which can b e solved in R to obtain 3.4. Gaussian appr oximation in V apnik b ounds 129 Corollary 3. 4.4 . With P pr ob ability at le ast 1 − ǫ , for any θ ∈ Θ , R ( θ ) ≤ r 1 ( θ ) + d ′′ 1 ( θ ) N 1 − 2 r 1 ( θ ) + 2 η J + s 4 d ′′ 1 ( θ ) 1 − r 1 ( θ ) r 1 ( θ ) N + d ′′ 1 ( θ ) 2 N 2 1 − 2 r 1 ( θ ) 2 + 4 d ′′ 1 ( θ ) N 1 − 2 r 1 ( θ ) η J . This is to b e compa r ed with V apnik’s res ult, a s prov ed in V apnik (1998, page 138): Theorem 3 . 4.5 (V apnik) . F or any i.i.d. pr ob ability distribution P , with P pr ob- ability at le ast 1 − ǫ , for any θ ∈ Θ , putting d V = lo g P ( N 1 ) + log (4 /ǫ ) , R ( θ ) ≤ r 1 ( θ ) + 2 d V N + r 4 d V r 1 ( θ ) N + 4 d 2 V N 2 . Recalling tha t we can c ho ose ( η j ) 2 j =1 such that η J = η 2 = 1 10 N (whic h brings a negligible contribution to the b ound) and suc h that for any N ≤ 1 0 9 , d ′′ 1 ( θ ) ≤ P log( N 1 ) | ( Z i ) N i =1 − log( ǫ ) + 4 . 7 , we see that our complexity term is somehow more sa tisfactory than V apnik’s, since it is in tegrated outside the logar ithm, with a slight ly larger additional constant (remember that log 4 ≃ 1 . 4, which is b etter than o ur 4 . 7, which could presumably be improv ed by working out a better sequence η j , but not down to log(4)). Our v ar iance ter m is b etter, since w e get r 1 (1 − r 1 ), instead of r 1 . W e also hav e d ′′ 1 N instead of 2 d V N , because w e use no symmetrization trick. Let us illustrate these b ounds o n a numerical example, corr esponding to a situ- ation where the sample is noisy or the cla s sification mo del is w eak . Let us ass ume that N = 10 00, inf Θ r 1 = r 1 ( b θ ) = 0 . 2, that we are p erforming binar y classification with a model with V a pnik– Cerv onenkis dimension not greater than h = 1 0, a nd that we w ork at confidence level ǫ = 0 . 01. V apnik’s theorem provides an upper bo und for R ( b θ ) not smaller than 0 . 61 0, whereas Cor ollary 3.4.4 gives R ( b θ ) ≤ 0 . 461 (using the b ound d ′′ 1 ≤ d ′ 1 + 3 . 7 when N = 1000 ). Now if w e go for Theo rem 3.3.6 and do not make a Gauss ian appr oximation, we get R ( b θ ) ≤ 0 . 4 53. It is in teresting to remark that this b ound is achiev ed for λ = 119 5 > N = 1 000. This explains why the Gaussian approximation in V apnik’s bo und ca n b e improv ed: for such a la r ge v alue of λ , λr 1 ( θ ) do es not b eha ve like a Gaussian random v aria ble. Let us recall in conclusio n that th e bes t b ound is provided by Theo rem 3.3.3 (page 125), giving R ( b θ ) ≤ 0 . 4 211, (that is approximately 2 / 3 of V apnik’s b ound), for optimal v a lues of k = 15, and of λ = 10 10. This b ound can be seen to take ad- v antage of the fact that Be r noulli random v ar iables a re not Gaussian (its Ga ussian approximation, Corollar y 3.4 .3, gives a b ound R ( θ ) ≃ 0 . 4325, still with an optimal k = 15), a nd of the fact that the o ptimal size of the shadow sample is sig nifican tly larger than the size of the observed sa mple. Moreover, Theorem 3.3.3 do es not as- sume that the sample is i.i.d., but only th at it is indep enden t, th us generalizing V apnik’s b ounds to inhomogeneous data (this will pr esumably b e the case when 130 Chapter 3. T r ansductive P AC-Bayesi an le arning data are collected fr o m differen t pla ces where the exp erimen tal co nditions ma y not be the same, although they ma y rea s onably be assumed to be independent). Our little numerical ex a mple was chosen to illustrate the case when it is non- trivial to decide whether the c hosen classifier do es b etter than the 0 .5 er ror rate of blind ra ndom classification. This cas e is of interest to choose “w eak lea rners” to b e aggrega ted or combined in s ome appropr ia te wa y in a s econd stage to reach a b etter classifica tion rate. This stage of feature s e lection is unav oidable in many real world classification tasks . Our little computations are mean t to exemplify the fact that V apnik’s b ounds, a lthough a s ymptotically sub optimal, as is ob vious by comparison with the fir st t wo c hapters, can do the job when dealing with mo derate sample sizes. Chapter 4 Supp ort V ector Mac hi nes 4.1. How to build them 4.1.1. The c anonic al hyp erplane Suppor t V ector Ma c hines, o f wide use and renown, were conceived by V. V apkik (V apnik , 1998). Before in tro ducing them, we will study as a prerequisite the sep- aration of points by hyperpla nes in a finite dimensional Euclidean space. Suppo r t V ector Machines p erform the sa me kind of linea r se pa ration after an implicit change of pattern space. The pre c eding P AC-Bay esian results provide a fit framework to analyse their generalizatio n prop erties. In this section we deal with the class ification of p oin ts in R d in t wo classes. L e t Z = ( x i , y i ) N i =1 ∈ R d × { − 1 , + 1 } N be some set o f la belled examples (called the training set hereafter). Let us s plit the set of indices I = { 1 , . . . , N } accor ding to the lab els in to tw o subsets I + = { i ∈ I : y i = +1 } , I − = { i ∈ I : y i = − 1 } . Let us then consider the set of admissible separa ting directions A Z = w ∈ R d : sup b ∈ R inf i ∈ I ( h w, x i i − b ) y i ≥ 1 , which can also b e written as A Z = w ∈ R d : max i ∈ I − h w, x i i + 2 ≤ min i ∈ I + h w, x i i . As it is easily seen, the optimal v alue o f b for a fixed v alue of w , in o ther w or ds the v alue of b whic h maximizes inf i ∈ I ( h w, x i i − b ) y i , is equal to b w = 1 2 h max i ∈ I − h w, x i i + min i ∈ I + h w, x i i i . Lemma 4. 1.1 . When A Z 6 = ∅ , inf {k w k 2 : w ∈ A Z } is r e ache d for only one value w Z of w . 131 132 Chapter 4. Supp ort V e ctor Machines Proof. Let w 0 ∈ A Z . The set A Z ∩ { w ∈ R d : k w k ≤ k w 0 k} is a compact conv ex set and w 7→ k w k 2 is strictly con vex and therefore has a unique minimum on this set, which is also obviously its minimum on A Z . Definition 4. 1.1 . When A Z 6 = ∅ , the tr aining set Z is said t o b e line arly sep ar a- ble. The hyp erplane H = { x ∈ R d : h w Z , x i − b Z = 0 } , wher e w Z = a r g min {k w k : w ∈ A Z } , b Z = b w Z , is c al le d the c anonic al sep ar ating hy p erplane of the tr aining set Z . The quantity k w Z k − 1 is c al le d the mar gin of the c anonic al hyp erplane. As min i ∈ I + h w Z , x i i − max i ∈ I − h w Z , x i i = 2, the margin is also equal to half the distance betw een the pr o jections on the direction w Z of the p ositive and nega tiv e patterns. 4.1.2. C omputation of the c anonic al hyp erplane Let us consider the con vex hu lls X + and X − of the positive and neg ativ e patterns: X + = n X i ∈ I + λ i x i : λ i i ∈ I + ∈ R I + + , X i ∈ I + λ i = 1 o , X − = n X i ∈ I − λ i x i : λ i i ∈ I − ∈ R I − + , X i ∈ I − λ i = 1 o . Let us introduce the closed conv ex set V = X + − X − = x + − x − : x + ∈ X + , x − ∈ X − . As v 7→ k v k 2 is strictly conv ex, with compact lower level s ets, there is a unique vector v ∗ such that k v ∗ k 2 = inf v ∈ V k v k 2 : v ∈ V . Lemma 4. 1.2 . The set A Z is non-empty (i.e. the tr aining set Z is line arly sep a- r able) if and only if v ∗ 6 = 0 . In this c ase w Z = 2 k v ∗ k 2 v ∗ , and the mar gin of t he c anonic al hyp erplane is e qual to 1 2 k v ∗ k . This lemma pro ves that the distance b et ween the con vex hulls of th e po sitiv e and negative patterns is equal to twice the mar gin of the canonica l hyperplane. Proof. Let us assume first that v ∗ = 0, or equiv alently that X + ∩ X − 6 = ∅ . F o r any vector w ∈ R d , min i ∈ I + h w, x i i = min x ∈ X + h w, x i , 4.1. How to build them 133 max i ∈ I − h w, x i i = max x ∈ X − h w, x i , so min i ∈ I + h w, x i i − max i ∈ I − h w, x i i ≤ 0 , which shows that w canno t b e in A Z and therefore that A Z is empty . Let us assume now that v ∗ 6 = 0 , or equiv alently that X + ∩ X − = ∅ . Let us put w ∗ = 2 v ∗ / k v ∗ k 2 . Let us remark first that min i ∈ I + h w ∗ , x i i − max i ∈ I − h w ∗ , x i i = inf x ∈ X + h w ∗ , x i − sup x ∈ X − h w ∗ , x i = inf x + ∈ X + ,x − ∈ X − h w ∗ , x + − x − i = 2 k v ∗ k 2 inf v ∈ V h v ∗ , v i . Let us now prove that inf v ∈ V h v ∗ , v i = k v ∗ k 2 . Some arbitrary v ∈ V b eing fixed, consider the function β 7→ k β v + (1 − β ) v ∗ k 2 : [0 , 1] → R . By definition of v ∗ , it reaches its minimum v alue for β = 0 , and ther efore has a non-nega tiv e deriv a tiv e at this p oint. Computing this deriv ative, we find that h v − v ∗ , v ∗ i ≥ 0, as claimed. W e have prov ed that min i ∈ I + h w ∗ , x i i − max i ∈ I − h w ∗ , x i i = 2 , and therefore that w ∗ ∈ A Z . On the other hand, any w ∈ A Z is such that 2 ≤ min i ∈ I + h w, x i i − max i ∈ I − h w, x i i = inf v ∈ V h w, v i ≤ k w k inf v ∈ V k v k = k w k k v ∗ k . This prov es that k w ∗ k = inf k w k : w ∈ A Z , and therefore that w ∗ = w Z as claimed. One w ay to co mpute w Z would therefore b e to compute v ∗ b y minimizing ( X i ∈ I λ i y i x i 2 : ( λ i ) i ∈ I ∈ R I + , X i ∈ I λ i = 2 , X i ∈ I y i λ i = 0 ) . Although this is a tra ctable quadratic progra mming problem, a direct co mputation of w Z through the following pro p osition is usually preferred. Prop osition 4. 1 .3 . The c anonic al dir e ction w Z c an b e expr esse d as w Z = N X i =1 α ∗ i y i x i , wher e ( α ∗ i ) N i =1 is obtaine d by minimizing inf F ( α ) : α ∈ A wher e A = n ( α i ) i ∈ I ∈ R I + , X i ∈ I α i y i = 0 o , and F ( α ) = X i ∈ I α i y i x i 2 − 2 X i ∈ I α i . 134 Chapter 4. Supp ort V e ctor Machines Proof. Let w ( α ) = P i ∈ I α i y i x i and let S ( α ) = 1 2 P i ∈ I α i . W e ca n express the function F ( α ) as F ( α ) = k w ( α ) k 2 − 4 S ( α ). Moreov er it is imp ortant to no- tice that for an y s ∈ R + , { w ( α ) : α ∈ A , S ( α ) = s } = s V . This shows that for any s ∈ R + , inf { F ( α ) : α ∈ A , S ( α ) = s } is reached a nd that for a n y α s ∈ { α ∈ A : S ( α ) = s } rea c hing this infimum, w ( α s ) = sv ∗ . As s 7→ s 2 k v ∗ k 2 − 4 s : R + → R reaches its infimum for only o ne v alue s ∗ of s , namely at s ∗ = 2 k v ∗ k 2 , th is sho ws that F ( α ) reaches its infim um on A , and that for any α ∗ ∈ A such that F ( α ∗ ) = inf { F ( α ) : α ∈ A } , w ( α ∗ ) = 2 k v ∗ k 2 v ∗ = w Z . 4.1.3. S u pp ort ve ctors Definition 4. 1.2 . The set of supp ort ve ctors S is define d by S = { x i : h w Z , x i i − b Z = y i } . Prop osition 4. 1 .4 . Any α ∗ minimizing F ( α ) on A is such that { x i : α ∗ i > 0 } ⊂ S . This implies that the r epr esentation w Z = w ( α ∗ ) involves in gener al only a limite d numb er of non-zer o c o efficients and that w Z = w Z ′ , wher e Z ′ = { ( x i , y i ) : x i ∈ S } . Proof. Let us co nsider any given i ∈ I + and j ∈ I − , such that α ∗ i > 0 and α ∗ j > 0 . There exists at least o ne such index in ea ch set I − and I + , since the sum of the comp onen ts of α ∗ on ea c h of these se ts are eq ua l and since P k ∈ I α ∗ k > 0 . F or any t ∈ R , consider α k ( t ) = α ∗ k + t 1 ( k ∈ { i , j } ) , k ∈ I . The vector α ( t ) is in A for a n y v alue of t in so me neigh b o urhoo d of 0, therefor e ∂ ∂ t | t =0 F α ( t ) = 0 . Computing this der iv ative, we find that y i h w ( α ∗ ) , x i i + y j h w ( α ∗ ) , x j i = 2 . As y i = − y j , this can also b e written a s y i h w ( α ∗ ) , x i i − b Z + y j h w ( α ∗ ) , x j i − b Z = 2 . As w ( α ∗ ) ∈ A Z , y k h w ( α ∗ ) , x k i − b Z ≥ 1 , k ∈ I , which implies necessa rily as claimed that y i h w ( α ∗ ) , x i i − b Z = y j h w ( α ∗ ) , x j i − b Z = 1 . 4.1.4. The non-sep ar able c ase In the case when the tr a ining set Z = ( x i , y i ) N i =1 is not linea rly separ a ble, we can define a no isy canonical h yp erplane a s follows: we can ch o ose w ∈ R d and b ∈ R to minimize (4.1) C ( w, b ) = N X i =1 1 − h w, x i i − b y i + + 1 2 k w k 2 , where for any real num b e r r , r + = max { r , 0 } is the p ositive part o f r . 4.1. How to build them 135 Theorem 4 . 1.5 . L et us intr o duc e the dual criterion F ( α ) = N X i =1 α i − 1 2 N X i =1 y i α i x i 2 and t he domain A ′ = α ∈ R N + : α i ≤ 1 , i = 1 , . . . , N , N X i =1 y i α i = 0 . L et α ∗ ∈ A ′ b e su ch that F ( α ∗ ) = sup α ∈ A ′ F ( α ) . L et w ∗ = P N i =1 y i α ∗ i x i . Ther e is a thr eshold b ∗ (whose c onstru ction wil l b e detaile d in the pr o of ), such that C ( w ∗ , b ∗ ) = inf w ∈ R d ,b ∈ R C ( w, b ) . Corollary 4. 1.6 . (scaled criterion) F or any p ositive r e al p ar ameter λ let us c onsider the criterion C λ ( w, b ) = λ 2 N X i =1 1 − ( h w , x i i − b ) y i + + 1 2 k w k 2 and the domain A ′ λ = α ∈ R N + : α i ≤ λ 2 , i = 1 , . . . , N , N X i =1 y i α i = 0 . F or any solution α ∗ of the minimization pr oblem F ( α ∗ ) = sup α ∈ A ′ λ F ( α ) , t he ve ctor w ∗ = P N i =1 y i α ∗ i x i is su ch that inf b ∈ R C λ ( w ∗ , b ) = inf w ∈ R d ,b ∈ R C λ ( w, b ) . In the separable cas e , the sca led criterion is minimized by the cano nical h yp er- plane for λ large enough. This extension of the c a nonical hyperplane computation in dual space is often called the b ox c onstr aint , for ob vious reaso ns. Proof. The co r ollary is a straightforw ard consequence of the scale property C λ ( w, b, x ) = λ 2 C ( λ − 1 w, b, λx ), where we hav e made the dep endence of the crite- rion in x ∈ R dN explicit. Let us come now to the pro of of the theorem. The minimization of C ( w , b ) can b e p erformed in dual spa ce extending the couple of parameters ( w , b ) to w = ( w, b, γ ) ∈ R d × R × R N + and in tro ducing the dual m ultipliers α ∈ R N + and the criterion G ( α, w ) = N X i =1 γ i + N X i =1 α i 1 − ( h w , x i i − b ) y i − γ i + 1 2 k w k 2 . W e s e e that C ( w, b ) = inf γ ∈ R N + sup α ∈ R N + G α, ( w , b, γ ) , and therefore, putting W = { ( w, b, γ ) : w ∈ R d , b ∈ R , γ ∈ R N + , w e are led to solve the minimization problem G ( α ∗ , w ∗ ) = inf w ∈ W sup α ∈ R N + G ( α, w ) , 136 Chapter 4. Supp ort V e ctor Machines whose solution w ∗ = ( w ∗ , b ∗ , γ ∗ ) is such that C ( w ∗ , b ∗ ) = inf ( w, b ) ∈ R d +1 C ( w , b ), according to the preceding iden tity . As for any v alue of α ′ ∈ R N + , inf w ∈ W sup α ∈ R N + G ( α, w ) ≥ inf w ∈ W G ( α ′ , w ) , it is immediately seen that inf w ∈ W sup α ∈ R N + G ( α, w ) ≥ sup α ∈ R N + inf w ∈ W G ( α, w ) . W e are going to show that there is no duality gap, meaning that this inequa lit y is indeed an e q ualit y . Mor e imp ortantly , we will do so by exhibiting a sa ddle p oint, which , so lving the dual minimization pro blem will also solve the origina l one. Let us first make explicit the solution of the dual pro blem (the interest of this dual problem precisely lies in the fact that it ca n more easily be so lv ed e x plicitly). In tro ducing the admissible set of v alues of α , A ′ = α ∈ R N : 0 ≤ α i ≤ 1 , i = 1 , . . . , N , N X i =1 y i α i = 0 , it is elementary to chec k that inf w ∈ W G ( α, w ) = ( inf w ∈ R d G α, ( w , 0 , 0) , α ∈ A ′ , −∞ , otherwise . As G α, ( w , 0 , 0) = 1 2 k w k 2 + N X i =1 α i 1 − h w , x i i y i , we see that inf w ∈ R d G α, ( w , 0 , 0) is reached at w α = N X i =1 y i α i x i . This prov es that inf w ∈ W G ( α, w ) = F ( α ) . The cont inuous map α 7→ inf w ∈ W G ( α, w ) rea c hes a maximum α ∗ , not necessar ily unique, on the compact co n vex set A ′ . W e ar e now go ing to exhibit a c hoice of w ∗ ∈ W suc h that ( α ∗ , w ∗ ) is a sadd le p oint . This mea ns that w e a r e going to show that G ( α ∗ , w ∗ ) = inf w ∈ W G ( α ∗ , w ) = sup α ∈ R N + G ( α, w ∗ ) . It will imply that inf w ∈ W sup α ∈ R d + G ( α, w ) ≤ sup α ∈ R N + G ( α, w ∗ ) = G ( α ∗ , w ∗ ) on the one hand and that inf w ∈ W sup α ∈ R d + G ( α, w ) ≥ inf w ∈ W G ( α ∗ , w ) = G ( α ∗ , w ∗ ) 4.1. How to build them 137 on the other hand, proving that G ( α ∗ , w ∗ ) = inf w ∈ W sup α ∈ R N + G ( α, w ) as required. Construct ion of w ∗ . • Let us put w ∗ = w α ∗ . • If there is j ∈ { 1 , . . . , N } such that 0 < α ∗ j < 1, let us put b ∗ = h x j , w ∗ i − y j . Otherwise, let us put b ∗ = sup { h x i , w ∗ i − 1 : α ∗ i > 0 , y i = + 1 , i = 1 , . . . , N } . • Let us then put γ ∗ i = ( 0 , α ∗ i < 1 , 1 − ( h w ∗ , x i i − b ∗ ) y i , α ∗ i = 1 . If we can prove that (4.2) 1 − ( h w ∗ , x i i − b ∗ ) y i ≤ 0 , α ∗ i = 0 , = 0 , 0 < α ∗ i < 1 , ≥ 0 , α ∗ i = 1 , it will show that γ ∗ ∈ R N + and therefore that w ∗ = ( w ∗ , b ∗ , γ ∗ ) ∈ W . It will a ls o show that G ( α, w ∗ ) = N X i =1 γ ∗ i + X i,α ∗ i =0 α i 1 − ( h w ∗ , x i i − b ∗ ) y i + 1 2 k w ∗ k 2 , proving that G ( α ∗ , w ∗ ) = sup α ∈ R N + G ( α, w ∗ ). As obviously G ( α ∗ , w ∗ ) = G α ∗ , ( w ∗ , 0 , 0) , we a lready kno w that G ( α ∗ , w ∗ ) = inf w ∈ W G ( α ∗ , w ). This will show that ( α ∗ , w ∗ ) is the saddle p oint we w ere lo oking for, th us ending the pro of of the theo- rem. Proof of equa tion (4.2) . Let us deal first with the case when there is j ∈ { 1 , . . . , N } such that 0 < α ∗ j < 1. F or any i ∈ { 1 , . . . , N } such that 0 < α ∗ i < 1 , ther e is ǫ > 0 such that for any t ∈ ( − ǫ, ǫ ), α ∗ + ty i e i − ty j e j ∈ A ′ , where ( e k ) N k =1 is the cano nical base of R N . Th us ∂ ∂ t | t =0 F ( α ∗ + ty i e i − ty j e j ) = 0. Computing this deriv a tiv e, w e obtain ∂ ∂ t | t =0 F ( α ∗ + ty i e i − ty j e j ) = y i − h w ∗ , x i i + h w ∗ , x j i − y j = y i 1 − h w, x i i − b ∗ y i . Thu s 1 − h w, x i i − b ∗ y i = 0 , as req uir ed. This shows also that the definition of b ∗ do es not dep end on the choice of j such that 0 < α ∗ j < 1. 138 Chapter 4. Supp ort V e ctor Machines F or any i ∈ { 1 , . . . , N } suc h that α ∗ i = 0, there is ǫ > 0 such that for any t ∈ (0 , ǫ ), α ∗ + te i − ty i y j e j ∈ A ′ . Thus ∂ ∂ t | t =0 F ( α ∗ + te i − ty i y j e j ) ≤ 0 , showing that 1 − h w ∗ , x i i − b ∗ y i ≤ 0 as req uired. F or an y i ∈ { 1 , . . . , N } such that α ∗ i = 1 , there is ǫ > 0 such that α ∗ − te i + ty i y j e j ∈ A ′ . Thus ∂ ∂ t | t =0 F ( α ∗ − te i + ty i y j e j ) ≤ 0 , showin g that 1 − h w ∗ , x i i − b ∗ y i ≥ 0 as required. This shows that ( α ∗ , w ∗ ) is a saddle point in this case. Let us dea l now with the case w her e α ∗ ∈ { 0 , 1 } N . If we a re no t in the trivial c a se where the vector ( y i ) N i =1 is constant, the case α ∗ = 0 is ruled o ut. Indeed, in this case, considering α ∗ + te i + te j , where y i y j = − 1, we w ould get the cont ra diction 2 = ∂ ∂ t | t =0 F ( α ∗ + te i + te j ) ≤ 0. Thu s there are v alues of j such that α ∗ j = 1, and since P N i =1 α i y i = 0, b oth classes are present in the set { j : α ∗ j = 1 } . Now for any i , j ∈ { 1 , . . . , N } suc h that α ∗ i = α ∗ j = 1 and such that y i = +1 a nd y j = − 1 , ∂ ∂ t | t =0 F ( α ∗ − te i − te j ) = − 2 + h w ∗ , x i i − h w ∗ , x j i ≤ 0. Th us sup {h w ∗ , x i i − 1 : α ∗ i = 1 , y i = + 1 } ≤ inf {h w ∗ , x j i + 1 : α ∗ j = 1 , y j = − 1 } , showing that 1 − h w ∗ , x k i − b ∗ y k ≥ 0 , α ∗ k = 1 . Finally , for any i such that α ∗ i = 0, for any j suc h that α ∗ j = 1 and y j = y i , we have ∂ ∂ t | t =0 F ( α ∗ + te i − te j ) = y i h w ∗ , x i − x j i ≤ 0 , showing that 1 − h w ∗ , x i i − b ∗ y i ≤ 0 . This shows that ( α ∗ , w ∗ ) is alwa ys a saddle po in t. 4.1.5. S u pp ort V e ctor Machines Definition 4. 1.3 . The symmetric me asur able kernel K : X × X → R is said to b e p ositive (or mor e pr e cisely p ositive semi-definite) if for a ny n ∈ N , a ny ( x i ) n i =1 ∈ X n , inf α ∈ R n n X i =1 n X j =1 α i K ( x i , x j ) α j ≥ 0 . Let Z = ( x i , y i ) N i =1 be some training set. Let us consider as previously A = ( α ∈ R N + : N X i =1 α i y i = 0 ) . Let F ( α ) = N X i =1 N X j =1 α i y i K ( x i , x j ) y j α j − 2 N X i =1 α i . Definition 4. 1.4 . L et K b e a p ositive s ymmet ric kernel. The tr aining set Z is said to b e K - sep ar able if inf F ( α ) : α ∈ A > −∞ . 4.1. How to build them 139 Lemma 4. 1.7 . When Z is K -sep ar able, inf { F ( α ) : α ∈ A } is r e ache d. Proof. Consider the tra ining set Z ′ = ( x ′ i , y i ) N i =1 , where x ′ i = n K ( x k , x ℓ ) o N N k =1 ,ℓ =1 1 / 2 ( i, j ) N j =1 ∈ R N . W e see that F ( α ) = k P N i =1 α i y i x ′ i k 2 − 2 P N i =1 α i . W e prov ed in the previo us section that Z ′ is linear ly sepa r able if and only if inf { F ( α ) : α ∈ A } > −∞ , a nd that the infim um is r e ac hed in this case. Prop osition 4. 1 .8 . L et K b e a symmetric p ositive kernel and let Z = ( x i , y i ) N i =1 b e some K -sep ar able tr aining set. L et α ∗ ∈ A b e such t hat F ( α ∗ ) = inf { F ( α ) : α ∈ A } . L et I ∗ − = { i ∈ N : 1 ≤ i ≤ N , y i = − 1 , α ∗ i > 0 } I ∗ + = { i ∈ N : 1 ≤ i ≤ N , y i = +1 , α ∗ i > 0 } b ∗ = 1 2 n N X j =1 α ∗ j y j K ( x j , x i − ) + N X j =1 α ∗ j y j K ( x j , x i + ) o , i − ∈ I ∗ − , i + ∈ I ∗ + , wher e the value of b ∗ do es not dep end on the choic e of i − and i + . The classific ation rule f : X → Y define d by the formula f ( x ) = sign N X i =1 α ∗ i y i K ( x i , x ) − b ∗ ! is indep endent of the choic e of α ∗ and is c al le d the su pp ort ve ctor machine define d by K and Z . The set S = { x j : P N i =1 α ∗ i y i K ( x i , x j ) − b ∗ = y j } is c al le d the set of supp ort ve ctors. F or any choic e of α ∗ , { x i : α ∗ i > 0 } ⊂ S . An imp ortant conse quence of this prop osition is that the supp ort vector machine defined by K and Z is also the suppo r t vector machine defined by K and Z ′ = { ( x i , y i ) : α ∗ i > 0 , 1 ≤ i ≤ N } , since this restriction o f the index set contains the v alue α ∗ where the minim um of F is reached. Proof. The indep endence o f the choice o f α ∗ , which is not necess a rily unique, is seen as follows. Let ( x i ) N i =1 and x ∈ X b e fixed. Let us put for ease of notation x N +1 = x . Let M b e the ( N + 1) × ( N + 1) symmetric semi-definite matrix defined b y M ( i, j ) = K ( x i , x j ), i = 1 , . . . , N + 1, j = 1 , . . . , N + 1. Le t us consider the mapping Ψ : { x i : i = 1 , . . . , N + 1 } → R N +1 defined b y (4.3) Ψ( x i ) = M 1 / 2 ( i, j ) N +1 j =1 ∈ R N +1 . Let us consider the training set Z ′ = Ψ( x i ) , y i N i =1 . Then Z ′ is linearly separable, F ( α ) = N X i =1 α i y i Ψ( x i ) 2 − 2 N X i =1 α i , and we hav e proved that for a n y choice of α ∗ ∈ A minimizing F ( α ), w Z ′ = P N i =1 α ∗ i y i Ψ( x i ). Thus the supp ort vector machine defined by K and Z can also b e expressed b y the formula f ( x ) = sign h h w Z ′ , Ψ ( x ) i − b Z ′ 140 Chapter 4. Supp ort V e ctor Machines which do es not dep end on α ∗ . The definition of S is such that Ψ( S ) is the set of suppo rt vectors defined in the linear cas e, where it s stated property has already been prov ed. W e can in the same w ay use the box co nstrain t and show that any solution α ∗ ∈ arg min { F ( α ) : α ∈ A , α i ≤ λ 2 , i = 1 , . . . , N } minimizes (4.4) inf b ∈ R λ 2 N X i =1 1 − N X j =1 y j α j K ( x j , x i ) − b y i + + 1 2 N X i =1 N X j =1 α i α j y i y j K ( x i , x j ) . 4.1.6. B uilding kernels Except the last, the results of this section are drawn from Cr istianini et al. (2000). W e hav e no refere nce for the last propo sition of this section, althoug h w e b elieve it is well known. W e include them for the conv enience of the reader. Prop osition 4. 1 .9 . L et K 1 and K 2 b e p ositive symmetric kernels on X . Then for any a ∈ R + ( aK 1 + K 2 )( x, x ′ ) def = aK 1 ( x, x ′ ) + K 2 ( x, x ′ ) and ( K 1 · K 2 )( x, x ′ ) def = K 1 ( x, x ′ ) K 2 ( x, x ′ ) ar e also p ositive symmetric kernels. Mor e over, for any me asur able function g : X → R , K g ( x, x ′ ) def = g ( x ) g ( x ′ ) is also a p ositive symmetric kernel. Proof. It is enough to prov e the pr oposition in the case when X is finite and kernels ar e just ordinary symmetric matrices. Thus we ca n assume without loss of generality that X = { 1 , . . . , n } . Then for any α ∈ R N , using usual ma tr ix notation, h α, ( aK 1 + K 2 ) α i = a h α, K 1 α i + h α, K 2 α i ≥ 0 , h α, ( K 1 · K 2 ) α i = X i,j α i K 1 ( i, j ) K 2 ( i, j ) α j = X i,j,k α i K 1 / 2 1 ( i, k ) K 1 / 2 1 ( k , j ) K 2 ( i, j ) α j = X k X i,j K 1 / 2 1 ( k , i ) α i K 2 ( i, j ) K 1 / 2 1 ( k , j ) α j | {z } ≥ 0 ≥ 0 , h α, K g α i = X i,j α i g ( i ) g ( j ) α j = X i α i g ( i ) ! 2 ≥ 0 . Prop osition 4. 1 .10 . L et K b e some p ositive symmetric kernel on X . L et p : R → R b e a p olynomial with p ositive c o efficients. L et g : X → R d b e a me asur able func- tion. Then p ( K )( x, x ′ ) def = p K ( x, x ′ ) , 4.1. How to build them 141 exp( K )( x, x ′ ) def = exp K ( x, x ′ ) and G g ( x, x ′ ) def = exp −k g ( x ) − g ( x ′ ) k 2 ar e al l p ositive symmetric kernels. Proof. The first assertion is a direct consequence of the previous propo sition. The second comes fr o m the fact that the expo nen tial function is the p oin twise limit of a sequence o f p olynomial functions with p ositive coefficients. The third is seen from the second and the decomp osition G g ( x, x ′ ) = h exp −k g ( x ) k 2 exp −k g ( x ′ ) k 2 i exp 2 h g ( x ) , g ( x ′ ) i Prop osition 4. 1 .11 . With the n otation of the pr evious pr op osition, any tr aining set Z = ( x i , y i ) N i =1 ∈ X ×{− 1 , +1 } N is G g -sep ar able as so on as g ( x i ) , i = 1 , . . . , N ar e distinct p oints of R d . Proof. It is clea rly enough to prove the c ase when X = R d and g is the ident ity . Let us consider some other g eneric p oin t x N +1 ∈ R d and define Ψ as in (4.3). It is enough to pr o ve that Ψ( x 1 ) , . . . , Ψ ( x N ) are affine indep enden t, since the simplex, and therefor e any affine indep enden t set o f p oints, ca n b e split in any arbitrary wa y b y affine half-spaces. Let us assume that ( x 1 , . . . , x N ) are affine dependent; then for some ( λ 1 , . . . , λ N ) 6 = 0 suc h that P N i =1 λ i = 0 , N X i =1 N X j =1 λ i G ( x i , x j ) λ j = 0 . Thu s, ( λ i ) N +1 i =1 , where we ha ve put λ N +1 = 0 is in the k ernel of the symmetric po sitiv e semi-definite matrix G ( x i , x j ) i,j ∈{ 1 ,...,N +1 } . Therefore N X i =1 λ i G ( x i , x N +1 ) = 0 , for any x N +1 ∈ R d . This w ould mean that the functions x 7→ ex p( −k x − x i k 2 ) are linearly dep endent , which c a n b e easily proved to b e false. Indeed, let n ∈ R d be suc h that k n k = 1 and h n, x i i , i = 1 , . . . , N are distinct (such a vector exists, beca use it has to b e outside the unio n of a finite num ber of h yp erplanes, which is of zer o Lebe sgue measure on the s phere). Let us assume for a while that for some ( λ i ) N i =1 ∈ R N , for any x ∈ R d , N X i =1 λ i exp( −k x − x i k 2 ) = 0 . Considering x = tn , for t ∈ R , we would get N X i =1 λ i exp(2 t h n, x i i − k x i k 2 ) = 0 , t ∈ R . Letting t go to infinit y , we see that this is only p ossible if λ i = 0 for all v alues o f i . 142 Chapter 4. Supp ort V e ctor Machines 4.2. Bounds for Supp ort V ector Machines 4.2.1. C ompr ession scheme b ounds W e can use Support V ector Machines in the framework of compression sc hemes and apply Theore m 3.3.3 (page 12 5 ). More pre c is ely , given some p ositive symmetric kernel K on X , we may consider for any training s et Z ′ = ( x ′ i , y ′ i ) h i =1 the c la ssifier ˆ f Z ′ : X → Y which is equal to the Supp ort V ector Machine defined by K a nd Z ′ whenever Z ′ is K -separa ble, and whic h is equal to so me constant cla ssification r ule otherwise; w e tak e this conv ention to stick to the framew or k describ ed on page 1 17, we will only use ˆ f Z ′ in the K -sepa rable case, so this extension of the definition is just a matter of presentation. In the application of Theorem 3.3.3 in the case when the observed sa mple ( X i , Y i ) N i =1 is K -separable, a natura l if p erhaps sub-o ptimal choice of Z ′ is to choos e for ( x ′ i ) the set of s upport vectors defined by Z = ( X i , Y i ) N i =1 and to c ho ose for ( y ′ i ) the corresp onding v a lues of Y . This is justified by the fact that ˆ f Z = ˆ f Z ′ , as sho wn in Prop osition 4.1 .8 (page 1 3 9). If Z is not K -separable, we can train a Supp ort V ector Ma c hine with the b ox constraint, then remov e all the errors to obtain a K -separable sub-sample Z ′ = { ( X i , Y i ) : α ∗ i < λ 2 , 1 ≤ i ≤ N } , using the sa me notation as in equation (4.4) on page 140, and then c o nsider its suppo rt vectors as the compressio n set. Still using the notation of pag e 140, this means we ha ve to compute succe s siv ely α ∗ ∈ a rg min { F ( α ) : α ∈ A , α i ≤ λ 2 } , and α ∗∗ ∈ a rg min { F ( α ) : α ∈ A , α i = 0 when α ∗ i = λ 2 } , to keep the compression set indexed by J = { i : 1 ≤ i ≤ N , α ∗∗ i > 0 } , and the corres ponding Supp ort V ector Machine b f J . Differen t v alues of λ can be used at this stage, producing different candidate co mpression sets: when λ increases, the num b er of erro rs should decr ease, on the other hand when λ decreases, the margin k w k − 1 of the separa ble subset Z ′ increases, supp orting the hop e for a s maller set o f s upport vectors, th us we can use λ to monitor the num b er of errors on the training set we ac cept fro m the compression scheme. As we can use whatever heur istic we w ant while selecting the compres sion set, we can also try to threshold in the previous construction α ∗∗ i at different lev els η ≥ 0 , to pro duce ca ndidate compression sets J η = { i : 1 ≤ i ≤ N , α ∗∗ i > η } of v ar ious s izes. As the size | J | of the compressio n set is random in this co nstruction, w e must use a version of Theorem 3.3.3 (page 125) whic h handles compression sets of arbi- trary sizes. This is done by choosing for each k a k -partially exchangeable p osterior distribution π k which w eights the co mpr ession sets o f all dimensio ns . W e immedi- ately see that we can choo se π k such that − lo g π k (∆ k ( J )) ≤ log | J | ( | J | + 1) + | J | log h ( k +1) eN | J | i . If we observe the shadow sample patterns, and if computer resourc e s p ermit, we can o f course use more elab orate b ounds than Theorem 3.3.3, such a s the transduc- tiv e equiv alent for Theorem 1.3.15 (pag e 31) (where w e may consider the submod- els made of all the compressio n sets o f the same size). Theor ems ba sed on r elativ e bo unds, suc h as Theor em 2.2.4 (pag e 73) or Theorem 2.3.9 (page 108) can also b e used. Gibbs distributions can b e a ppro ximated by Monte Carlo techniques, where a Markov chain with the proper inv ariant measure consists in appropriate loc al per turbations of the compression set. Let us mention also that the us e of c o mpression schemes ba s ed on Supp ort V e c to r Machines ca n be tailo r ed to p erform some kind of fe atur e aggr e gation . Imagine that the kernel K is defined as the scalar pro duct in L 2 ( π ), where π ∈ M 1 + (Θ). More precisely let us consider for s o me set of soft clas sification rules f θ : X → R ; θ ∈ Θ 4.2. Bounds for Supp ort V e ctor Machi nes 143 the kernel K ( x, x ′ ) = Z θ ∈ Θ f θ ( x ) f θ ( x ′ ) π ( dθ ) . In this setting, the Supp ort V ector Machine applied to the training set Z = ( x i , y i ) N i =1 has the form f Z ( x ) = sign Z θ ∈ Θ f θ ( x ) N X i =1 y i α i f θ ( x i ) π ( dθ ) − b ! and, if this is to o burdensome to compute, w e can replace it with some finite approximation e f Z ( x ) = sign 1 m m X k =1 f θ k ( x ) w k − b ! , where the s e t { θ k , k = 1 , . . . , m } and the weight s { w k , k = 1 , . . . , m } are computed in some suitable way from the set Z ′ = ( x i , y i ) i,α i > 0 of supp ort vectors of f Z . F or instance, we can draw { θ k , k = 1 , . . . , m } at r a ndom acc ording to the pr obabilit y distribution prop ortional to N X i =1 y i α i f θ ( x i ) π ( dθ ) , define the weigh ts w k b y w k = sig n N X i =1 y i α i f θ k ( x i ) ! Z θ ∈ Θ N X i =1 y i α i f θ ( x i ) π ( dθ ) , and c ho ose the sma llest v alue of m for which this approximation still classifies Z ′ without errors . Let us remark that w e hav e built e f Z in such a wa y that lim m → + ∞ e f Z ( x i ) = f Z ( x i ) = y i , a.s. for any supp ort index i such that α i > 0. Alternatively , given Z ′ , we can select a finite set o f features Θ ′ ⊂ Θ such that Z ′ is K Θ ′ separable, where K Θ ′ ( x, x ′ ) = P θ ∈ Θ ′ f θ ( x ) f θ ( x ′ ) a nd consider the Supp ort V ector Machines f Z ′ built with the kernel K Θ ′ . As so on as Θ ′ is chosen as a function of Z ′ only , Theo rem 3.3.3 (page 125) applies and provides some level of co nfidence for the risk of f Z ′ . 4.2.2. The V apni k–Cervonenkis dimension of a family of subsets Let us consider some set X and some s et S ⊂ { 0 , 1 } X of subsets of X . Let h ( S ) b e the V apnik– Cerv onenkis dimension of S , defined as h ( S ) = max n | A | : A ⊂ X, | A | < ∞ and A ∩ S = { 0 , 1 } A o , where by definition A ∩ S = { A ∩ B : B ∈ S } and | A | is the num b er of points in A . Let us no tice that this definition do es no t dep end on the c hoice of the r eference se t X . Indeed X can b e c hosen to b e S S , the unio n of all the sets in S or any bigg er 144 Chapter 4. Supp ort V e ctor Machines set. Let us notice also that for any set B , h ( B ∩ S ) ≤ h ( S ), the reaso n b eing that A ∩ ( B ∩ S ) = B ∩ ( A ∩ S ). This notion of V apnik–Cervonenkis dimension is useful b ecause, as we will see for Supp ort V ector Machines, it can b e computed in some important sp ecial ca s es. Let us prov e her e as an illustr a tion that h ( S ) = d + 1 when X = R d and S is made of all the half spaces: S = { A w, b : w ∈ R d , b ∈ R } , where A w, b = { x ∈ X : h w , x i ≥ b } . Prop osition 4. 2 .1 . With the pr evious notation, h ( S ) = d + 1 . Proof. Let ( e i ) d +1 i =1 be the canonical base o f R d +1 , and let X be the a ffine subspace it gener ates, whic h can b e identified with R d . F o r any ( ǫ i ) d +1 i =1 ∈ {− 1 , +1 } d +1 , let w = P d +1 i =1 ǫ i e i and b = 0. The half space A w, b ∩ X is such that { e i ; i = 1 , . . . , d + 1 } ∩ ( A w, b ∩ X ) = { e i ; ǫ i = + 1 } . This prov es that h ( S ) ≥ d + 1. T o prove tha t h ( S ) ≤ d + 1 , we hav e to show that for any set A ⊂ R d of size | A | = d + 2, there is B ⊂ A such that B 6∈ ( A ∩ S ). Obviously this will b e the case if the conv ex hull s of B and A \ B have a non-empty in tersectio n: indeed if a hyper pla ne separates tw o sets of p oint s, it a lso separ ates their conv ex hulls. As | A | > d + 1, A is affine dependent: there is ( λ x ) x ∈ A ∈ R d +2 \ { 0 } such that P x ∈ A λ x x = 0 and P x ∈ A λ x = 0 . The set B = { x ∈ A : λ x > 0 } and its co mplement A \ B are non- empt y , bec a use P x ∈ A λ x = 0 and λ 6 = 0. Mor eo ver P x ∈ B λ x = P x ∈ A \ B − λ x > 0 . The relation 1 P x ∈ B λ x X x ∈ B λ x x = 1 P x ∈ B λ x X x ∈ A \ B − λ x x shows that the convex h ulls of B and A \ B hav e a non-void in terse c tio n. Let us in tro duce the function of t wo in tegers Φ h n = h X k =0 n k , which can alternatively b e defined by the r elations Φ h n = ( 2 n when n ≤ h, Φ h − 1 n − 1 + Φ h n − 1 when n > h. Theorem 4 . 2.2 . Whenever S S is finite, | S | ≤ Φ [ S , h ( S ) . Theorem 4 . 2.3 . F or any h ≤ n , Φ h n ≤ exp nH h n ≤ exp h log( n h ) + 1 , wher e H ( p ) = − p lo g( p ) − (1 − p ) lo g (1 − p ) is t he Shannon ent r opy of t he Bernoul li distribution with p ar ameter p . 4.2. Bounds for Supp ort V e ctor Machi nes 145 Proof of theorem 4.2.2. Let us prove this theorem by induction on | S S | . It is ea sy to chec k that it holds true when | S S | = 1 . Let X = S S , let x ∈ X a nd X ′ = X \ { x } . Define ( △ denoting the symmetric difference of t wo sets) S ′ = { A ∈ S : A △ { x } ∈ S } , S ′′ = { A ∈ S : A △ { x } 6∈ S } . Clearly , ⊔ deno ting the disjoint union, S = S ′ ⊔ S ′′ and S ∩ X ′ = ( S ′ ∩ X ′ ) ⊔ ( S ′′ ∩ X ′ ). Moreov er | S ′ | = 2 | S ′ ∩ X ′ | and | S ′′ | = | S ′′ ∩ X ′ | . Thus | S | = | S ′ | + | S ′′ | = 2 | S ′ ∩ X ′ | + | S ′′ | = | S ∩ X ′ | + | S ′ ∩ X ′ | . Obviously h ( S ∩ X ′ ) ≤ h ( S ). Mor e over h ( S ′ ∩ X ′ ) = h ( S ′ ) − 1, b ecause if A ⊂ X ′ is shattered by S ′ (or equiv alently by S ′ ∩ X ′ ), then A ∪ { x } is shattered by S ′ (w e say that A is shattered by S when A ∩ S = { 0 , 1 } A ). Using the induction hypothesis, we then see that | S ∩ X ′ | ≤ Φ h ( S ) | X ′ | + Φ h ( S ) − 1 | X ′ | . But as | X ′ | = | X | − 1, the right-hand side of this inequa lit y is equal to Φ h ( S ) | X | , according to the recurrence equa tion satisfied b y Φ. Proof of theorem 4. 2.3: This is the well-known Chernoff bound for the deviation of sums of Bernoulli random v ariables : let ( σ 1 , . . . , σ n ) b e i.i.d. Bernoulli random v a riables with para meter 1 / 2. Let us notice that Φ h n = 2 n P n X i =1 σ i ≤ h ! . F or a n y po sitiv e real num ber λ , P n X i =1 σ i ≤ h ≤ exp( λh ) E exp − λ n X i =1 σ i = exp n λh + n log E exp − λσ 1 o . Differen tiating the right-hand side in λ shows that its minimal v alue is exp − n K ( h n , 1 2 ) , where K ( p, q ) = p log( p q ) + (1 − p ) lo g ( 1 − p 1 − q ) is the Kullback diver- gence function b etw een tw o Bernoulli distributions B p and B q of parameters p and q . Indeed the optimal v alue λ ∗ of λ is such that h = n E σ 1 exp( − λ ∗ σ 1 ) E exp( − λ ∗ σ 1 ) = nB h/n ( σ 1 ) . Therefore, using the fact that t wo Bernoulli distributions with the same exp ecta- tions are equal, log E exp( − λ ∗ σ 1 ) = − λ ∗ B h/n ( σ 1 ) − K ( B h/n , B 1 / 2 ) = − λ ∗ h n − K ( h n , 1 2 ) . The announced result then follows from the identit y H ( p ) = log(2 ) − K ( p, 1 2 ) = p log( p − 1 ) + (1 − p ) log (1 + p 1 − p ) ≤ p log( p − 1 ) + 1 . 146 Chapter 4. Supp ort V e ctor Machines 4.2.3. V apnik–Cer vonenkis dimension of line ar ru les with mar gi n The pro of of the follo wing theorem was suggested to us by a similar pro of pre s en ted in Cristianini et al. (2000). Theorem 4 . 2.4 . Consider a family of p oints ( x 1 , . . . , x n ) in some Euclide an ve c- tor sp ac e E and a family of affine functions H = g w, b : E → R ; w ∈ E , k w k = 1 , b ∈ R , wher e g w, b ( x ) = h w, x i − b, x ∈ E . Assume that ther e is a set of thr esholds ( b i ) n i =1 ∈ R n such that for any ( y i ) n i =1 ∈ {− 1 , +1 } n , t her e is g w, b ∈ H su ch that n inf i =1 g w, b ( x i ) − b i y i ≥ γ . L et us also intr o duc e the empiri c al varianc e of ( x i ) n i =1 , V ar( x 1 , . . . , x n ) = 1 n n X i =1 x i − 1 n n X j =1 x j 2 . In this c ase and with this notation, (4.5) V ar( x 1 , . . . , x n ) γ 2 ≥ ( n − 1 when n is even, ( n − 1) n 2 − 1 n 2 when n is o dd. Mor e over, e quality is re ache d when γ is optimal, b i = 0 , i = 1 , . . . , n and ( x 1 , . . . , x n ) is a r e gular simplex (i.e. when 2 γ is the minimum distanc e b etwe en the c onvex hul ls of any two subsets of { x 1 , . . . , x n } and k x i − x j k do es not dep end on i 6 = j ). Proof. Let ( s i ) n i =1 ∈ R n be such that P n i =1 s i = 0. Let σ b e a uniformly distributed random v aria ble with v a lues in S n , the set of p erm utations of the first n integers { 1 , . . . , n } . By ass umption, for any v alue of σ , there is an affine function g w, b ∈ H such that min i =1 ,...,n g w, b ( x i ) − b i 2 1 ( s σ ( i ) > 0 ) − 1 ≥ γ . As a consequence * n X i =1 s σ ( i ) x i , w + = n X i =1 s σ ( i ) h x i , w i − b − b i + n X i =1 s σ ( i ) b i ≥ n X i =1 γ | s σ ( i ) | + s σ ( i ) b i . Therefore, using the fact that the map x 7→ max 0 , x 2 is conv ex, E n X i =1 s σ ( i ) x i 2 ! ≥ E max ( 0 , n X i =1 γ | s σ ( i ) | + s σ ( i ) b i )! 2 4.2. Bounds for Supp ort V e ctor Machi nes 147 ≥ max ( 0 , n X i =1 γ E | s σ ( i ) | + E s σ ( i ) b i )! 2 = γ 2 n X i =1 | s i | ! 2 , where E is the e x pectation with resp ect to the rando m per m utation σ . On the other hand E n X i =1 s σ ( i ) x i 2 ! = n X i =1 E ( s 2 σ ( i ) ) k x i k 2 + X i 6 = j E ( s σ ( i ) s σ ( j ) ) h x i , x j i . Moreov er E ( s 2 σ ( i ) ) = 1 n E n X i =1 s 2 σ ( i ) ! = 1 n n X i =1 s 2 i . In the same wa y , for a n y i 6 = j , E s σ ( i ) s σ ( j ) = 1 n ( n − 1) E X i 6 = j s σ ( i ) s σ ( j ) = 1 n ( n − 1) X i 6 = j s i s j = 1 n ( n − 1) " n X i =1 s i | {z } =0 ! 2 − n X i =1 s 2 i # = − 1 n ( n − 1) n X i =1 s 2 i . Thu s E n X i =1 s σ ( i ) x i 2 ! = n X i =1 s 2 i ! 1 n n X i =1 k x i k 2 − 1 n ( n − 1) X i 6 = j h x i , x j i = n X i =1 s 2 i ! " 1 n + 1 n ( n − 1) n X i =1 k x i k 2 − 1 n ( n − 1) n X i =1 x i 2 # = n n − 1 n X i =1 s 2 i ! V ar( x 1 , . . . , x n ) . W e have prov ed that V ar( x 1 , . . . , x n ) γ 2 ≥ ( n − 1) n X i =1 | s i | 2 n n X i =1 s 2 i . This ca n b e used with s i = 1 ( i ≤ n 2 ) − 1 ( i > n 2 ) in the case when n is even and s i = 2 ( n − 1) 1 ( i ≤ n − 1 2 ) − 2 n +1 1 ( i > n − 1 2 ) in the case when n is o dd, to establish the first inequality (4.5) o f the theo r em. 148 Chapter 4. Supp ort V e ctor Machines Checking that equality is reached for the s implex is an eas y computation when the simplex ( x i ) n i =1 ∈ ( R n ) n is parametrize d in such a wa y that x i ( j ) = ( 1 if i = j, 0 o therwise. Indeed the distance b et ween the conv ex hulls of any t wo subsets of the simplex is the distance b et ween their mean v alues (i.e. cent ers of ma ss). 4.2.4. A pplic ation to Su pp ort V e ctor Machines W e are going to apply Theorem 4.2 .4 (pag e 14 6 ) to Supp ort V ector Machines in the tr a nsductiv e case. Let ( X i , Y i ) ( k +1) N i =1 be distributed according to some pa rtially exchangeable distribution P and ass ume that ( X i ) ( k +1) N i =1 and ( Y i ) N i =1 are obse r v ed. Let us consider some p ositive kernel K on X . F or any K -separable tra ining set of the form Z ′ = ( X i , y ′ i ) ( k +1) N i =1 , where ( y ′ i ) ( k +1) N i =1 ∈ Y ( k +1) N , let ˆ f Z ′ be the Supp ort V ector Machine defined by K and Z ′ and let γ ( Z ′ ) be its margin. L e t R 2 = max i =1 ,..., ( k +1) N K ( X i , X i ) + 1 ( k + 1 ) 2 N 2 ( k +1) N X j =1 ( k +1) N X k =1 K ( X j , X k ) − 2 ( k + 1 ) N ( k +1) N X j =1 K ( X i , X j ) . This is a n eas ily computable upp er-b ound for the radius of some ball co n taining the image of ( X 1 , . . . , X ( k +1) N ) in feature space. Let us define for any integer h the marg ins (4.6) γ 2 h = (2 h − 1) − 1 / 2 and γ 2 h +1 = 2 h 1 − 1 (2 h + 1) 2 − 1 / 2 . Let us consider for any h = 1 , . . . , N the exchangeable mo del R h = ˆ f Z ′ : Z ′ = ( X i , y ′ i ) ( k +1) N i =1 is K -separa ble and γ ( Z ′ ) ≥ R γ h . The family of mo dels R h , h = 1 , . . . , N is nested, and we know from Theorem 4.2.4 (page 146) and Theorems 4.2.2 (page 144) and 4.2.3 (page 144) that log | R h | ≤ h log ( k +1) eN h . W e c a n then consider o n the large model R = F N h =1 R h (the disjoin t union of the sub-mo dels) an exc hangea ble prior π whic h is uniform on each R h and is suc h that π ( R h ) ≥ 1 h ( h +1) . Applying Theorem 3.2.3 (page 117) w e get Prop osition 4. 2 .5 . With P pr ob ability at le ast 1 − ǫ , for any h = 1 , . . . , N , any Supp ort V e ct or Machine f ∈ R h , r 2 ( f ) ≤ k + 1 k inf λ ∈ R + 1 − e x p h − λ N r 1 ( f ) − h N log e ( k +1) N h − log[ h ( h +1)] − log( ǫ ) N i 1 − e x p ( − λ N ) − r 1 ( f ) k . 4.2. Bounds for Supp ort V e ctor Machi nes 149 Searching the whole mo del R h to optimize the b ound may r equire mor e computer resources than a r e av ailable, but an y heuristic can b e applied to choo se f , since the bo und is uniform. F or instance, a Supp ort V ector Machine f ′ using a b o x constraint can b e trained fro m the tr a ining set ( X i , Y i ) N i =1 and then ( y ′ i ) ( k +1) N i =1 can b e set to y ′ i = sign( f ′ ( X i )), i = 1 , . . . , ( k + 1) N . 4.2.5. Indu ctive mar gin b ounds for Supp ort V e ctor Machines In order to establish inductive margin b ounds, we will need a different combinatorial lemma. It is due to Alon et al. (1997). W e will r e produce their pro of with some tiny improv ements on the v a lues of constants. Let us co nsider the finite case when X = { 1 , . . . , n } , Y = { 1 , . . . , b } and b ≥ 3 . The question we will study would be meaningless when b ≤ 2 . Assume as usual that we are dealing with a prescr ibed set of clas sification r ules R = f : X → Y . Let us say that a pair ( A, s ), where A ⊂ X is a non-empt y set of s hapes and s : A → { 2 , . . . , b − 1 } a threshold function, is shatter e d b y th e set of func- tions F ⊂ R if for a ny ( σ x ) x ∈ A ∈ {− 1 , +1 } A , there e x ists some f ∈ F s uch that min x ∈ A σ x f ( x ) − s ( x ) ≥ 1 . Definition 4. 2.1 . L et t he fat shattering dimension of ( X , R ) b e the maximal size | A | of the first c omp onent of the p airs which ar e shatter e d by R . Let us say that a s ubse t of classification rules F ⊂ Y X is sep ar ate d whenever for any pa ir ( f , g ) ∈ F 2 such that f 6 = g , k f − g k ∞ = max x ∈ X | f ( x ) − g ( x ) | ≥ 2. Let M ( R ) b e the maximum size | F | o f separ ated subsets F of R . Note that if F is a separated subset o f R suc h that | F | = M ( R ), then it is a 1-net for the L ∞ distance: for any function f ∈ R there exists g ∈ F such that k f − g k ∞ ≤ 1 (otherwise f could b e added to F to c r eate a lar ger separa ted set). Lemma 4. 2.6 . With the ab ove notation, whenever the fat sh attering dimension of ( X , R ) is not gr e ater than h , log M ( R ) < log ( b − 1 )( b − 2) n ( log P h i =1 n i ( b − 2 ) i log(2) + 1 ) + log (2) ≤ log ( b − 1)( b − 2) n ( log h ( b − 2) n h i + 1 h log(2) + 1 ) + log(2 ) . Proof. F o r a n y set of functions F ⊂ Y X , let t ( F ) b e the num b er of pa irs ( A, s ) shattered b y F . Let t ( m, n ) be the minim um of t ( F ) ov er a ll sep ar ate d sets of functions F ⊂ Y X of size | F | = m ( n is her e to recall that the sha pe space X is made o f n shap es). F or any m such that t ( m, n ) > P h i =1 n i ( b − 2) i , it is clear that any separated set o f functions of size | F | ≥ m shatters at lea st one pair ( A, s ) such that | A | > h . Indeed, from its definition t ( m, n ) is clearly a non-decrea sing function of m , so that t ( | F | , n ) > P h i =1 n i ( b − 2 ) i . Moreov er there a re only P h i =1 n i ( b − 2 ) i pairs ( A, s ) suc h that | A | ≤ h . As a consequence, whenever the fat shattering dimension of ( X , R ) is not g reater than h we have M ( R ) < m . It is clear that for any n ≥ 1 , t (2 , n ) = 1. Lemma 4. 2.7 . F or any m ≥ 1 , t mn ( b − 1)( b − 2) , n ≥ 2 t m, n − 1 , and ther efor e t 2 n ( n − 1 ) · · · ( n − r + 1)( b − 1) r ( b − 2 ) r , n ≥ 2 r . 150 Chapter 4. Supp ort V e ctor Machines Proof. Let F = { f 1 , . . . , f mn ( b − 1)( b − 2) } be s ome sepa rated set of functions of size mn ( b − 1)( b − 2 ). F or an y pair ( f 2 i − 1 , f 2 i ), i = 1 , . . . , m n ( b − 1)( b − 2) / 2, there is x i ∈ X such that | f 2 i − 1 ( x i ) − f 2 i ( x i ) | ≥ 2 . Since | X | = n , there is x ∈ X such that P mn ( b − 1)( b − 2) / 2 i =1 1 ( x i = x ) ≥ m ( b − 1)( b − 2 ) / 2. Let I = { i : x i = x } . Since there are ( b − 1)( b − 2 ) / 2 pairs ( y 1 , y 2 ) ∈ Y 2 such that 1 ≤ y 1 < y 2 − 1 ≤ b − 1 , there is so me pair ( y 1 , y 2 ), such that 1 ≤ y 1 < y 2 ≤ b and such that P i ∈ I 1 ( { y 1 , y 2 } = { f 2 i − 1 ( x ) , f 2 i ( x ) } ) ≥ m . Let J = i ∈ I : { f 2 i − 1 ( x ) , f 2 i ( x ) } = { y 1 , y 2 } . Let F 1 = { f 2 i − 1 : i ∈ J, f 2 i − 1 ( x ) = y 1 } ∪ { f 2 i : i ∈ J, f 2 i ( x ) = y 1 } , F 2 = { f 2 i − 1 : i ∈ J, f 2 i − 1 ( x ) = y 2 } ∪ { f 2 i : i ∈ J, f 2 i ( x ) = y 2 } . Obviously | F 1 | = | F 2 | = | J | = m . Moreov er the re strictions o f the functions of F 1 to X \ { x } are sepa rated, a nd it is the same with F 2 . Thus F 1 strongly shatters at leas t t ( m, n − 1) pair s ( A, s ) such that A ⊂ X \ { x } and it is the sa me with F 2 . Finally , if the pa ir ( A, s ) wher e A ⊂ X \ { x } is bo th shattered by F 1 and F 2 , then F 1 ∪ F 2 shatters also ( A ∪ { x } , s ′ ) w her e s ′ ( x ′ ) = s ( x ′ ) fo r any x ′ ∈ A and s ′ ( x ) = ⌊ y 1 + y 2 2 ⌋ . Thu s F 1 ∪ F 2 , and therefore F , shatters a t least 2 t ( m, n − 1) pairs ( A, s ). Resuming the pro of of lemma 4.2.6, let us c ho ose for r the smallest integer such that 2 r > P h i =1 n i ( b − 2) i , which is no grea ter than log P h i =1 ( n i ) ( b − 2) i log(2) + 1 . In the case when 1 ≤ n ≤ r , log( M ( R )) < | X | log( | Y | ) = n log( b ) ≤ r log( b ) ≤ r log ( b − 1 )( b − 2) n + log(2 ) , which proves the lemma. In the remaining case n > r , t 2 n r ( b − 1 ) r ( b − 2 ) r , n ≥ t 2 n ( n − 1 ) . . . ( n − r + 1)( b − 1) r ( b − 2 ) r , n > h X i =1 n i ( b − 2) i . Thu s | M ( R ) | < 2 h ( b − 2 )( b − 1) n i r as claimed. In or der to apply this c o m binatorial lemma to Suppo rt V ecto r Mac hines, let us consider now the ca se of sepa rating h yp erplanes in R d (the g eneralization to Supp ort V ector Ma c hines b eing straightforw ar d). Ass ume that X = R d and Y = {− 1 , +1 } . F or a n y sample ( X ) ( k +1) N i =1 , let R ( X ( k +1) N 1 ) = max {k X i k : 1 ≤ i ≤ ( k + 1) N } . Let us consider the set of parameters Θ = ( w, b ) ∈ R d × R : k w k = 1 . F or an y ( w, b ) ∈ Θ, let g w, b ( x ) = h w , x i − b . Let h b e some fixed integer and let γ = R ( X ( k +1) N 1 ) γ h , where γ h is defined by equation (4.6, page 148). 4.2. Bounds for Supp ort V e ctor Machi nes 151 Let us define ζ : R → Z by ζ ( r ) = − 5 when r ≤ − 4 γ , − 3 when − 4 γ 0 , 1 k N ( k +1) N X i = N +1 1 g w, b ( X i ) Y i < 0 ≤ k +1 k 1 − e x p − log 20( k +1) N N n 16 R 2 +2 γ 2 log(2) γ 2 log e ( k +1) N γ 2 4 R 2 + 1 o + 1 N log( ǫ 2 ) . This inequality compares favourably with s imila r inequalities in Cristianini et al. (2000), which moreover do no t e x tend to the marg in quantile cas e as this one. Let us a lso men tion that it is easy to circumv ent the fact that R is not observed when the test set X ( k +1) N N +1 is not observed. Indeed, w e can consider the sample obtained by pro jecting X ( k +1) N 1 on some ball of fixed radius R max , putting t R max ( X i ) = min 1 , R max k X i k X i . W e ca n further co nsider a n atomic prior distribution ν ∈ M 1 + ( R + ) b earing on R max , to obtain a uniform result thro ugh a union b ound. As a consequence of the prev ious theorem, w e have Corollary 4. 2.9 . F or any atomic prior ν ∈ M 1 + ( R + ) , for any p artial ly exchange- able pr ob ability me asur e P ∈ M 1 + (Ω) , with P p r ob ability a t le ast 1 − ǫ , for any ( w, b ) ∈ Θ , any R max ∈ R + , 4.2. Bounds for Supp ort V e ctor Machi nes 153 1 k N ( k +1) N X i = N +1 1 n 2 1 g w, b ◦ t R max ( X i ) ≥ 0 − 1 6 = Y i o ≤ k + 1 k inf λ ∈ R + ,h ∈ N ∗ 1 − e x p( − λ N ) − 1 ( 1 − exp " − λ N 2 N X i =1 1 g w, b ◦ t R max ( X i ) Y i ≤ 4 R max γ h − log 20( k + 1 ) N n h log(2) log 4 e ( k +1) N h + 1 o + log h 2 h ( h +1) ǫν ( R max ) i N #) − 1 k N N X i =1 1 g w, b ◦ t R max ( X i ) Y i ≤ 4 R max γ h . Let us r emark that t R max ( X i ) = X i , i = N + 1 , . . . , ( k + 1) N , as soon as w e consider only the v alues o f R max not smaller than ma x i = N +1 ,..., ( k +1) N k X i k in this corollar y . Thus w e obtain a b ound on the transductive generalization error of the un thresho lded cla ssification rule 2 1 g w, b ( X i ) ≥ 0 − 1, as w ell as some incitation to replace it with a thresholded rule when the v alue o f R max minimizing the bo und falls b elo w max i = N +1 ,..., ( k +1) N k X i k . 154 Chapter 4. Supp ort V e ctor Machines App endi x : Classification b y thresholding In this app endix, we s ho w how the b ounds g iv en in the firs t s ection of this mono- graph ca n be computed in practice o n a simple exa mple: the case when the cla s- sification is p erformed b y compar ing a ser ies of measure ments to thresho ld v alues. Let us mention that o ur descr iption c overs the case when the same measurement is compared to s ev eral thresholds, since it is enough to repea t a measurement in the list of measurements describing a pattern to cov er this case. 5.1. Description of the mo del Let us a ssume that the patterns we wan t to classify ar e descr ibed throug h h real v alued measurement s normalized in the range (0 , 1). In this setting the pattern space can th us b e defined as X = (0 , 1) h . Consider the threshold set T = (0 , 1) h and the r e s ponse set R = Y { 0 , 1 } h . F or any t ∈ (0 , 1) h and an y a : { 0 , 1 } h → Y , let f ( t,a ) ( x ) = a n 1 ( x j ≥ t j ) h j =1 o , x ∈ X , where x j is the j th coo rdinate of x ∈ X . Thu s our par ameter set here is Θ = T × R . Let us cons ider the Lebes g ue measure L on T and the unifor m probability distribution U on R . Let o ur prior distribution b e π = L ⊗ U . Let us define for any threshold sequence t ∈ T ∆ t = n t ′ ∈ T : ( t ′ j , t j ) ∩ { X j i ; i = 1 , . . . , N } = ∅ , j = 1 , . . . , h o , where X j i is the j th co ordinate of the sample pa ttern X i , and where the interv al ( t ′ j , t j ) o f the real line is defined as the con vex hull of the tw o point set { t ′ j , t j } , whether t ′ j ≤ t j or not. W e s ee that ∆ t is the set o f thresholds giving the same resp onse as t o n the training patterns. L e t us cons ider for any t ∈ T the middle m (∆ t ) = R ∆ t t ′ L ( dt ′ ) L (∆ t ) of ∆ t . The set ∆ t being a pro duct o f interv als, its middle is the p oin t whose coor di- nates are the middle o f these interv als. Le t us introduce the finite set T compo sed of the middles of the cells ∆ t , which can b e defined as T = { t ∈ T : t = m (∆ t ) } . It is easy to see that | T | ≤ ( N + 1) h and that | R | = | Y | 2 h . 155 156 App endix 5.2. Com putation of inductiv e b ounds F or an y par ameter ( t, a ) ∈ T × R = Θ, let us co nsider the posterio r distribution defined b y its density dρ ( t,a ) dπ ( t ′ , a ′ ) = 1 t ′ ∈ ∆ t 1 a ′ = a π ∆ t × { a } . In fact we a re considering a finite num b er of po s terior distributions, since ρ ( t,a ) = ρ ( m (∆ t ) ,a ) , where m (∆ t ) ∈ T . Moreov er, for any exchangeable sample distribution P ∈ M 1 + ( X × Y ) N +1 and any thresholds t ∈ T , P h ( X j N +1 , t j ) ∩ { X j i , i = 1 , . . . , N } = ∅ i ≤ 2 N + 1 . Thu s, fo r a n y ( t, a ) ∈ Θ, P n ρ ( t,a ) f . ( X N +1 ) 6 = f ( t,a ) ( X N +1 ) o ≤ 2 h N + 1 , showing that the classification pro duced by ρ ( t,a ) on new examples is typically non- random; this re s ult is only indicative, since it is concerned with a non-r a ndom c hoice of ( t, a ). Let us co mpute the v arious quan tities needed to a pply the results of the first section, fo cussing our a tten tion on Theorem 2.1.3 (page 54). First note that ρ ( t,a ) ( r ) = r [( t, a )]. The en tropy ter m is such that K ( ρ t,a , π ) = − lo g π ∆ t × { r } = − log L (∆ t ) + 2 h log | Y | . Let us notice accordingly that min ( t,a ) ∈ Θ K ( ρ ( t,a ) , π ) ≤ h log( N + 1) + 2 h log | Y | . Let us introduce the counters b t y ( c ) = 1 N N X i =1 1 n Y i = y and 1 ( X j i ≥ t j ) h j =1 = c o , t ∈ T , c ∈ { 0 , 1 } h , y ∈ Y , b t ( c ) = X y ∈ Y b t y ( c ) = 1 N N X i =1 1 n 1 ( X j i ≥ t j ) h j =1 = c o , t ∈ T , c ∈ { 0 , 1 } h . Since r [( t, a )] = X c ∈{ 0 , 1 } h b t ( c ) − b t a ( c ) ( c ) , the partition function of the Gibbs estimator can b e computed as π exp( − λr ) = X t ∈ T L (∆ t ) X a ∈ R 1 | Y | 2 h exp − λ N X i =1 1 Y i 6 = f ( t,a ) ( X i ) = X t ∈ T L (∆ t ) X a ∈ R 1 | Y | 2 h exp − λ X c ∈{ 0 , 1 } h b t ( c ) − b t a ( c ) ( c ) 5.2. Computation of induct ive b ounds 157 = X t ∈ T L (∆ t ) Y c ∈{ 0 , 1 } h " 1 | Y | X y ∈ Y exp − λ b t ( c ) − b t y ( c ) # . W e s ee that the n umber of oper ations needed to c o mpute π exp( − λr ) is prop or- tional to | T | × 2 h × | Y | ≤ ( N + 1 ) h 2 h | Y | . An exact co mputation will therefor e be feasible only for small v alues of N and h . F or higher v alues, a Monte Carlo approx- imation of this sum will hav e to be p erformed instead. If we want to compute the b ound pr o vided by Theorem 2 .1.3 (page 54) o r by Theorem 2.2.2 (pag e 69), we need also to compute, for any fixed parameter θ ∈ Θ, quantit ies o f the type π exp( − λr ) n exp ξ m ′ ( · , θ ) o = π exp( − λr ) n exp ξ ρ θ ( m ′ ) o , λ, ξ ∈ R + . W e need to intro duce b t y ( θ, c ) = 1 N N X i =1 1 f θ ( X i ) 6 = Y i − 1 ( y 6 = Y i ) 1 1 ( X j i ≥ t j ) h j =1 = c . Similarly to what has been done previously , we obtain π exp − λr + ξ m ′ ( · , θ ) = X t ∈ T L (∆ t ) Y c ∈{ 0 , 1 } h 1 | Y | X y ∈ Y exp − λ b t ( c ) − b t y ( c ) + ξ b t y ( θ, c ) . W e ca n then compute π exp( − λr ) ( r ) = − ∂ ∂ λ log π exp( − λr ) , π exp( − λr ) n exp ξ ρ θ ( m ′ ) o = π exp − λr + ξ m ′ ( · , θ ) π exp( − λr ) , π exp( − λr ) m ′ ( · , θ ) = ∂ ∂ ξ | ξ = 0 log h π exp − λr + ξ m ′ ( · , θ ) i . This is all w e need to compute B ( ρ θ , β , γ ) (and also B ( π exp( − λr ) , β , γ )) in Theorem 2.1.3 (page 54), using the approximation log n π exp( − λ 1 r ) h exp ξ π exp( − λ 2 r ) ( m ′ ) io ≤ log n π exp( − λ 1 r ) h exp ξ m ′ ( · , θ ) io + ξ π exp( − λ 2 r ) m ′ ( · , θ ) , ξ ≥ 0 . Let us a lso ex plain how to apply the posterior distribution ρ ( t,a ) , in other words our randomized estimated classification rule, to a new pattern X N +1 : ρ ( t,a ) f · ( X N +1 ) = y = L (∆ t ) − 1 Z ∆ t 1 h a 1 ( X j N +1 ≥ t ′ j ) h j =1 = y i L ( dt ′ ) = L (∆ t ) − 1 X c ∈{ 0 , 1 } h L n t ′ ∈ ∆ t : 1 ( X j N +1 ≥ t ′ j ) h j =1 = c o 1 a ( c ) = y . Let us define for short ∆ t ( c ) = n t ′ ∈ ∆ t : 1 ( X j N +1 ≥ t ′ j ) h j =1 = c o , c ∈ { 0 , 1 } h . 158 App endix With this notation ρ ( t,a ) f . ( X N +1 ) = y = L ∆ t − 1 X c ∈{ 0 , 1 } h L ∆ t ( c ) 1 a ( c ) = y . W e can co mpute in the same wa y the probabilities for the lab el of the new pattern under the Gibbs p osterior distribution: π exp( − λr ) f · ( X N +1 ) = y ′ = ( X t ∈ T Y c ∈{ 0 , 1 } h 1 | Y | X y ∈ Y exp − λ b t ( c ) − b t y ( c ) × X c ∈{ 0 , 1 } h L ∆ t ( c ) P y ∈ Y 1 ( y = y ′ ) exp − λ b t ( c ) − b t y ( c ) P y ∈ Y exp − λ b t ( x ) − b t y ( c ) ) × ( X t ∈ T L (∆ t ) Y c ∈{ 0 , 1 } h 1 | Y | X y ∈ Y exp − λ b t ( c ) − b t y ( c ) ) − 1 . 5.3. T ransductiv e b ounds In the case when we observe the patterns o f a shadow s ample ( X i ) ( k +1) N i = N +1 on top of the tr aining sample ( X i , Y i ) N i =1 , we can introduce the set of thresho lds resp onding as t on the extended sample ( X i ) ( k +1) N i =1 ∆ t = n t ′ ∈ T : ( t ′ j , t j ) ∩ X j i ; i = 1 , . . . , ( k + 1) N } = ∅ , j = 1 , . . . , h o , consider the set T = t ∈ T : t = m (∆ t ) , of the middle points of the cells ∆ t , t ∈ T , and replace the Leb esgue meas ur e L ∈ M 1 + (0 , 1) h of the previo us section with the unifor m proba bilit y measure L o n T . W e can then consider π = L ⊗ U , wher e U is as previous ly the uniform proba bilit y measure on R . This gives obviously an exchangeable posterior distribution and therefore qualifies π for transductiv e bo unds. Let us notice that | T | ≤ ( k + 1) N + 1 h , and therefore that π ( t, a ) ≥ ( k + 1) N + 1 − h | Y | − 2 h , for any ( t, a ) ∈ T × R . F or a n y ( t, a ) ∈ T × R w e may similarly to the inductiv e case co ns ider the po sterior distribution ρ ( t,a ) defined b y dρ ( t,a ) dπ ( t ′ , a ′ ) = 1 ( t ′ ∈ ∆ t ) 1 ( a ′ = a ) π ∆ t × { a } ) , but we may also consider δ ( m ( ∆ t ) ,a ) , whic h is such that r i { [ m ( ∆ t ) , a ] } = r i [( t, a )], i = 1 , 2, wher eas only ρ ( t,a ) ( r 1 ) = r 1 [( t, a )], while ρ ( t,a ) ( r 2 ) = 1 | T ∩ ∆ t | X t ′ ∈ T ∩ ∆ t r 2 [( t ′ , a )] . W e g et 5.3. T r ansductive b ounds 159 K ( ρ ( t,a ) , π ) = − log L (∆ t ) + 2 h log | Y | ≤ log | T | + 2 h log( | Y | ) = K ( δ [ m ( ∆ t ) ,a ] , π ) ≤ h log ( k + 1) N + 1 + 2 h log( | Y | ) , whereas w e had no such uniform b ound in the inductiv e case. Similarly to the inductiv e ca se π exp( − λr 1 ) = X t ∈ T L (∆ t ) Y c ∈{ 0 , 1 } h 1 | Y | X y ∈ Y exp − λ b t ( c ) − b t y ( c ) . Moreov er, for an y θ ∈ Θ, π exp − λr 1 + ξ ρ θ ( m ′ ) = π exp − λr 1 + ξ m ′ ( · , θ ) = X t ∈ T L (∆ t ) Y c ∈{ 0 , 1 } h 1 | Y | X y ∈ Y exp − λ b t ( c ) − b t y ( c ) + ξ b ( θ, c ) . The b ound for the tra nsductiv e counterpart to Theorems 2.1.3 (page 54) or 2 .2.2 (page 6 9), obtained as explained page 11 5 , can b e co mputed as in the inductiv e case, from these tw o partition functions and the a bov e entrop y computation. Let us men tion finally that, using the same notation as in the inductive cas e, π exp( − λr 1 ) f · ( X N +1 ) = y ′ = ( X t ∈ T Y c ∈{ 0 , 1 } h 1 | Y | X y ∈ Y exp − λ b t ( c ) − b t y ( c ) × X c ∈{ 0 , 1 } h L ∆ t ( c ) P y ∈ Y 1 ( y = y ′ ) exp − λ b t ( c ) − b t y ( c ) P y ∈ Y exp − λ b t ( x ) − b t y ( c ) ) × ( X t ∈ T L (∆ t ) Y c ∈{ 0 , 1 } h 1 | Y | X y ∈ Y exp − λ b t ( c ) − b t y ( c ) ) − 1 . T o conclude this app endix o n classification by thresho lding, note that similar fa c - torized computations a re feasible in the imp ortant case of classific ation tr e es . This can b e achiev ed using s ome v aria n t of the c ont ext tr e e weighting metho d discovered b y Willems et al. (1995) a nd successfully used in lossless compression theor y . The in teres ted r eader can find a description of this a lg orithm applied to classification trees in Catoni (2004, page 62). 160 App endix Bibliograph y Alon, N., Ben-Da vid, S., Cesa-Bianchi, N. and Haussler, D. (1997). Sca le sensitive dimensions, uniform conv erg e nce and learna bilit y . J . ACM 44 615– 631. MR14813 18 Audiber t, J.-Y. (20 04a). Aggregated estimator s and empirical complexity for least square r egression. A nn. Inst. H. Poinc ar´ e Pr ob ab. Statist. 40 685–73 6 . MR20962 15 Audiber t, J.-Y. (2 004b). P A C-Bayesian statistical lear ning theory . Ph.D. thesis, Univ. Paris 6. Av a ilable at h ttp://ce rmics.enpc.fr/ ~ audibe rt/ . Barron, A. (19 87). Are Bay es r ules consistent in information? In Op en Pr oblems in Communic ation and Computation (T. M. Cover and B . Gopinath, eds.) 85–9 1. Springer, New Y o rk. MR0 922073 Barron, A. and Y ang, Y. (1999). Information-theoretic determination of mini- max rates of conv ergence . A nn. Statist. 27 1564 –1599. MR174250 0 Barron, A. , Birg ´ e, L. and Massar t, P . (1999). Risk b ounds for mo del selection b y p enalization. Pr ob ab. The ory R elate d Fields 113 301–4 13. MR1 679028 Blanchard, G. (199 9). The “progre s siv e mixture” estimator for regression trees. Ann. Inst. H. Poinc ar ´ e Pr ob ab. Statist. 35 793–8 20. MR1725711 Blanchard, G. (200 1). Mixture and aggr egation of estimators for pattern recog- nition. Application to decision trees. Ph.D. thesis, Univ. Paris 13 . Av ailable at http:/ /ida.fir st.fraunhofer.de/ ~ blanch ard/ . Blanchard, G. (200 4). Un algor ithme acc´ el ´ er´ e d’´ echan tillonnage Bay ´ esien p our le mo d ` ele CAR T. R ev. Intel l. A rtificiel le 1 8 38 3–410. Bir g ´ e, L. and Massar t, P. (1997). F rom mo del s election to adaptive estimation. In F estschrift for Lucie n L e Ca m (D. Pollard, ed.) 55–8 7 . Springer , New Y ork. MR14629 39 Bir g ´ e, L. and Massar t, P. (199 8 ). Minimum contrast estimators on s iev es. Bernoul li 4 32 9 –375. MR16532 72 Bir g ´ e, L. and Massar t, P. (2001a). A gener alized C p cri- terion for Gaussian mo del selection. Preprint. Av ailable at http:/ /www.mat h.u- psud.fr/ ~ massar t/ . MR1848 946 Bir g ´ e, L. and Massar t, P. (20 01b). Ga ussian mo del selection. J. Eur. Math. So c. 3 203–2 68. MR1 848946 Blum, A. and Langf ord, J. (2 003). P AC-MDL b ounds. Computational L e arn- ing The ory and Kernel Machines. 16th Annual Co nfer enc e on Computational L e arning Th e ory and 7th Kernel Workshop. COL T/Kernel 2003, Washi ngton, DC, US A, August 24-27, 2003, Pr o c e e dings . L e ctur e Notes in Comput. S ci. 2777 344–3 57. Springer, New Y or k. Ca toni, O. (2002 ). Data compression a nd adaptive histogr ams. In F oundations of Computational Mathematics. Pr o c e e dings of the Smalefest 2000 (F. Cuck er and 161 162 Biblio gr aphy J. M. Ro jas eds.) 35–60. W orld Scientific. MR20 21977 Ca toni, O. (2 003). Lapla c e transform estimates and deviation inequalities. Ann. Inst. H. Poinc ar´ e Pr ob ab. Statist. 39 1– 26. MR1959 8 40 Ca toni, O. (2 004). Statistical learning theory a nd sto c hastic optimization. Ec ole d’ ´ Et´ e de Pr ob abilit ´ es de Saint-Flour XXXI—2001 . L e ct ur e Notes in Math. 18 5 1 1–270 . Springer, New Y or k. MR21 6 3920 Cristianini, N. and Sha we T a ylor, J. (20 00). An Intr o duct ion to Supp ort V e c- tor Machines and Ot her Kernel Base d L e arning Metho ds . Cambridge Univ. Press. Feder, M. and Merha v, N . (1 996). Hierarchical universal co ding. IEEE T ra ns. Inform. The ory 42 135 4–1364. Hastie, T. , Tibshirani, R. and Friedman, J. (2001). The Elements of Statis- tic al L e arning . Springer, New Y o rk. MR18 51606 Langf ord, J. and McAllester, D. (200 4). Computable shell decompo sition bo unds. J . Machine L e arning R ese ar ch 5 5 29–547. MR224 7990 Langf ord, J. and Seeger, M. (2001a). Bounds for av eraging classi- fiers. T ech nical rep ort CMU-CS-01-102 , Carnegie Mello n Univ. Av ailable at http:/ /www.cs. cmu.edu/ ~ jcl . Langf ord, J. , Seeger, M. and Megiddo, N. (2001 b). An improv ed predictiv e accuracy bound for a veraging classifiers. Int ernational C onfer enc e on Machine L e arning 18 29 0–297. Littlestone, N. and W armuth, M. (1986). Relating da ta co mpression and learnability . T echnical report, Univ. Califor nia, Santa Cr uz. Av ailable at http:/ /www.soe .ucsc.edu/ ~ manfre d/pubs.h tml . McAllester, D. A. (19 98). Some P AC-Bay esian theorems. In Pr o c e e dings of the Eleventh Annual Confer enc e on Computational L e arning The ory (Madison, WI, 1998) 230 – 234. A CM, New Y ork. MR18115 87 McAllester, D. A. (1999). P AC-Bay esian model av era ging. In Pr o c e e dings of the Twelf th Annual Confer enc e on Computational L e arning The ory (Santa Cruz, CA, 1999) 164–17 0. ACM, New Y or k. MR181 1612 McDiarmid, C. (1998) Concen tration. In Pr ob abilistic Metho ds for A lgorithmic Discr et e Mathematics (M. Habib, C. McDiarmid a nd B. Reed, eds.) 19 5–248. Springer, New Y o rk. MR1 678578 Mammen, E. and Tsybak ov, A. (1999 ). Smo oth discrimination analy sis. A nn. Statist. 27 1808 –1829. MR176561 8 R y abko, B. Y. (1984). Twice-universal co ding. Pr oblems Inform. T r ansmission 20 24–28 . MR07917 32 Seeger, M. (2002 ). P A C-Bayesian gener alization error bo unds for Gaussian pro- cess classification. J. Mach ine L e arning R ese ar ch 3 233–26 9. MR19 71338 Sha we-T a ylor, J., Bar tlett, P. L., Williamson, R. C. and Anthony, M. (1998). Structural risk minimization ov er data -dependent hiera rc hies. IEEE T r ans. Inform. The ory 44 1926–1 940. MR16640 5 5 Sha we-T a ylor, J. and Cristianini, N. (2002). On the g e neralization of s oft margin algor ithms. IEEE T r ans. Information The ory 48 27 2 1–2735. MR19303 39 Tsybako v, A . (2004 ). Optimal agg r egation of classifiers in statistical learning . Ann. Statist. 32 135– 166. MR205100 2 Tsybako v, A. an d V an de Geer, S. (2005). Squa re ro ot p enalty: Adaptation to the marg in in classification and in edge estimation. Ann. S tatist. 33 1203– 1224. MR21956 33 V an de G eer, S. (200 0). Applic ations of Empiric al Pr o c ess The ory. Cambridge Univ. Press. MR173907 9 V apnik, V . N. (1998). St atistic al L e arning The ory . Wiley , New Y ork. MR16412 50 Biblio gr aphy 16 3 Ver t, J.-P. (2000). Double mixture a nd universal inference. Pr eprin t. Av ailable at http ://cbio.e nsmp.fr/ ~ vert/p ubli/ . Ver t, J.-P. (2 001a). Adaptiv e context trees a nd text clustering . IEEE T ra ns. Inform. The ory 47 188 4–1901. MR184 2525 Ver t, J.-P. (2001b). T ext categ orization using adaptive context trees. Pr o c e e dings of the CICLing-2001 Co nfer enc e (A. Gelbu kh, ed.) 423–4 36. L e ctur e Notes in Comput. Sci. 2004 . Springer, New Y ork . MR188 8793 Willems, F. M. J., Sht arkov, Y. M. and Tjalkens, T. J. (1995). The context-tree weigh ting method: Basic properties. IEEE T r ans. Inform. The ory 41 653–6 64. Willems, F. M. J., Sht arko v, Y. M. and Tjalkens, T. J. (1996 ). Co n- text weigh ting for general finite-con text sources. IEEE T r ans. In form. The ory 42 1514– 1520. Zhang, T. (200 6 a). F rom ǫ -entropy to KL-entropy: Ana ly s is of minimum informa- tion complexity density estimation. Ann. St atist. 34 21 80–221 0 . MR22914 97 Zhang, T. (2006b). Infor mation-theoretic upper and lower b ounds for s tatistical estimation. IEEE T r ans. Inform. The ory 52 1307 –1321. MR224119 0
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment