Pac-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning
📝 Original Info
- Title: Pac-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning
- ArXiv ID: 0712.0248
- Date: 2007-12-04
- Authors: Researchers mentioned in the ArXiv original paper
📝 Abstract
This monograph deals with adaptive supervised classification, using tools borrowed from statistical mechanics and information theory, stemming from the PACBayesian approach pioneered by David McAllester and applied to a conception of statistical learning theory forged by Vladimir Vapnik. Using convex analysis on the set of posterior probability measures, we show how to get local measures of the complexity of the classification model involving the relative entropy of posterior distributions with respect to Gibbs posterior measures. We then discuss relative bounds, comparing the generalization error of two classification rules, showing how the margin assumption of Mammen and Tsybakov can be replaced with some empirical measure of the covariance structure of the classification model.We show how to associate to any posterior distribution an effective temperature relating it to the Gibbs prior distribution with the same level of expected error rate, and how to estimate this effective temperature from data, resulting in an estimator whose expected error rate converges according to the best possible power of the sample size adaptively under any margin and parametric complexity assumptions. We describe and study an alternative selection scheme based on relative bounds between estimators, and present a two step localization technique which can handle the selection of a parametric model from a family of those. We show how to extend systematically all the results obtained in the inductive setting to transductive learning, and use this to improve Vapnik's generalization bounds, extending them to the case when the sample is made of independent non-identically distributed pairs of patterns and labels. Finally we review briefly the construction of Support Vector Machines and show how to derive generalization bounds for them, measuring the complexity either through the number of support vectors or through the value of the transductive or inductive margin.💡 Deep Analysis
This research explores the key findings and methodology presented in the paper: Pac-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning.This monograph deals with adaptive supervised classification, using tools borrowed from statistical mechanics and information theory, stemming from the PACBayesian approach pioneered by David McAllester and applied to a conception of statistical learning theory forged by Vladimir Vapnik. Using convex analysis on the set of posterior probability measures, we show how to get local measures of the complexity of the classification model involving the relative entropy of posterior distributions with respect to Gibbs posterior measures. We then discuss relative bounds, comparing the generalization error of two classification rules, showing how the margin assumption of Mammen and Tsybakov can be replaced with some empirical measure of the covariance structure of the classification model.We show how to associate to any posterior distribution an effective temperature relating it to the Gibbs prior distribution with the same level of expected error rate, and how to estimate this effective temper
📄 Full Content
Accordingly, we assume that we have prepared in some way or another a sample of N labelled patterns (X i , Y i ) N i=1 , where X i ranges in some pattern space X and Y i ranges in some finite label set Y. We also assume that we have devised our experiment in such a way that the couples of random variables (X i , Y i ) are independent (but not necessarily equidistributed). Here, randomness should be understood to come from the way the statistician has planned his experiment. He may for instance have drawn the X i s at random from some larger population of patterns the algorithm is meant to be applied to in a second stage. The labels Y i may have been set with the help of some external expertise (which may itself be faulty or contain some amount of randomness, so we do not assume that Y i is a function of Introduction to describe particle systems with many degrees of freedom. More specifically, the sets of classification rules will be described by Gibbs measures defined on parameter sets and depending on the observed sample value. A Gibbs measure is the special kind of probability measure used in statistical mechanics to describe the state of a particle system driven by a given energy function at some given temperature. Here, Gibbs measures will emerge as minimizers of the average loss value under entropy (or mutual information) constraints. Entropy itself, more precisely the Kullback divergence function between probability measures, will emerge in conjunction with the use of exponential deviation inequalities: indeed, the log-Laplace transform may be seen as the Legendre transform of the Kullback divergence function, as will be stated in Lemma 1.1.3 (page 4).
To fix notation, let (X i , Y i ) N i=1 be the canonical process on Ω = (X × Y) N (which means the coordinate process). Let the pattern space be provided with a sigmaalgebra B turning it into a measurable space (X, B). On the finite label space Y, we will consider the trivial algebra B ′ made of all its subsets. Let M 1 + (K × Y) N , (B ⊗ B ′ ) ⊗N be our notation for the set of probability measures (i.e. of positive measures of total mass equal to 1) on the measurable space (X×Y) N , (B×B ′ ) ⊗N . Once some probability distribution
into the canonical realization of a stochastic process modelling the observed sample (also called the training set). We will assume that P = N i=1 P i , where for each i = 1, . . . , N , P i ∈ M 1 + (X × Y, B ⊗ B ′ ), to reflect the assumption that we observe independent pairs of patterns and labels. We will also assume that we are provided with some indexed set of possible classification rules
where (Θ, T) is some measurable index set. Assuming some indexation of the classification rules is just a matter of presentation. Although it leads to heavier notation, it allows us to integrate over the space of classification rules as well as over Ω, using the usual formalism of multiple integrals. For this matter, we will assume that (θ, x) → f θ (x) : (Θ × X, B ⊗ T) → (Y, B ′ ) is a measurable function.
In many cases, as already mentioned, Θ = m∈M Θ m will be a finite (or more generally countable) union of subspaces, dividing the classification model R Θ = m∈M R Θm into a union of sub-models. The importance of introducing such a structure has been put forward by V. Vapnik, as a way to avoid making strong hypotheses on the distribution P of the sample. If neither the distribution of the sample nor the set of classification rules were constrained, it is well known that no kind of statistical inference would be possible. Considering a family of sub-models is a way to provide for adaptive classification where the choice of the model depends on the observed sample. Restricting the set of classification rules is more realistic than restricting the distribution of patterns, since the classification rules are a processing tool left to the choice of the statistician, whereas the distribution of the patterns is not fully under his control, except for some planning of the learning experiment which may enforce some weak properties like independence, but not the precise shapes of the marginal distributions P i which are as a rule unknown distributions on some high dimensional space.
In these notes, we will concentrate on general issues concerned with a natural measure of risk, namely the expected error rate of each classification rule f θ , expressed as
As this quantity is unobserved, we will be led to work with the corresponding empirical error rate
This does not mean that practical learning algorithms will always try to minimize this criterion. They often on the contrary try to minimize some other criterion which is linked with the structure of the problem and has some nice additional properties (like smoothness and convexity, for example). Nevertheless, and independently of the precise form of the estimator θ : Ω → Θ under study, the analysis of R( θ) is a natural question, and often corresponds to what is required in practice.
Answering this question is not straightforward because, although R(θ) is the expectation of r(θ), a sum of independent Bernoulli random variables, R( θ) is not the expectation of r( θ), because of the dependence of θ on the sample, and neither is r( θ) a sum of independent random variables. To circumvent this unfortunate situation, some uniform control over the deviations of r from R is needed.
We will follow the PAC-Bayesian approach to this problem, originated in the machine learning community and pioneered by McAllester (1998McAllester ( , 1999)). It can be seen as some variant of the more classical approach of M -estimators relying on empirical process theory -as described for instance in Van de Geer (2000).
It is built on some general principles:
• One idea is to embed the set of estimators of the type θ : Ω → Θ into the larger set of regular conditional probability measures ρ : Ω, (B ⊗ B ′ ) ⊗N → M 1 + (Θ, T). We will call these conditional probability measures posterior distributions, to follow standard terminology.
• A second idea is to measure the fluctuations of ρ with respect to the sample, using some prior distribution π ∈ M 1 + (Θ, T), and the Kullback divergence function K(ρ, π). The expectation P K(ρ, π) measures the randomness of ρ. The optimal choice of π would be P(ρ), resulting in a measure of the randomness of ρ equal to the mutual information between the sample and the estimated parameter drawn from ρ. Anyhow, since P(ρ) is usually not better known than P, we will have to be content with some less concentrated prior distribution π, resulting in some looser measure of randomness, as shown by the identity P K(ρ, π) = P K ρ, P(ρ) + K P(ρ), π .
• A third idea is to analyse the fluctuations of the random process θ → r(θ) from its mean process θ → R(θ) through the log-Laplace transform -1 λ log exp -λr(θ, ω) π(dθ)P(dω) ,
as would be done in statistical mechanics, where this is called the free energy. This transform is well suited to relate min θ∈Θ r(θ) to inf θ∈Θ R(θ), since for large enough values of the parameter λ, corresponding to low enough values of the temperature, the system has small fluctuations around its ground state. • A fourth idea deals with localization. It consists of considering a prior distribution π depending on the unknown expected error rate function R. Thus some central result of the theory will consist in an empirical upper bound for K ρ, π exp (-βR) , where π exp(-βR) , defined by its density is a Gibbs distribution built from a known prior distribution π ∈ M 1 + (Θ, T), some inverse temperature parameter β ∈ R + and the expected error rate R. This bound will in particular be used when ρ is a posterior Gibbs distribution, of the form π exp (-βr) . The general idea will be to show that in the case when ρ is not too random, in the sense that it is possible to find a prior (that is non-random) distribution π such that K(ρ, π) is small, then ρ(r) can be reliably taken for a good approximation of ρ(R).
This monograph is divided into four chapters. The first deals with the inductive setting presented in these lines. The second is devoted to relative bounds. It shows that it is possible to obtain a tighter estimate of the mutual information between the sample and the estimated parameter by comparing prior and posterior Gibbs distributions. It shows how to use this idea to obtain adaptive model selection schemes under very weak hypotheses.
The third chapter introduces the transductive setting of V. Vapnik (Vapnik, 1998), which consists in comparing the performance of classification rules on the learning sample with their performance on a test sample instead of their average performance. The fourth one is a fast introduction to Support Vector Machines. It is the occasion to show the implications of the general results discussed in the three first chapters when some particular choice is made about the structure of the classification rules.
In the first chapter, two types of bounds are shown. Empirical bounds are useful to build, compare and select estimators. Non random bounds are useful to assess the speed of convergence of estimators, relating this speed to the behaviour of the Gibbs prior expected error rate β → π exp(-βR) (R) and to covariance factors related to the margin assumption of Mammen and Tsybakov when a finer analysis is performed. We will proceed from the most straightforward bounds towards more elaborate ones, built to achieve a better asymptotic behaviour. In this course towards more sophisticated inequalities, we will introduce local bounds and relative bounds.
The study of relative bounds is expanded in the third chapter, where tighter comparisons between prior and posterior Gibbs distributions are proved. Theorems 2.1.3 (page 54) and 2.2.4 (page 73) present two ways of selecting some nearly optimal classification rule. They are both proved to be adaptive in all the parameters under Mammen and Tsybakov margin assumptions and parametric complexity assumptions. This is done in Corollary 2.1.17 (page 67) of Theorem 2.1.15 (page 66) and in Theorem 2.2.11 (page 89). In the first approach, the performance of a randomized estimator modelled by a posterior distribution is compared with the performance of a prior Gibbs distribution. In the second approach posterior distributions are directly compared between themselves (and leads to slightly stronger results, to the price of using a more complex algorithm). When there are more than one parametric model, it is appropriate to use also some doubly localized scheme: two step localization is presented for both approaches, in Theorems 2.3.2 (page 94) and 2.3.9 (page 108) and provides bounds with a decreased influence of the number of empirically inefficient models included in the selection scheme.
We would not like to induce the reader into thinking that the most sophisticated results presented in these first two chapters are necessarily the most useful ones, they are as a rule only more efficient asymptotically, whereas, being more involved, they use looser constants leading to less precision for small sample sizes. In practice whether a sample is to be considered small is a question of the ratio between the number of examples and the complexity (roughly speaking the number of parameters) of the model used for classification. Since our aim here is to describe methods appropriate for complex data (images, speech, DNA, . . . ), we suspect that practitioners wanting to make use of our proposals will often be confronted with small sample sizes; thus we would advise them to try the simplest bounds first and only afterwards see whether the asymptotically better ones can bring some improvement.
We would also like to point out that the results of the first two chapters are not of a purely theoretical nature: posterior parameter distributions can indeed be computed effectively, using Monte Carlo techniques, and there is well-established knowhow about these computations in Bayesian statistics. Moreover, non-randomized estimators of the classical form θ : Ω → Θ can be efficiently approximated by posterior distributions ρ : Ω → M 1 + (Θ) supported by a fairly narrow neighbourhood of θ, more precisely a neighbourhood of the size of the typical fluctuations of θ, so that this randomized approximation of θ will most of the time provide the same classification as θ itself, except for a small amount of dubious examples for which the classification provided by θ would anyway be unreliable. This is explained on page 7.
As already mentioned, the third chapter is about the transductive setting, that is about comparing the performance of estimators on a training set and on a test set. We show first that this comparison can be based on a set of exponential deviation inequalities which parallels the one used in the inductive case. This gives the opportunity to transport all the results obtained in the inductive case in a systematic way. In the transductive setting, the use of prior distributions can be extended to the use of partially exchangeable posterior distributions depending on the union of training and test patterns, bringing increased possibilities to adapt to the data and giving rise to such crucial notions of complexity as the Vapnik-Cervonenkis dimension.
Having done so, we more specifically focus on the small sample case, where local and relative bounds are not expected to be of great help. Introducing a fictitious (that is unobserved) shadow sample, we study Vapnik-type generalization bounds, showing how to tighten and extend them with some original ideas, like making no Gaussian approximation to the log-Laplace transform of Bernoulli random variables, using a shadow sample of arbitrary size. shrinking from the use of any symmetrization trick, and using a suitable subset of the group of permutations to cover the case of independent non-identically distributed data. The culminating result of the third chapter is Theorem 3.3.3 (page 125), subsequent bounds showing the separate influence of the above ideas and providing an easier comparison with Vapnik’s original results. Vapnik-type generalization bounds have a broad applicability, not only through the concept of Vapnik-Cervonenkis dimension, but also through the use of compression schemes (Little et al., 1986), which are briefly described on page 117.
The beginning of the fourth chapter introduces Support Vector Machines, both in the separable and in the non-separable case (using the box constraint). We then describe different types of bounds. We start with compression scheme bounds, to proceed with margin bounds. We begin with transductive margin bounds, recalling on this occasion in Theorem 4.2.2 (page 144) the growth bound for a family of classification rules with given Vapnik-Cervonenkis dimension. In Theorem 4.2.4 (page 146) we give the usual estimate of the Vapnik-Cervonenkis dimension of a family of separating hyperplanes with a given transductive margin (we mean by this that the margin is computed on the union of the training and test sets). We present an original probabilistic proof inspired by a similar one from Cristianini et al. (2000), whereas other proofs available usually rely on the informal claim that the simplex is the worst case. We end this short review of Support Vector Machines with a discussion of inductive margin bounds. Here the margin is computed on the training set only, and a more involved combinatorial lemma, due to Alon et al. (1997) and recalled in Lemma 4.2.6 (page 149) is used. We use this lemma and the results of the third chapter to establish a bound depending on the margin of the training set alone.
In appendix, we finally discuss the textbook example of classification by thresholding: in this setting, each classification rule is built by thresholding a series of measurements and taking a decision based on these thresholded values. This relatively simple example (which can be considered as an introduction to the more technical case of classification trees) can be used to give more flesh to the results of the first three chapters.
It is a pleasure to end this introduction with my greatest thanks to Anthony Davison, for his careful reading of the manuscript and his numerous suggestions.
Chapter 1
The setting of inductive inference (as opposed to transductive inference to be discussed later) is the one described in the introduction.
When we will have to take the expectation of a random variable Z : Ω → R as well as of a function of the parameter h : Θ → R with respect to some probability measure, we will as a rule use short functional notation instead of resorting to the integral sign: thus we will write P(Z) for Ω Z(ω)P(dω) and π(h) for Θ h(θ)π(dθ).
A more traditional statistical approach would focus on estimators θ : Ω → Θ of the parameter θ and be interested on the relationship between the empirical error rate r( θ), defined by equation (0.1, page viii), which is the number of errors made on the sample, and the expected error rate R( θ), defined by equation (0.2, page ix), which is the expected probability of error on new instances of patterns. The PAC-Bayesian approach instead chooses a broader perspective and allows the estimator θ to be drawn at random using some auxiliary source of randomness to smooth the dependence of θ on the sample. One way of representing the supplementary randomness allowed in the choice of θ, is to consider what it is usual to call posterior distributions on the parameter space, that is probability measures ρ : Ω → M 1 + (Θ, T), depending on the sample, or from a technical perspective, regular conditional (or transition) probability measures. Let us recall that we use the model described in the introduction: the training sample is modelled by the canonical process (X i , Y i ) N i=1 on Ω = X × Y N , and a product probability measure P = N i=1 P i on Ω is considered to reflect the assumption that the training sample is made of independent pairs of patterns and labels. The transition probability measure ρ, along with P ∈ M 1 + (Ω), defines a probability distribution on Ω × Θ and describes the conditional distribution of the estimated parameter θ knowing the sample (X i , Y i ) N i=1 . The main subject of this broadened theory becomes to investigate the relationship between ρ(r), the average error rate of θ on the training sample, and ρ(R), the expected error rate of θ on new samples. The first step towards using some kind of thermodynamics to tackle this question, is to consider the Laplace transform of ρ(R) -ρ(r), a well known provider of non-asymptotic deviation bounds. This transform takes the form where some inverse temperature parameter λ ∈ R + , as a physicist would call it, is introduced. This Laplace transform would be easy to bound if ρ did not depend on ω ∈ Ω (namely on the sample), because ρ(R) would then be non-random, and
would be a sum of independent random variables. It turns out, and this will be the subject of the next section, that this annoying dependence of ρ on ω can be quantified, using the inequality
which holds for any probability measure π ∈ M 1 + (Θ) on the parameter space; for our purpose it will be appropriate to consider a prior distribution π that is non-random, as opposed to ρ, which depends on the sample. Here, K(ρ, π) is the Kullback divergence of ρ from π, whose definition will be recalled when we will come to technicalities; it can be seen as an upper bound for the mutual information between the (X i , Y i ) N i=1 and the estimated parameter θ . This inequality will allow us to relate the penalized difference ρ(R) -ρ(r) -λ -1 K(ρ, π) with the Laplace transform of sums of independent random variables.
Let us now come to the details of the investigation sketched above. The first thing we will do is to study the Laplace transform of R(θ) -r(θ), as a starting point for the more general study of ρ(R) -ρ(r): it corresponds to the simple case where θ is not random at all, and therefore where ρ is a Dirac mass at some deterministic parameter value θ.
In the setting described in the introduction, let us consider the Bernoulli random variables σ i (θ) = ½ Y i = f θ (X i ) , which indicates whether the classification rule f θ made an error on the ith component of the training sample. Using independence and the concavity of the logarithm function, it is readily seen that for any real constant λ log P exp -λr(θ)
The right-hand side of this inequality is the log-Laplace transform of a Bernoulli distribution with parameter 1 N N i=1 P(σ i ) = R(θ). As any Bernoulli distribution is fully defined by its parameter, this log-Laplace transform is necessarily a function of R(θ). It can be expressed with the help of the family of functions (1.1) Φ a (p) = -a -1 log 1 -1 -exp(-a) p , a ∈ R, p ∈ (0, 1).
It is immediately seen that Φ a is an increasing one-to-one mapping of the unit interval onto itself, and that it is convex when a > 0, concave when a < 0 and can 1.1. Basic inequality be defined by continuity to be the identity when a = 0. Moreover the inverse of Φ a is given by the formula Φ -1 a (q) = 1 -exp(-aq) 1 -exp(-a) , a ∈ R, q ∈ (0, 1).
This formula may be used to extend Φ -1 a to q ∈ R, and we will use this extension without further notice when required.
Using this notation, the previous inequality becomes log P exp -λr(θ) ≤ -λΦ λ N R(θ) , proving Lemma 1.1.1. For any real constant λ and any parameter θ ∈ Θ,
In previous versions of this study, we had used some Bernstein bound, instead of this lemma. Anyhow, as it will turn out, keeping the log-Laplace transform of a Bernoulli instead of approximating it provides simpler and tighter results.
Lemma 1.1.1 implies that for any constants λ ∈ R + and ǫ ∈)0, 1),
Choosing λ ∈ arg max
, we deduce Lemma 1.1.2. For any ǫ ∈)0, 1), any θ ∈ Θ,
We will illustrate throughout these notes the bounds we prove with a small numerical example: in the case where N = 1000, ǫ = 0.01 and r(θ) = 0.2, we get with a confidence level of 0.99 that R(θ) ≤ .2402, this being obtained for λ = 234. Now, to proceed towards the analysis of posterior distributions, let us put U λ (θ,
θ) -r(θ, ω) for short, and let us consider some prior probability distribution π ∈ M 1 + (Θ, T). A proper choice of π will be an important question, underlying much of the material presented in this monograph, so for the time being, let us only say that we will let this choice be as open as possible by writing inequalities which hold for any choice of π . Let us insist on the fact that when we say that π is a prior distribution, we mean that it does not depend on the training sample (X i , Y i ) N i=1 . The quantity of interest to obtain the bound we are looking for is log P π exp(U λ )
. Using Fubini’s theorem for non-negative functions, we see that log P π exp(U λ ) = log π P exp(U λ ) ≤ 0.
To relate this quantity to the expectation ρ(U λ ) with respect to any posterior distribution ρ : Ω → M 1 + (Θ), we will use the properties of the Kullback divergence Chapter 1. Inductive PAC-Bayesian learning K(ρ, π) of ρ with respect to π, which is defined as
log( dρ dπ )dρ, when ρ is absolutely continuous with respect to π, +∞, otherwise.
The following lemma shows in which sense the Kullback divergence function can be thought of as the dual of the log-Laplace transform.
Lemma 1.1.3. For any bounded measurable function h : Θ → R, and any probability distribution
where by definition
The proof is just a matter of writing down the definition of the quantities involved and using the fact that the Kullback divergence function is non-negative, and can be found in Catoni (2004, page 160). In the duality between measurable functions and probability measures, we thus see that the log-Laplace transform with respect to π is the Legendre transform of the Kullback divergence function with respect to π. Using this, we get
which, combined with the convexity of λΦ λ N , proves the basic inequality we were looking for.
Theorem 1.1.4. For any real constant λ, P exp sup
We insist on the fact that in this theorem, we take a supremum in ρ ∈ M 1 + (Θ) inside the expectation with respect to P, the sample distribution. This means that the proved inequality holds for any ρ depending on the training sample, that is for any posterior distribution: indeed, measurability questions set aside, P exp sup
and more formally,
where the supremum in ρ taken in the left-hand side is restricted to regular conditional probability distributions.
The following sections will show how to use this theorem.
At least three sorts of bounds can be deduced from Theorem 1.1.4. The most interesting ones with which to build estimators and tune parameters, as well as the first that have been considered in the development of the PAC-Bayesian approach, are deviation bounds. They provide an empirical upper bound for ρ(R) -that is a bound which can be computed from observed data -with some probability 1-ǫ, where ǫ is a presumably small and tunable parameter setting the desired confidence level.
Anyhow, most of the results about the convergence speed of estimators to be found in the statistical literature are concerned with the expectation P ρ(R) , therefore it is also enlightening to bound this quantity. In order to know at which rate it may be approaching inf Θ R, a non-random upper bound is required, which will relate the average of the expected risk P ρ(R) with the properties of the contrast function θ → R(θ).
Since the values of constants do matter a lot when a bound is to be used to select between various estimators using classification models of various complexities, a third kind of bound, related to the first, may be considered for the sake of its hopefully better constants: we will call them unbiased empirical bounds, to stress the fact that they provide some empirical quantity whose expectation under P can be proved to be an upper bound for P ρ(R) , the average expected risk. The price to pay for these better constants is of course the lack of formal guarantee given by the bound: two random variables whose expectations are ordered in a certain way may very well be ordered in the reverse way with a large probability, so that basing the estimation of parameters or the selection of an estimator on some unbiased empirical bound is a hazardous business. Anyhow, since it is common practice to use the inequalities provided by mathematical statistical theory while replacing the proven constants with smaller values showing a better practical efficiency, considering unbiased empirical bounds as well as deviation bounds provides an indication about how much the constants may be decreased while not violating the theory too much.
Let ρ : Ω → M 1 + (Θ) be some fixed (and arbitrary) posterior distribution, describing some randomized estimator θ : Ω → Θ. As we already mentioned, in these notes a posterior distribution will always be a regular conditional probability measure. By this we mean that • for any A ∈ T, the map ω → ρ(ω, A) : Ω, (B ⊗ B ′ ) ⊗N → R + is assumed to be measurable; • for any ω ∈ Ω, the map A → ρ(ω, A) : T → R + is assumed to be a probability measure.
We will also assume without further notice that the σ-algebras we deal with are always countably generated. The technical implications of these assumptions are standard and discussed for instance in Catoni (2004, pages 50-54), where, among other things, a detailed proof of the decomposition of the Kullback Liebler divergence is given.
Let us restrict to the case when the constant λ is positive. We get from Theorem 1.1.4 that
where we have used the convexity of the exp function and of Φ λ
. Since we have restricted our attention to positive values of the constant λ, equation (1.2) can also be written
For any posterior distribution ρ : Ω → M 1 + (Θ), for any positive parameter λ,
The last inequality provides the unbiased empirical upper bound for ρ(R) we were looking for, meaning that the expectation of
-1 and therefore that this coefficient is close to 1 when λ is significantly smaller than N .
If we are ready to believe in this bound (although this belief is not mathematically well founded, as we already mentioned), we can use it to optimize λ and to choose ρ. While the optimal choice of ρ when λ is fixed is, according to Lemma 1.1.3 (page 4), to take it equal to π exp(-λr) , a Gibbs posterior distribution, as it is sometimes called, we may for computational reasons be more interested in choosing ρ in some other class of posterior distributions.
For instance, our real interest may be to select some non-randomized estimator from a family θ m : Ω → Θ m , m ∈ M , of possible ones, where Θ m are measurable subsets of Θ and where M is an arbitrary (non necessarily countable) index set. We may for instance think of the case when θ m ∈ arg min Θm r. We may slightly randomize the estimators to start with, considering for any θ ∈ Θ m and any m ∈ M ,
and defining ρ m by the formula
Our posterior minimizes K(ρ, π) among those distributions whose support is restricted to the values of θ in Θ m for which the classification rule f θ is identical to the estimated one f θm on the observed sample. Presumably, in many practical situations, f θ (x) will be ρ m almost surely identical to f θm (x) when θ is drawn from ρ m , for the vast majority of the values of x ∈ X and all the sub-models Θ m not plagued with too much overfitting (since this is by construction the case when x ∈ {X i : i = 1, . . . , N }). Therefore replacing θ m with ρ m can be expected to be a minor change in many situations. This change by the way can be estimated in the (admittedly not so common) case when the distribution of the patterns (X i ) N i=1 is known. Indeed, introducing the pseudo distance
Let us notice also that in the case where Θ m ⊂ R dm , and R happens to be convex on ∆ m ( θ m ), then ρ m (R) ≥ R θρ m (dθ) , and we can replace θ m with θ m = θρ m (dθ), and obtain bounds for R( θ m ). This is not a very heavy assumption about R, in the case where we consider θ m ∈ arg min Θm r. Indeed, θ m , and therefore ∆ m ( θ m ), will presumably be close to arg min Θm R, and requiring a function to be convex in the neighbourhood of its minima is not a very strong assumption. Since r( θ m ) = ρ m (r), and K(ρ m , π) = -log π ∆ m ( θ m ) , our unbiased empirical upper bound in this context reads as
Let us notice that we obtain a complexity factor -log π ∆ m ( θ m ) which may be compared with the Vapnik-Cervonenkis dimension. Indeed, in the case of binary classification, when using a classification model with Vapnik-Cervonenkis dimension not greater than h m , that is when any subset of X which can be split in any arbitrary way by some classification rule
is a partition of Θ m with at most eN hm hm components: these facts, if not already familiar to the reader, will be proved in Theorems 4. 2.2 and 4.2.3 (page 144). Therefore
Thus, if the model and prior distribution are well suited to the classification task, in the sense that there is more “room” (where room is measured with π) between the Chapter 1. Inductive PAC-Bayesian learning two clusters defined by θ m than between other partitions of the sample of patterns (X i ) N i=1 , then we will have
An optimal value m may be selected so that
Since ρ m is still another posterior distribution, we can be sure that
Taking the infimum in λ inside the expectation with respect to P would be possible at the price of some supplementary technicalities and a slight increase of the bound that we prefer to postpone to the discussion of deviation bounds, since they are the only ones to provide a rigorous mathematical foundation to the adaptive selection of estimators.
In this section we address some technical issues we think helpful to the understanding of Theorem 1.2.1 (page 6): namely to investigate how the upper bound it provides could be optimized, or at least approximately optimized, in λ. It turns out that this can be done quite explicitly. So we will consider in this discussion the posterior distribution ρ : Ω → M 1 + (Θ) to be fixed, and our aim will be to eliminate the constant λ from the bound by choosing its value in some nearly optimal way as a function of P ρ(r) , the average of the empirical risk, and of P K(ρ, π) , which controls overfitting.
Let the bound be written as
We see that
Thus, the optimal value for λ is such that
Assuming that 1 ≫ λ N P ρ(r) ≫ P[K(ρ,π)] N , and keeping only higher order terms, we are led to choose
This result of course is not very useful in itself, since neither of the two quantities P ρ(r) and P K(ρ, π) are easy to evaluate. Anyhow it gives a hint that replacing them boldly with ρ(r) and K(ρ, π) could produce something close to a legitimate empirical upper bound for ρ(R). We will see in the subsection about deviation bounds that this is indeed essentially true.
Let us remark that in the third chapter of this monograph, we will see another way of bounding
as soon as P ρ(r)
and P ρ(R) ≤ P ρ(r)
This theorem enlightens the influence of three terms on the average expected risk:
• the average empirical risk, P ρ(r) , which as a rule will decrease as the size of the classification model increases, acts as a bias term, grasping the ability of the model to account for the observed sample itself;
• a variance term 1 N P ρ(r) 1 -P ρ(r) is due to the random fluctuations of ρ(r); • a complexity term P K(ρ, π) , which as a rule will increase with the size of the classification model, eventually acts as a multiplier of the variance term.
We observed numerically that the bound provided by Theorem 1.2.2 is better than the more classical Vapnik-like bound of Theorem 1.2.3. For instance, when N = 1000, P ρ(r) = 0.2 and P K(ρ, π) = 10, Theorem 1.2.2 gives a bound lower than 0.2604, whereas the more classical Vapnik-like approximation of Theorem 1.2.3 gives a bound larger than 0.2622. Numerical simulations tend to suggest the two bounds are always ordered in the same way, although this could be a little tedious to prove mathematically.
Chapter 1. Inductive PAC-Bayesian learning
It is time now to come to less tentative results and see how far is the average expected error rate P ρ(R) from its best possible value inf Θ R.
Let us notice first that
Let us remark moreover that r → log π exp(-λr) is a convex functional, a property which from a technical point of view can be dealt with in the following way:
(1.4) P log π exp(-λr) = P sup
These remarks applied to Theorem 1.2.1 lead to Theorem 1.2.4. For any posterior distribution ρ : Ω → M 1 + (Θ), for any positive parameter λ,
This theorem is particularly well suited to the case of the Gibbs posterior distribution ρ = π exp (-λr) , where the entropy factor cancels and where P π exp(-λr) (R) is shown to get close to inf Θ R when N goes to +∞, as soon as λ/N goes to 0 while λ goes to +∞.
We can elaborate on Theorem 1.2.4 and define a notion of dimension of (Θ, R), with margin η ≥ 0 putting (1.5)
This last inequality can be established by the chain of inequalities:
where we have used successively the fact that λ → π exp(-λR) (R) is decreasing (because it is the derivative of the concave function λ → -log π exp(-λR) ) and the fact that the exponential function takes positive values.
In typical “parametric” situations d 0 (Θ, R) will be finite, and in all circumstances d η (Θ, R) will be finite for any η > 0 (this is a direct consequence of the definition of the essential infimum). Using this notion of dimension, we see that
Corollary 1.2.5 With the above notation, for any margin η ∈ R + , for any posterior distribution ρ :
If one wants a posterior distribution with a small support, the theorem can also be applied to the case when ρ is obtained by truncating π exp(-λr) to some level set to reduce its support: let Θ p = {θ ∈ Θ : r(θ) ≤ p}, and let us define for any q ∈)0, 1) the level p q = inf{p : π exp(-λr) (Θ p ) ≥ q}, let us then define ρ q by its density dρ q dπ exp(-λr)
π exp(-λr) (Θ pq ) , then ρ 0 = π exp(-λr) and for any q ∈ (0, 1(,
They provide results holding under the distribution P of the sample with probability at least 1-ǫ, for any given confidence level, set by the choice of ǫ ∈)0, 1(. Using them is the only way to be quite (i.e. with probability 1 -ǫ) sure to do the right thing, although this right thing may be over-pessimistic, since deviation upper bounds are larger than corresponding non-biased bounds.
Starting again from Theorem 1.1.4 (page 4), and using Markov’s inequality P exp(h) ≥ 1 ≤ P exp(h) , we obtain Theorem 1.2.6. For any positive parameter λ, with P probability at least 1 -ǫ, for any posterior distribution ρ :
We see that for a fixed value of the parameter λ, the upper bound is optimized when the posterior is chosen to be the Gibbs distribution ρ = π exp (-λr) .
In this theorem, we have bounded ρ(R), the average expected risk of an estimator θ drawn from the posterior ρ. This is what we will do most of the time in this study. This is the error rate we will get if we classify a large number of test patterns, drawing a new θ for each one. However, we can also be interested in the error rate we get if we draw only one θ from ρ and use this single draw of θ to classify a large number of test patterns. This error rate is R( θ). To state a result about its deviations, we can start back from Lemma 1.1.1 (page 3) and integrate it with respect to the prior distribution π to get for any real constant λ
For any posterior distribution ρ : Ω → M 1 + (Θ), this can be rewritten as
proving Theorem 1.2.7 For any positive real parameter λ, for any posterior distribution
Let us remark that the bound provided here is the exact counterpart of the bound of Theorem 1.2.6, since log dρ dπ appears as a disintegrated version of the divergence K(ρ, π). The parallel between the two theorems is particularly striking in the special case when ρ = π exp(-λr) . Indeed Theorem 1.2.6 proves that with P probability at least 1 -ǫ,
whereas Theorem 1.2.7 proves that with Pπ exp(-λr) probability at least 1
showing that we get the same deviation bound for π exp(-λr) (R) under P and for θ under Pπ exp(-λr) .
We would like to show now how to optimize with respect to λ the bound given by Theorem 1.2.6 (the same discussion would apply to Theorem 1.2.7). Let us notice first that values of λ less than 1 are not interesting (because they provide a bound larger than one, at least as soon as ǫ ≤ exp(-1)). Let us consider some real parameter α > 1, and the set Λ = {α k ; k ∈ N}, on which we put the probability measure ν(α k ) = [(k+1)(k+2)] -1 . Applying Theorem 1.2.6 to λ = α k at confidence level 1 -ǫ (k+1)(k+2) , and using a union bound, we see that with probability at least 1 -ǫ, for any posterior distribution ρ,
Now we can remark that for any λ ∈ (1, +∞(, there is
Thus with probability at least 1 -ǫ, for any posterior distribution ρ,
.
Taking the approximately optimal value
we obtain Theorem 1.2.8. With probability 1 -ǫ, for any posterior distribution ρ :
Moreover with probability at least 1 -ǫ, for any posterior distribution ρ such that ρ(r) = 0,
We can also elaborate on the results in an other direction by introducing the empirical dimension (1.6)
There is no need to introduce a margin in this definition, since r takes at most N values, and therefore π r = ess inf π r is strictly positive. This leads to Corollary 1.2.9. For any positive real constant λ, with P probability at least 1 -ǫ, for any posterior distribution ρ :
We could then make the bound uniform in λ and optimize this parameter in a way similar to what was done to obtain Theorem 1.2.8.
In this section, better bounds will be achieved through a better choice of the prior distribution. This better prior distribution turns out to depend on the unknown sample distribution P, and some work is required to circumvent this and obtain empirical bounds.
As mentioned in the introduction, if one is willing to minimize the bound in expectation provided by Theorem 1.2.1 (page 6), one is led to consider the optimal choice π = P(ρ). However, this is only an ideal choice, since P is in all conceivable situations unknown. Nevertheless it shows that it is possible through Theorem 1.2.1 to measure the complexity of the classification model with P K ρ, P(ρ) , which is nothing but the mutual information between the random sample (X i , Y i ) N i=1 and the estimated parameter θ, under the joint distribution Pρ.
In practice, since we cannot choose π = P(ρ), we have to be content with a flat prior π, resulting in a bound measuring complexity according to P K(ρ, π) = P K ρ, P(ρ) +K P(ρ), π larger by the entropy factor K P(ρ), π than the optimal one (we are still commenting on Theorem 1.2.1).
If we want to base the choice of π on Theorem 1.2.4 (page 10), and if we choose ρ = π exp(-λr) to optimize this bound, we will be inclined to choose some π such that 1 λ
is as far as possible close to inf θ∈Θ R(θ) in all circumstances. To give a more specific example, in the case when the distribution of the design (X i ) N i=1 is known, one can introduce on the parameter space Θ the metric D already defined by equation (1.3, page 7) (or some available upper bound for this distance). In view of the fact that R(θ) -R(θ ′ ) ≤ D(θ, θ ′ ), for any θ, θ ′ ∈ Θ, it can be meaningful, at least theoretically, to choose π as
where π k is the uniform measure on some minimal (or close to minimal)
Another possibility, when we have to deal with real valued parameters, meaning that
i=1 to some precision and to use a prior µ which is atomic on dyadic numbers. More precisely let us parametrize the set of dyadic real numbers as
where, as can be seen, s codes the sign, m the order of magnitude, p the precision and (b j ) p j=1 the binary representation of the dyadic number r s, m, p, (b j ) p j=1 . We can for instance consider on D the probability distribution
and define π ∈ M 1 + (R d ) as π = µ ⊗d . This kind of “coding” prior distribution can be used also to define a prior on the integers (by renormalizing the restriction of µ to integers to get a probability distribution). Using µ is somehow equivalent to picking up a representative of each dyadic interval, and makes it possible to restrict to the case when the posterior ρ is a Dirac mass without losing too much (when Θ = (0, 1), this approach is somewhat equivalent to considering as prior distribution the Lebesgue measure and using as posterior distributions the uniform probability measures on dyadic intervals, with the advantage of obtaining non-randomized estimators). When one uses in this way an atomic prior and Dirac masses as posterior distributions, the bounds proven so far can be obtained through a simpler union bound argument. This is so true that some of the detractors of the PAC-Bayesian approach (which, as a newcomer, has sometimes received a suspicious greeting among statisticians) have argued that it cannot bring anything that elementary union bound arguments could not essentially provide. We do not share of course this derogatory opinion, and while we think that allowing for non atomic priors and posteriors is worthwhile, we also would like to stress that the upcoming local and relative bounds could hardly be obtained with the only help of union bounds.
Although the choice of a flat prior seems at first glance to be the only alternative when nothing is known about the sample distribution P, the previous discussion shows that this type of choice is lacking proper localisation, and namely that we loose a factor K P π exp(-λr) , π , the divergence between the bound-optimal prior P π exp(-λr) , which is concentrated near the minima of R in favourable situations, and the flat prior π. Fortunately, there are technical ways to get around this difficulty and to obtain more local empirical bounds.
The idea is to start with some flat prior π ∈ M 1 + (Θ), and the posterior distribution ρ = π exp(-λr) minimizing the bound of Theorem 1.2.1 (page 6), when π is used as a Chapter 1. Inductive PAC-Bayesian learning prior. To improve the bound, we would like to use P π exp(-λr) instead of π, and we are going to make the guess that we could approximate it with π exp(-βR) (we have replaced the parameter λ with some distinct parameter β to give some more freedom to our investigation, and also because, intuitively, P π exp(-λr) may be expected to be less concentrated than each of the π exp(-λr) it is mixing, which suggests that the best approximation of P π exp(-λr) by some π exp(-βR) may be obtained for some parameter β < λ). We are then led to look for some empirical upper bound of K ρ, π exp (-βR) . This is happily provided by the following computation
- log π exp(-βR) -P log π exp(-βr) .
Using the convexity of r → log π exp(-βr) as in equation (1.4) on page 10, we conclude that 0 ≤ P K ρ, π exp(-βR) ≤ βP ρ(R -r) + P K ρ, π exp (-βr) .
This inequality has an interest of its own, since it provides a lower bound for P ρ(R) . Moreover we can plug it into Theorem 1.2.1 (page 6) applied to the prior distribution π exp(-βR) and obtain for any posterior distribution ρ and any positive parameter λ that
In view of this, it it convenient to introduce the function
showing that it is an increasing one to one convex map of the unit interval unto itself as soon as b ≤ a -1 1 -exp(-a) . Its convexity, combined with the value of its derivative at the origin, shows that
Using this notation and remarks, we can state Theorem 1.3.1. For any positive real constants β and λ such that
Thus (taking λ = 2β), for any
Note that the last inequality is obtained using the fact that 1-exp(-x) ≥ x-x 2 2 , x ∈ R + .
Corollary 1.3.2. For any β ∈ (0, N (,
the last inequality holding only when β < N 2 .
It is interesting to compare the upper bound provided by this corollary with Theorem 1.2.1 (page 6) when the posterior is a Gibbs measure ρ = π exp(-βr) . We see that we have got rid of the entropy term K π exp(-βr) , π , but at the price of an increase of the multiplicative factor, which for small values of β N grows from (1 -β 2N ) -1 (when we take λ = β in Theorem 1.2.1), to (1 -2β N ) -1 . Therefore non-localized bounds have an interest of their own, and are superseded by localized bounds only in favourable circumstances (presumably when the sample is large enough when compared with the complexity of the classification model).
Corollary 1.3.2 shows that when 2β N is small, π exp(-βr) (r) is a tight approximation of π exp(-βr) (R) in the mean (since we have an upper bound and a lower bound which are close together).
Another corollary is obtained by optimizing the bound given by Theorem 1.3.1 in ρ, which is done by taking ρ = π exp(-λr) .
Corollary 1.3.3. For any positive real constants β and λ such that
Although this inequality gives by construction a better upper bound for inf λ∈R+ P π exp(-λr) (R) than Corollary 1.3.2, it is not easy to tell which one of the two inequalities is the best to bound P π exp(-λr) (R) for a fixed (and possibly suboptimal) value of λ, because in this case, one factor is improved while the other is worsened.
Using the empirical dimension d e defined by equation (1.6) on page 13, we see that
Therefore, in the case when we keep the ratio λ β bounded, we get a better dependence on the empirical dimension d e than in Corollary 1.2.9 (page 14).
Let us come now to the localization of the non-random upper bound given by Theorem 1.2.4 (page 10). According to Theorem 1.2.1 (page 6) applied to the localized prior π exp(-βR) ,
where we have used as previously inequality (1.4) (page 10). This proves Theorem 1.3.4. For any posterior distribution ρ : Ω → M 1 + (Θ), for any real parameters β and λ such that 0
Let us notice in particular that this theorem contains Theorem 1.2.4 (page 10) which corresponds to the case β = 0. As a corollary, we see also, taking ρ = π exp(-λr) and λ = 2β, and noticing that γ → π exp(-γR) (R) is decreasing, that
We can use this inequality in conjunction with the notion of dimension with margin η introduced by equation (1.5) on page 10, to see that the Gibbs posterior achieves for a proper choice of λ and any margin parameter η ≥ 0 (which can be chosen to be equal to zero in parametric situations)
Deviation bounds to come next will show that the optimal λ can be estimated from empirical data. Let us propose a little numerical example as an illustration: assuming that d 0 = 10, N = 1000 and ess inf π R = 0.2, we obtain from equation (1.8) that inf λ P π exp(-λr) (R) ≤ 0.373.
When it comes to deviation bounds, for technical reasons we will choose a slightly more involved change of prior distribution and apply Theorem 1.2.6 (page 11) to the prior π exp[-βΦ
The advantage of tweaking R with the nonlinear function Φ -β N will appear in the search for an empirical upper bound of the local entropy term. Theorem 1.1.4 (page 4), used with the above-mentioned local prior, shows that (1.9) P sup
which is an invitation to find an upper bound for log π exp -βΦ -λ N
• R log π exp(-βr) . For conciseness, let us call our localized prior distribution π, thus defined by its density
.
Applying once again Theorem 1.1.4 (page 4), but this time to -β, we see that
Combining equations (1.10) and (1.11) and using the concavity of Φ -β N , we see that with P probability at least 1 -ǫ, for any posterior distribution ρ :
We have proved a lower deviation bound:
Theorem 1.3.5 For any positive real constant β, with P probability at least 1 -ǫ, for any posterior distribution ρ :
We can also obtain a lower deviation bound for θ. Indeed equation (1.11) can also be written as
This means that for any posterior distribution ρ : Ω → M 1 + (Θ),
We have proved Theorem 1.3.6 For any positive real constant β, for any posterior distribution
.
Let us now resume our investigation of the upper deviations of ρ(R). Using the Cauchy-Schwarz inequality to combine equations (1.9, page 19) and (1.11, page 19), we obtain (1.12)
Thus with P probability at least 1 -ǫ, for any posterior distribution ρ,
(It would have been more straightforward to use a union bound on deviation inequalities instead of the Cauchy-Schwarz inequality on exponential moments, anyhow, this would have led to replace -2 log(ǫ) with the worse factor 2 log( 2 ǫ ).) Let us now recall that
and let us put
Let us consider moreover the change of variables α = 1-exp(-λ N ) and γ = exp( β N )-1. We obtain 1 -αρ(R) 1 + γρ(R) ≥ exp(-B N ), leading to Theorem 1.3.7. For any positive constants α, γ, such that 0 ≤ γ < α < 1, with P probability at least 1 -ǫ, for any posterior distribution ρ : Ω → M 1 + (Θ), the bound
π exp(-ξr) (r)dξ -2 log(ǫ)
Let us now give an upper bound for R( θ ). Equation (1.12 page 20) can also be written as
This means that for any posterior distribution ρ : Ω → M 1 + (Θ),
Using the concavity of the square root function, this inequality can be weakened to
We have proved Theorem 1.3.8. For any positive real constants λ and β and for any posterior distribution ρ : Ω → M 1 + (Θ), with Pρ probability at least 1 -ǫ,
we can also, in the case when γ < α, write this inequality as
It may be enlightening to introduce the empirical dimension d e defined by equation (1.6) on page 13. It provides the upper bound
which shows that in Theorem 1.3.7 (page 21),
Similarly, in Theorem 1.3.8 above,
Let us give a little numerical illustration: assuming that d e = 10, N = 1000, and ess inf π r = 0.2, taking ǫ = 0.01, α = 0.5 and γ = 0.1, we obtain from Theorem 1.3.7 π exp[N log(1-α)r] (R) ≃ π exp(-693r) (R) ≤ 0.332 ≤ 0.372, where we have given respectively the non-linear and the linear bound. This shows the practical interest of keeping the non-linearity. Optimizing the values of the parameters α and γ would not have yielded a significantly lower bound.
The following corollary is obtained by taking λ = 2β and keeping only the linear bound; we give it for the sake of its simplicity: Corollary 1.3.9. For any positive real constant β such that exp( β N )+ exp(-2β N ) < 2, which is the case when β < 0.48N , with P probability at least 1 -ǫ, for any posterior distribution ρ :
.
Let us mention that this corollary applied to the above numerical example gives π exp(-200r) (R) ≤ 0.475 (when we take β = 100, consistently with the choice γ = 0.1).
Local bounds are suitable when the lowest values of the empirical error rate r are reached only on a small part of the parameter set Θ. When Θ is the disjoint union of sub-models of different complexities, the minimum of r will as a rule not be “localized” in a way that calls for the use of local bounds. Just think for instance of the case when Θ = M m=1 Θ m , where the sets
In this case we will have inf Θ1 r ≥ inf Θ2 r ≥ • • • ≥ inf ΘM r, although Θ M may be too large to be the right model to use. In this situation, we do not want to localize the bound completely. Let us make a more specific fanciful but typical pseudo computation. Just imagine we have a countable collection (Θ m ) m∈M of sub-models. Let us assume we are interested in choosing between the estimators θ m ∈ arg min Θm r, maybe randomizing them (e.g. replacing them with π m exp(-λr) ). Let us imagine moreover that we are in a typically parametric situation, where, for some priors
M ) be some distribution on the index set M . It is easy to see that (µπ) exp(-λr) will typically not be properly local, in the sense that typically (µπ) exp(-λr) (r) = µ π exp(-λr) (r)π exp(-λr)
where we have used the approximations
These approximations have no pretension to be rigorous or very accurate, but they nevertheless give the best order of magnitude we can expect in typical situations, and show that this order of magnitude is not what we are looking for: mixing different models with the help of µ spoils the localization, introducing a multiplier log λ dm to the dimension d m which is precisely what we would have got if we had not localized the bound at all. What we would really like to do in such situations is to use a partially localized posterior distribution, such as π m exp(-λr) , where m is an estimator of the best sub-model to be used. While the most straightforward way to do this is to use a union bound on results obtained for each sub-model Θ m , here we are going to show how to allow arbitrary posterior distributions on the index set (corresponding to a randomization of the choice of m).
Let us consider the framework we just mentioned: let the measurable parameter set (Θ, T) be a union of measurable sub-models, Θ = m∈M Θ m . Let the index set (M, M) be some measurable space (most of the time it will be a countable set). Let µ ∈ M 1 + (M ) be a prior probability distribution on (M, M). Let π : M → M 1 + (Θ) be a regular conditional probability measure such that π(m, Θ m ) = 1, for any m ∈ M . Let µπ ∈ M 1 + (M × Θ) be the product probability measure defined for any bounded measurable function h :
h(m, θ)π(m, dθ) µ(dm).
- (Θ) be the regular conditional posterior probability measure defined by
where consistently with previous notation π(m, h) = Θ h(m, θ)π(m, dθ) (we will also often use the less explicit notation π(h)). For short, let
Integrating with respect to µ equation (1.12, page 20), written in each sub-model Θ m using the prior distribution π(m, •), we see that
This proves that (1.13) P exp 1 2 sup
Introducing the optimal value of r on each sub-model r ⋆ (m) = ess inf π(m,•) r and the empirical dimensions
we can thus state Theorem 1.3.10. For any positive real constants β < λ, with P probability at least 1 -ǫ, for any posterior distribution ν :
where
and therefore
Thus, for any real constants α and γ such that 0 ≤ γ < α < 1, with P probability at least 1 -ǫ, for any posterior distribution ν : Ω → M 1 + (M ) and any conditional posterior distribution ρ : Ω × M → M 1 + (Θ), the bound
Chapter 1. Inductive PAC-Bayesian learning
If one is willing to bound the deviations with respect to Pνρ, it is enough to remark that the equation preceding equation (1.13, page 25) can also be written as
Thus for any posterior distributions ν :
Using the concavity of the square root function to pull the integration with respect to ρ out of the square root, we get
Theorem 1.3.11. For any positive real constants β < λ, for any posterior distributions ν :
Another way to state the same inequality is to say that for any real constants α and γ such that 0 ≤ γ < α < 1, with Pνρ probability at least 1 -ǫ, R( m, θ)
where
Let us remark that in the case when
1/2 and ρ = π (1-α) N r , we get as desired a bound that is adaptively local in all the Θ m (at least when M is countable and µ is atomic):
.
The penalization by the empirical dimension d e (m) in each sub-model is as desired linear in d e (m). Non random partially local bounds could be obtained in a way that is easy to imagine. We leave this investigation to the reader.
Chapter 1. Inductive PAC-Bayesian learning
We have seen that the bound optimal choice of the posterior distribution ν on the index set in Theorem 1.3.10 (page 25) is such that
This suggests replacing the prior distribution µ with µ defined by its density
The use of Φ -η N •R instead of R is motivated by technical reasons which will appear in subsequent computations. Indeed, we will need to bound
In the spirit of equation ( 1.9, page 19), starting back from Theorem 1.1.4 (page 4), applied in each sub-model Θ m to the prior distribution π exp(-γΦ -η N
•R) and integrated with respect to µ, we see that for any positive real constants λ, γ and η, with P probability at least 1 -ǫ, for any posterior distribution ν : Ω → M 1 + (M ) on the index set and any conditional posterior distribution ρ :
Thus if we put
, we obtain that f (x) ≥ 0, x ∈ R, and therefore that the left-hand side of equation (1.15) is non-negative. We can moreover introduce the prior conditional distribution π defined by
.
With P probability at least 1 -ǫ, for any posterior distributions ν :
Thus, coming back to equation ( 1.15), we see that under condition (1.16), with P probability at least 1 -ǫ,
Noticing moreover that
and choosing ρ = π exp(-λr) , we have proved Theorem 1.3.12. For any positive real constants β, γ and η, such that γ < η exp( η N ) -1 -1 , defining λ by condition (1.16), so that
Let us remark that this theorem does not require that β < γ, and thus provides both an upper and a lower bound for the quantity of interest: Corollary 1.3.13. For any positive real constants β, γ and η such that max{β, γ} < η exp( η N )-1 -1 , with P probability at least 1-ǫ, for any posterior distributions
to conclude that, putting
and
the divergence of ν with respect to the local prior µ is bounded by
For any positive constants β, γ and η such that max{β, γ} < η exp( η N ) -1 -1 , with P probability at least 1 -ǫ, for any pos-
where the local prior µ is defined by equation (1.14, page 28) and the local posterior ν and the function G η are defined by equation (1.18, page 30).
We can then use this theorem to give a local version of Theorem 1.3.10 (page 25). To get something pleasing to read, we can apply Theorem 1.3.14 with constants β ′ , γ ′ and η chosen so that
where
and where the function G η is defined by equation (1.17, page 30).
A first remark: if we had the stamina to use Cauchy Schwarz inequalities (or more generally Hölder inequalities) on exponential moments instead of using weighted union bounds on deviation inequalities, we could have replaced log( 4ǫ ) with -log(ǫ) in the above inequalities.
We see that we have achieved the desired kind of localization of Theorem 1.3.10 (page 25), since the new empirical entropy term
] cancels for a value of the posterior distribution on the index set ν which is of the same form as the one minimizing the bound B 1 (ν, ρ) of Theorem 1.3.10 (with a decreased constant, as could be expected). In a typical parametric setting, we will have
and therefore, if we choose for ν the Dirac mass at
and ρ(m, •) = π exp(-λr) (m, •), we will get, in the case when the index set M is countable,
This shows that the impact on the bound of the addition of supplementary models depends on their penalized minimum empirical risk r ⋆ (m) +
. More precisely the adaptive and local complexity factor
replaces in this bound the non local factor
which appears when applying Theorem 1.3.10 (page 25) to the Dirac mass ν = δ m . Thus in the local bound, the influence of models decreases exponentially fast when their penalized empirical risk increases.
One can deduce a result about the deviations with respect to the posterior νρ from Theorem 1.3.15 (page 31) without much supplementary work: it is enough for that purpose to remark that with P probability at least 1 -ǫ, for any posterior distribution ν :
this inequality being obtained by taking a supremum in ρ in Theorem 1.3.15 (page 31). One can then take a supremum in ν, to get, still with P probability at least
Using the fact that x → x α is concave when
We can thus state Theorem 1.3.16. For any ǫ ∈)0, 1(, with P probability at least 1 -ǫ, for any posterior distribution ν : Ω → M 1 + (M ) and conditional posterior distribution ρ :
Note that the given bound consequently holds with Pνρ probability at least
The behaviour of the minimum of the empirical process θ → r(θ) is known to depend on the covariances between pairs r(θ), r(θ ′ ) , θ, θ ′ ∈ Θ. In this respect, our previous study, based on the analysis of the variance of r(θ) (or technically on some exponential moment playing quite the same role), loses some accuracy in some circumstances (namely when inf Θ R is not close enough to zero).
In this section, instead of bounding the expected risk ρ(R) of any posterior distribution, we are going to upper bound the difference ρ(R) -inf Θ R, and more generally ρ(R) -R( θ), where θ ∈ Θ is some fixed parameter value.
In the next section we will analyse ρ(R) -π exp(-βR) (R), allowing us to compare the expected error rate of a posterior distribution ρ with the error rate of a Gibbs prior distribution. We will also analyse ρ 1 (R) -ρ 2 (R), where ρ 1 and ρ 2 are two arbitrary posterior distributions, using comparison with a Gibbs prior distribution as a tool, and in particular as a tool to establish the required Kullback divergence bounds.
Relative bounds do not provide the same kind of results as direct bounds on the error rate: it is not possible to estimate ρ(R) with an order of precision higher than (ρ(R)/N ) 1/2 , so that relative bounds cannot of course achieve that, but they provide a way to reach a faster rate for ρ(R) -inf Θ R, that is for the relative performance of the estimator within a restricted model.
The study of PAC-Bayesian relative bounds was initiated in the second and third parts of J.-Y. Audibert’s dissertation (Audibert, 2004b).
In this section and the next, we will suggest a series of possible uses of relative bounds. As usual, we will start with the simplest inequalities and proceed towards more sophisticated techniques with better theoretical properties, but at the same time less precise constants, so that which one is the more fitted will depend on the size of the training sample.
The first thing we will do is to compute for any posterior distribution ρ : Ω → M 1 + (Θ) a relative performance bound bearing on ρ(R) -inf Θ R. We will also compare the classification model indexed by Θ with a sub-model indexed by one of its measurable subsets Θ 1 ⊂ Θ. For this purpose we will form the difference ρ(R) -R( θ), where θ ∈ Θ 1 is some possibly unobservable value of the parameter in the sub-model defined by Θ 1 , typically chosen in arg min Θ1 R. If this is so and ρ(R) -R( θ) = ρ(R) -inf Θ1 R, a negative upper bound indicates that it is definitely worth using a randomized estimator ρ supported by the larger parameter set Θ instead of using only the classification model defined by the smaller set Θ 1 .
Relative bounds in this section are based on the control of r(θ)-r( θ), where θ, θ ∈ Θ. These differences are related to the random variables
Some supplementary technical difficulties, as compared to the previous sections, come from the fact that ψ i (θ, θ) takes three values, whereas σ i (θ) takes only two. Let
The right-hand side of this inequality is a function of C. On the other hand, C being a probability measure on a three point set, is defined by two parameters, that we may take equal to ψ C(dψ) and ψ 2 C(dψ). To this purpose, let us introduce
It is a pseudo distance (meaning that it is symmetric and satisfies the triangle inequality), since it can also be written as
It is readily seen that
where
Thus plugging this equality into inequality (1.20, page 35) we get Theorem 1.4.1. For any real parameter λ,
where r ′ is defined by equation (1.19, page 34) and Ψ and M ′ are defined just above.
To make a link with previous work of Mammen and Tsybakov -see e.g. Mammen et al. (1999) and Tsybakov (2004) -we may consider the pseudo-distance D on Θ defined by equation (1.3, page 7). This distance only depends on the distribution of the patterns. It is often used to formulate margin assumptions, in the sense of Mammen and Tsybakov. Here we are going to work rather with M ′ : as it is dominated by D in the sense that M ′ (θ, θ) ≤ D(θ, θ), θ, θ ∈ Θ, with equality in the important case of binary classification, hypotheses formulated on D induce hypotheses on M ′ , and working with M ′ may only sharpen the results when compared to working with D.
Using the same reasoning as in the previous section, we deduce Theorem 1.4.2. For any real parameter λ, any θ ∈ Θ, any prior distribution π ∈ M 1 + (Θ),
We are now going to derive some other type of relative exponential inequality. In Theorem 1.4.2 we obtained an inequality comparing one observed quantity
This may be inconvenient when looking for an empirical bound for ρ R ′ (•, θ) , and we are going now to seek an inequality comparing ρ R ′ (•, θ ) with empirical quantities only. This is possible by considering the log-Laplace transform of some modified random variable χ i (θ, θ). We may consider more precisely the change of variable defined by the equation
which is possible when λ N ∈ )-1, 1( and leads to define
We may then work on the log-Laplace transform
We may now follow the same route as previously, writing
Let us also introduce the random pseudo distance
With this notation, we can conveniently write the previous inequality as
Integrating with respect to a prior probability measure π ∈ M 1 + (Θ), we obtain Theorem 1.4.3. For any real parameter γ, for any θ ∈ Θ, for any prior probability distribution π ∈ M 1 + (Θ),
Let us first deduce a non-random bound from Theorem 1.4.2 (page 36). This theorem can be conveniently taken advantage of by throwing the non-linearity into a localized prior, considering the prior probability measure µ defined by its density
.
Indeed, for any posterior distribution ρ :
Plugging this into Theorem 1.4.2 (page 36) and using the convexity of the exponential function, we see that for any posterior probability distribution ρ :
We can then recall that
and notice moreover that
Putting these two remarks together, we obtain Theorem 1.4.4. For any real positive parameter λ, for any prior distribution
It may be interesting to derive some more suggestive (but slightly weaker) bound in the important case when Θ 1 = Θ and R( θ) = inf Θ R. In this case, it is convenient to introduce the expected margin function
We see that ϕ is convex and non-negative on R + . Using the bound M ′ (θ, θ ) ≤ xR ′ (θ, θ ) + ϕ(x), we obtain
Let us make the change of variable
Corollary 1.4.5. For any real positive parameters x, γ and λ such that
Let us remark that these results, although well suited to study Mammen and Tsybakov’s margin assumptions, hold in the general case: introducing the convex expected margin function ϕ is a substitute for making hypotheses about the relations between R and D.
Using the fact that R ′ (θ, θ ) ≥ 0, θ ∈ Θ and that ϕ(x) ≥ 0, x ∈ R + , we can weaken and simplify the preceding corollary even more to get Corollary 1.4.6. For any real parameters β, λ and x such that x ≥ 0 and 0 ≤ β < λ -x λ 2 2N , for any posterior distribution ρ :
Let us apply this bound under the margin assumption first considered by Mammen and Tsybakov (Mammen et al., 1999;Tsybakov, 2004), which says that for some real positive constant c and some real exponent κ ≥ 1,
In the case when κ = 1, then ϕ(c -1 ) = 0, proving that
β , for some positive real constant d linked with the dimension of the classification model, then
In the case when κ > 1,
In the parametric case when π exp(-γR) R ′ (•, θ ) ≤ d γ , we get
We see that this formula coincides with the result for κ = 1. We can thus reduce the two cases to a single one and state Corollary 1.4.7. Let us assume that for some θ ∈ Θ, some positive real constant c, some real exponent κ ≥ 1 and for any θ ∈ Θ, R(θ) ≥ R( θ) + cD(θ, θ) κ . Let us also assume that for some positive real constant d and any positive real parameter
Let us remark that the exponent of N in this corollary is known to be the minimax exponent under these assumptions: it is unimprovable, whatever estimator is used in place of the Gibbs posterior shown here (at least in the worst case compatible with the hypotheses). The interest of the corollary is to show not only the minimax exponent in N , but also an explicit non-asymptotic bound with reasonable and simple constants. It is also clear that we could have got slightly better constants if we had kept the full strength of Theorem 1.4.4 (page 38) instead of using the weaker Corollary 1.4.6 (page 39).
We will prove in the following empirical bounds showing how the constant λ can be estimated from the data instead of being chosen according to some margin and complexity assumptions.
We are going to define an empirical counterpart for the expected margin function ϕ. It will appear in empirical bounds having otherwise the same structure as the non-random bound we just proved. Anyhow, we will not launch into trying to compare the behaviour of our proposed empirical margin function with the expected margin function, since the margin function involves taking a supremum which is not straightforward to handle. When we will touch the issue of building provably adaptive estimators, we will instead formulate another type of bounds based on integrated quantities, rather than try to analyse the properties of the empirical margin function.
Let us start as in the previous subsection with the inequality
We have already defined by equation (1.22, page 37) the empirical pseudo-distance
Recalling that P m ′ (θ, θ ) = M ′ (θ, θ ), and using the convexity of h → log π exp(h) , leads to the following inequalities:
We may moreover remark that
This establishes Theorem 1.4.8. For any positive real parameters β and λ, for any posterior distribution ρ : Ω → M 1 + (Θ),
Taking β = N 2 sinh( λ N ), using the fact that sinh(a) ≥ a, a ≥ 0 and expressing tanh( a2 ) = a -1 1 + sinh(a) 2 -1 and a = log 1 + sinh(a) 2 +sinh(a) , we deduce Corollary 1.4.9. For any positive real constant β and any posterior distribution
This theorem and its corollary are really analogous to Theorem 1.4.4 (page 38), and it could easily be proved that under Mammen and Tsybakov margin assumptions we obtain an upper bound of the same order as Corollary 1.4.7 (page 40). Anyhow, in order to obtain an empirical bound, we are now going to take a supremum over all possible values of θ, that is over Θ 1 . Although we believe that taking this supremum will not spoil the bound in cases when over-fitting remains under control, we will not try to investigate precisely if and when this is actually true, and provide our empirical bound as such. Let us say only that on qualitative grounds, the values of the margin function quantify the steepness of the contrast function R or its empirical counterpart r, and that the definition of the empirical margin function is obtained by substituting P, the true sample distribution, with
⊗N , the empirical sample distribution, in the definition of the expected margin function. Therefore, on qualitative grounds, it seems hopeless to presume that R is steep when r is not, or in other words that a classification model that would be inefficient at estimating a bootstrapped sample according to our non-random bound would be by some miracle efficient at estimating the true sample distribution according to the same bound. To this extent, we feel that our empirical bounds bring a satisfactory counterpart of our non-random bounds. Anyhow, we will also produce estimators which can be proved to be adaptive using PAC-Bayesian tools in the next section, at the price of a more sophisticated construction involving comparisons between a posterior distribution and a Gibbs prior distribution or between two posterior distributions.
Let us now restrict discussion to the important case when θ ∈ arg min Θ1 R. To obtain an observable bound, let θ ∈ arg min θ∈Θ r(θ) and let us introduce the empirical margin functions
Using the fact that m ′ (θ, θ) ≤ m ′ (θ, θ) + m ′ ( θ, θ), we get Corollary 1.4.10. For any positive real parameters β and λ, for any posterior distribution ρ :
Taking β = N 2 sinh( λ N ), we also obtain
.
Note that we could also use the upper bound m
Corollary 1.4.11. For any non-negative real parameters x, α and λ, such that
Chapter 1. Inductive PAC-Bayesian learning
.
Let us notice that in the case when Θ 1 = Θ, the upper bound provided by this corollary has the same general form as the upper bound provided by Corollary 1.4.5 (page 39), with the sample distribution P replaced with the empirical distribution of the sample
⊗N . Therefore, our empirical bound can be of a larger order of magnitude than our non-random bound only in the case when our non-random bound applied to the bootstrapped sample distribution P would be of a larger order of magnitude than when applied to the true sample distribution P. In other words, we can say that our empirical bound is close to our non-random bound in every situation where the bootstrapped sample distribution P is not harder to bound than the true sample distribution P. Although this does not prove that our empirical bound is always of the same order as our non-random bound, this is a good qualitative hint that this will be the case in most practical situations of interest, since in situations of “under-fitting”, if they exist, it is likely that the choice of the classification model is inappropriate to the data and should be modified. Another reassuring remark is that the empirical margin functions ϕ and ϕ behave well in the case when inf Θ r = 0. Indeed in this case m ′ (θ, θ) = r ′ (θ, θ) = r(θ), θ ∈ Θ, and thus ϕ(1) = ϕ(1) = 0, and ϕ(x) ≤ -(x -1) inf Θ1 r, x ≥ 1. This shows that in this case we recover the same accuracy as with non-relative local empirical bounds. Thus the bound of Corollary 1.4.11 does not collapse in presence of massive over-fitting in the larger model, causing r( θ) = 0, which is another hint that this may be an accurate bound in many situations.
It is natural to make use of Theorem 1.4.3 (page 37) to obtain empirical deviation bounds, since this theorem provides an empirical variance term.
Theorem 1.4.3 is written in a way which exploits the fact that ψ i takes only the three values -1, 0 and +1. However, it will be more convenient for the following computations to use it in its more general form, which only makes use of the fact that ψ i ∈ (-1, 1). With notation to be explained hereafter, it can indeed also be written as (1.25) P exp sup
We have used the following notation in this inequality. We have put
so that P is our notation for the empirical distribution of the process (X i , Y i ) N i=1 . Moreover we have also used
where it should be remembered that the joint distribution of the process
In the same way
Moreover integration with respect to ρ bears on the index θ, so that
We have chosen concise notation, as we did throughout these notes, in order to make the computations easier to follow.
To get an alternate version of empirical relative deviation bounds, we need to find some convenient way to localize the choice of the prior distribution π in equation (1.25, page 44). Here we propose replacing π with µ = π exp{-N log[1+βP (ψ)]} , which can also be written π exp{-N log[1+βR ′ (•, θ)]} . Indeed we see that
.
Moreover, we deduce from our deviation inequality applied to -ψ, that (as long as β > -1),
-log π exp -N P log(1 + βψ)
This can be used to handle K(ρ, µ), making use of the Cauchy-Schwarz inequality as follows
This implies that with P probability at least 1 -ǫ,
It is now convenient to remember that
We thus can write the previous inequality as
Let us assume now that θ ∈ arg min Θ1 R. Let us introduce θ ∈ arg min Θ r. Decomposing r ′ (θ, θ) = r ′ (θ, θ) + r ′ ( θ, θ) and considering that m ′ (θ, θ) ≤ m ′ (θ, θ) + m ′ ( θ, θ), we see that with P probability at least 1 -ǫ, for any posterior distribution ρ :
Let us now define for simplicity the posterior ν : Ω → M 1 + (Θ) by the identity
Let us also introduce the random bound
Theorem 1.4.12. Using the above notation, for any real constants 0 ≤ β < λ < 1, for any prior distribution π ∈ M 1 + (Θ), for any subset Θ 1 ⊂ Θ, with P probability at least 1 -ǫ, for any posterior distribution ρ :
Therefore,
Let us define the posterior ν by the identity
.
This inequality is a special case of log π exp(g) -log π exp(h)
which is a consequence of the convexity of α → log π exp h + α(g -h) .
Let us introduce as previously ϕ
These functions can be used to produce a result which is slightly weaker, but maybe easier to read and understand. Indeed, we see that, for any x ∈ R + , with P probability at least 1 -ǫ, for any posterior distribution ρ,
Theorem 1.4.13. With the previous notation, for any real constants 0 ≤ β < λ < 1, for any positive real constant x, for any prior probability distribution π ∈ M 1 + (Θ), for any subset Θ 1 ⊂ Θ, with P probability at least 1-ǫ, for any posterior distribution
the following bounds hold true:
Let us remark that this alternative way of handling relative deviation bounds made it possible to carry on with non-linear bounds up to the final result. For instance, if λ = 0.5, β = 0.2 and B(ρ) = 0.1, the non-linear bound gives ρ(R)inf Θ1 R ≤ 0.096.
Comparing posterior distributions to Gibbs priors
We now come to an approach to relative bounds whose performance can be analysed with PAC-Bayesian tools.
The empirical bounds at the end of the previous chapter involve taking suprema in θ ∈ Θ, and replacing the expected margin function ϕ with some empirical counterparts ϕ or ϕ, which may prove unsafe when using very complex classification models.
We are now going to focus on the control of the divergence K ρ, π exp(-βR) . It is already obvious, we hope, that controlling this divergence is the crux of the matter, and that it is a way to upper bound the mutual information between the training sample and the parameter, which can be expressed as K ρ, P(ρ) = K ρ, π exp(-βR) -K P(ρ), π exp(-βR) , as explained on page 14.
Through the identity
we see that the control of this divergence is related to the control of the difference ρ(R) -π exp(-βR) (R). This is the route we will follow first.
Thus comparing any posterior distribution with a Gibbs prior distribution will provide a first way to build an estimator which can be proved to reach adaptively the best possible asymptotic error rate under Mammen and Tsybakov margin assumptions and parametric complexity assumptions (at least as long as orders of magnitude are concerned, we will not discuss the question of asymptotically optimal constants).
Then we will provide an empirical bound for the Kullback divergence K ρ, π exp(-βR) itself. This will serve to address the question of model selection, which will be achieved by comparing the performance of two posterior distributions possibly supported by two different models. This will also provide a second way to build estimators which can be proved to be adaptive under Mammen and Tsybakov margin assumptions and parametric complexity assumptions (somewhat weaker than with the first method).
Finally, we will present two-step localization strategies, in which the performance of the posterior distribution to be analysed is compared with a two-step Gibbs prior.
Similarly to Theorem 1.4.3 (page 37) we can prove that for any prior distribution
Replacing π with π exp(-βR) and considering the posterior distribution ρ⊗π exp(-βR) , provides a starting point in the comparison of ρ with π exp(-βR) ; we can indeed state with P probability at least 1 -ǫ that
Using equation (2.1, page 51) to handle the entropy term, we get
We can then decompose in the right-hand side γ ρ(r) -π exp(-βR) (r) into (γλ) ρ(r) -π exp(-βR) (r) + λ ρ(r) -π exp(-βR) (r) for some parameter λ to be set later on and use the fact that
to get rid of the appearance of the unobserved Gibbs prior π exp(-βR) in most places of the right-hand side of our inequality, leading to Theorem 2.1.1. For any real constants β and γ, with P probability at least 1 -ǫ, for any posterior distribution ρ :
We would like to have a fully empirical upper bound even in the case when λ = γ. This can be done by using the theorem twice. We will need a lemma.
Lemma 2.1.2 For any probability distribution π ∈ M 1 + (Θ), for any bounded measurable functions g, h : Θ → R,
which ends the proof.
For any positive real constants β and λ, we can then apply Theorem 2.1.1 to ρ = π exp(-λr) , and use the inequality
provided by the previous lemma. We thus obtain with P probability at least 1 -ǫ
Let us introduce the convex function
With P probability at least 1 -ǫ,
Since Theorem 2.1.1 holds uniformly for any posterior distribution ρ, we can apply it again to some arbitrary posterior distribution ρ. We can moreover make the result uniform in β and γ by considering some atomic measure ν ∈ M 1 + (R) on the real line and using a union bound. This leads to Theorem 2.1.3. For any atomic probability distribution on the positive real line ν ∈ M 1 + (R + ), with P probability at least 1 -ǫ, for any posterior distribution ρ : Ω → M 1 + (Θ), for any positive real constants β and γ,
where
where we have written for short ν(β) and ν(γ) instead of ν({β}) and ν({γ}).
Let us notice that B(ρ, β, γ) = +∞ when ν(β) = 0 or ν(γ) = 0, the uniformity in β and γ of the theorem therefore necessarily bears on a countable number of values of these parameters. We can typically choose distributions for ν such as the one used in Theorem 1.2.8 (page 13): namely we can put for some positive real ratio
or alternatively, since we are interested in values of the parameters less than N , we can prefer
We can also use such a coding distribution on dyadic numbers as the one defined by equation (1.7, page 15). Following the same route as for Theorem 1.3.15 (page 31), we can also prove the following result about the deviations under any posterior distribution ρ: Theorem 2.1.4 For any ǫ ∈)0, 1(, with P probability at least 1-ǫ, for any posterior distribution ρ :
The only tricky point is to justify that we can still take an infimum in λ 1 without using a union bound. To justify this, we have to notice that the following variant of Theorem 2.1.1 (page 52) holds: with P probability at least 1 -ǫ, for any posterior distribution ρ :
We leave the details as an exercise.
Using the parametric approximation π exp(-αr) (r) -inf Θ r ≃ de α , we get as an order of magnitude
Therefore, if the empirical dimension d e stays bounded when N increases, we are going to obtain a negative upper bound for any values of the constants λ 1 > λ 2 > β, as soon as γ and N γ are chosen to be large enough. This ability to obtain negative values for the bound B(π exp(-λ1r) , γ, β), and more generally B(ρ, γ, β), leads the way to introducing the new concept of the effective temperature of an estimator.
) is continuous and strictly decreasing from ess sup π R to ess inf π R (as soon as these two bounds do not coincide). This shows that the effective temperature T (ρ) is a well-defined random variable. Theorem 2.1.3 provides a bound for T (ρ), indeed:
Proposition 2.1.5. Let
where B(ρ, β, γ) is as in Theorem 2.1.3 (page 54). Then with P probability at least 1 -ǫ, for any posterior distribution ρ :
This notion of effective temperature of a (randomized) estimator ρ is interesting for two reasons:
• the difference ρ(R) -π exp(-βR) (R) can be estimated with better accuracy than ρ(R) itself, due to the use of relative deviation inequalities, leading to convergence rates up to 1/N in favourable situations, even when inf Θ R is not close to zero; • and of course π exp(-βR) (R) is a decreasing function of β, thus being able to estimate ρ(R) -π exp(-βR) (R) with some given accuracy, means being able to discriminate between values of ρ(R) with the same accuracy, although doing so through the parametrization β → π exp(-βR) (R), which can neither be observed nor estimated with the same precision!
We are now going to launch into a mathematically rigorous analysis of the bound B(π exp(-λ1r),β,γ ) provided by Theorem 2.1.3 (page 54), to show that inf ρ∈M 1
π exp[-β(ρ)R] (R) converges indeed to inf Θ R at some optimal rate in favourable situations.
It is more convenient for this purpose to use deviation inequalities involving M ′ rather than m ′ . It is straightforward to extend Theorem 1.4.2 (page 36) to Theorem 2.1.6. For any real constants β and γ, for any prior distributions π, µ ∈ M 1 + (Θ), with P probability at least 1 -η, for any posterior distribution ρ :
In order to transform the left-hand side into a linear expression and in the same time localize this theorem, let us choose µ defined by its density
where C is such that µ(Θ) = 1. We get
Thus with P probability at least 1 -η,
Remarking that
we deduce from the previous inequality Theorem 2.1.7. For any real constants β and γ, with P probability at least 1 -η, for any posterior distribution ρ :
We can also go into a slightly different direction, starting back again from equation (2.6, page 57) and remarking that for any real constant λ,
Theorem 2.1.8. For any real constants β and γ, with P probability at least 1 -η, for any real constant λ,
where the definition of C(β, γ) is given by equation (2.6, page 57).
We can now use this inequality in the case when ρ = π exp(-λr) and combine it with Inequality (2.5, page 53) to obtain Theorem 2.1.9 For any real constants β and γ, with P probability at least 1 -η, for any real constant λ,
We deduce from this theorem Proposition 2.1.10 For any real positive constants β 1 , β 2 and γ, with P probability at least 1 -η, for any real constants λ 1 and λ 2 , such that
.
Moreover, π exp(-β1R) and π exp(-β2R) being prior distributions, with P probability at least 1 -η, γ π exp(-β1R) (r) -π exp(-β2R) (r)
. Hence Proposition 2.1.11 For any positive real constants β 1 , β 2 and γ, with P probability at least 1 -η, for any positive real constants λ 1 and λ 2 such that
.
In order to achieve the analysis of the bound B(π exp(-λ1r) , β, γ) given by Theorem 2. 1.3 (page 54), it now remains to bound quantities of the general form
Let us consider the prior distribution µ ∈ M 1 + (Θ × Θ) on couples of parameters defined by the density
where the normalizing constant C is such that µ(Θ×Θ) = 1. Since for fixed values of the parameters θ and θ ′ ∈ Θ, m ′ (θ, θ ′ ), like r(θ), is a sum of independent Bernoulli random variables, we can easily adapt the proof of Theorem 1.1.4 on page 4, to establish that with P probability at least 1 -η, for any posterior distribution ρ and any real constant λ,
Thus for any real constant β and any positive real constants α and γ, with P probability at least 1 -η, for any real constant λ,
To finish, we need some appropriate upper bound for the entropy K ρ, π exp (-βR) . This question can be handled in the following way: using Theorem 2.1.7 (page 58), we see that for any positive real constants γ and β, with P probability at least 1 -η, for any posterior distribution ρ,
In other words, Theorem 2.1.12. For any positive real constants β and γ such that β < N × sinh( γ N ), with P probability at least 1 -η, for any posterior distribution ρ :
, where the quantity C(β, γ) is defined by equation (2.6, page 57). Equivalently, it will be in some cases more convenient to use this result in the form: for any positive real constants λ and γ, with P probability at least 1 -η, for any posterior distribution
and
, we obtain with P probability at least 1 -η,
This proves Proposition 2.1.13. For any positive real constants λ < γ, with P probability at least 1 -η,
). We are now ready to analyse the bound B(π exp(-λ1r) , β, γ) of Theorem 2.1.3 (page 54).
Theorem 2.1.14. For any positive real constants λ 1 , λ 2 , β 1 , β 2 , β and γ, such that
γ tanh( γ N ), with P probability 1 -η, the bound B(π exp(-λ1r) , β, γ) of Theorem 2.1.3 (page 54) satisfies B(π exp(-λ1r) , β, γ)
where the function C(β, γ) is defined by equation (2.6, page 57).
To help understand the previous theorem, it may be useful to give linear upperbounds to the factors appearing in the right-hand side of the previous inequality.
Introducing θ such that R( θ) = inf Θ R (assuming that such a parameter exists) and remembering that
the last inequality being rather a consequence of the definition of ϕ than a property of M ′ , we easily see that
Let us push further the investigation under the parametric assumption that for some positive real constant d
This assumption will for instance hold true with d = n 2 when R : Θ → (0, 1) is a smooth function defined on a compact subset Θ of R n that reaches its minimum value on a finite number of non-degenerate (i.e. with a positive definite Hessian) interior points of Θ, and π is absolutely continuous with respect to the Lebesgue measure on Θ and has a smooth density.
In case of assumption (2.8), if we restrict ourselves to sufficiently large values of the constants β, β 1 , β 2 , λ 1 , λ 2 and γ (the smaller of which is as a rule β, as we will see), we can use the fact that for some (small) positive constant δ, and some (large) positive constant A,
Under this assumption,
Thus with P probability at least 1 -η,
, and let us introduce the notation
This simplifies to
This shows that there exist universal positive real constants A 1 , A 2 , B 1 , B 2 , B 3 , and B 4 such that as soon as γ max{x,1}
as soon as
.
Choosing some real ratio α > 1, we can now make the above result uniform for any
by substituting ν(β) and ν(γ) with log(α) log(αN ) and -log(η) with -log(η) + 2 × log log (αN ) log(α) . Taking η = ǫ for simplicity, we can summarize our result in Theorem 2.1.15. There exist positive real universal constants A, B 1 , B 2 , B 3 and B 4 such that for any positive real constants α > 1, d and δ, for any prior distribution π ∈ M 1 + (Θ), with P probability at least 1 -ǫ, for any β, γ ∈ Λ α (where Λ α is defined by equation (2.10) above) such that
and such that also for some positive real parameter x γ max{x, 1}
, the bound B(π exp(-γ 2 r) , β, γ) given by Theorem 2.1.3 on page 54 in the case where we have chosen ν to be the uniform probability measure on Λ α , satisfies B(π exp(-γ 2 r) , β, γ) ≤ 0, proving that β(π exp(-γ 2 r) ) ≥ β and therefore that
What is important in this result is that we do not only bound π exp(-γ 2 r) (R), but also B(π exp(-γ 2 r) , β, γ), and that we do it uniformly on a grid of values of β and γ, showing that we can indeed set the constants β and γ adaptively using the empirical bound B(π exp(-γ 2 r) , β, γ). Let us see what we get under the margin assumption (1.24, page 39). When κ = 1, we have ϕ(c -1 ) ≤ 0, leading to Corollary 2.1.16. Assuming that the margin assumption (1.24, page 39) is satisfied for κ = 1, that R : Θ → (0, 1) is independent of N (which is the case for instance when P = P ⊗N ), and is such that
there are universal positive real constants B 5 and B 6 and N 1 ∈ N such that for any N ≥ N 1 , with P probability at least 1 -ǫ
where γ ∈ arg max γ∈Λ2 max β ∈ Λ 2 ; B(π exp(-γ r 2 ) , β, γ) ≤ 0 , where Λ 2 is defined by equation (2.10, page 66), and B is the bound of Theorem 2. 1.3 (page 54).
κ-1 , and we can choose γ and x such that
Corollary 2.1.17. Assuming that the margin assumption (1.24, page 39) is satisfied for some exponent κ > 1, that R : Θ → (0, 1) is independent of N (which is for instance the case when P = P ⊗N ), and is such that
there are universal positive constants B 7 and B 8 and N 1 ∈ N such that for any N ≥ N 1 , with P probability at least 1 -ǫ,
, where γ ∈ arg max γ∈Λ2 max β ∈ Λ 2 ; B(π exp(-γ r 2 ) , β, γ) ≤ 0 , Λ 2 being defined by equation (2.10, page 66) and B by Theorem 2. 1.3 (page 54).
We find the same rate of convergence as in Corollary 1.4.7 (page 40), but this time, we were able to provide an empirical posterior distribution π exp(-γ r
2 ) which achieves this rate adaptively in all the parameters (meaning in particular that we do not need to know d, c or κ). Moreover, as already mentioned, the power of N in this rate of convergence is known to be optimal in the worst case (see Mammen et al. (1999); Tsybakov (2004); Tsybakov et al. (2005), and more specifically in Audibert (2004b) -downloadable from its author’s web page -Theorem 3.3, page 132).
Another interesting question is to estimate K ρ, π exp(-βR) using relative deviation inequalities. We follow here an idea to be found first in (Audibert, 2004b, page 93). Indeed, combining equation (2.3, page 52) with equation (2.1, page 51), we see that for any positive real parameters β and λ, with P probability at least 1 -ǫ, for any posterior distribution ρ :
We thus obtain Theorem 2.1.18. For any positive real constants β and γ such that β < N × tanh( γ N ), with P probability at least 1 -ǫ, for any posterior distribution ρ :
This theorem provides another way of measuring over-fitting, since it gives an upper bound for K π exp[-βγ N tanh( γ N ) -1 r] , π exp(-βR) . It may be used in combination with Theorem 1.2.6 (page 11) as an alternative to Theorem 1.3.7 (page 21). It will also be used in the next section.
An alternative parametrization of the same result providing a simpler right-hand side is also useful: Corollary 2.1.19. For any positive real constants β and γ such that β < γ, with P probability at least 1 -ǫ, for any posterior distribution ρ :
2.2. Playing with two posterior and two local prior distributions
Estimating the effective temperature of an estimator provides an efficient way to tune parameters in a model with parametric behaviour. On the other hand, it will not be fitted to choose between different models, especially when they are nested, because as we already saw in the case when Θ is a union of nested models, the prior distribution π exp(-βR) does not provide an efficient localization of the parameter in this case, in the sense that π exp(-βR) (R) does not go down to inf Θ R at the desired rate when β goes to +∞, requiring a resort to partial localization.
Once some estimator (in the form of a posterior distribution) has been chosen in each sub-model, these estimators can be compared between themselves with the help of the relative bounds that we will establish in this section. It is also possible to choose several estimators in each sub-model, to tune parameters in the same time (like the inverse temperature parameter if we decide to use Gibbs posterior distributions in each sub-model).
From equation (2.2 page 52) (slightly modified by replacing π ⊗ π with π 1 ⊗ π 2 ), we easily obtain Theorem 2.2.1. For any positive real constant λ, for any prior distributions π 1 , π 2 ∈ M 1 + (Θ), with P probability at least 1 -ǫ, for any posterior distributions ρ 1 and ρ
This is where the entropy bound of the previous section enters into the game, providing a localized version of Theorem 2.2.1 (page 69). We will use the notation
q, a, q ∈ R.
Theorem 2.2.2. For any ǫ ∈)0, 1(, any sequence of prior distributions (π i ) i∈N ∈ M 1 + (Θ) N , any probability distribution µ on N, any atomic probability distribution ν on R + , with P probability at least 1 -ǫ, for any posterior distributions ρ 1 , ρ 2 : Ω → M 1 + (Θ),
, where
The sequence of prior distributions (π i ) i∈N should be understood to be typically supported by subsets of Θ corresponding to parametric sub-models, that is submodels for which it is reasonable to expect that lim β→+∞
exists and is positive and finite. As there is no reason why the bound B(ρ 1 , ρ 2 ) provided by the previous theorem should be sub-additive (in the sense that B(ρ 1 , ρ 3 ) ≤ B(ρ 1 , ρ 2 )+B(ρ 2 , ρ 3 )), it is adequate to consider some workable subset P of posterior distributions (for instance the distributions of the form π i exp(-βr) , i ∈ N, β ∈ R + ), and to define the sub-additive chained bound
Proposition 2.2.3. With P probability at least 1 -ǫ, for any posterior distributions
. Moreover for any posterior distribution ρ 1 ∈ P, any posterior distribution ρ 2 ∈ P such that B(ρ 1 , ρ 2 ) = inf ρ3∈P B(ρ 1 , ρ 3 ) is unimprovable with the help of B in P in the sense that inf ρ3∈P B(ρ 2 , ρ 3 ) ≥ 0.
Proof. The first assertion is a direct consequence of the previous theorem, so only the second assertion requires a proof: for any ρ 3 ∈ P, we deduce from the optimality of ρ 2 and the sub-additivity of B that
This proposition provides a way to improve a posterior distribution ρ 1 ∈ P by choosing ρ 2 ∈ arg min ρ∈P B(ρ 1 , ρ) whenever B(ρ 1 , ρ 2 ) < 0. This improvement is proved by Proposition 2.2.3 to be one-step: the obtained improved posterior ρ 2 cannot be improved again using the same technique.
Let us give some examples of possible starting distributions ρ 1 for this improvement scheme: ρ 1 may be chosen as the best posterior Gibbs distribution according to Proposition 2.1.5 (page 56). More precisely, we may build from the prior distributions π i , i ∈ N, a global prior π = i∈N µ(i)π i . We can then define the estimator of the inverse effective temperature as in Proposition 2.1.5 (page 56) and choose ρ 1 ∈ arg min ρ∈P β(ρ), where P is as suggested above the set of posterior distributions P = π i exp(-βr) ; i ∈ N, β ∈ R + . This starting point ρ 1 should already be pretty good, at least in an asymptotic perspective, the only gain in the rate of convergence to be expected bearing on spurious log(N ) factors.
More elaborate uses of relative bounds are described in the third section of the second chapter of Audibert (2004b), where an algorithm is proposed and analysed, which allows one to use relative bounds between two posterior distributions as a stand-alone estimation tool.
Let us give here some alternative way to address this issue. We will assume for simplicity and without great loss of generality that the working set of posterior distributions P is finite (so that among other things any ordering of it has a first element).
It is natural to define the estimated complexity of any given posterior distribution ρ ∈ P in our working set as the bound for inf i∈N K(ρ, π i ) used in Theorem 2.2.1 (page 69). This leads to set (given some confidence level 1 -ǫ)
Let us moreover call γ(ρ), β(ρ) and i(ρ) the values achieving this infimum, or nearly achieving it, which requires a slight change of the definition of C(ρ) to take this modification into account. For the sake of simplicity, we can assume without substantial loss of generality that the supports of ν and µ are large but finite, and thus that the minimum is reached.
To understand how this notion of complexity comes into play, it may be interesting to keep in mind that for any posterior distributions ρ and ρ ′ we can write the bound in Theorem 2.2.2 (page 69) as
where
(Let us recall that the function Ξ is defined by equation ( 2.11, page 69).) Thus for any ρ, ρ ′ such that B(ρ ′ , ρ) > 0, we can deduce from the monotonicity of Ξ
proving that the left-hand side is small, and consequently that B(ρ, ρ ′ ) and its chained counterpart defined by equation (2.12, page 70) are small:
It is also worth noticing that B(ρ, ρ ′ ) and B(ρ, ρ ′ ) are upper bounded in terms of variance and complexity only.
The presence of the ratios γ(ρ) β(ρ) should not be obnoxious, since their values should be automatically tamed by the fact that β(ρ) and γ(ρ) should make the estimate of the complexity of ρ optimal.
As an alternative, it is possible to restrict to set of parameter values β and γ such that, for some fixed constant ζ > 1, the ratio γ β is bounded away from 1 by the inequality γ β ≥ ζ. This leads to an alternative definition of C(ρ):
We can even push simplification a step further, postponing the optimization of the ratio γ β , and setting it to the fixed value ζ. This leads us to adopt the definition
With either of these modified definitions of the complexity C(ρ), we get the upper bound
With these definitions, we have for any posterior distributions ρ and ρ
Consequently in the case when B(ρ ′ , ρ) > 0, we get
To select some nearly optimal posterior distribution in P, it is appropriate to order the posterior distributions of P according to increasing values of their complexity C(ρ) and consider some indexation
Let us now consider for each ρ k ∈ P the first posterior distribution in P which cannot be proved to be worse than ρ k according to the bound B:
In this definition, which uses the chained bound defined by equation ( 2.12, page 70), it is appropriate to assume by convention that B(ρ, ρ) = 0, for any posterior distribution ρ. Let us now define our estimated best ρ ∈ P as ρ k , where (2.17) k = min(arg max t).
Thus we take the posterior with smallest complexity which can be proved to be better than the largest starting interval of P in terms of estimated relative classification error.
The following theorem is a simple consequence of the chosen optimisation scheme. It is valid for any arbitrary choice of the complexity function ρ → C(ρ).
Theorem 2.2.4. Let us put t = t( k), where t is defined by equation (2.16) and k is defined by equation (2.17). With P probability at least 1 -ǫ,
where the chained bound B is defined from the bound of Theorem 2.2.2 (page 69) by equation ( 2.12, page 70). In the mean time, for any j such that t ≤ j < k, t(j) < t = max t, because j ∈ (arg max t). Thus
where the function Ξ is defined by equation (2.11, page 69) and S λ is defined by equation (2.13, page 71). For any j ∈ (arg max t), (including notably k), B(ρ t , ρ j ) ≥ B(ρ t , ρ j ) > 0,
Finally in the case when j ∈ k + 1, . . . , M \ (arg max t), due to the fact that in particular j ∈ (arg max t),
Thus in this last case
Thus for any j = 1, . . . , M , ρ k (R) -ρ j (R) is bounded from above by an empirical quantity involving only variance and entropy terms of posterior distributions ρ ℓ such that ℓ ≤ j, and therefore such that C(ρ ℓ ) ≤ C(ρ j ). Moreover, these distributions ρ ℓ are such that ρ ℓ (r) -ρ j (r) and ρ ℓ (R) -ρ j (R) have an empirical upper bound of the same order as the bound stated for ρ k (R) -ρ j (R) -namely the bound for ρ ℓ (r) -ρ j (r) is in all circumstances not greater than Ξ -1 λ N applied to the bound stated for ρ k (R) -ρ j (R), whereas the bound for ρ ℓ (R) -ρ j (R) is always smaller than two times the bound stated for ρ k (R) -ρ j (R). This shows that variance terms are between posterior distributions whose empirical as well as expected error rates cannot be much larger than those of ρ j .
Let us remark that the estimation scheme described in this theorem is very general, the same method can be used as soon as some confidence interval for the relative expected risks
is available. The definition of the complexity is arbitrary, and could in an abstract context be chosen as
Proof. The case when 1 ≤ j < t is straightforward from the definitions: when j < t, B(ρ j , ρ k ) ≤ 0 and therefore ρ k (R) ≤ ρ j (R).
In the second case, that is when t ≤ j < k, j cannot be in arg max t, because of the special choice of k in arg max t. Thus t(j) < t and we deduce from the first case that ρ k (R) ≤ ρ t(j) (R) ≤ ρ j (R) + B(ρ j , ρ t(j) ).
Moreover, we see from the defintion of t that B(ρ t(j) , ρ j ) > 0, implying
and therefore that
In the third case j belongs to arg max t. In this case, we are not sure that B(ρ k , ρ j ) > 0, and it is appropriate to involve t, which is the index of the first posterior distribution which cannot be improved by ρ k , implying notably that B(ρ t , ρ k ) > 0 for any k ∈ arg max t. On the other hand, ρ t cannot either improve any posterior distribution ρ k with k ∈ (arg max t), because this would imply for any ℓ < t that B(ρ ℓ , ρ t ) ≤ B(ρ ℓ , ρ k ) + B(ρ k , ρ t ) ≤ 0, and therefore that t( t) ≥ t + 1, in contradiction of the fact that t = max t. Thus B(ρ k , ρ t ) > 0, and these two remarks imply that
and consequently also that
the last inequality being due to the fact that Ξ λ N is a concave function. Let us notice that it may be the case that k < t, but that only the case when j ≥ t is to be considered, since otherwise we already know that ρ k (R) ≤ ρ j (R).
In the fourth case, j is greater than k, and the complexity of ρ j is larger than the complexity of ρ k . Moreover, j is not in arg max t, and thus B(ρ k , ρ j ) > 0, because otherwise, the sub-additivity of B would imply that B(ρ ℓ , ρ j ) ≤ 0 for any ℓ ≤ t and therefore that t(j) ≥ t = max t. Therefore
Let us start our investigation of the theoretical properties of the algorithm described in Theorem 2.2.4 (page 73) by computing some non-random upper bounds for B(ρ, ρ ′ ), the bound of Theorem 2.2.2 (page 69), and C(ρ), the complexity factor defined by equation (2.14, page 72), for any ρ, ρ ′ ∈ P. This analysis will be done in the case when
in which it will be possible to get some control on the randomness of any ρ ∈ P, in addition to controlling the other random expressions appearing in the definition of B(ρ, ρ ′ ), ρ, ρ ′ ∈ P. We will also use a simpler choice of complexity function, removing from equation (2.14 page 72) the optimization in i and β and using instead the definition
With this definition,
where S λ is defined by equation ( 2.13, page 71), so that
.
Let us successively bound the various random factors entering into the definition of B π i exp(-βr) , π j exp(-β ′ r) . The quantity π j exp(-β ′ r) (r) -π i exp(-βr) (r) can be bounded using a slight adaptation of Proposition 2.1.11 (page 59).
Proposition 2.2.5. For any positive real constants λ, λ ′ and γ, with P probability at least 1 -η, for any positive real constants β, β ′ such that
, where
As for π i exp(-βr) ⊗ π j exp(-β ′ r) (m ′ ), we can write with P probability at least 1 -η, for any posterior distributions ρ and ρ
We can then replace λ with β N λ sinh( λ N ) and use Theorem 2.1.12 (page 60) to get Proposition 2.2.6. For any positive real constants γ, λ, λ ′ , β and β ′ , with P probability 1 -η,
The last random factor in B(ρ, ρ ′ ) that we need to upper bound is
A slight adaptation of Proposition 2.1.13 (page 61) shows that with P probability at least 1 -η,
where as usual Φ is the function defined by equation (1.1, page 2). This leads us to define for any i, j ∈ N, any β,
Recall that the definition of C i (λ, γ) is to be found in Proposition 2.2.5, page 76. Let us remark that, since
we have
Let us put
Let us remark that
Let us define accordingly
Proposition 2.2.7.
• With P probability at least 1 -η, for any
It is also interesting to find a non-random lower bound for C(π i exp(-βr) ). Let us start from the fact that with P probability at least 1 -η,
On the other hand, we already proved that with P probability at least 1 -η,
Thus for any ξ > 0, putting β = αλ N tanh( λ N ) , with P probability at least 1 -η,
Taking ξ = βλ 2N , we get with P probability at least 1 -
and Υ(γ
this can be rewritten as
It is now tempting to simplify the picture a little bit by setting γ ′ = γ, leading to Proposition 2.2.8. With P probability at least 1 -η, for any i ∈ N, any β ∈ R + ,
where C π i exp(-βr) is defined by equation ( 2.18, page 75).
We are now going to analyse Theorem 2.2.4 (page 73). For this, we will also need an upper bound for S λ (ρ, ρ ′ ), defined by equation ( 2.13, page 71), using M ′ and empirical complexities, because of the special relations between empirical complexities induced by the selection algorithm. To this purpose, a useful alternative to Proposition 2.2.6 (page 76) is to write, with P probability at least 1 -η,
and thus at least with P probability 1 -3η,
-log(η).
When ρ = π i exp(-βr) and ρ ′ = π j exp(-β ′ r) , we get with P probability at least 1 -η, for any β, β ′ , γ ∈ R + , any i, j ∈ N,
Proposition 2.2.9. With P probability at least 1 -η, for any ρ = π i exp(-βr) , any
In order to analyse Theorem 2.2.4 (page 73), we need to index P = ρ 1 , . . . , ρ M in order of increasing empirical complexity C(ρ). To deal in a convenient way with this indexation, we will write C(i, β) as C π i exp(-βr) , C(i, β) as C π i exp(-βr) , and S (i, β), (j, β ′ ) as S π i exp(-βr) , π j exp(-β ′ r) .
With P probability at least 1 -ǫ, when t ≤ j < k, as we already saw,
where i = t(j) < t. Therefore, with P probability at least 1 -ǫ -η,
We can now remark that
Moreover, assuming as usual without substantial loss of generality that there exists θ ∈ arg min Θ R, we can split M ′ (θ, θ ′ ) ≤ M ′ (θ, θ) + M ′ ( θ, θ ′ ). Let us then consider the expected margin function defined by
and let us write for any y ∈ R + ,
With P probability at least 1-ǫ-η, for any λ, γ, x, y ∈ R + , any j ∈ t, . . . , k-1 ,
Now we have to get an upper bound for ρ j (R). We can write ρ j = π ℓ exp(-β ′ r) , as we assumed that all the posterior distributions in P are of this special form. Moreover, we already know from Theorem 2.1.8 (page 58) that with P probability at least 1 -η,
This proves that with P probability at least 1 -ǫ -2η,
The case when j ∈ k + 1, . . . , M \ (arg max t) is dealt with exactly in the same way, with i = t(j) replaced directly with k itself, leading to the same inequality.
The case when j ∈ (arg max t) is dealt with bounding first ρ k (R) -R( θ) in terms of ρ t (R) -R( θ), and this latter in terms of ρ j (R) -R( θ). Let us put (2.20) where C(ρ j ) = C(ℓ, β ′ ) is defined, when ρ j = π ℓ exp(-β ′ r) , by equation (2.19, page 77). We obtain, still with P probability 1 -ǫ -2η,
The use of the factor D(λ, γ, ρ j ) in the first of these two inequalities, instead of D(λ, γ, ρ t ), is justified by the fact that C(ρ t ) ≤ C(ρ j ). Combining the two we get
Since it is the worst bound of all cases, it holds for any value of j, proving Theorem 2.2.10. With P probability at least 1 -ǫ -2η,
where the notation A(λ, γ), B(λ, γ) and D(λ, γ, ρ) is defined by equation ( 2.20 page 84) and where the notation C i (β, γ) is defined in Proposition 2.2.5 (page 76).
The bound is a little involved, but as we will prove next, it gives the same rate as Theorem 2.1.15 (page 66) and its corollaries, when we work with a single model (meaning that the support of µ is reduced to one point) and the goal is to choose adaptively the temperature of the Gibbs posterior, except for the appearance of the union bound factor -log ν(β) which can be made of order log log(N ) without spoiling the order of magnitude of the bound.
We will encompass the case when one must choose between possibly several parametric models. Let us assume that each π i is supported by some measurable parameter subset Θ i ( meaning that π i (Θ i ) = 1), let us also assume that the behaviour of π i is parametric in the sense that there exists a dimension d i ∈ R + such that (2.21) sup
Then
Thus
In the same way,
In order to keep the right order of magnitude while simplifying the bound, let us consider
Then, for any β ∈ (0, β max ),
If we are not seeking tight constants, we can take for the sake of simplicity λ = γ = β, x = y and ζ = 2.
Let us put
This leads to
We see in this expression that, in order to balance the various factors depending on x it is advisable to choose x such that
as long as x ≤ N 4C2β .
Following Mammen and Tsybakov, let us assume that the usual margin assumption holds: for some real constants c > 0 and κ ≥ 1,
As D(θ, θ ) ≥ M ′ (θ, θ ), this also implies the weaker assumption
which we will really need and use. Let us take β max = N and
Then, as we have already seen, ϕ(x) ≤ (1-κ -1 ) κcx
Using the fact that when r ∈ (0, 1 2 ), 1+r
1-r 2 ≤ 1+16r ≤ 9, we get with P probability at least 1 -ǫ, for any β ∈ supp ν, in the case when
and in the case when x = x 2 ≤ x 1 ,
Thus with P probability at least 1 -ǫ,
Theorem 2.2.11. With probability at least 1 -ǫ, for any i ∈ N,
where C 2 , given by equation ( 2.23 page 86), will in most cases be close to 1, and in any case less than 3.2.
This result gives a bound of the same form as that given in Theorem 2.1.15 (page 66) in the special case when there is only one model -that is when µ is a Dirac mass, for instance µ(1) = 1, implying that R( θ 1 )-R( θ) = 0. Morover the parametric complexity assumption we made for this theorem, given by equation ( 2.21 page 85), is weaker than the one used in Theorem 2.1.15 and described by equation (2.8, page 63). When there is more than one model, the bound shows that the estimator makes a trade-off between model accuracy, represented by inf Θi R -R( θ), and dimension, represented by d i , and that for optimal parametric sub-models, meaning those for which inf Θi R = inf Θ R, the estimator does at least as well as the minimax optimal convergence speed in the best of these.
Another point is that we obtain more explicit constants than in Theorem 2.1.15. It is also clear that a more careful choice of parameters could have brought some improvement in the value of these constants.
These results show that the selection scheme described in this section is a good candidate to perform temperature selection of a Gibbs posterior distribution built within a single parametric model in a rate optimal way, as well as a proposal with proven performance bound for model selection.
Let us reconsider the case where we want to choose adaptively among a family of parametric models. Let us thus assume that the parameter set is a disjoint union of measurable sub-models, so that we can write Θ = ⊔ m∈M Θ m , where M is some measurable index set. Let us choose some prior probability distribution on the index set µ ∈ M 1 + (M ), and some regular conditional prior distribution π :
Let us then study some arbitrary posterior distributions ν : Ω → M 1 + (M ) and ρ : Ω × M :→ M 1 + (Θ), such that ρ(ω, i, Θ i ) = 1, ω ∈ Ω, i ∈ M . We would like to compare νρ(R) with some doubly localized prior distribution µ exp[-β 1+ζ 2 π exp(-βR) (R)] π exp(-βR) (R) (where ζ 2 is a positive parameter to be set as needed later on). To ease notation we will define two prior distributions (one being more precisely a conditional distribution) depending on the positive real parameters β and ζ 2 , putting (2.24) π = π exp(-βR) and µ
Similarly to Theorem 1.4.3 on page 37 we can write for any positive real constants β and γ
and deduce, using Lemma 1.1.3 on page 4, that (2.25) P exp sup
This will be our starting point in comparing νρ(R) with µ π(R). However, obtaining an empirical bound will require some supplementary efforts. For each index of the model index set M , we can write in the same way
Integrating this inequality with respect to µ and using Fubini’s lemma for positive functions, we get
Note that µ(π ⊗ π) is a probability measure on M × Θ × Θ, whereas (µ π) ⊗ (µ π) considered previously is a probability measure on (M × Θ) × (M × Θ). We get as previously (2.26) P exp sup
Let us finally recall that
From equations (2.25), (2.26) and (2.28) we deduce
and
where the prior distribution µ π is defined by equation (2.24) on page 90 and depends on β and ζ 2 .
Let us put for short
We will use an entropy compensation strategy for which we need a couple of entropy bounds. We have according to Proposition 2.3.1, with P probability at least 1 -ǫ,
Thus, for any positive real constants β, γ and ζ i , i = 1, . . . , 5, with P probability at least 1 -ǫ, for any posterior distributions ν, ν 3 :
Adding these six inequalities and assuming that (2.29)
we find
where we have also used the fact (concerning the 11th line of the preceding inequalities) that
Let us now apply to π (we shall later do the same with µ) the following inequalities, holding for any random functions of the sample and the parameters h
When h and g are observable, and h is not too far from βr ≃ βR, this gives a way to replace π with a satisfactory empirical approximation. We will apply this method, choosing ρ 1 and ρ 5 such that µ π is replaced either with µρ 1 , when it comes from the first two inequalities or with µρ 5 otherwise, choosing ρ 2 such that νπ is replaced with νρ 2 and ρ 4 such that ν 3 π is replaced with ν 3 ρ 4 . We will do so because it leads to a lot of helpful cancellations. For those to happen, we need to choose ρ i = π exp(-λir) , i = 1, 2, 4, where λ 1 , λ 2 and λ 4 are such that (2.33) and to assume that (2.34) ζ 4 > ζ 3 .
We obtain that with P probability at least 1 -ǫ,
In order to obtain more cancellations while replacing µ by some posterior distribution, we will choose the constants such that λ 5 = λ 4 , which can be done by choosing (2.35)
We can now replace µ with µ exp -ξ1ρ1(r)-ξ4ρ4(r) , where
Choosing moreover ν 3 = µ exp -ξ1ρ1(r)-ξ4ρ4(r) , to induce some more cancellations, we get Theorem 2.3.2. Let us use the notation introduced above. For any positive real constants satisfying equations (2.29, page 92), (2.30, page 93), (2.31, page 93), (2.32, page 93), (2.33, page 93), (2.34, page 93), (2.35, page 94), (2.36, page 94), (2.37, page 94), with P probability at least 1 -ǫ, for any posterior distribution ν : Ω → M 1 + (M ) and any conditional posterior distribution ρ :
where B(ν, ρ, β)
This theorem can be used to find the largest value β(νρ) of β such that B(ν, ρ, β) ≤ 0, thus providing an estimator for β(νρ) defined as νρ(R) = µ β(νρ) π β(νρ) (R), where we have mentioned explicitly the dependence of µ and π in β, the constant ζ 2 staying fixed. The posterior distribution νρ may then be chosen to maximize β(νρ) within some manageable subset of posterior distributions P, thus gaining the assurance that νρ(R) ≤ µ β(νρ) π β(νρ) (R), with the largest parameter β(νρ) that this approach can provide. Maximizing β(νρ) is supported by the fact that lim β→+∞ µ β π β (R) = ess inf µπ R. Anyhow, there is no assurance (to our knowledge) that β → µ β π β (R) will be a decreasing function of β all the way, although this may be expected to be the case in many practical situations.
We can make the bound more explicit in several ways. One point of view is to put forward the optimal values of ρ and ν. We can thus remark that
This formula is better understood when thinking about the following upper bound for the two first lines in the expression of B(ν, ρ, β):
Chapter 2. Comparing posterior distributions to Gibbs priors
π exp(-αr) (r)dα -γρ 1 (r) .
Another approach to understanding Theorem 2.3.2 is to put forward ρ 0 = π exp(-λ0r) , for some positive real constant λ 0 < γ, noticing that
In the case when we want to select a single model m(ω), and therefore to set ν = δ m , the previous inequality engages us to take m ∈ arg min
In parametric situations where
resulting in a linear penalization of the empirical dimension of the models.
We will not state a formal result, but will nevertheless give some hints about how to establish one. This is a rather technical section, which can be skipped at a first reading , since it will not be used below. We should start from Theorem 1.4.2 (page 36), which gives a deterministic variance term. From Theorem 1.4.2, after a change of prior distribution, we obtain for any positive constants α 1 and α 2 , any prior distributions µ 1 and µ 2 ∈ M 1 + (M ), for any prior conditional distributions π 1 and π 2 : M → M 1 + (Θ), with P probability at least 1 -η, for any posterior distributions ν 1 ρ 1 and ν 2 ρ 2 ,
Applying this to α 1 = 0, we get that
In the same way, to bound quantities of the form
where C 1 , C 2 , p 1 and p 2 are positive constants, and similar terms, we need to use inequalities of the type: for any prior distributions µ i π i , i = 1, 2, with P probability at least 1 -η, for any posterior distributions
We need also the variant: with P probability at least 1 -η, for any posterior distribution ν 1 : Ω → M 1 + (M ) and any conditional posterior distributions ρ 1 , ρ 2 :
We deduce that
We are then left with the need to bound entropy terms like K(ν 3 ρ 1 , µ 3 π 1 ), where we have the choice of µ 3 and π 1 , to obtain a useful bound. As could be expected, we decompose it into
Let us look after the second term first, choosing π 1 = π exp(-β1R) :
Thus, when the constraint λ 1 = β1α2 α1 is satisfied,
We can further specialize the constants, choosing α 1 = N sinh( α2 N ), so that
We can for instance choose α 2 = γ, α 1 = N sinh( γ N ) and
With the notation of Theorem 2.3.2, the constants being set as explained above, putting
More generally
In a similar way, let us now choose
Let us choose α 2 = γ, α 1 = N sinh( γ N ), and let us add some other entropy inequalities to get rid of π in a suitable way, the approach of entropy compensation being the same as that used to obtain the empirical bound of Theorem 2.3.2 (page 94). This results with P probability at least 1 -η in
where we have introduced a bunch of constants, assumed to be positive, that we will more precisely set to
We get with P probability at least 1 -η,
Let us choose the constants so that λ 1 = λ 7 = λ 9 , λ 4 = λ 6 = λ 8 , α 3 x 9 γ α1 = ξ 1 and α 3 x 8 γ α1 = ξ 4 . This is done by setting
The inequality λ 1 > ξ 1 is always satisfied. The inequality λ 4 > ξ 4 is required for the above choice of constants, and will be satisfied for a suitable choice of ζ 3 and ζ 4 . Under these assumptions, we obtain with P probability at least 1 -η
This proves Proposition 2.3.4. The constants being set as explained above, with P probability at least 1 -η, for any posterior distribution ν :
Chapter 2. Comparing posterior distributions to Gibbs priors
We will not go further, lest it may become tedious, but we hope we have given sufficient hints to state informally that the bound B(ν, ρ, β) of Theorem 2.3.2 (page 94) is upper bounded with P probability close to one by a bound of the same flavour where the empirical quantities r and m ′ have been replaced with their expectations R and M ′ .
Here we work with a family of prior distributions described by a regular conditional prior distribution π = M → M 1 + (Θ), where M is some measurable index set. This family may typically describe a countable family of parametric models. In this case M = N, and each of the prior distributions π(i, .), i ∈ N satisfies some parametric complexity assumption of the type lim sup
Let us consider also a prior distribution µ ∈ M 1 + (M ) defined on the index set M . Our aim here will be to compare the performance of two given posterior distributions ν 1 ρ 1 and ν 2 ρ 2 , where ν 1 , ν 2 : Ω → M 1 + (M ), and where ρ 1 , ρ 2 : Ω × M → M 1 + (Θ). More precisely, we would like to establish a bound for (ν 1 ρ 1 -ν 2 ρ 2 )(R) which could be a starting point to implement a selection method similar to the one described in Theorem 2.2.4 (page 73). To this purpose, we can start with Theorem 2.2.1 (page 69), which says that with P probability at least 1 -ǫ,
where µ ∈ M 1 + (M ) and π : M → M 1 + (Θ) are suitably localized prior distributions to be chosen later on. To use these localized prior distributions, we need empirical bounds for the entropy terms K(ν i , µ) and ν i K(ρ i , π) , i = 1, 2.
Bounding ν K(ρ, π) can be done using the following generalization of Corollary 2.1.19 page 68: Corollary 2.3.5. For any positive real constants γ and λ such that γ < λ, for any prior distribution µ ∈ M 1 + (M ) and any conditional prior distribution π : M → M 1 + (Θ), with P probability at least 1 -ǫ, for any posterior distribution ν : Ω → M 1 + (M ), and any conditional posterior distribution ρ :
where
To apply this corollary to our case, we have to set
Let us also consider for some positive real constant β the conditional prior distribution π = π exp(-βR)
and the prior distribution
Let us see how we can bound, given any posterior distribution ν : Ω → M 1 + (M ), the divergence K(ν, µ). We can see that
Starting from the exponential inequality
and reasoning in the same way that led to Theorem 2.1.1 (page 52) in the simple case when we take in this theorem λ = γ, we get with P probability at least 1 -ǫ, that
In the meantime, using Theorem 2.2.1 (page 69) and Corollary 2.3.5 above, we see that with P probability at least 1 -2ǫ, for any conditional posterior distribution
Putting all this together, we see that with P probability at least 1 -3ǫ, for any
Replacing in the right-hand side of this inequality the unobserved prior distribution µ with the worst possible posterior distribution, we obtain Theorem 2.3.6. For any positive real constants α, β, γ and λ, using the notation,
with P probability at least 1 -ǫ, for any posterior distribution ν :
This result is satisfactory, but in the same time hints at some possible improvement in the choice of the localized prior µ, which is here somewhat lacking a variance term. We will consider in the remainder of this section the use of
where ξ is some positive real constant and π = π exp(-βR) is some appropriate conditional prior distribution with positive real parameter β. With this new choice
We already know how to deal with the first factor α(νµ)π(R), since the computations we made to give it an empirical upper bound were valid for any choice of the localized prior distribution µ. Let us now deal with ξ(ν -µ)( π ⊗ π)(M ′ ). Since m ′ (θ, θ ′ ) is a sum of independent Bernoulli random variables, we can easily generalize the result of Theorem 1.1.4 (page 4) to prove that with P probability at least
In the same way, with P probability at least 1 -ǫ,
We would like now to replace ( π ⊗ π)(m ′ ) with an empirical quantity. In order to do this, we will use an entropy bound. Indeed for any conditional posterior distribution
Taking for simplicity ξ = 2N log cosh( γ N ) and noticing that 2N log cosh( γ N ) = -N log 1 -β 2 N 2 , we get Theorem 2.3.7. Let us put π = π exp(-βR) and π = π exp(-γr) , where γ is some arbitrary positive real constant and β = N tanh( γ N ), so that γ = N 2 log
As a consequence
Let us take for the sake of simplicity ζ = 2N log cosh( γ N ) , to get
This proves
It remains now only to replace in the right-hand side of this inequality µ with the worst possible posterior distribution to obtain Theorem 2.3.9. Let λ > γ > β, ζ, α and ξ be arbitrary positive real constants. Let us use the notation
Let us assume moreover that
With P probability at least 1 -ǫ, for any posterior distribution ν :
The interest of this theorem lies in the presence of a variance term in the localized posterior distribution µ, which with a suitable choice of parameters seems to be an interesting option in the case when there are nested models: in this situation there may be a need to prevent integration with respect to µ in the right-hand side to put weight on wild oversized models with large variance terms. Moreover, the righthand side being empirical, parameters can be, as usual, optimized from data using a union bound on a grid of candidate values.
If one is only interested in the general shape of the result, a simplified inequality as the one below may suffice: Corollary 2.3.10. For any positive real constants λ > γ > β, ζ, α and ξ, let us use the same notation as in Theorem 2.3.9 (page 108). Let us put moreover
Let us assume that A 1 < 1. With P probability at least 1 -ǫ, for any posterior distribution ν :
Chapter 3
Transductive PAC-Bayesian learning 3.1. Basic inequalities
In this chapter the observed sample (X i , Y i ) N i=1 will be supplemented with a test or shadow sample (X i , Y i ) (k+1)N i=N +1 . This point of view, called transductive classification, has been introduced by V. Vapnik. It may be justified in different ways.
On the practical side, one interest of the transductive setting is that it is often a lot easier to collect examples than it is to label them, so that it is not unrealistic to assume that we indeed have two training samples, one labelled and one unlabelled. It also covers the case when a batch of patterns is to be classified and we are allowed to observe the whole batch before issuing the classification.
On the mathematical side, considering a shadow sample proves technically fruitful. Indeed, when introducing the Vapnik-Cervonenkis entropy and Vapnik-Cervonenkis dimension concepts, as well as when dealing with compression schemes, albeit the inductive setting is our final concern, the transductive setting is a useful detour. In this second scenario, intermediate technical results involving the shadow sample are integrated with respect to unobserved random variables in a second stage of the proofs.
Let us describe now the changes to be made to previous notation to adapt them to the transductive setting. The distribution P will be a probability measure on the canonical space Ω = (X × Y) (k+1)N , and
will be the canonical process on this space (that is the coordinate process). Unless explicitly mentioned, the parameter k indicating the size of the shadow sample will remain fixed. Assuming the shadow sample size is a multiple of the training sample size is convenient without significantly restricting generality. For a while, we will use a weaker assumption than independence, assuming that P is partially exchangeable, since this is all we need in the proofs. Definition 3.1.1. For i = 1, . . . , N , let τ i : Ω → Ω be defined for any To see the connection with the previously defined generalization error rate, let us comment on the case when P is invariant by any permutation of any row, meaning that P h(ω • s) = P h(ω) for all s ∈ S({i + jN ; j = 0, . . . , k}) and all i = 1, . . . , N , where S(A) is the set of permutations of A, extended to {1, . . . , (k + 1)N } so as to be the identity outside of A. In other words, P is assumed to be invariant under any permutation which keeps the rows unchanged. In this case, if ρ is invariant by any permutation of any row of the shadow sample, meaning that ρ(ω
, meaning that the expectation can be taken on a restricted shadow sample of the same size as the observed sample. If moreover the rows are equidistributed, meaning that their marginal distributions are equal, then
. This means that under these quite commonly fulfilled assumptions, the expectation can be taken on a single new object to be classified, our study thus covers the case when only one of the patterns from the shadow sample is to be labelled and one is interested in the expected error rate of this single labelling. Of course, in the case when P is i.i.d. and ρ depends only on the training sample (X i , Y i ) N i=1 , we fall back on the usual criterion of performance
Using an obvious factorization, and considering for the moment a fixed value of θ and any partially exchangeable positive real measurable function λ : Ω → R + , we can compute the log-Laplace transform of r 1 under T , which acts like a conditional probability distribution:
where the function Φ λ N was defined by equation (1.1, page 2). Remarking that
We deduce from this lemma a result analogous to the inductive case:
Theorem 3.1.2. For any partially exchangeable positive real measurable function λ : Ω × Θ → R + , for any partially exchangeable posterior distribution π : Ω → M 1 + (Θ), P exp sup
The proof is deduced from the previous lemma, using the fact that π is partially exchangeable:
Introducing in the same way
we could prove along the same line of reasoning Theorem 3.1.3. For any real parameter λ, any θ ∈ Θ, any partially exchangeable posterior distribution π : Ω → M 1 + (Θ), P exp sup
where the function Ψ λ N was defined by equation (1.21, page 35).
Theorem 3.1.4. For any real constant γ, for any θ ∈ Θ, for any partially exchangeable posterior distribution π : Ω → M 1 + (Θ), P exp sup
This last theorem can be generalized to give Theorem 3.1.5. For any real constant γ, for any partially exchangeable posterior distributions π 1 , π 2 : Ω → M 1 + (Θ),
To conclude this section, we see that the basic theorems of transductive PAC-Bayesian classification have exactly the same form as the basic inequalities of inductive classification, Theorems 1.1.4 (page 4), 1.4.2 (page 36) and 1.4.3 (page 37) with R(θ) replaced with r(θ), r(θ) replaced with r 1 (θ) and M ′ (θ, θ) replaced with m(θ, θ).
Thus all the results of the first two chapters remain true under the hypotheses of transductive classification, with R(θ) replaced with r(θ), r(θ) replaced with r 1 (θ) and M ′ (θ, θ ) replaced with m(θ, θ).
Consequently, in the case when the unlabelled shadow sample is observed, it is possible to improve on the Vapnik bounds to be discussed hereafter by using an explicit partially exchangeable posterior distribution π and resorting to localized or to relative bounds (in the case at least of unlimited computing resources, which of course may still be unrealistic in many real world situations, and with the caveat, to be recalled in the conclusion of this study, that for small sample sizes and comparatively complex classification models, the improvement may not be so decisive).
Let us notice also that the transductive setting when experimentally available, has the advantage that
is observable in this context, providing an empirical upper bound for the difference r( θ) -ρ(r) for any non-randomized estimator θ and any posterior distribution ρ, namely r( θ) ≤ ρ(r) + ρ d(•, θ) .
Thus in the setting of transductive statistical experiments, the PAC-Bayesian framework provides fully empirical bounds for the error rate of non-randomized estimators θ : Ω → Θ, even when using a non-atomic prior π (or more generally a nonatomic partially exchangeable posterior distribution π), even when Θ is not a vector space and even when θ → R(θ) cannot be proved to be convex on the support of some useful posterior distribution ρ.
In this section, we will stick to plain unlocalized non-relative bounds. As we have already mentioned, (and as it was put forward by Vapnik himself in his seminal works), these bounds are not always superseded by the asymptotically better ones when the sample is of small size: they deserve all our attention for this reason. We will start with the general case of a shadow sample of arbitrary size. We will then discuss the case of a shadow sample of equal size to the training set and the case of a fully exchangeable sample distribution, showing how they can be taken advantage of to sharpen inequalities.
The great thing with the transductive setting is that we are manipulating only r 1 and r which can take only a finite number of values and therefore are piecewise constant on Θ. This makes it possible to derive inequalities that will hold uniformly for any value of the parameter θ ∈ Θ. To this purpose, let us consider for any value θ ∈ Θ of the parameter the subset ∆(θ) ⊂ Θ of parameters θ ′ such that the classification rule f θ ′ answers the same on the extended sample (X i )
We see immediately that ∆(θ) is an exchangeable parameter subset on which r 1 and r 2 and therefore also r take constant values. Thus for any θ ∈ Θ we may consider the posterior ρ θ defined by
and use the fact that ρ θ (r 1 ) = r 1 (θ) and ρ θ (r) = r(θ), to prove that Lemma 3.2.1. For any partially exchangeable positive real measurable function
and any partially exchangeable posterior distribution π : Ω → M 1 + (Θ), with P probability at least 1 -ǫ, for any θ ∈ Θ,
We can then remark that for any value of λ independent of ω, the left-hand side of the previous inequality is a partially exchangeable function of ω ∈ Ω. Thus this left-hand side is maximized by some partially exchangeable function λ, namely arg max
is partially exchangeable as depending only on partially exchangeable quantities. Moreover this choice of λ(ω, θ) satisfies also condition (3.1) stated in the previous lemma of being constant on ∆(θ), proving Lemma 3.2.2. For any partially exchangeable posterior distribution π : Ω → M 1 + (Θ), with P probability at least 1 -ǫ, for any θ ∈ Θ and any λ ∈ R + ,
Writing r = r1+kr2 k+1 and rearranging terms we obtain Theorem 3.2.3. For any partially exchangeable posterior distribution π : Ω → M 1 + (Θ), with P probability at least 1 -ǫ, for any θ ∈ Θ,
If we have a set of binary classification rules {f θ ; θ ∈ Θ} whose Vapnik-Cervonenkis dimension is not greater than h, we can choose π such that π ∆(θ) is independent of θ and not less than h e(k + 1)N h , as will be proved further on in Theorem 4. 2.2 (page 144).
Another important setting where the complexity term -log π ∆(θ) can easily be controlled is the case of compression schemes, introduced by Little et al. (1986). It goes as follows: we are given for each labelled sub-sample (X i , Y i ) i∈J , J ⊂ {1, . . . , N }, an estimator of the parameter
is an exchangeable function providing estimators for sub-samples of arbitrary size. Let us assume that θ is exchangeable, meaning that for any k = 1, . . . , N and any permutation σ of {1, . . . , k}
In this situation, we can introduce the exchangeable subset
which is seen to contain at most h j=0 (k + 1)N j ≤ e(k + 1)N h h classification rules -as will be proved later on in Theorem 4. 2.3 (page 144). Note that we had to extend the range of J to all the subsets of the extended sample, although we will use for estimation only those of the training sample, on which the labels are observed. Thus in this case also we can find a partially exchangeable posterior distribution π such that
We see that the size of the compression scheme plays the same role in this complexity bound as the Vapnik-Cervonenkis dimension for Vapnik-Cervonenkis classes.
In these two cases of binary classification with Vapnik-Cervonenkis dimension not greater than h and compression schemes depending on a compression set with at most h points, we get a bound of
Let us make some numerical application: when N = 1000, h = 10, ǫ = 0.01, and inf Θ r 1 = r 1 ( θ) = 0.2, we find that r 2 ( θ) ≤ 0.4093, for k between 15 and 17, and values of λ equal respectively to 965, 968 and 971. For k = 1, we find only r 2 ( θ) ≤ 0.539, showing the interest of allowing k to be larger than 1.
In the case when k = 1, we can improve Theorem 3.1.2 by taking advantage of the fact that T i (σ i ) can take only 3 values, namely 0, 0.5 and 1. We see thus that
.
This shows that in the case when k = 1, log T exp(-λr 1 ) = -λr
)
2 ) = N λ log cosh( λ 2N ) , we obtain Theorem 3.2.4. For any partially exchangeable function λ : Ω× Θ → R + , for any partially exchangeable posterior distribution π : Ω → M 1 + (Θ), P exp sup
As a consequence, reasoning as previously, we deduce Theorem 3.2.5. In the case when k = 1, for any partially exchangeable posterior distribution π : Ω → M 1 + (Θ), with P probability at least 1 -ǫ, for any θ ∈ Θ and any λ ∈ R + ,
and consequently for any θ ∈ Θ,
-r 1 (θ).
In the case of binary classification using a Vapnik-Cervonenkis class of Vapnik-Cervonenkis dimension not greater than h, we can choose π such that -log π ∆(θ)
≤ h log( 2eN h ) and obtain the following numerical illustration of this theorem: for N = 1000, h = 10, ǫ = 0.01 and inf Θ r 1 = r 1 ( θ) = 0.2, we find an upper bound r 2 ( θ) ≤ 0.5033, which improves on Theorem 3.2.3 but still is not under the significance level 1 2 (achieved by blind random classification). This indicates that considering shadow samples of arbitrary sizes some noisy situations yields a significant improvement on bounds obtained with a shadow sample of the same size as the training sample.
When k = 1 and P is exchangeable meaning that for any bounded measurable function h : Ω → R and any permutation s ∈ S {1, . . . , 2N } P h(ω•s) = P h(ω) , then we can still improve the bound as follows. Let
Then we can write
Using this identity, we get for any exchangeable function λ :
Let us put
Let us notice now that
Let π : Ω → M 1 + (Θ) be any given exchangeable posterior distribution. Using the exchangeability of P and π and the exchangeability of the exponential function, we get
We are thus ready to state Theorem 3.2.6. In the case when k = 1, for any exchangeable probability distribution P, for any exchangeable posterior distribution π : Ω → M 1 + (Θ), for any exchangeable function λ : Ω × Θ → R + , P exp sup
where A(λ) is defined by equation (3.2, page 119).
We then deduce as previously Corollary 3.2.7. For any exchangeable posterior distribution π : Ω → M 1 + (Θ), for any exchangeable probability measure P ∈ M 1 + (Ω), for any measurable exchangeable function λ : Ω × Θ → R + , with P probability at least 1 -ǫ, for any θ ∈ Θ,
where A(λ) is defined by equation (3.2, page 119).
In order to deduce an empirical bound from this theorem, we have to make some choice for λ(ω, θ). Fortunately, it is easy to show that the bound holds uniformly in λ, because the inequality can be rewritten as a function of only one non-exchangeable quantity, namely r 1 (θ). Indeed, since r 2 = 2r -r 1 , we see that the inequality can be written as
It can be solved in r 1 (θ), to get
where
Thus we can find some exchangeable function λ(ω, θ), such that f λ(ω, θ), r(θ), -log ǫπ ∆(θ) = sup β∈R+ f β, r(θ), -log ǫπ ∆(θ) .
Applying Corollary 3.2.7 (page 120) to that choice of λ, we see that Theorem 3.2.8. For any exchangeable probability measure P ∈ M 1 + (Ω), for any exchangeable posterior probability distribution π : Ω → M 1 + (Θ), with P probability at least 1 -ǫ, for any θ ∈ Θ, for any λ ∈ R + ,
where A(λ) is defined by equation (3.2, page 119).
Solving the previous inequality in r 2 (θ), we get Corollary 3.2.9. Under the same assumptions as in the previous theorem, with P probability at least 1 -ǫ, for any θ ∈ Θ,
.
Applying this to our usual numerical example of a binary classification model with Vapnik-Cervonenkis dimension not greater than h = 10, when N = 1000, inf Θ r 1 = r 1 ( θ) = 10 and ǫ = 0.01, we obtain that r 2 ( θ) ≤ 0.4450.
We assume in this section that
where P i ∈ M 1 + X × Y : we consider an infinite i.i.d. sequence of independent non-identically distributed samples of size N , the first one only being observed. More precisely, under P each sample (X i+jN , Y i+jN ) N i=1 is distributed according to N i=1 P i , and they are all independent from each other. Only the first sample (X i , Y i ) N i=1 is assumed to be observed. The shadow samples will only appear in the proofs. The aim of this section is to prove better Vapnik bounds, generalizing them in the same time to the independent non-i.i.d. setting, which to our knowledge has not been done before.
Let us introduce the notation
, where h may be any suitable (e.g. bounded) random variable, let us also put Ω = (X × Y) N N . Definition 3.3.1. For any subset A ⊂ N of integers, let C(A) be the set of circular permutations of the totally ordered set A, extended to a permutation of N by taking it to be the identity on the complement N \ A of A. We will say that a random function h : Ω → R is k-partially exchangeable if h(ω • s) = h(ω), s ∈ C {i + jN ; j = 0, . . . , k} , i = 1, . . . , N.
In the same way, we will say that a posterior distribution π :
Note that P itself is k-partially exchangeable for any k in the sense that for any bounded measurable function h : Ω → R P h(ω • s) = P h(ω) , s ∈ C {i + jN ; j = 0, . . . , k} , i = 1, . . . , N. 1.2 shows that for any positive real parameter λ and any k-partially exchangeable posterior distribution
Using the general fact that P exp(h) = P P ′ exp(h) ≥ P exp P ′ (h) , and the fact that the expectation of a supremum is larger than the supremum of an expectation, we see that with P probability at most 1 -ǫ, for any θ ∈ Θ,
For short let us put
We can use the convexity of Φ λ N and the fact that P ′ (r k ) = r1+kR k+1 , to establish that
We have proved Theorem 3.3.1. Using the above hypotheses and notation, for any sequence π k :
, where π k is a k-partially exchangeable posterior distribution, for any positive real constant λ, any positive integer k, with P probability at least 1 -ǫ, for any θ ∈ Θ,
We can make as we did with Theorem 1.2.6 (page 11) the result of this theorem uniform in λ ∈ {α j ; j ∈ N * } and k ∈ N * (considering on k the prior 1 k(k+1) and on j the prior 1 j(j+1) ), and obtain Theorem 3.3.2. For any real parameter α > 1, with P probability at least 1 -ǫ, for any θ ∈ Θ,
As a special case we can choose π k such that log π k ∆ k (θ) is independent of θ and equal to log(N k ), where
; θ ∈ Θ is the size of the trace of the classification model on the extended sample of size (k+1)N . With this choice, we obtain a bound involving a new flavour of conditional Vapnik entropy, namely
In the case of binary classification using a Vapnik-Cervonenkis class of Vapnik-Cervonenkis dimension not greater than h = 10, when N = 1000, inf Θ r 1 = r 1 ( θ) = 0.2 and ǫ = 0.01, choosing α = 1.1, we obtain R( θ) ≤ 0.4271 (for an optimal value of λ = 1071.8, and an optimal value of k = 16).
If we are not pleased with optimizing λ on a discrete subset of the real line, we can use a slightly different approach. From Theorem 3.1.2 (page 113), we see that for any positive integer k, for any k-partially exchangeable positive real measurable function λ : Ω × Θ → R + satisfying equation (3.1, page 116) -with ∆(θ) replaced with ∆ k (θ) -for any ǫ ∈)0, 1) and η ∈)0, 1),
≤ ǫη, therefore with P probability at least 1 -ǫ,
and consequently, with P probability at least 1-ǫ, with P ′ probability at least 1-η, for any θ ∈ Θ,
Now we are entitled to choose λ(ω, θ) ∈ arg max
This shows that with P probability at least 1 -ǫ, with P ′ probability at least 1 -η, for any θ ∈ Θ,
which can also be written
Thus with P probability at least 1 -ǫ, for any θ ∈ Θ, any λ ∈ R + ,
On the other hand, Φ λ N being a convex function,
Thus with P probability at least 1 -ǫ, for any θ ∈ Θ,
We can generalize this approach by considering a finite decreasing sequence η 0 = 1 > η 1 > η 2 > • • • > η J > η J+1 = 0, and the corresponding sequence of levels
Taking a union bound in j, we see that with P probability at least 1 -ǫ, for any θ ∈ Θ, for any λ ∈ R + ,
and consequently
Let us put
We have proved that for any decreasing sequence (η j ) J j=1 , with P probability at least 1 -ǫ, for any θ ∈ Θ,
In the case where N = 1000 and for any ǫ ∈)0, in the above inequality.
Taking moreover a weighted union bound in k, we get Theorem 3.3.3. For any ǫ ∈)0, 1), any sequence
, where π k is a k-partially exchangeable posterior distribution, with P probability at least 1 -ǫ, for any θ ∈ Θ,
Corollary 3.3.4. For any ǫ ∈)0, 1), for any N ≤ 10 9 , with P probability at least 1 -ǫ, for any θ ∈ Θ,
Let us end this section with a numerical example: in the case of binary classification with a Vapnik-Cervonenkis class of dimension not greater than 10, when N = 1000, inf Θ r 1 = r 1 ( θ) = 0.2 and ǫ = 0.01, we get a bound R( θ) ≤ 0.4211 (for optimal values of k = 15 and of λ = 1010).
In the case when k = 1, we can use Theorem 3.2.5 (page 118) and replace Φ -1 λ N (q) with 1 -2N λ × log cosh( λ 2N ) -1 q, resulting in Theorem 3.3.5. For any ǫ ∈)0, 1), any N ≤ 10 9 , any one-partially exchangeable posterior distribution π 1 : Ω → M 1 + (Θ), with P probability at least 1 -ǫ, for any θ ∈ Θ,
.
3.3.4. Improvement on the equal sample size bound in the i.i.d. case
Finally, in the case when P is i.i.d., meaning that all the P i are equal, we can improve the previous bound. For any partially exchangeable function λ : Ω × Θ → R + , we saw in the discussion preceding Theorem 3.2.6 (page 120) that
with the notation introduced therein. Thus for any partially exchangeable positive real measurable function λ : Ω × Θ → R + satisfying equation (3.1, page 116), any one-partially exchangeable posterior distribution π
Therefore with P probability at least 1 -ǫ, with P ′ probability 1 -η,
We can then choose λ(ω, θ) ∈ arg min
λ ′ , which satisfies the required conditions, to show that with P probability at least 1 -ǫ, for any θ ∈ Θ, with P ′ probability at least 1 -η, for any λ ∈ R + ,
We can then take a union bound on a decreasing sequence of J values η 1 ≥ • • • ≥ η J of η. Weakening the order of quantifiers a little, we then obtain the following statement: with P probability at least 1 -ǫ, for any θ ∈ Θ, for any λ ∈ R + , for any j = 1, . . . , J
Consequently for any λ ∈ R + ,
Keeping track of quantifiers, we obtain Theorem 3.3.6. For any decreasing sequence (η j ) J j=1 , any ǫ ∈)0, 1), any onepartially exchangeable posterior distribution π : Ω → M 1 + (Θ), with P probability at least 1 -ǫ, for any θ ∈ Θ,
.
To obtain formulas which could be easily compared with original Vapnik bounds, we may replace p -Φ a (p) with a Gaussian upper bound:
For any p ∈ ( 1 2 , 1), p -Φ a (p) ≤ a 8 .
Proof. Let us notice that for any p ∈ (0, 1),
Thus taking a Taylor expansion of order one with integral remainder:
-aΦ(a) ≤
This ends the proof of our lemma.
Lemma 3.4.2. Let us consider the bound
Let us also put
For any positive real parameters q and d inf
Then let us remark that B(q, d)
If moreover 1 2 ≥ B(q, d), then according to this remark 1 2 ≥ q + d 2N ≥ p. Therefore p ≤ 1 2 , and consequently p ≤ q + 2dp(1-p)
, implying that p ≤ B(q, d).
The previous lemma combined with Corollary 3.3.4 For any ǫ ∈)0, 1), any integer N ≤ 10 9 , with P probability at least 1 -ǫ, for any θ ∈ Θ,
To make a link with Vapnik’s result, it is useful to state the Gaussian approximation to Theorem 3.3.6 (page 127). Indeed, using the upper bound A(λ) ≤ λ 4N , where A(λ) is defined by equation (3.2) on page 119, we get with P probability at least
which can be solved in R to obtain Corollary 3.4.4. With P probability at least 1 -ǫ, for any θ ∈ Θ,
This is to be compared with Vapnik’s result, as proved in Vapnik (1998, page 138): Theorem 3.4.5 (Vapnik). For any i.i.d. probability distribution P, with P probability at least 1 -ǫ, for any θ ∈ Θ, putting
Recalling that we can choose (η j ) 2 j=1 such that η J = η 2 = 1 10N (which brings a negligible contribution to the bound) and such that for any N ≤ 10 9 ,
we see that our complexity term is somehow more satisfactory than Vapnik’s, since it is integrated outside the logarithm, with a slightly larger additional constant (remember that log 4 ≃ 1.4, which is better than our 4.7, which could presumably be improved by working out a better sequence η j , but not down to log(4)). Our variance term is better, since we get r 1 (1 -r 1 ), instead of r 1 . We also have
, because we use no symmetrization trick.
Let us illustrate these bounds on a numerical example, corresponding to a situation where the sample is noisy or the classification model is weak. Let us assume that N = 1000, inf Θ r 1 = r 1 ( θ) = 0.2, that we are performing binary classification with a model with Vapnik-Cervonenkis dimension not greater than h = 10, and that we work at confidence level ǫ = 0.01. Vapnik’s theorem provides an upper bound for R( θ) not smaller than 0.610, whereas Corollary 3.4.4 gives R( θ) ≤ 0.461 (using the bound d ′′ 1 ≤ d ′ 1 + 3.7 when N = 1000). Now if we go for Theorem 3.3.6 and do not make a Gaussian approximation, we get R( θ) ≤ 0.453. It is interesting to remark that this bound is achieved for λ = 1195 > N = 1000. This explains why the Gaussian approximation in Vapnik’s bound can be improved: for such a large value of λ, λr 1 (θ) does not behave like a Gaussian random variable.
Let us recall in conclusion that the best bound is provided by Theorem 3.3.3 (page 125), giving R( θ) ≤ 0.4211, (that is approximately 2/3 of Vapnik’s bound), for optimal values of k = 15, and of λ = 1010. This bound can be seen to take advantage of the fact that Bernoulli random variables are not Gaussian (its Gaussian approximation, Corollary 3.4.3, gives a bound R(θ) ≃ 0.4325, still with an optimal k = 15), and of the fact that the optimal size of the shadow sample is significantly larger than the size of the observed sample. Moreover, Theorem 3.3.3 does not assume that the sample is i.i.d., but only that it is independent, thus generalizing Vapnik’s bounds to inhomogeneous data (this will presumably be the case when data are collected from different places where the experimental conditions may not be the same, although they may reasonably be assumed to be independent).
Our little numerical example was chosen to illustrate the case when it is nontrivial to decide whether the chosen classifier does better than the 0.5 error rate of blind random classification. This case is of interest to choose “weak learners” to be aggregated or combined in some appropriate way in a second stage to reach a better classification rate. This stage of feature selection is unavoidable in many real world classification tasks. Our little computations are meant to exemplify the fact that Vapnik’s bounds, although asymptotically suboptimal, as is obvious by comparison with the first two chapters, can do the job when dealing with moderate sample sizes.
Proof. Let w 0 ∈ A Z . The set A Z ∩ {w ∈ R d : w ≤ w 0 } is a compact convex set and w → w 2 is strictly convex and therefore has a unique minimum on this set, which is also obviously its minimum on A Z . As min i∈I+ w Z , x i -max i∈I-w Z , x i = 2, the margin is also equal to half the distance between the projections on the direction w Z of the positive and negative patterns.
Let us consider the convex hulls X + and X -of the positive and negative patterns:
As v → v 2 is strictly convex, with compact lower level sets, there is a unique vector
The set A Z is non-empty (i.e. the training set Z is linearly separable) if and only if v * = 0. In this case
and the margin of the canonical hyperplane is equal to 1 2 v * . This lemma proves that the distance between the convex hulls of the positive and negative patterns is equal to twice the margin of the canonical hyperplane.
Proof. Let us assume first that v * = 0, or equivalently that X + ∩ X -= ∅. For any vector w
w, x , so min i∈I+ w, x i -max i∈I-w, x i ≤ 0, which shows that w cannot be in A Z and therefore that A Z is empty.
Let us assume now that v * = 0, or equivalently that
Let us now prove that inf v∈V v * , v = v * 2 . Some arbitrary v ∈ V being fixed, consider the function
By definition of v * , it reaches its minimum value for β = 0, and therefore has a non-negative derivative at this point. Computing this derivative, we find that v -v * , v * ≥ 0, as claimed. We have proved that
and therefore that w * ∈ A Z . On the other hand, any w
This proves that w * = inf w : w ∈ A Z , and therefore that w * = w Z as claimed.
One way to compute w Z would therefore be to compute v * by minimizing
Although this is a tractable quadratic programming problem, a direct computation of w Z through the following proposition is usually preferred.
Proposition 4.1.3. The canonical direction w Z can be expressed as
Proof. Let w(α) = i∈I α i y i x i and let S(α) = 1 2 i∈I α i . We can express the function F (α) as F (α) = w(α) 2 -4S(α). Moreover it is important to notice that for any s ∈ R + , {w(α) : α ∈ A, S(α) = s} = sV. This shows that for any s ∈ R + , inf{F (α) : α ∈ A, S(α) = s} is reached and that for any α s ∈ {α ∈ A : S(α) = s} reaching this infimum, w(α s ) = sv * . As s → s 2 v * 2 -4s : R + → R reaches its infimum for only one value s * of s, namely at s * = 2 v * 2 , this shows that F (α) reaches its infimum on A, and that for any
This implies that the representation w Z = w(α * ) involves in general only a limited number of non-zero coefficients and that w Z = w Z ′ , where Z ′ = {(x i , y i ) : x i ∈ S}.
Proof. Let us consider any given i ∈ I + and j ∈ I -, such that α * i > 0 and α * j > 0. There exists at least one such index in each set I -and I + , since the sum of the components of α * on each of these sets are equal and since k∈I α * k > 0. For any t ∈ R, consider α k (t) = α * k + t½(k ∈ {i, j}), k ∈ I. The vector α(t) is in A for any value of t in some neighbourhood of 0, therefore ∂ ∂t |t=0 F α(t) = 0. Computing this derivative, we find that y i w(α * ), x i + y j w(α * ), x j = 2.
As y i = -y j , this can also be written as
which implies necessarily as claimed that
In the case when the training set Z = (x i , y i ) N i=1 is not linearly separable, we can define a noisy canonical hyperplane as follows: we can choose w ∈ R d and b ∈ R to minimize
where for any real number r, r + = max{r, 0} is the positive part of r.
Theorem 4.1.5. Let us introduce the dual criterion
There is a threshold b * (whose construction will be detailed in the proof ), such that
Corollary 4.1.6. (scaled criterion) For any positive real parameter λ let us consider the criterion
and the domain
For any solution α * of the minimization problem F (α * ) = sup α∈A ′ λ F (α), the vector
In the separable case, the scaled criterion is minimized by the canonical hyperplane for λ large enough. This extension of the canonical hyperplane computation in dual space is often called the box constraint, for obvious reasons.
Proof. The corollary is a straightforward consequence of the scale property C λ (w, b, x) = λ 2 C(λ -1 w, b, λx), where we have made the dependence of the criterion in x ∈ R dN explicit. Let us come now to the proof of the theorem.
The minimization of C(w, b) can be performed in dual space extending the couple of parameters (w, b) to w
- and introducing the dual multipliers α ∈ R N + and the criterion
We see that C(w, b) = inf We are going to show that there is no duality gap, meaning that this inequality is indeed an equality. More importantly, we will do so by exhibiting a saddle point, which, solving the dual minimization problem will also solve the original one.
Let us first make explicit the solution of the dual problem (the interest of this dual problem precisely lies in the fact that it can more easily be solved explicitly). Introducing the admissible set of values of α,
we see that inf w∈R d G α, (w, 0, 0) is reached at
This proves that inf w∈W G(α, w) = F (α).
The continuous map α → inf w∈W G(α, w) reaches a maximum α * , not necessarily unique, on the compact convex set A ′ . We are now going to exhibit a choice of w * ∈ W such that (α * , w * ) is a saddle point. This means that we are going to show that • Let us put w * = w α * .
• If there is j ∈ {1, . . . , N } such that 0 < α * j < 1, let us put b * = x j , w * -y j .
Otherwise, let us put
• Let us then put
If we can prove that
it will show that γ * ∈ R N + and therefore that w * = (w * , b * , γ * ) ∈ W. It will also show that
proving that G(α * , w * ) = sup α∈R N + G(α, w * ). As obviously G(α * , w * ) = G α * , (w * , 0, 0) , we already know that G(α * , w * ) = inf w∈W G(α * , w). This will show that (α * , w * ) is the saddle point we were looking for, thus ending the proof of the theorem.
Proof of equation (4.2). Let us deal first with the case when there is j ∈ {1, . . . , N } such that 0 < α * j < 1. For any i ∈ {1, . . . , N } such that 0 < α * i < 1, there is ǫ > 0 such that for any t ∈ (-ǫ, ǫ), α * + ty i e i -ty j e j ∈ A ′ , where (e k ) N k=1 is the canonical base of R N . Thus ∂ ∂t |t=0 F (α * + ty i e i -ty j e j ) = 0. Computing this derivative, we obtain
Thus 1 -w, x i -b * y i = 0, as required. This shows also that the definition of b * does not depend on the choice of j such that 0 < α * j < 1.
Lemma 4.1.7. When Z is K-separable, inf{F (α) : α ∈ A} is reached.
Proof. Consider the training set Z ′ = (x ′ i , y i ) N i=1 , where
We see that
We proved in the previous section that Z ′ is linearly separable if and only if inf{F (α) : α ∈ A} > -∞, and that the infimum is reached in this case.
Proposition 4.1.8. Let K be a symmetric positive kernel and let Z
where the value of b * does not depend on the choice of i -and i + . The classification rule f : X → Y defined by the formula
is independent of the choice of α * and is called the support vector machine defined by K and Z. The set S = {x j : N i=1 α * i y i K(x i , x j ) -b * = y j } is called the set of support vectors. For any choice of α * , {x i : α * i > 0} ⊂ S. An important consequence of this proposition is that the support vector machine defined by K and Z is also the support vector machine defined by K and Z ′ = {(x i , y i ) : α * i > 0, 1 ≤ i ≤ N }, since this restriction of the index set contains the value α * where the minimum of F is reached.
Proof. The independence of the choice of α * , which is not necessarily unique, is seen as follows. Let (x i ) N i=1 and x ∈ X be fixed. Let us put for ease of notation x N +1 = x. Let M be the (N + 1) × (N + 1) symmetric semi-definite matrix defined by M (i, j) = K(x i , x j ), i = 1, . . . , N + 1, j = 1, . . . , N + 1. Let us consider the mapping Ψ :
and we have proved that for any choice of α * ∈ A minimizing F (α),
). Thus the support vector machine defined by K and Z can also be expressed by the formula
which does not depend on α * . The definition of S is such that Ψ(S) is the set of support vectors defined in the linear case, where its stated property has already been proved.
We can in the same way use the box constraint and show that any solution
Except the last, the results of this section are drawn from Cristianini et al. (2000).
We have no reference for the last proposition of this section, although we believe it is well known. We include them for the convenience of the reader.
Proposition 4.1.9. Let K 1 and K 2 be positive symmetric kernels on X. Then for any a
are also positive symmetric kernels. Moreover, for any measurable function g
) is also a positive symmetric kernel.
Proof. It is enough to prove the proposition in the case when X is finite and kernels are just ordinary symmetric matrices. Thus we can assume without loss of generality that X = {1, . . . , n}. Then for any α ∈ R N , using usual matrix notation,
Proposition 4.1.10. Let K be some positive symmetric kernel on X. Let p : R → R be a polynomial with positive coefficients. Let g : X → R d be a measurable function. Then
are all positive symmetric kernels.
Proof. The first assertion is a direct consequence of the previous proposition. The second comes from the fact that the exponential function is the pointwise limit of a sequence of polynomial functions with positive coefficients. The third is seen from the second and the decomposition
Proposition 4. 1.11. With the notation of the previous proposition,
N is G g -separable as soon as g(x i ), i = 1, . . . , N are distinct points of R d .
Proof. It is clearly enough to prove the case when X = R d and g is the identity. Let us consider some other generic point x N +1 ∈ R d and define Ψ as in (4.3). It is enough to prove that Ψ(x 1 ), . . . , Ψ(x N ) are affine independent, since the simplex, and therefore any affine independent set of points, can be split in any arbitrary way by affine half-spaces. Let us assume that (x 1 , . . . , x N ) are affine dependent; then for some (λ 1 , . . . , λ N ) = 0 such that
Thus, (λ i ) N +1 i=1 , where we have put λ N +1 = 0 is in the kernel of the symmetric positive semi-definite matrix G(x i , x j ) i,j∈{1,…,N +1} . Therefore N i=1 λ i G(x i , x N +1 ) = 0, for any x N +1 ∈ R d . This would mean that the functions x → exp(-x -x i 2 ) are linearly dependent, which can be easily proved to be false. Indeed, let n ∈ R d be such that n = 1 and n, x i , i = 1, . . . , N are distinct (such a vector exists, because it has to be outside the union of a finite number of hyperplanes, which is of zero Lebesgue measure on the sphere). Let us assume for a while that for some (λ i ) N i=1 ∈ R N , for any x ∈ R d , N i=1 λ i exp(-x -x i 2 ) = 0.
Considering x = tn, for t ∈ R, we would get N i=1 λ i exp(2t n, x i -x i 2 ) = 0, t ∈ R.
Letting t go to infinity, we see that this is only possible if λ i = 0 for all values of i.
We can use Support Vector Machines in the framework of compression schemes and apply Theorem 3.3.3 (page 125). More precisely, given some positive symmetric kernel K on X, we may consider for any training set Z ′ = (x ′ i , y ′ i ) h i=1 the classifier fZ ′ : X → Y which is equal to the Support Vector Machine defined by K and Z ′ whenever Z ′ is K-separable, and which is equal to some constant classification rule otherwise; we take this convention to stick to the framework described on page 117, we will only use fZ ′ in the K-separable case, so this extension of the definition is just a matter of presentation. In the application of Theorem 3.3.3 in the case when the observed sample (X i , Y i ) N i=1 is K-separable, a natural if perhaps sub-optimal choice of Z ′ is to choose for (x ′ i ) the set of support vectors defined by Z = (X i , Y i ) N i=1 and to choose for (y ′ i ) the corresponding values of Y . This is justified by the fact that fZ = fZ ′ , as shown in Proposition 4. 1.8 (page 139). If Z is not K-separable, we can train a Support Vector Machine with the box constraint, then remove all the errors to obtain a K-separable sub-sample Z ′ = {(X i , Y i ) : α * i < λ 2 , 1 ≤ i ≤ N }, using the same notation as in equation (4.4) on page 140, and then consider its support vectors as the compression set. Still using the notation of page 140, this means we have to compute successively α * ∈ arg min{F (α) : α ∈ A, α i ≤ λ 2 }, and α * * ∈ arg min{F (α) : α ∈ A, α i = 0 when α * i = λ 2 }, to keep the compression set indexed by J = {i : 1 ≤ i ≤ N, α * * i > 0}, and the corresponding Support Vector Machine f J . Different values of λ can be used at this stage, producing different candidate compression sets: when λ increases, the number of errors should decrease, on the other hand when λ decreases, the margin w -1 of the separable subset Z ′ increases, supporting the hope for a smaller set of support vectors, thus we can use λ to monitor the number of errors on the training set we accept from the compression scheme. As we can use whatever heuristic we want while selecting the compression set, we can also try to threshold in the previous construction α * * i at different levels η ≥ 0, to produce candidate compression sets J η = {i : 1 ≤ i ≤ N, α * * i > η} of various sizes.
As the size |J| of the compression set is random in this construction, we must use a version of Theorem 3.3.3 (page 125) which handles compression sets of arbitrary sizes. This is done by choosing for each k a k-partially exchangeable posterior distribution π k which weights the compression sets of all dimensions. We immediately see that we can choose π k such that -log π k (∆ k (J)) ≤ log |J|(|J| + 1) + |J| log (k+1)eN |J| . If we observe the shadow sample patterns, and if computer resources permit, we can of course use more elaborate bounds than Theorem 3.3.3, such as the transductive equivalent for Theorem 1.3.15 (page 31) (where we may consider the submodels made of all the compression sets of the same size). Theorems based on relative bounds, such as Theorem 2.2.4 (page 73) or Theorem 2.3.9 (page 108) can also be used. Gibbs distributions can be approximated by Monte Carlo techniques, where a Markov chain with the proper invariant measure consists in appropriate local perturbations of the compression set.
Let us mention also that the use of compression schemes based on Support Vector Machines can be tailored to perform some kind of feature aggregation. Imagine that the kernel K is defined as the scalar product in L 2 (π), where π ∈ M 1 + (Θ). More precisely let us consider for some set of soft classification rules f θ : X → R ; θ ∈ Θ the kernel K(x, x ′ ) = θ∈Θ f θ (x)f θ (x ′ )π(dθ).
In this setting, the Support Vector Machine applied to the training set Z = (x i , y i ) N i=1 has the form
and, if this is too burdensome to compute, we can replace it with some finite approximation
where the set {θ k , k = 1, . . . , m} and the weights {w k , k = 1, . . . , m} are computed in some suitable way from the set Z ′ = (x i , y i ) i,αi>0 of support vectors of f Z . For instance, we can draw {θ k , k = 1, . . . , m} at random according to the probability distribution proportional to for any support index i such that α i > 0.
Alternatively, given Z ′ , we can select a finite set of features Θ ′ ⊂ Θ such that Z ′ is K Θ ′ separable, where K Θ ′ (x, x ′ ) = θ∈Θ ′ f θ (x)f θ (x ′ ) and consider the Support Vector Machines f Z ′ built with the kernel K Θ ′ . As soon as Θ ′ is chosen as a function of Z ′ only, Theorem 3.3.3 (page 125) applies and provides some level of confidence for the risk of f Z ′ .
Let us consider some set X and some set S ⊂ {0, 1} X of subsets of X. Let h(S) be the Vapnik-Cervonenkis dimension of S, defined as Let us notice that this definition does not depend on the choice of the reference set X. Indeed X can be chosen to be S, the union of all the sets in S or any bigger set. Let us notice also that for any set B, h(B ∩ S) ≤ h(S), the reason being that A ∩ (B ∩ S) = B ∩ (A ∩ S).
This notion of Vapnik-Cervonenkis dimension is useful because, as we will see for Support Vector Machines, it can be computed in some important special cases. Let us prove here as an illustration that h(S) = d + 1 when X = R d and S is made of all the half spaces: Proof. Let (e i ) d+1 i=1 be the canonical base of R d+1 , and let X be the affine subspace it generates, which can be identified with R d . For any (ǫ i ) d+1 i=1 ∈ {-1, +1} d+1 , let w = d+1 i=1 ǫ i e i and b = 0. The half space A w,b ∩ X is such that {e i ; i = 1, . . . , d + 1} ∩ (A w,b ∩ X) = {e i ; ǫ i = +1}. This proves that h(S) ≥ d + 1.
To prove that h(S) ≤ d + 1, we have to show that for any set A ⊂ R d of size |A| = d + 2, there is B ⊂ A such that B ∈ (A ∩ S). Obviously this will be the case if the convex hulls of B and A\B have a non-empty intersection: indeed if a hyperplane separates two sets of points, it also separates their convex hulls. As |A| > d + 1, A is affine dependent: there is (λ x ) x∈A ∈ R d+2 \ {0} such that x∈A λ x x = 0 and
Obviously h(S ∩ X ′ ) ≤ h(S). Moreover h(S ′ ∩ X ′ ) = h(S ′ ) -1, because if A ⊂ X ′ is shattered by S ′ (or equivalently by S ′ ∩ X ′ ), then A ∪ {x} is shattered by S ′ (we say that A is shattered by S when A ∩ S = {0, 1} A ). Using the induction hypothesis, we then see that |S ∩ X ′ | ≤ Φ Differentiating the right-hand side in λ shows that its minimal value is exp -nK( h n , 1 2 ) , where K(p, q) = p log( p q ) + (1 -p) log( 1-p 1-q ) is the Kullback divergence function between two Bernoulli distributions B p and B q of parameters p and q. Indeed the optimal value λ * of λ is such that h = n E σ 1 exp(-λ * σ 1 ) E exp(-λ * σ 1 ) = nB h/n (σ 1 ).
Therefore, using the fact that two Bernoulli distributions with the same expectations are equal, log E exp(-λ * σ 1 ) = -λ * B h/n (σ 1 ) -K(B h/n , B 1/2 ) = -λ * h n -K( h n , 1 2 ).
The announced result then follows from the identity H(p) = log(2) -K(p, 1 2 ) = p log(p -1 ) + (1 -p) log(1 + p 1 -p ) ≤ p log(p -1 ) + 1 .
The proof of the following theorem was suggested to us by a similar proof presented in Cristianini et al. (2000). Let us also introduce the empirical variance of (x i ) n i=1 ,
In this case and with this notation, (4.5) Var(x 1 , . . . , x n ) γ 2 ≥ n -1 when n is even, (n -1) n 2 -1 n 2 when n is odd. Moreover, equality is reached when γ is optimal, b i = 0, i = 1, . . . , n and (x 1 , . . . , x n ) is a regular simplex (i.e. when 2γ is the minimum distance between the convex hulls of any two subsets of {x 1 , . . . , x n } and x i -x j does not depend on i = j).
Proof. Let (s i ) n i=1 ∈ R n be such that n i=1 s i = 0. Let σ be a uniformly distributed random variable with values in S n , the set of permutations of the first n integers {1, . . . , n}. By assumption, for any value of σ, there is an affine function g w,b ∈ H such that min i=1,…,n g w,b (x i ) -b i 2½(s σ(i) > 0) -1 ≥ γ. ,
where E is the expectation with respect to the random permutation σ. On the other hand
i =j E(s σ(i) s σ(j) ) x i , x j .
In the same way, for any i = j, This can be used with s i = ½(i ≤ n 2 ) -½(i > n
2 ) in the case when n is even and
2 ) in the case when n is odd, to establish the first inequality (4.5) of the theorem.
Proof. Let F = {f 1 , . . . , f mn(b-1)(b-2) } be some separated set of functions of size mn(b -1)(b -2). For any pair (f 2i-1 , f 2i ), i = 1, . . . , mn(b -1)(b -2)/2, there is x i ∈ X such that |f 2i-1 (x i ) -f 2i (x i )| ≥ 2. Since |X| = n, there is x ∈ X such that is some pair (y 1 , y 2 ), such that 1 ≤ y 1 < y 2 ≤ b and such that i∈I ½({y 1 , y 2 } = {f 2i-1 (x), f 2i (x)}) ≥ m. Let J = i ∈ I : {f 2i-1 (x), f 2i (x)} = {y 1 , y 2 } . Let
Moreover the restrictions of the functions of F 1 to X \ {x} are separated, and it is the same with F 2 . Thus F 1 strongly shatters at least t(m, n -1) pairs (A, s) such that A ⊂ X \ {x} and it is the same with F 2 . Finally, if the pair (A, s) where A ⊂ X \ {x} is both shattered by F 1 and F 2 , then F 1 ∪ F 2 shatters also (A ∪ {x}, s ′ ) where s ′ (x ′ ) = s(x ′ ) for any x ′ ∈ A and s ′ (x) = ⌊ y1+y2 2 ⌋. Thus F 1 ∪ F 2 , and therefore F , shatters at least 2t(m, n -1) pairs (A, s).
Resuming the proof of lemma 4.2.6, let us choose for r the smallest integer such that 2 r > h i=1 n i (b -2) i , which is no greater than In order to apply this combinatorial lemma to Support Vector Machines, let us consider now the case of separating hyperplanes in R d (the generalization to Support Vector Machines being straightforward). Assume that X = R d and Y = {-1, +1}. For any sample (X) in an exchangeable way. Similarly to Theorem 3.2.3 (page 117), it can be proved that for any partially exchangeable probability distribution P ∈ M 1 + (Ω), with P probability at least 1 -ǫ, for any f w,b ∈ F,
Let us remark that
This proves the following theorem.
Properly speaking this theorem is not a margin bound, but more precisely a margin quantile bound, since it covers the case where some fraction of the training sample falls within the region defined by the margin parameter γ h which optimizes the bound.
As a consequence though, we get a true (weaker) margin bound: with P probability at least 1 -ǫ, for any (w, b) ∈ Θ such that γ = min This inequality compares favourably with similar inequalities in Cristianini et al. (2000), which moreover do not extend to the margin quantile case as this one.
Let us also mention that it is easy to circumvent the fact that R is not observed when the test set X (k+1)N N +1
is not observed. Indeed, we can consider the sample obtained by projecting X (k+1)N 1 on some ball of fixed radius R max , putting
We can further consider an atomic prior distribution ν ∈ M 1 + (R + ) bearing on R max , to obtain a uniform result through a union bound. As a consequence of the previous theorem, we have Corollary 4.2.9. For any atomic prior ν ∈ M 1 + (R + ), for any partially exchangeable probability measure P ∈ M 1 + (Ω), with P probability at least 1 -ǫ, for any (w, b) ∈ Θ, any R max ∈ R + ,
Let us remark that t Rmax (X i ) = X i , i = N + 1, . . . , (k + 1)N , as soon as we consider only the values of R max not smaller than max i=N +1,…,(k+1)N X i in this corollary. Thus we obtain a bound on the transductive generalization error of the unthresholded classification rule 2½ g w,b (X i ) ≥ 0 -1, as well as some incitation to replace it with a thresholded rule when the value of R max minimizing the bound falls below max i=N +1,…,(k+1)N X i . We see that the number of operations needed to compute π exp(-λr) is proportional to |T | × 2 h × |Y| ≤ (N + 1) h 2 h |Y|. An exact computation will therefore be feasible only for small values of N and h. For higher values, a Monte Carlo approximation of this sum will have to be performed instead.
If we want to compute the bound provided by Theorem 2.1.3 (page 54) or by Theorem 2.2.2 (page 69), we need also to compute, for any fixed parameter θ ∈ Θ, quantities of the type π exp(-λr) exp ξm ′ (•, θ) = π exp(-λr) exp ξρ θ (m ′ ) , λ, ξ ∈ R + .
We need to introduce
Similarly to what has been done previously, we obtain This is all we need to compute B(ρ θ , β, γ) (and also B(π exp(-λr) , β, γ)) in Theorem 2.1.3 (page 54), using the approximation log π exp(-λ1r) exp ξπ exp(-λ2r) (m ′ )
≤ log π exp(-λ1r) exp ξm ′ (•, θ) + ξπ exp(-λ2r) m ′ (•, θ) , ξ ≥ 0.
Let us also explain how to apply the posterior distribution ρ (t,a) , in other words our randomized estimated classification rule, to a new pattern X N +1 :
Let us define for short ∆ t (c) = t ′ ∈ ∆ t : ½(X j N +1 ≥ t ′ j ) The bound for the transductive counterpart to Theorems 2.1.3 (page 54) or 2.2.2 (page 69), obtained as explained page 115, can be computed as in the inductive case, from these two partition functions and the above entropy computation.
Let us mention finally that, using the same notation as in the inductive case, .
To conclude this appendix on classification by thresholding, note that similar factorized computations are feasible in the important case of classification trees. This can be achieved using some variant of the context tree weighting method discovered by Willems et al. (1995) and successfully used in lossless compression theory. The interested reader can find a description of this algorithm applied to classification trees in Catoni (2004, page 62).
This theorem provides, using a union bound argument to further optimize the parameters, an empirical bound for ν 1 ρ 1 (R) -ν 2 ρ 2 (R), which can serve to build a selection algorithm exactly in the same way as what was done in Theorem 2.2.4 (page 73). This represents the highest degree of sophistication that we will achieve in this monograph, as far as model selection is concerned: this theorem shows that it is indeed possible to derive a selection scheme in which localization is performed in two steps and in which the localization of the model selection itself, as opposed to the localization of the estimation in each model, includes a variance term as well as a bias term, so that it should be possible to localize the choice of nested models, something that would not have been feasible with the localization techniques exposed in the previous sections of this study. We should point out however that more sophisticated does not necessarily mean more efficient : as the reader may have noticed, sophistication comes at a price, in terms of the complexity of the estimation schemes, with some possible loss of accuracy in the constants that can mar the benefits of using an asymptotically more efficient method for small sample sizes.We will do the hurried reader a favour: we will not launch into a study of the theoretical properties of this selection algorithm, although it is clear that all the tools needed are at hand! We would like as a conclusion to this chapter, to put forward a simple idea: this approach of model selection revolves around entropy estimates concerned with the divergence of posterior distributions with respect to localized prior distributions. Moreover, this localization of the prior distribution is more effectively done in several steps in some situations, and it is worth mentioning that these situations include the typical case of selection from a family of parametric models. Finally, the whole story relies upon estimating the relative generalization error rate of one posterior distribution with respect to some local prior distribution as well as with respect to another posterior distribution, because these relative rates can be estimated more accurately than absolute generalization error rates, at least as soon as no classification model of reasonable size provides a good match to the training sample, meaning that the classification problem is either difficult or noisy.
📸 Image Gallery
