Pac-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning

Reading time: 7 minute
...

📝 Original Info

  • Title: Pac-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning
  • ArXiv ID: 0712.0248
  • Date: 2007-12-04
  • Authors: Researchers from original ArXiv paper

📝 Abstract

This monograph deals with adaptive supervised classification, using tools borrowed from statistical mechanics and information theory, stemming from the PACBayesian approach pioneered by David McAllester and applied to a conception of statistical learning theory forged by Vladimir Vapnik. Using convex analysis on the set of posterior probability measures, we show how to get local measures of the complexity of the classification model involving the relative entropy of posterior distributions with respect to Gibbs posterior measures. We then discuss relative bounds, comparing the generalization error of two classification rules, showing how the margin assumption of Mammen and Tsybakov can be replaced with some empirical measure of the covariance structure of the classification model.We show how to associate to any posterior distribution an effective temperature relating it to the Gibbs prior distribution with the same level of expected error rate, and how to estimate this effective temperature from data, resulting in an estimator whose expected error rate converges according to the best possible power of the sample size adaptively under any margin and parametric complexity assumptions. We describe and study an alternative selection scheme based on relative bounds between estimators, and present a two step localization technique which can handle the selection of a parametric model from a family of those. We show how to extend systematically all the results obtained in the inductive setting to transductive learning, and use this to improve Vapnik's generalization bounds, extending them to the case when the sample is made of independent non-identically distributed pairs of patterns and labels. Finally we review briefly the construction of Support Vector Machines and show how to derive generalization bounds for them, measuring the complexity either through the number of support vectors or through the value of the transductive or inductive margin.

💡 Deep Analysis

Deep Dive into Pac-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning.

This monograph deals with adaptive supervised classification, using tools borrowed from statistical mechanics and information theory, stemming from the PACBayesian approach pioneered by David McAllester and applied to a conception of statistical learning theory forged by Vladimir Vapnik. Using convex analysis on the set of posterior probability measures, we show how to get local measures of the complexity of the classification model involving the relative entropy of posterior distributions with respect to Gibbs posterior measures. We then discuss relative bounds, comparing the generalization error of two classification rules, showing how the margin assumption of Mammen and Tsybakov can be replaced with some empirical measure of the covariance structure of the classification model.We show how to associate to any posterior distribution an effective temperature relating it to the Gibbs prior distribution with the same level of expected error rate, and how to estimate this effective temper

📄 Full Content

Among the possible approaches to pattern recognition, statistical learning theory has received a lot of attention in the last few years. Although a realistic pattern recognition scheme involves data pre-processing and post-processing that need a theory of their own, a central role is often played by some kind of supervised learning algorithm. This central building block is the subject we are going to analyse in these notes.

Accordingly, we assume that we have prepared in some way or another a sample of N labelled patterns (X i , Y i ) N i=1 , where X i ranges in some pattern space X and Y i ranges in some finite label set Y. We also assume that we have devised our experiment in such a way that the couples of random variables (X i , Y i ) are independent (but not necessarily equidistributed). Here, randomness should be understood to come from the way the statistician has planned his experiment. He may for instance have drawn the X i s at random from some larger population of patterns the algorithm is meant to be applied to in a second stage. The labels Y i may have been set with the help of some external expertise (which may itself be faulty or contain some amount of randomness, so we do not assume that Y i is a function of Introduction to describe particle systems with many degrees of freedom. More specifically, the sets of classification rules will be described by Gibbs measures defined on parameter sets and depending on the observed sample value. A Gibbs measure is the special kind of probability measure used in statistical mechanics to describe the state of a particle system driven by a given energy function at some given temperature. Here, Gibbs measures will emerge as minimizers of the average loss value under entropy (or mutual information) constraints. Entropy itself, more precisely the Kullback divergence function between probability measures, will emerge in conjunction with the use of exponential deviation inequalities: indeed, the log-Laplace transform may be seen as the Legendre transform of the Kullback divergence function, as will be stated in Lemma 1.1.3 (page 4).

To fix notation, let (X i , Y i ) N i=1 be the canonical process on Ω = (X × Y) N (which means the coordinate process). Let the pattern space be provided with a sigmaalgebra B turning it into a measurable space (X, B). On the finite label space Y, we will consider the trivial algebra B ′ made of all its subsets. Let M 1 + (K × Y) N , (B ⊗ B ′ ) ⊗N be our notation for the set of probability measures (i.e. of positive measures of total mass equal to 1) on the measurable space (X×Y) N , (B×B ′ ) ⊗N . Once some probability distribution

into the canonical realization of a stochastic process modelling the observed sample (also called the training set). We will assume that P = N i=1 P i , where for each i = 1, . . . , N , P i ∈ M 1 + (X × Y, B ⊗ B ′ ), to reflect the assumption that we observe independent pairs of patterns and labels. We will also assume that we are provided with some indexed set of possible classification rules

where (Θ, T) is some measurable index set. Assuming some indexation of the classification rules is just a matter of presentation. Although it leads to heavier notation, it allows us to integrate over the space of classification rules as well as over Ω, using the usual formalism of multiple integrals. For this matter, we will assume that (θ, x) → f θ (x) : (Θ × X, B ⊗ T) → (Y, B ′ ) is a measurable function.

In many cases, as already mentioned, Θ = m∈M Θ m will be a finite (or more generally countable) union of subspaces, dividing the classification model R Θ = m∈M R Θm into a union of sub-models. The importance of introducing such a structure has been put forward by V. Vapnik, as a way to avoid making strong hypotheses on the distribution P of the sample. If neither the distribution of the sample nor the set of classification rules were constrained, it is well known that no kind of statistical inference would be possible. Considering a family of sub-models is a way to provide for adaptive classification where the choice of the model depends on the observed sample. Restricting the set of classification rules is more realistic than restricting the distribution of patterns, since the classification rules are a processing tool left to the choice of the statistician, whereas the distribution of the patterns is not fully under his control, except for some planning of the learning experiment which may enforce some weak properties like independence, but not the precise shapes of the marginal distributions P i which are as a rule unknown distributions on some high dimensional space.

In these notes, we will concentrate on general issues concerned with a natural measure of risk, namely the expected error rate of each classification rule f θ , expressed as

As this quantity is unobserved, we will be led to work with the corresponding empirical error rate

This does not mean that practical learning algorithms will always t

…(Full text truncated)…

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut