Predictive Hypothesis Identification

Reading time: 5 minute
...

📝 Original Info

  • Title: Predictive Hypothesis Identification
  • ArXiv ID: 0809.1270
  • Date: 2009-12-30
  • Authors: Researchers from original ArXiv paper

📝 Abstract

While statistics focusses on hypothesis testing and on estimating (properties of) the true sampling distribution, in machine learning the performance of learning algorithms on future data is the primary issue. In this paper we bridge the gap with a general principle (PHI) that identifies hypotheses with best predictive performance. This includes predictive point and interval estimation, simple and composite hypothesis testing, (mixture) model selection, and others as special cases. For concrete instantiations we will recover well-known methods, variations thereof, and new ones. PHI nicely justifies, reconciles, and blends (a reparametrization invariant variation of) MAP, ML, MDL, and moment estimation. One particular feature of PHI is that it can genuinely deal with nested hypotheses.

💡 Deep Analysis

Deep Dive into Predictive Hypothesis Identification.

While statistics focusses on hypothesis testing and on estimating (properties of) the true sampling distribution, in machine learning the performance of learning algorithms on future data is the primary issue. In this paper we bridge the gap with a general principle (PHI) that identifies hypotheses with best predictive performance. This includes predictive point and interval estimation, simple and composite hypothesis testing, (mixture) model selection, and others as special cases. For concrete instantiations we will recover well-known methods, variations thereof, and new ones. PHI nicely justifies, reconciles, and blends (a reparametrization invariant variation of) MAP, ML, MDL, and moment estimation. One particular feature of PHI is that it can genuinely deal with nested hypotheses.

📄 Full Content

Consider data D sampled from some distribution p(D|θ) with unknown θ ∈ Ω. The likelihood function or the posterior contain the complete statistical information of the sample. Often this information needs to be summarized or simplified for various reasons (comprehensibility, communication, storage, computational efficiency, mathematical tractability, etc.). Parameter estimation, hypothesis testing, and model (complexity) selection can all be regarded as ways of summarizing this information, albeit in different ways or context. The posterior might either be summarized by a single point Θ = {θ} (e.g. ML or MAP or mean or stochastic model selection), or by a convex set Θ ⊆ Ω (e.g. confidence or credible interval), or by a finite set of points Θ = {θ 1 ,...,θ l } (mixture models) or a sample of points (particle filtering), or by the mean and covariance matrix (Gaussian approximation), or by more general density estimation, or in a few other ways [BM98,Bis06]. I have roughly sorted the methods in increasing order of complexity. This paper concentrates on set estimation, which includes (multiple) point estimation and hypothesis testing as special cases, henceforth jointly referred to as "hypothesis identification" (this nomenclature seems uncharged and naturally includes what we will do: estimation and testing of simple and complex hypotheses but not density estimation). We will briefly comment on generalizations beyond set estimation at the end.

Desirable properties. There are many desirable properties any hypothesis identification principle ideally should satisfy. It should • lead to good predictions (that’s what models are ultimately for),

• be broadly applicable, • be analytically and computationally tractable, • be defined and make sense also for non-i.i.d. and non-stationary data, • be reparametrization and representation invariant, • work for simple and composite hypotheses, • work for classes containing nested and overlapping hypotheses, • work in the estimation, testing, and model selection regime, • reduce in special cases (approximately) to existing other methods.

Here we concentrate on the first item, and will show that the resulting principle nicely satisfies many of the other items.

The main idea. We address the problem of identifying hypotheses (parameters/models) with good predictive performance head on. If θ 0 is the true parameter, then p(x|θ 0 ) is obviously the best prediction of the m future observations x. If we don’t know θ 0 but have prior belief p(θ) about its distribution, the predictive distribution p(x|D) based on the past n observations D (which averages the likelihood p(x|θ) over θ with posterior weight p(θ|D)) is by definition the best Bayesian predictor Often we cannot use full Bayes (for reasons discussed above) but predict with hypothesis H = {θ ∈ Θ}, i.e. use p(x|Θ) as prediction. The closer p(x|Θ) is to p(x|D) or p(x|θ 0 ,D)1 the better is H’s prediction (by definition), where we can measure closeness with some distance function d. Since x and θ 0 are (assumed to be) unknown, we have to sum or average over them. (Un)related work. The general idea of inference by maximizing predictive performance is not new [Gei93]. Indeed, in the context of model (complexity) selection it is prevalent in machine learning and implemented primarily by empirical cross validation procedures and variations thereof [Zuc00] or by minimizing test and/or train set (generalization) bounds; see [Lan02] and references therein. There are also a number of statistics papers on predictive inference; see [Gei93] for an overview and older references, and [BB04, MGB05] for newer references. Most of them deal with distribution free methods based on some form of cross-validation discrepancy measure, and often focus on model selection. A notable exception is MLPD [LF82], which maximizes the predictive likelihood including future observations. The full decision-theoretic setup in which a decision based on D leads to a loss depending on x, and minimizing the expected loss, has been studied extensively [BM98,Hut05], but scarcely in the context of hypothesis identification. On the natural progression of estimation→prediction→action, approximating the predictive distribution by minimizing (1) lies between traditional parameter estimation and optimal decision making. Formulation (1) is quite natural but I haven’t seen it elsewhere. Indeed, besides ideological similarities the papers above bear no resemblance to this work.

Contents. The main purpose of this paper is to investigate the predictive losses above and in particular their minima, i.e. the best predictor in H. Section 2 introduces notation, global assumptions, and illustrates PHI on a simple example. This also shows a shortcoming of MAP and ML esimtation. Section 3 formally states PHI, possible distance and loss functions, their minima, In Section 4, I study exact properties of PHI: invariances, sufficient statistics, and equivalences. Sections 5 investigates

…(Full text truncated)…

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut