Information Criteria for Deciding between Normal Regression Models

Informati on criteria for deciding b et w een normal regres sion mo dels By Ro ber t S . Maier ∗ Dep artments of Mathematics and Physics, and S tatistics Pr o g r am University of Arizona, T ucson, AZ 85721, USA Regression mo dels ﬁtt ed to data can b e assessed on their go odness of ﬁt, though mo dels with ma n y parameters sho uld be disfavored to prevent o ver-ﬁtting. Statisticians’ to ols for this are little known to ph ysical scientists. These include the Ak a ik e Information Criterion (AIC), a penalized g oo dness-of-ﬁt statistic, and the AICc, a v aria n t including a s mall-sample correction. T hey entered the physical scien ces through b eing used by astrophysicists to compare cosmologica l mo dels; e.g., predictions of the distance–r edshif t relation. The AICc is shown to have b een mis-applied, b eing a pplicable o nly if err o r v a riances are unknown. If err or bar s accompany the data, the AIC sho uld b e used instead. Erroneous applications o f the AICc a re listed in an app endix. It is also shown how the v ariability o f the AIC diﬀerence b et ween mo dels with a known error v ariance can b e estimated. This yields a sig niﬁca nce test that can p otent ially replace the use of ‘Ak a ik e weight s’ for deciding b et ween such mo dels. Additionally , the eﬀects of mo del mis-sp eciﬁcation are examined. F or re g ression mo dels ﬁtted to data sets without (rather than with) error bars, they are ma jor: the AICc ma y b e shifted b y an unknown a moun t. The exten t of this in the ﬁtting of physical mo dels remains to be studied. 1. Intr o duct ion ( a ) Backgr ound and overview Ph ysical scien tists are f am iliar with the task of ﬁtting a parametric mo del suc h as a regression model to a data set, by using maxim um lik eliho od estimation (MLE) or other parameter estimation tec hniques. But they are less familiar with mo del sele ction : deciding among t w o or more mo dels ﬁtted to the s ame data in a w ay that f or eac h mo del tak es accoun t of b oth its go o dness of ﬁt and its n umber of para meters. T o what extent s hould one attempt to preven t ov er-ﬁtting, i.e., ‘ﬁtting the noise,’ b y p enalizing models with to o man y paramete rs? This question of parsimon y can b e viewed not only as a problem in data analysis, but as one in the philosoph y of science (Keuzenk amp et al. 2001). The case when the mo dels being compared are incompatible, i.e., are non-nested in that they are not related by parametric restrictions, is esp ecially vexing . So is the case when they are mis-sp eciﬁe d, i.e., do not agree with the ‘truth’ (the unknown and p erhap s inﬁnite-dime nsional data-generating pro cess), regardless of what v alues for their parameters are chosen; so that ﬁtting errors of non-zero mean are present. ∗ rsm@math.arizona.edu Pr oc. R. So c . A 1–27; doi: 10.1098/rspa.00000000 This journal is c  2013 The Roy al Socie t y Information criteria and normal regression 2 T ec hniques for mo del s ele ction that p en alize ov er-ﬁtting h av e b een applied in the life sciences, so cial sciences and econometrics, and several b ook-length exp ositions of these tec hniques by statisticians are a v ailable (Sak amoto et al. 1986; McQuarrie & T sai 1998; Cla esk ens & Hjort 20 08; K onishi & Kitaga wa 2008). A fruitful concept is the AIC (Ak aike I nform ation Criterion), a certain p enalized lik eliho o d or go odness of ﬁt statistic (Ak aik e 1973, reprin ted as Ak aike 1992). It is an estimate of the discre pancy , in a sense re lated to MLE and information theory , b et w een a ﬁtted statistical mo del and the unkno wn data-generating pro cess; the latter b eing statistical also, if measuremen t uncertain ti es are incorporated. In simple cases th e AIC is eﬀectiv ely a p enalized sum of s quare d prediction errors. By comparing AIC’s one can compare mo dels with diﬀeren t n um b ers of parameters, including incompatible mo dels. But in the absence of a sys te matic theory of the v ariabilit y of the AIC statistic, using AIC’s to decide b et w een ﬁtted ver sions of parametric models M 1 , M 2 cannot be view ed as a procedure in classical stati stical inference, i.e., as a signiﬁcance test. No p -v alue, as in a frequen tist ass essmen t of the evidence against a null h yp othesis, is actually calculated. I ns tead one simply sa y s , e.g., that if ∆ 12 := AIC 2 − AIC 1 is less than 2 . 0 , the evidence that M 1 is to b e preferred o ver M 2 is weak; and that if ∆ 12 is greater than 5 . 0 , it is strong. The ‘Ak aike weigh t’ exp( − AIC l / 2) is often viewed as an unno rmalized probabilit y (in some s ense) that M l is to b e preferred (Burnha m & Anderson 2002), but this in terpretation has not b een univ ersally accepted. A few y ears ago, the AIC and related criteria (such as AICc, a v arian t including a small correction) ente red the physical sciences b y b eing in tro duc ed in to astroph ysics (T ak euchi 2000; Liddle 2007). They ha v e b een used to compare cosmological mo dels, suc h as regression mo dels of the distance–redshift relation that c haracterizes the ex pansion of the Univ erse. U nfort unately , in man y pap ers a mistak e in data analysis has been made. It can perhaps be attributed to a misreading of the exp ositions of Burnham & Anderson (2002) and Liddle (2007). The mistak e is this. A data set may b e accompanied by error bars (i.e., measuremen t uncertain ties), or not. If the latter, regression mo del ﬁtting will in volv e the estimation of a ‘n uisance parame ter,’ namely the unkno wn v ariance σ 2 of the measuremen t errors. The AICc, whic h w as designed to un bias completely the estimate of the Kullbac k–Leibler information-theoretic discrepancy pro v ided b y the AIC, is appro priate only if σ 2 is unkno wn. But data in the ph ysical sciences are typically accompanied b y error bars. When asses sing statistical mo dels that incorporate known error bars, the AIC and not the AI Cc s ho uld b e used. T o sho w this, w e ﬁrst place the AIC in a general framew ork that can be used to deriv e man y information-theoretic mo del-selection statis tics. (See § 2 .) In § 3 a w e restrict our fo cus to linear regression and MLE, and sho w that the applicabilit y of the AICc is limited as claimed. In App endix B, pap ers from the astroph y sics literature that ha ve erroneously applied the AICc are listed. In § 3 b w e obtain a further result: under reasonable conditions of mis- sp eciﬁcat ion, using the correct statistic (the AIC) to decide betw een normal regression mo de ls M 1 , M 2 that incorporate error bars can indeed b e viewed as a test of signiﬁcance. That is, the decision can b e made in a class ica l w ay . One can calculate a p -v alue as sociated to the null hypothesis that M 1 , M 2 are equidistan t in an information-theoretic sense from the true but unkno wn mo del M ∗ , as opp osed to the alternativ e h yp othesis that they are not. This is b ecause the v ariabilit y Information criteria and normal regression 3 of ∆ 12 = AIC 2 − AIC 1 can b e estimated. F or data sets with error bars, this can p oten tially render Ak aik e w eigh ts obsolete. It is explained how the estimation can b e carri ed out for mis-speciﬁed normal linear mo dels, and a hypothesis test based on the estimate is prop osed. The test can b e extended to non-linear mo dels. In decisions b et w een statistical mo dels M 1 , M 2 that incorp orate know n error bars, the v alidit y of the uncorrected AIC is unaﬀected if the mo dels are mis- sp eciﬁed, as is shown in § 3 a . This result is une xp ected, sinc e the usual deriv ation of the AI C statistic (an d indeed of the AICc) requires th at there be nesting and no mis-sp eciﬁca tion; and its widespread application to non-nested, p ot en tially mis- sp eciﬁed mo dels has in fact b een somewhat heuristic. In § 4 w e sho w that if the data set to whic h M 1 , M 2 are ﬁtted is not accompanied b y error bars, problems with the AICc can indeed arise. F or a normal linear mo del ﬁtted to a data set without error bars, we discuss the b eha vior of the AICc under mis-sp eciﬁca tion, and obt ain an asymptoticall y exact expression f or the resulting shift. I f the exten t of the mis-sp eciﬁcation is unkno wn, this shift ma y render th e A ICc of little v alue. This fact deserv es to b e b etter kno wn. Besides deriving the AIC and AIC corrections from ﬁrst principles, w e brieﬂy discuss the applicabi lit y to normal linear mo dels of such v ariants as AIC γ , whic h suppresses o v er-ﬁtting to a greater exten t than do es the A IC. Many additional v arian ts hav e app eared in the literature, such as the KIC (Kullbac k I nf or mation Criterion) and KICc (Ca v anaugh 1999, 2004), but they are b ey ond the scop e of this pap er. In the ﬁnal section (§ 5), we summarize our results . ( b ) AIC b asics A regression mo del ﬁtted to data can b e linear or non-linear, according to its parameter d ep ende nce. The linear case is familiar (Bevington & Robinson 2003; Drap er & Smith 1998; W eisb erg 2005). Supp ose the data s et comprises y 1 , . . . , y n ∈ R ; whic h could b e, e.g., the v alues of a resp onse v ariable corresp onding to n distinct v alues of an explanatory v ariable x , ch osen b y an observer or an exp erimen ter. Supp ose that y 1 , . . . , y n w ould dep end linearly on parameters β 1 , . . . , β k ∈ R in t he absence of measur emen t errors or other noi se. That is , y = X β , where y = ( y i ) n i =1 , β = ( β j ) k j =1 are column ve ctors and X is an n × k design matrix. (It will b e ass um ed throughout that n > k and that X is of full rank, i.e., of rank k .) A statistical mo del M of the data w ould then b e y i = X k j =1 X ij β j + ǫ i , (1.1) where ǫ 1 , . . . , ǫ n are residuals, i.e., errors. In the simplest case the residuals would b e tak en to b e indep enden t. In a (homoscedastic) normal mo del one wou ld also tak e ǫ i ∼ N (0 , σ 2 ) , i.e., tak e eac h ǫ i to b e normally distributed with mean zero and a common v ariance σ 2 . The parameter σ 2 ma y b e known, or it ma y b e an unkno wn n uisance paramete r that needs to be estim ated (w hic h is the case if error bars are not supplied). Note that from a data set ¯ y = ( ¯ y i ) n i =1 including error bars of diﬀering lengths, i.e., k nown but diﬀering v ariances σ 2 1 , . . . , σ 2 n , one can obtain a data set y = ( y i ) n i =1 with a know n common v ariance σ 2 b y deﬁning y i := ( σ /σ i ) ¯ y i . Whether o r not σ 2 is kno wn, an estimate ˆ β = ( ˆ β j ) k j =1 of β can be computed b y MLE, whic h reduces to ordinary least-squares f or an y normal linear model with Information criteria and normal regression 4 independent , iden tically distribute d (i.i.d.) errors. By a standard calculation, ˆ β = ( X t X ) − 1 X t y . A ccompan ying the observ ed data ve ctor y ther e is then a predicted data v ector ˆ y := X ˆ β = Py , where the ‘hat’ matrix P = X ( X t X ) − 1 X t pro jects on to the column space of X (the estimation space L ⊂ R n ). The residual sum of squares (RSS) for the ﬁt is the s um of squared errors. That is, RSS = ( y − ˆ y ) t ( y − ˆ y ) = ( y − Py ) t ( y − Py ) = y t Qy , (1.2) where Q = I n − P is complemen tary to P and pro jects onto the left n ull space of X (the error space L ⊥ ⊂ R n ). If σ 2 is kno wn, so that the parameter vector θ of M is simply β , the standard deﬁnition of the A I C for the ﬁtted mo del is AIC = RSS /σ 2 + 2 k . (1.3) If alternativ ely σ 2 is unkno wn, s o that θ = ( β ; σ 2 ) , it is AIC = n ln  ˆ σ 2  + 2 ( k + 1) = n ln ( RSS /n ) + 2 ( k + 1) , (1.4) where ˆ σ 2 = RSS /n is the maxim um likel iho od estimate of σ 2 . In b oth (1.3) and (1.4 ) the ﬁrst term equals up to an unimportant constant the s ta tistic − 2 ln L ( ˆ θ N | y ) , where ln L ( ˆ θ N | y ) is the log-lik eliho o d of the ﬁtted mo del. So the ﬁrst term is a measure of go o dness of ﬁt. In mo del selection a smaller AIC is preferred; hence the second term (which will b e seen to originate as an unb iasing term) p enali zes M according to its n um b er of ﬁtted parameters ( k , resp. k + 1 ). It is usually diﬀer enc es of A I C’s that are importan t, so an y term not inv olving k ma y b e added to (1.3) and (1.4). The c hoice ‘2’ in (1.3) and (1.4) for the co eﬃcien t of k (resp. k + 1 ) is motiv ated b y information theory , as will b e explained. But applied s tat istic ians ha ve long b een in terested in the eﬀects on model selec tion of choosing a mor e general p enalt y term γ k , where γ > 0 ma y diﬀer from 2. The resulting mo diﬁed AIC is denoted AIC γ (McQuarrie & T sai 1998). Bhansali & Dow nham (1 977) considered the eﬀects of v arying γ on order selection in autoregressive mo dels, and sho wed empirically that it ma y b e useful f or γ t o range, sa y , b et ween 2 and 6. The ab o v ementi oned KIC lik e the AIC has an information- theoretic j ustiﬁ cation, and in the n → ∞ limit turns out to b e equiv alen t to AIC 3 . Men tion should also b e made of the AICu (McQuarrie et al. 1997 ), whic h is a heuristic mo diﬁcation of (1.4) in whic h the ML estimate ˆ σ 2 is replaced by the un biased estimator s 2 := RSS / ( n − k ) of σ 2 . In the n → ∞ limit, AICu is also equiv alent to AIC 3 . This can b e seen b y workin g to leading order in 1 /n and using the asy m ptotic appro ximation n ln[ n/ ( n − k )] ∼ k + O (1 / n ) , n → ∞ . The most familiar mo diﬁed or corrected AIC, AICc, is a less drastic mo diﬁ cation of (1.4), the mo diﬁca tion being a ma jor one only f or small n . U nder the ass um ption of i.i.d. normal residuals, and the traditional ass um ption of no mis-sp eciﬁca tion, it is given b y AICc = n ln  ˆ σ 2  + 2( k + 1) n n − k − 2 (1.5a) ∼ AIC + 2 ( k + 1)( k + 2) n + O (1 / n 2 ) , n → ∞ (1.5b) Information criteria and normal regression 5 (Sugiura 1978; Hurvich & T sai 1989; Ca v anaugh 1997). Wh y an O (1 /n ) correction term should b e ad ded to (1.4), but not to the expression (1.3) that applie s if σ 2 is kno wn, will b e explained. 2. Minim um discrepancy estimation ( a ) A gener al fr amework In this section the AIC, a p enalized go o dness-of-ﬁt statistic, is placed in a mo del- selection framew ork that go es w ell b ey ond the use of MLE in regression. The AI C for a candidate mo del can b e viewe d as an un biased esti ma tor of its discrepancy , in a certain sense, from the true mo del. This wi ll ev en tually lead to the introduction of a n ull hypothesis that tw o candidate mo dels are equally discrepan t, and to sys tematic results on AI C corrections. But the theme of this section is the existence of alternativ es to the AIC, whic h h a ve not y et b ee n applied in the ph ysical sciences. It is hoped that in terest in this area will b e stim ulated. A framew ork resem bling the one used here w as ﬁrst dev elop ed b y Linhart & Zucch ini (1986). Supp os e one has a parametric s ta tistical mo del M θ that will b e used for appro ximation or ﬁtting purp oses, with θ ∈ Θ (a parameter space); and a true, underlying statistical mo del M ∗ of the data-gene rating process, whic h is not kno wn explicitly . If eac h generated datum is an elemen t of a set S , b oth M θ and M ∗ will be probabilit y distributions on S . (The c hoice S = R n is approp riate for regression, as in § 1 b .) Their resp ectiv e probabilit y densit y functions (PDF’s) will b e denoted f θ ( y ) and g ( y ) . In general it will not b e assumed that g = f θ ∗ for an y θ ∈ Θ , i.e., mis-sp eciﬁcation will b e allo w ed. T o an y random sample y N of size N from the true distribution g , comprising y (1) , . . . , y ( N ) ∈ S , there corresp on ds an empirical distribution g N = g N , y N on S . It is deﬁned by g N , y N ( · ) = N − 1 X N i =1 δ ( · − y ( i ) ) , (2.1) where δ ( · ) is the Dirac delta function, if S is a Euclidean space such as R n . If alternativ ely S is a discrete space, then instead of a PDF there will b e a probabilit y mass function (PMF), deﬁned using a Kronec k er d elt a rather than a delta function. The restrict ion N = 1 , meani ng that there is only a single replication, i.e. only one observ ation of y ∈ S , was implicitl y made in § 1 b , where y w as a random v ector in R n ; but here it will b e relaxed. The deﬁnition of MLE, whic h is an almost univ ersally applicable but hardly unique ﬁtting sche me, is familiar. F rom the sample y N one constructs the lik eliho o d function L ( θ | y N ) = Q N i =1 f θ ( y ( i ) ) , and computes a parameter estimate ˆ θ N = ˆ θ N ( y N ) b y maximizing the likel iho od, or equiv alen tly by minimizing the negativ e log-likelih o od − ln L ( θ | y N ) . The b est ﬁt to the dat a is then the mo del M ˆ θ N , with PDF f ˆ θ N . This sche me generalizes to m i nimum discr ep ancy estimation , whic h is itself a generaliza tion of minim um distanc e estimation. In an abstra ct description one starts with y N or equiv alen tly an N -p oin t empirical distribution g N on S , and computes ˆ θ N b y minimizing d ( g N ; f θ ) o v er θ ∈ Θ . That is, ˆ θ N = Information criteria and normal regression 6 arg m in θ ∈ Θ d ( g N ; f θ ) . Here d ( g ; f ) signiﬁes some real-v alued measure of the discrepancy b et ween the PDF’s g , f , whic h quan tiﬁes ho w diﬃcult it is to discriminate b et ween them. The case when d s atisﬁes the axioms f or a metric can b e esp ecially nice (Donoho & Liu 1988; T rosset & Sands 199 5 ), but this will not b e assumed. Thus d may b e asymmetric, i.e. , d irected, and ma y not satisfy the triangle inequalit y . A lso, it may not satisfy d ( g ; f ) > 0 . But it is useful to require that d ( g ; f ) > d ( g ; g ) , with equalit y holding only if g = f . Then the normalized discrepancy D ( g ; f ) := d ( g ; f ) − d ( g ; g ) > 0 will satisfy D ( g ; f ) = 0 only if g = f . As will b e seen, it is sometimes p ossible for an unnormalized discrepancy d to b e deﬁned on a larger class of PDF’s than is the case for a normalized one. In minim um discrepancy estimation one m ust distinguish b etw een the mo del M θ ﬁtted to an empirical distribution g N generated by M ∗ , whic h has PDF f ˆ θ N , and the bes t appro ximating mo del, the PDF of whic h is some f θ ∗ . Here θ ∗ = arg min θ ∈ Θ d ( g ; f θ ) ma y diﬀer from ˆ θ N . The v alue θ ∗ is called the ‘pseudo-true’ v alue of θ . (If there is no mis-speciﬁcation, i.e., g = f θ ∗ for some θ ∗ , it is the true v alue.) The discr ep ancy due to appr oximation (AD) is d ( g ; f θ ∗ ) . In the absence of mis-sp eciﬁcation this would equal the constan t d ( g ; g ) , and what w ould b e more imp ortan t w ould b e the dis c r ep ancy due to estimation (ED), i.e. d ( f θ ∗ ; f ˆ θ N ) . The over al l discr ep ancy (OD), of the ﬁtted mo del f ro m the truth, is the quant it y d ( g ; f ˆ θ N ) . If there is no mis-sp eciﬁcation , OD reduces to ED. None of the se three discrepancies can be calculated if the true model is unkno wn, though they can b e estimated from the data y N , meaning from g N . (It is d ( g N ; f ˆ θ N ) , the ﬁtte d discr ep ancy (FD) of the m o del from the data, that can b e calculate d from the data.) The f ol lo wing p olicy is an abstraction of Ak aik e’s . Selection Policy. Given a discrepancy functional d , data y N generated b y an unkno wn true mo del M ∗ with PDF g , and candidate models M 1 θ 1 , M 2 θ 2 with parametric P DF’s f 1 θ 1 , f 2 θ 2 and par ameter spaces Θ 1 , Θ 2 , one should ideally assess the go o dn ess of ﬁt of eac h mo del on the basis of its exp e cte d over al l discr ep ancy from M ∗ . That is, if when ﬁtted to data y N or equiv alen tly to the empirical distribution g N = g N , y N , mo del M l θ l w ould ha ve parameter ˆ θ l N = ˆ θ l N ( y N ) , one should select the mo del with the minim um v alue of d ′ ( g , f l . ) := E y N OD l d ( y N ) = E y N d  g , f l ˆ θ l N ( y N )  . (2.2) The exp ectation (i.e., a veragi ng) is computed ov er data y N generated by the true mo del, meaning o ver the PDF g . As an alternativ e, the double exp ectation d ′′ ( g , f l . ) := E y N E y ′ N d  g N , y ′ N , f l ˆ θ l N ( y N )  , (2.3) where data y N , y ′ N are generated indep enden tly b y g , ma y b e emplo yed. The exp ected OD, d ′ ( g , f l . ) , is a p enalized version of the AD d ( g ; f θ l ∗ ) , where θ l ∗ is the pseudo-true v alue of the parameter θ l ∈ Θ l . It is at least as large as the AD, and because the num b er of w a y s in whic h the data-dependen t ﬁtted PDF f l ˆ θ l N ( y N ) can deviate from the pseudo-true PDF f l θ l ∗ is the dimensionalit y Information criteria and normal regression 7 of the parameter s pa ce Θ l , one exp ects that relying on the exp ected OD as a measure of closeness of M l ˆ θ l N to M ∗ will disfa vor o v er-ﬁtting in an AIC-like w ay . It m ust b e stressed that neither d ′ ( g , f l . ) nor d ′′ ( g , f l . ) can b e calculated directly , though they can b e estimated (with bias) by the ﬁtted discrepancy of the i ’th mo del from the data, FD l d ( y N ) = d ( g N , y N ; f ˆ θ l N ( y N ) ) . (This is an R SS -lik e quan tity .) The selection policy can therefore b e implemen ted by c ho osing the mo del with the minim um v alue of a certain function MSC, deﬁned as follo ws to b e an un biased estimator of the exp ected o verall discrepancy . (It will sp ecialize to the AIC.) Definition 2.1 . The mo del- selection criterion funct ion based on a discrepancy functional d , denot ed MSC d or simply MSC, is deﬁned so that the v alue MSC l / 2 for the l ’th mo del equals its ﬁtted discrepancy FD l d ( y N ) = d ( g N , y N ; f ˆ θ l N ( y N ) ) , plus an un biasing term B l equal to E y N OD l d ( y N ) min us E y N FD l d ( y N ) . That is, B l = d ′ ( g ; f l . ) − E y N d  g N , y N ; f ˆ θ l N ( y N )  = E y N d  g ; f l ˆ θ l N ( y N )  − E y N d  g N , y N ; f ˆ θ l N ( y N )  , (2.4) so that in expectation, MSC l / 2 equals the exp ected o verall discrepancy , on whic h it is p osited that mo del selection should b e based. R emark. Distinct discrepancies could be used for (i) p erforming th e ﬁtt ing and computing the FD, and (ii) computing the exp ected OD. This w ould generalize the selection p olicy (and turns out to b e needed in the deﬁnition of the KIC, as will b e discussed elsewhere). F or practical reasons, one could also p erform the ﬁtting using MLE and compute the FD using a discrepancy not related to MLE (Linhart & Zucc hini 1986, § 4.4). But this seems less theoretically justiﬁable. The alternativ e d ′′ to d ′ w as men tioned for tw o reasons. First, for MLE it is iden tical to d ′ , as will b e seen. Second, using d ′′ mak es s election in eﬀect a pro ced ure of cr oss validation , in whic h a ﬁtted mo del is assessed according to its empirically exp ec ted predictio n errors (Stone 1978). The deﬁnitio n (2.3) of d ′′ in volv es t w o h y p othetical sets of data: o ne ( y N ) used to estimate the parameter θ l , and one ( y ′ N ) used to as s ess the ﬁt of the resulting mo del. The equiv alence b et w een c ho osing mo dels b y cross-v alidation and Ak aik e’s techni que of p enal izing mo de ls b y their num b er of parameters ha s lon g b een recogniz ed (Stone 1977; Kuha 2004). ( b ) Discr ep ancies, informati on the ory and MLE The selection policy of § 2 a can b e applied v ery generally , to b oth discrete and con tin uous mo dels. Give n a discrepancy functional d and a sample y N comprising y (1) , . . . , y ( N ) ∈ S , w hic h yields an N -p oin t empirical distribution g N = g N , y N on S , one w ould naiv ely decide whether to select paramet ric mo del M l θ l with P DF f l θ l on S on the basis of its ﬁtted discrepancy d ( g N , y N ; f ˆ θ l N ( y N ) ) from the sample. But the selection p olicy mo diﬁes this R SS-like q uantit y by adding an unb iasing term, giving an un biased estimate MSC / 2 of the exp ected o v erall discrepancy . Selection based on the latter disfa v ors o ver-ﬁtti ng, as des ired . Information criteria and normal regression 8 This is the natural setting for the AIC and AICc statistics. But for regression mo dels, in whic h S = R n (and most often N = 1 ), the c hoice of a discrepancy functional closely tied to MLE and in fact to information theory m ust ﬁrst b e justiﬁed. Only f or one sp ecial functional do es the ﬁtted discrepancy turn out to b e the negativ e log-lik eliho o d. F or discrete rather than con tinuou s mo dels, with S a discrete set such as { 1 , . . . , m } or { 1 , 2 , 3 , . . . } , a wide v ariety of discrepancy f unc tionals d ( g ; f ) hav e b een used in statistics and elsewhere. They include the Pea rson X 2 statistic P n ∈ S [ g ( n ) − f ( n )] 2 /f ( n ) , used w hen S is the set o f cells in a con tingency table to compare an empirical and a theoretical distribution. The Neyman X 2 is similar. There are also man y discrepancies ro oted in information theory , suc h as the (discret e) Kullbac k–Leibler (KL) div ergence, often calle d the information al div ergence (Csiszár & Körner 2011). Its normalized f or m is D KL ( g ; f ) := X n ∈ S g ( n ) ln g ( n ) f ( n ) > 0 (2.5) and its denormaliz ed form is d KL ( g ; f ) := − X n ∈ S g ( n ) ln f ( n ) > d KL ( g ; g ) . (2.6) They are related by D KL ( g ; f ) = d KL ( g ; f ) − d KL ( g ; g ) . The subtracted quant it y d KL ( g ; g ) is the (Shannon) en trop y of the distribution g , and in statistical mec hanics D KL ( g ; f ) w ould therefore b e called a relativ e en trop y . The denormalized d KL ( g ; f ) is s ometimes called an ‘inaccuracy .’ Discrepancies in information theory are often of the ‘ ϕ -diverg ence’ form D ϕ ( g ; f ) := P n ∈ S g ( n ) ϕ ( f ( n ) /g ( n )) , for ϕ some conv ex function (Par do 2006). F or instance, if ϕ ( u ) equals − ln u then D ϕ = D KL . If ϕ ( u ) ∝ 1 − u (1+ α ) / 2 then D ϕ b ecom es the so-called α -div ergence D ( α ) , whic h reduces to D KL in a scaled α → − 1 limit. This generalized discrepancy arises in the geometry of statistical inference (Amari & Nagaok a 2000), and there is a corresp onding denormalized d ( α ) ( g ; f ) ∝ X n ∈ S g ( n ) (1 − α ) / 2 h 1 − f ( n ) (1+ α ) / 2 i , (2.7) a s ca led α → − 1 limit of whic h equals d KL ( g ; f ) . The self-div ergence d ( α ) ( g ; g ) of g is called in ph ysics the T sallis en trop y of g (T sallis 1988). Man y other discrepancies ha v e b een in v estigated. (See Kapur (1989, Chap. 7) and Basu et al. (2011, Chap. 11).) But the KL div ergence is p erhaps the most imp ort an t, b ecause of its connection to MLE. F rom a sample y N comprising y (1) , . . . , y ( N ) ∈ S , where S i s a discrete set, obtaining a b est-ﬁt PMF f ˆ θ N b y maximizing ov er θ the likelihoo d L f ( θ | y N ) = f ( y N | θ ) of a candidate PMF f θ on S is equiv alen t to minimizing D KL ( g N ; f θ ) or d KL ( g N ; f θ ) o v er θ , where g N = g N , y N is the N -point empirica l distribution deﬁned b y the data. This is b ecause d KL ( g N ; f θ ) equals − ln L f ( θ | y N ) , the negativ e log-lik elihoo d, as follow s from (2.6). In particular, d KL ( g N ; f ˆ θ N ) equals − ln L f ( ˆ θ N | y N ) . MLE can th us b e in terpreted as a minim um discrepancy estimation. No w consider the contin uous case, when the data lie in a space S that is Euclidean, such as the c hoice S = R n arising in regression. The selection p olicy of § 2 a requires the computation of (an unb iased version of ) some Information criteria and normal regression 9 ﬁtted discrepancy d ( g N ; f ˆ θ N ) . The empirical distribution g N = g N , y N is a linear com bination of N delta functions computed from the data y N , as in (2.1 ), and ˆ θ N = ˆ θ N ( y N ) is computed by minimizing d ( g N ; f θ ) o v er θ , wh ere f θ is a candidate PDF on S . F or the selection p olicy to b e implemen ted as stated, the discrepancy functional d ( g ; f ) b eing emplo yed m ust allo w its ﬁrst argumen t to b e a PDF g N that is ‘atomic,’ in the sense that it is a combi nation of delta functions. This is a stringen t requiremen t (Liese & V a jda 1987, § 10.9). When S = R , man y statistical discrepancies d ( g ; f ) hav e b een emplo yed ; e.g., in the robust estimation of lo cation and scale parameters of distributions on S (Sahler 1968; Par r & Sc h ucan y 1980). Most are computed f ro m the cum ulativ e distributions (CDF’s) G, F corresp on ding to g, f , so they are w ell-deﬁned ev en if one or the other is an empirical distribution. A ls o, most are metrics, or legitimate distances; in fact minim um discrepancy estimation grew out of minim um distance estimation, which has a long history (Par r 1981). Examples include the K olmogoro v–Smirno v and Cramér–v on M ises discrepancies, whic h are widely used as goo dness-of-ﬁt statistics. But their g eneralizations to the m ultiv ariate case, when S = R n with n > 1 , are not s tra igh tforw ard at all. When S = R n with n ar bitrary , the most widely used discrep ancy is the (con tin uous) KL divergenc e, with its close ties to MLE. Its normalized form is D KL ( g ; f ) := Z R n g ( y ) ln g ( y ) f ( y ) d n y > 0 (2.8) and its denormaliz ed form is d KL ( g ; f ) := − Z R n g ( y ) ln f ( y ) d n y > d KL ( g ; g ) . (2.9) They are related b y D KL ( g ; f ) = d KL ( g ; f ) − d KL ( g ; g ) , as in the discrete case. A k ey observ ation is that the in tegral in (2.9 ) is w ell-deﬁned even if g is a purely atomic function of y , s uc h as an empirical PDF g N . But the in tegral in (2.8) is not, as one cannot tak e the logarithm of a s um of delta functions. The distinctio n can b e viewed as arising from the en tropy d KL ( g ; g ) not b eing deﬁned when g = g N : an y empirical distribution has undeﬁned en tropy . The go o d b ehavior of the in tegral in (2.9) justiﬁes the use of d KL in mo del selection, to the exclusion of D KL . F or an y s am ple size N > 1 , a veraging ov er data y ′ N sampled from the PDF g yields g itself; whic h is to sa y , E y ′ N g N , y ′ N = g . By the linearit y in g of the in tegral in (2.9 ), it follo ws that E y ′ N d KL ( g N , y ′ N ; f ) equals E d KL ( g ; f ) . This conﬁrms a claim made in § 2 a : if d KL is used as the discrepancy d , the tw o exp ected o ver all discrepancies d ′ , d ′′ deﬁned in (2.2),(2.3) are e qual , and giv e rise to identic al mo del- selection p olicies. F or non-KL discrepancies, this ma y not hold. In the con tin uous case as in the discrete, the ﬁtted discrepancy FD d KL ( y N ) , i.e., d KL ( g N , y N ; f ˆ θ N ( y N )) , equals − ln L f ( ˆ θ N | y N ) , the negativ e of the ﬁtted log- lik eliho o d. Using d KL as the discrepancy , as in the follo wing deﬁnition, s p ecializes the MSC of Deﬁnition 2.1 to what will b e called a M SC of AIC t yp e. Definition 2.2 . A d KL -based mo del-selec tion criterion MSC, of A IC type, is deﬁned th us: the v alue MSC l / 2 for the l ’th mo del, calculate d after ﬁtting, equals its negativ e log-lik eliho o d − ln L f ( ˆ θ l N | y N ) , plus an unbi asing correction B l that Information criteria and normal regression 10 equals E y N OD l d KL ( y N ) minu s E y N FD l d KL ( y N ) . That is, B l = E y N d KL  g ; f l ˆ θ l N ( y N )  − E y N h − ln L f ( ˆ θ l N | y N ) i , so that in exp ectation, MSC l / 2 equals the d KL -based ex p ected o verall discrepancy , on whic h it is p osited that mo del s election should b e based. In the f oll o wing, only MSCs of AIC t yp e will b e emplo yed. The choice of the KL divergen ce as the discrepancy used in mo del ﬁtting has a cl ear justiﬁcation: it allo ws the selection p olicy of § 2 a to b e applied as s ta ted. But it s ho uld b e noted that there are altern ativ es that merit examination. By adaptin g the selection p olicy it ma y b e p ossible to emplo y q uite diﬀeren t discrepancies, suc h as the ab o vem en tioned α -diverg ence. This requires a brief explanation. The con tin uous, denormalize d version of the α -divergen ce is d ( α ) ( g ; f ) ∝ Z R n g ( y ) (1 − α ) / 2 h 1 − f ( y ) (1+ α ) / 2 i d n y , (2.10) whic h is undeﬁned if g is an empirical distribution. But the α = 0 case of this, d (0) ( g ; f ) , is equiv alen t to the Hellinger distance, whic h has long b een used in parametric estimation (Beran 1977). F rom data y N , or an empirical distribution g N = g N , y N deﬁned f rom it as in (2.1), a ﬁtted parametric mo del f ˆ θ can indeed b e found by minimizing the Hellinger distance. The ﬁtting, t hou gh, inv olv es a preliminary s te p: replacing the delta functions of (2.1) b y appro ximate deltas. That is, one ﬁrst engages in kernel densit y estimatio n, b y conv olving g N with some in tegral k ernel. The in tegral in (2.10 ) will b e w ell-deﬁned if the resulting ‘smo othed ’ g is used. Alternativ ely , the model PDF f as w ell as the data PDF g N can b e smoothed (Basu et al. 1997, § 3.3). How ev er, the smo othing of g N can apparen tly b e justiﬁed only in the large-sample ( N → ∞ ) limit. In what follo ws N = 1 , and the limit take n (if an y) will b e n → ∞ . The usefulness in this setting of a preliminary smo othing of the empirical distribution remains to b e explored. 3. Selection with error bars In this section we sp ecialize to the case when the true m o del M ∗ and the candidate mo dels ﬁtted to a s et of n data point s are normal regre ssion models, inco rp orati ng kno wn error bars a s explained in § 1 b . W e in v estigate the mo del-selec tion crit erion function giv en in Deﬁnition 2.2 (a d KL -based MSC of AIC type). In § 3 a it is sho wn th at irresp ectiv e of n and the exten t of m is-sp eciﬁcatio n, the MSC f or a candidate linear mo del reduces to the standard AI C of (1.3): a sum of sq ua red residuals p enalized by 2 k , i.e., b y t wice the n um b er of parameters. In terestingly , it is p ossible to deriv e the mo diﬁed AIC kno wn as the AIC γ , in whic h the p enalt y γ k replaces 2 k , b y sligh tly mo difying the selection p olicy of § 2. But ev en when n is s ma ll, the stand ard AIC is nev er extended b y an O (1 /n ) correction term; th us the use of the AICc is not appropriate here. This realization is new. I n App endix B, pap ers from the astroph ys ics literatu re that ha v e erroneously applied the AICc are listed. In § 3 b the v ariabilit y of ∆ 12 := AIC 2 − AIC 1 is determined, and an asymptotically v alid hypothesis test for mo del selection that is based on ∆ 12 Information criteria and normal regression 11 is prop osed. A t an y sp ec iﬁed signiﬁcance lev el α , the test either rejects or accepts the n ull hypothesis that the ﬁtted mo de ls M 1 ˆ θ 1 , M 2 ˆ θ 2 are equally close to M ∗ in the ‘exp ected o v erall discrepancy’ s ense of § 2. F or mis-sp eciﬁed linear models incorporating error bars, this approac h to mo del selection can potent ially replace the rule-of-th um b use of Ak aike w eigh ts. The v ariabilit y estimation and the test of signiﬁcance can b e extended to the case of non-linear mo dels, as is sket c hed. ( a ) Exp e cte d di sc r ep ancies Consider a true mo del M ∗ and a candidate linear mo del M that are b oth normal, as in § 1 b . They are y = y 0 + ǫ 0 and y = X β + ǫ , where β = ( β j ) k j =1 is a column v ector of parameters, and the error vectors ǫ 0 , ǫ ha v e mean zero and co v ariance matrices σ 2 0 I n , σ 2 I n . Th us ǫ 0 = σ 0 z 0 and ǫ = σ z , where z 0 and z are v ectors of indep enden t standard normal v ariables. It is not assumed that y 0 is in the column s pa ce of the n × k design matrix X , i.e., mis-sp eciﬁcation is allo w ed. In this s ec tion the v ariance σ 2 is sp ec iﬁed and not estimated, so the full parameter vector θ of M is simply β . If the true v ariance σ 2 0 is known , as is the case when the data are accompanied b y error bars, then it is natural to c ho ose σ 2 = σ 2 0 . But for the momen t this will not b e assumed: s tat istical as w e ll as deterministic mis-s peciﬁcation will b e allo w ed. There is assumed to b e only one observ ation ( N = 1 ), s o only one instance of the random vector y ∈ S = R n is av ailable as a datum. Thus y will b e written for y , and the subscript N dropp ed. MLE is equiv alen t to c ho osing β ∈ R k so as to minimize the discrepancy (in the d KL sense) of the 1-p oin t atomic PDF g y ( · ) = δ ( · − y ) on R n from M β . Equiv alen tly , MLE minimizes the negativ e log- lik eliho o d − ln L ( β | y ) of M β . It yields a ﬁtted model M ˆ β , where the estimated parameter v ector ˆ β = ˆ β ( y ) ∈ R k is giv en b y ˆ β = ( X t X ) − 1 X t y . The n × n hat matrix P and its complemen t Q = I n − P , whic h pro ject on to the estimation and error subspaces of R n , i.e., the column and left n ull spaces of X , are deﬁned as usual by P = X ( X t X ) − 1 X t . The predicted data v ector ˆ y is deﬁned by ˆ y = Py . The RSS (sum of squared residuals) is ( y − ˆ y ) t ( y − ˆ y ) = y t Qy . The p olicy of § 2 requires that to the exten t that it can b e estimated, the exp ected o verall discrepancy E y OD ( y ) should b e used for mo del s ele ction. The AIC-t yp e selection c riterion MSC (see Deﬁnition 2.2) has the property that MSC / 2 for M is an un biased estimator of E y OD ( y ) . It is deﬁned b y MSC/2 = FD ( y ) + E y [ OD ( y ) − FD ( y )] , (3.1) where the ﬁtted discrepancy FD ( y ) , i.e., d KL ( g y ; f ˆ β ( y ) ) , is simply the negativ e log-lik eliho od − ln L ( ˆ β | y ) of the ﬁtted mo del M ˆ β . Since OD ( y ) is d KL ( g ; f ˆ β ( y ) ) , the second, un biasing term in (3.1), whic h w as denoted B in the previous section, can b e calculated f rom the Gaussian PDF’s g , f β of M ∗ , M β . They are g ( y ) = (2 π σ 2 0 ) − n/ 2 exp  − ( y − y 0 ) t ( y − y 0 ) / 2 σ 2 0  , (3.2) f β ( y ) = (2 π σ 2 ) − n/ 2 exp  − ( y − X β ) t ( y − X β ) / 2 σ 2  . (3.3) Information criteria and normal regression 12 F or con venien ce w e s hall write d KL ( f ; f ) := d KL ( f β , f β ) = n 2 [1 + ln (2 π )] + n 2 ln  σ 2  , (3.4) since d KL ( f β , f β ) do es not dep end on β . By examination, FD ( y ) = − ln L f ( ˆ β | y ) = d KL ( f ; f ) − n 2 + RSS 2 σ 2 (3.5) expresses the ﬁtted discrepancy in terms of the RSS, whic h is y t Qy . In the follo wing theorem, λ := y t 0 Qy 0 /σ 2 0 is an M -sp eciﬁc mis-sp eci ﬁcation parameter, χ 2 r is a c hi-squared random v ariable with r degrees of f ree dom and χ 2 r ( λ ) is a similar but non-cen tral v ariable, with non-cen tralit y parameter λ . (F or cen tral and non-cen tral c hi-squared distributions, see App endix A.) Theorem 3.1 . Under the mo del M ∗ , the over al l and ﬁtte d discr ep ancies of the ﬁtte d mo del M ˆ β ( y ) , OD = OD( y ) and FD = FD( y ) , ar e distri bute d ac c or ding to OD = d KL ( g ; f ˆ β ( y ) ) ∼ d KL ( f ; f ) + n 2  σ 2 0 σ 2 − 1  + σ 2 0 2 σ 2 χ 2 k ( λ ) , FD = d KL ( g y ; f ˆ β ( y ) ) ∼ d KL ( f ; f ) − n 2 + σ 2 0 2 σ 2 χ 2 n − k ( λ ) , wher e ˆ β ( y ) = ( X t X ) − 1 X t y ∈ R k is the ﬁ tte d value of the p ar ameter β . F or the discr ep ancies due to appr oximation and estimation, AD and ED = ED( y ) , the c orr esp onding statements ar e AD = d KL ( g ; f β ∗ ) = d KL ( f ; f ) + n 2  σ 2 0 σ 2 − 1  + σ 2 0 2 σ 2 λ, ED = d KL ( f β ∗ ; f ˆ β ( y ) ) ∼ d KL ( f ; f ) + σ 2 0 2 σ 2 χ 2 k , wher e β ∗ = ( X t X ) − 1 X t y 0 ∈ R k is the pseudo-true value of the p ar ameter β . Pr o of. Use the deﬁnition (2.9) of d KL and the deﬁnitions (3.2),(3.3) of g , f β . Eac h in tegral ov er R n in the computation of a d KL is a normal exp ectation that can b e ev aluated in closed form. In the expressions for OD,FD,ED the distributions χ 2 k ( λ ) , χ 2 n − k ( λ ) , χ 2 k arise resp ectiv ely as the distributions of ( Py − y 0 ) t ( Py − y 0 ) /σ 2 0 = ( Pz 0 − Qy 0 /σ 0 ) t ( Pz 0 − Qy 0 /σ 0 ) , (3.6a) ( Py − y ) t ( Py − y ) /σ 2 0 = ( z 0 + y 0 /σ 0 ) t Q ( z 0 + y 0 /σ 0 ) , (3.6b) ( Py − Py 0 ) t ( Py − Py 0 ) /σ 2 0 = z t 0 Pz 0 , (3.6c) if one uses the fact that P , Q are n × n pro jection matrices of ranks k , n − k . (F or distributions of quadratic forms, see A pp endix A.) Similarly , the ‘ λ ’ in the expression for AD is the non-random v alue of ( Py 0 − y 0 ) t ( Py 0 − y 0 ) /σ 2 0 .  Information criteria and normal regression 13 Theorem 3.2 . The c orr esp onding exp e ctations over M ∗ -gener ate d data ar e E y OD( y ) = d KL ( f ; f ) + n 2  σ 2 0 σ 2 − 1  + σ 2 0 2 σ 2 ( k + λ ) , E y FD( y ) = d KL ( f ; f ) + n 2  σ 2 0 σ 2 − 1  + σ 2 0 2 σ 2 ( − k + λ ) , AD = d KL ( f ; f ) + n 2  σ 2 0 σ 2 − 1  + σ 2 0 2 σ 2 λ, E y ED( y ) = d KL ( f ; f ) + σ 2 0 2 σ 2 k . Thus in exp e ctation only , the variable OD − d KL ( f ; f ) is the sum of AD − d KL ( f ; f ) and the variable ED − d KL ( f ; f ) . Pr o of. Use E  χ 2 r ( λ )  = r + λ , with χ 2 r equalling χ 2 r (0) .  Cor ollar y 3.3 . In the deﬁnition of the AIC -typ e mo del-sele ction criterion MSC for m o del M , ac c or ding to which MS C / 2 e quals FD( y ) plus an unbiasing term B , the term B ( i.e. E y [OD( y ) − FD( y )] ) e quals σ 2 0 2 σ 2 times 2 k . Henc e MSC = 2 d KL ( f ; f ) + RSS /σ 2 + σ 2 0 σ 2 2 k . Pr o of. Compute E y [OD( y ) − FD( y )] from the theorem, and then use the form ula (3.5) for FD ( y ) .  Th us wit h the exception of a constan t term eq ual to 2 d KL ( f ; f ) , whic h does not aﬀect the relativ e ranking of mo dels, f or any normal linear mo del M β ﬁtted to data the A I C-t yp e s ele ction criterion reduces to the standard A IC giv en in (1.3): the usual RSS /σ 2 , p enali zed b y t wice th e n um b er of parameters. Pro vided, that is, the mo del incorp orates a σ 2 equal to the v ariance param eter σ 2 0 of th e true mo del M ∗ . There m ust b e no statistical mis-sp eciﬁcation : M m ust incorp orate error bars of the correct length. I f so, the form ula (1.3 ) is exact f or all n . There is no sign of an y small- n correction term, such as app ears in the AICc. It m ust b e stressed that deterministi c mis-sp eciﬁcati on is allow ed here. There ma y b e a non-zero v alue for the mis-sp eciﬁcation parameter λ = y t 0 Qy 0 /σ 2 0 , indicating that the constan t vector y 0 in the deﬁnition of the true model M ∗ do es not lie in the estimation space, i.e., the column space of the design matrix X ; so the model M β do es not agree wi th the true mo del M ∗ for an y v alue of β . Because the parameter λ app ears in b oth E y OD ( y ) and E y FD ( y ) , it cancels. The preceding analysis w as based ent irely on the mo del-selection p olicy of § 2 , according to whic h the exp ected o v erall discrepancy E y OD ( y ) of the true mo del M ∗ from a candidate mo del M should b e used for selection purp oses. This p olicy giv es rise to the AIC, but it is in teresting to cons id er the eﬀects of generalizing it s lig h tly . By the form ulas of Theorem 3.2 , this p olicy is equiv alen t to c ho osing the mo del with the smallest v alue of the sum AD + E y ED ( y ) , the di screpancy due to appro ximation plus the expected discrepancy du e to Information criteria and normal regression 14 estimation. Supp ose that instead, one assessed the go o dness of ﬁt of M to M ∗ b y emplo y ing (any m ultiple of ) the con vex com bination  1 γ  AD +  1 − 1 γ  E y ED ( y ) , (3.7) where γ > 1 is free. By increasing γ one em phasizes the discrepancy due to estimation, rather than the discrepancy of M ∗ from M due to appro ximation (whic h if there w ere no mis-sp eciﬁcation w ould b e a constan t, i.e., w ould eﬀectiv ely b e zero). If there is no statistical mis-sp eciﬁcation ( σ 2 = σ 2 0 ), it follo ws from the form ulas in the theorem that an un biased estimator of this conv ex com bination, obtained by un biasing FD, is M SC γ /γ , where MSC γ = 2 d KL ( f ; f ) + RSS /σ 2 + γ k . With its constant ﬁrst term dropp ed, the mo del-selection criterion MSC γ b ecom es what is widely kno wn as the AIC γ , whic h p enalizes any mo del by γ times its n umber of parameters. As γ increases, ov er-ﬁtting is increasingly disfav ored. The conceptual diﬀerence betw een the t wo sorts of error in statistical mo del ﬁtting w as p oin ted out b y Inagaki (1977), and an A IC γ -lik e criterio n resem bling (3.7) w as deﬁned for autoregressiv e mo dels b y Bhansali (1986, Eq. (2.12)). But it seems not ha ve b een noticed that for normal linear mo dels, AIC γ arises rather naturally . Of course in applications, domain-sp eciﬁc considerations that are less axiomatic than practical ma y aﬀect the c hoice of γ . ( b ) AIC variability and a signiﬁc anc e test The procedure of deciding among candidate normal linear mo dels will now b e placed in the classical h yp othesis testing framew ork. A tes t of signiﬁcance for the evidence that M 1 , M 2 are not eq ually close to the true data-generating pro cess M ∗ will b e prop osed. The test is v alid in the n → ∞ limit, when applied to mo dels that in a certain precise sense, are separately mis-sp eciﬁed. It is assumed that the data p oin ts are accompani ed b y error bars, i.e., that the residual v ariance σ 2 0 is kno wn and is incorp ora ted in M 1 , M 2 . The prop osed test is based on the statistic ∆ 12 := AIC 2 − AIC 1 and an expression for its v ariance, and can p oten tially replace the traditiona l use of Ak aik e w eights. F o cusing on the v ariabilit y and hence the signiﬁcance of an AIC diﬀerence has m uch in common with the approac hes of Efron (1984) and F raser & Geb ot ys (1987). But unlik e Efron w e do not use a b o otstrap pro cedure, and unlik e F raser & Geb ot ys w e allow arbitrary mis-sp eciﬁcation. The genera l approac h is distinguished from the lik eliho o d ratio testing approac h originating with Co x (1962), in that it decides b et w een M 1 , M 2 on the b asis of whic h is closer to the truth , not on the basis of whic h is m or e li kely to b e c orr e ct . F or regression applications, consider the case when the mo dels are ﬁtted to a random v ector y ∈ R n that is generated b y an unkno wn true mo del y = y 0 + ǫ 0 , and N = 1 : only one observ ation of the random v ector y is av ailable as a datum. The candidates M l , l = 1 , 2 , are deﬁned b y y ( l ) = X ( l ) β ( l ) + ǫ ( l ) , where X ( l ) is an n × k l design matrix of f ull rank (with k l < n ) and β ( l ) ∈ R k l is a column vector of parameters. The estimation space L l ⊂ R n (i.e., the column space of X ( l ) ) is a linear subspace of dimension k l . Since the mo dels are given, the subspaces L 1 , L 2 are sp eciﬁed in adv ance; thus the analysis b elo w is in a sense conditional. The Information criteria and normal regression 15 error v ectors ǫ , ǫ (1) , ǫ (2) are tak en to be σ 0 z 0 , σ 0 z (1) , σ 0 z (2) , where z 0 , z (1) , z (2) are v ectors of n indep enden t s ta ndard normal v ariables. In this setting the d KL -based MSC of AIC t yp e reduces to the standard AIC, b y Corollary 3.3. Dropping the additiv e constan t d ( f ; f ) , w e write AIC l = RSS l /σ 2 0 + 2 k l , (3.8a) RSS l /σ 2 0 =  y − ˆ y ( l )  t  y − ˆ y ( l )  /σ 2 0 (3.8b) = y t Q ( l ) y /σ 2 0 = ( z 0 + y 0 /σ 0 ) t Q ( l ) ( z 0 + y 0 /σ 0 ) , since ˆ y := P ( l ) y . (Cf. (3.6b)). The n × n matrices P ( l ) , Q ( l ) pro ject on to L l , L ⊥ l ⊂ R n , with Q ( l ) = I n − P ( l ) ; note that tr P ( l ) = k l and tr Q ( l ) = n − k l . Being an inhomogene ous quadratic form in z 0 , RSS l /σ 2 0 has a non-cen tral ch i- squared distribution, whic h is χ 2 n − k l ( λ ( l ) ) , where λ ( l ) := y t 0 Q ( l ) y 0 /σ 2 0 c haracterizes the mis-sp eciﬁcation of mo del M l against M ∗ . AIC l is therefore distributed b y AIC l ∼ χ 2 n − k l ( λ ( l ) ) + 2 k l . (3.9) It shoul d b e note d that as r → ∞ , the distribution of χ 2 r ( λ ) is increasingly normal, whether or not λ g ro ws with r ; th us a s n → ∞ , th e distribution of AIC l is increasingly normal. But it is the distribution of ∆ 12 := AIC 2 − AIC 1 that is of in terest in mo del selection, and this is determined b y the joint distribution of AIC 1 , AIC 2 , and hence b y the joint distribution of the q uadratic f or ms y t Q (1) y and y t Q (2) y . As will be seen, obtaining an n → ∞ limit theor em requires that the relationship b et wee n Q (1) , Q (2) b e somewhat restricted. Theorem 3.4 . The exp e ctation and varianc e of AIC 1 , AIC 2 and the diﬀer enc e ∆ 12 := AIC 2 − AIC 1 ar e given by E AIC l = tr Q ( l ) + y t 0 Q ( l ) y 0 /σ 2 0 + 2 k l = ( n − k l ) + λ ( l ) + 2 k l , V ar A I C l =2 tr Q ( l ) + 4 y t 0 Q ( l ) y 0 /σ 2 0 = 2( n − k l ) + 4 λ ( l ) , E ∆ 12 = tr( Q (2) − Q (1) ) + y t 0 ( Q (2) − Q (1) ) y 0 /σ 2 0 = ( k 2 − k 1 ) + ( λ (2) − λ (1) ) , V ar ∆ 12 =2 tr  ( Q (2) − Q (1) ) 2  + 4 y t 0  ( Q (2) − Q (1) ) 2  y 0 /σ 2 0 . Pr o of. The ﬁrst three of these follo w immed iately from (3.9) by using E  χ 2 r ( λ )  = r + λ and V ar  χ 2 r ( λ )  = 2 r + 4 λ . All four follo w from (3.8) b y using the known expressions for normal momen ts (i.e., the momen ts of the comp onen ts of the normal random vecto r y ∈ R n ) .  The join t distribution of a pair of quadratic forms in a normal vector such as y ∈ R n is c omplicated, and in gen eral ca n only b e express ed in terms of sp ecial functions (Mathai & Pro vost 1992). But some cases can b e treated in closed form. F or instance, y t Q (1) y and y t Q (2) y are indep endent if (and only if ) Q (1) Q (2) = 0 , b y the Craig–Sak amoto theorem (Oga w a & Olkin 2008). A case more importan t in applications is the follo wing. Supp ose th at normal linear regression mo dels M 1 , M 2 , such as the pair considered here, satisfy M 2 ⊂ M 1 . That is , M 2 is a reduced version of the fuller mo del M 1 , obtained b y parametric Information criteria and normal regression 16 restriction. Then they are nested: their estimation subspaces L 1 , L 2 are related b y L 2 ⊂ L 1 and L ⊥ 2 ⊃ L ⊥ 1 , so that Q (1) Q (2) = Q (1) . If the v ector y 0 in the true mo del M ∗ satisﬁes y 0 ∈ L 2 ⊂ L 1 , so that neither of M 1 , M 2 is mis-sp eciﬁed and λ (1) = λ (2) = 0 , then in addition to the distributional s tat emen t RSS l /σ 2 0 ∼ χ 2 n − k l , one has ( RSS 2 − RSS 1 ) /σ 2 0 ∼ χ 2 k 1 − k 2 . If alternativ ely y 0 ∈ L 1 \ L 2 , so that M 2 is mis-sp eciﬁed but the fuller mo del M 1 is not, then ∆ 12 + 2( k 1 − k 2 ) = ( RSS 2 − RSS 1 ) /σ 2 0 ∼ χ 2 k 1 − k 2 ( λ (2) ) . (3.10) Suc h situations are familiar from m ultiv ariate regression, and lead to (partial) F-tests of the signiﬁcance of linear regressors (Mardia et al. 1979). H o wev er, w e wish also to handle Q (1) , Q (2) or equiv alen tly L 1 , L 2 that are less closely related: non-nestedness and more general mis-sp eciﬁcations should b e allo w ed. T o motiv ate the propos ed h y pothesis test a simple limit theorem will no w b e pro v ed, on the distribution of ∆ 12 in a case often encoun tered in the physical sciences. This is when the mo dels M 1 , M 2 are at least sligh tly mis-sp eciﬁed relativ e to the (unkno wn, presumably inﬁnite-dimensional ) true mo del M ∗ , in the rather consequen tial s ense that eac h has a non-zero mean ﬁtting error p er data p oin t. Hence one exp ects that in the n → ∞ limit, the mis-sp eciﬁcat ion parameters λ ( l ) := y t 0 Q ( l ) y 0 /σ 2 0 will gro w prop ortionate ly to n (genericall y , at diﬀeren t rates). F or further discussion of mis-sp eciﬁcation regimes, see § 4. In the theorem a certain trac e condition will app ear as a hypothesis. It is motiv ated by the followin g consideration. F rom tr Q ( l ) = n − k l it follo ws that tr( Q (2) − Q (1) ) = k 1 − k 2 . If there is nestedness and L 2 ⊂ L 1 , then Q (2) − Q (1) is also a pro jection, and idemp oten t; th us for an y sp eciﬁed m > 1 , tr  ( Q (2) − Q (1) ) m  is O (1) , i.e. it do es not gro w with n . It is reasonable to supp ose that this condition will hold if M 1 , M 2 , even if non-nested, are suﬃcien tly similar to justify their b eing used as comp eting mo dels of the same data. (Note that if the condition holds for m = 2 then it holds f or all m > 2 , by a standard trace norm inequalit y .) The condition do es not hold in the maximally dissimilar case Q (1) Q (2) = 0 , as tr  ( Q (2) − Q (1) ) 2  then equals tr Q (1) + tr Q (2) . Definition 3.5 . Consider a sequence of triples ( M 1 , M 2 , M ∗ ) indexed b y n (including s eque nces of ve ctors y 0 ∈ R n and design matrices), with a common error v ariance σ 2 0 . The n × n pro jections Q (1) , Q (2) are deﬁned as usual. If as n → ∞ , y t 0  ( Q (2) − Q (1) ) 2  y 0 is b ounde d b elo w b y a p ositiv e m ultiple of n , while (m uch more routinely) the mis-sp eciﬁca tion parameters λ ( l ) = y t 0 Q ( l ) y 0 /σ 2 0 of the t wo mo dels are b ounded ab o ve b y a p ositiv e m ultiple of n , the mo dels are said to b e asymptotic al ly sep ar ately mis-sp e ciﬁe d . R emark. N ested mo dels M 1 , M 2 will b e asymptotically separately mis-s peciﬁed if λ (1) , λ (2) ,   λ (1) − λ (2)   all gro w prop ortionately to n . Theorem 3.6 . In this setting of a se quenc e of triples ( M 1 , M 2 , M ∗ ) indexe d by n , if the c andidate mo dels ar e asymptotic al ly s ep ar ately mis-sp e ciﬁe d, and also satisfy the tr ac e c onditi on that tr  ( Q (2) − Q (1) ) 2  is O (1 ) , then the distribution of ∆ 12 := AIC 2 − AIC 1 , the exp e ctation and varianc e of which ar e given i n The or em 3.4 , is asymptotic al ly normal as n → ∞ . Information criteria and normal regression 17 R emark. Under the conditions of this theorem, V ar ∆ 12 will b e b ounded belo w b y a p os iti v e mu ltiple of n . This is b ecause according to Theorem 3.4, V ar ∆ 12 equals a com bination of tr  ( Q (2) − Q (1) ) 2  and y t 0  ( Q (2) − Q (1) ) 2  y 0 . Pr o of. The second cum ulan t c 2  ∆ 12  = V ar ∆ 12 is b ound ed b elow b y a p ositiv e multip le of n , as just remark ed. It is easily seen that eac h higher cum ulan t c m  ∆ 12  , m > 3 , is O ( n ) . F or instance, c 3  ∆ 12  equals a com bination of tr  ( Q (2) − Q (1) ) 3  and y t 0  ( Q (2) − Q (1) ) 3  y 0 , and these are resp ecti v ely O (1) and O ( n ) . Therefore the momen ts of  ∆ 12 − E ∆ 12  /  V ar ∆ 12  1 / 2 tend to those of N (0 , 1) , b ecause its higher cumu lan ts tend to zero.  The h yp othesis test is suggested b y the f oll o wing. Recall that up to an unimportant additiv e constan t, A IC / 2 is an unb iased es timator of the exp ected o verall discrepancy E y OD l ( y ) , i.e. the negativ e of the exp ected log-lik eliho o d after ﬁtting, the exp ectation b eing ov er data generated by M ∗ . Cor ollar y 3.7 . In the ab ove setting, under the nul l hyp othesis that E y OD 1 ( y ) = E y OD 2 ( y ) f or al l n , i.e., that M 1 , M 2 ar e e qual ly discr ep ant fr om M ∗ for al l n , the distribution of ∆ 12 . n 2 tr  ( Q (2) − Q (1) ) 2  + 4 y t 0  ( Q (2) − Q (1) ) 2  y 0 /σ 2 0 o 1 / 2 tends to N (0 , 1) as n → ∞ . Pr o of. The denominator is  V ar ∆ 12  1 / 2 , as given in Theorem 3.4 .  An un biased estimator of V ar ∆ 12 is the quan tit y − 2 tr  ( Q (2) − Q (1) ) 2  + 4 y t  ( Q (2) − Q (1) ) 2  y /σ 2 0 (3.11) as follo ws b y ev aluating its exp ectation o ver y . It is not guaran teed to b e positive, but the probabilit y of its b eing so ten ds to unit y as n → ∞ , s in ce the second term increasingly dominates the ﬁrst. (A maxim um lik eliho o d estimator could p erhaps b e used instead, but even when M 1 , M 2 are nested and ∆ 12 has essent ially a non- cen tral chi- squared distribution as in (3.10), MLE is diﬃcult to p erform (Anderson 1981).) The expression (3.11) is the ke y to the followin g test, whic h can b e applied at any ﬁxed n . Hypothe sis Test. T o test the n ull h yp othesis H 0 that M 1 , M 2 are equally discrepan t from the true mo del M ∗ , against the alternativ e that they are not, calculate what is asymptotically an N (0 , 1) test statistic, z (12) := ∆ 12 . n − 2 tr  ( Q (2) − Q (1) ) 2  + 4 y t  ( Q (2) − Q (1) ) 2  y /σ 2 0 o 1 / 2 . If   z (12)   > z 0 , where P ( | Z | > z 0 ) = α for a standard normal v ariable Z , the evidence against H 0 is s ign iﬁcan t at level α . Equiv alentl y , the p -v alue as s ociated to H 0 is given b y the form ula p = P ( | Z | >   z (12)   ) . T o test against a one-sided alternativ e that one mo del is less divergen t than the other, pro ceed similarly . Information criteria and normal regression 18 Estimating the v ariance of the AIC diﬀerence by (3.11) is what mak es this z -test p ossible. (The small probabilit y that the estimated v ariance ma y b e non- p ositiv e should b e noted.) It s ho uld b e s tre ssed that the normalit y of the test statistic is a go o d appro ximation only for large- n mo dels M 1 , M 2 that diﬀer appreciably in their mis-sp eciﬁcation. In general one would need to exploit the join t distribution of the forms y t Q (1) y and y t Q (2) y , whic h is complicated. This test of the signiﬁcance of an AIC diﬀerence is mo delled after a z -test prop osed by Linhart (1988). His test is b ased on the large-sample ( N → ∞ ) prop er ties of minim um discrepancy estimators, and is not re stricte d to normal regression. Our test applies when N = 1 , and is v alid in th e rather diﬀeren t n → ∞ limit. How ever, the need for an asymptotic mis-s p eciﬁcation o ccurs in his analysis, as in ours. In its absence, the test statistic could ha ve a limiting non-cen tral c hi- squared distribution, rather than a normal one (cf. Steiger et al. 1985) Throughout this section w e ha ve dealt with line ar regression mo dels. But it is not diﬃcult to extend the estimation of V ar ∆ 12 , and hence the prop osed h yp othesis test, to mo dels M 1 , M 2 that are non-linear. The follow ing is a sket c h. Supp os e that mo del M l , l = 1 , 2 , is deﬁned by y = y ( l ) 0 + ǫ ( l ) , where y ( l ) 0 = y ( l ) 0 ( β ( l ) ) is a suﬃcien tly smo oth function, not necessarily linear, of the parameter vecto r β ( l ) ∈ R k l . In the non-linear case the estimation subspace L l ⊂ R n is replaced b y an estimation submanifold of dimension k l , but M l can b e ﬁtted to an y datum y ∈ R n b y non-linear regression (Bates & W atts 1988). There will b e a b est-ﬁt c hoice ˆ β ( l ) ∈ R k l for the parameter v ector, and a predicted v ector ˆ y ( l ) = ˆ y ( l ) 0 ( ˆ β ( l ) ) . As usual, the residual sum of squares R SS l equals  y − ˆ y ( l )  t  y − ˆ y ( l )  . The diﬀere nce from th e linear case is this: R SS l is no longer quadratic in y 0 , as in Eq. (3.8b). But it is straigh tforw ard to der iv e a pow er series in y 0 for R SS l from a T aylor expansion of y 0 ( β ( l ) ) ab out the p oin t β ( l ) = ˆ β ( l ) . In m uc h the same w ay , one can obtain an expansion of V ar ∆ 12 in pow ers of y 0 . F rom this one can readily construct an unbiased estimator of V ar ∆ 12 as a p o wer series in y , b y requiring un biasedness to each order. By emplo ying a truncation of this series, whic h is a generalizatio n of the quadratic estimator (3.11), on e can extend the proposed test to candidate normal regression mo dels that are non-linear. Th us for non-linear mo dels as for linear ones, it ma y b e p ossible to emplo y a decision pro cedure that relies on the v ariabilit y of the ∆ 12 statistic, rather than on Ak aike w eigh ts. 4. Selection without error bars The applicabilit y in model selection of the AI C and AI C c statistics will no w b e considered, in the case when the linear regression models b eing assessed are ﬁtted to a data s et without error bars. This is quite diﬀeren t f rom the case when error (i.e. residual) v ariances are kno wn and hav e b een incorp orated in eac h mo del. The calculations b elo w rev eal the need for the AICc correction, but also revea l a serious diﬃcu lt y when a cand idate model is mis-sp eciﬁ ed b y an unkno wn amoun t. It has long b een k no wn that applying the AIC(c) to a mis-sp eciﬁed mo del is problematic (Sa w a 1978; Resc henhofer 1999), but w e obtain precise expressions Information criteria and normal regression 19 for the asymptotic ( n → ∞ ) shift in the AICc, coming from the mis-speciﬁcation. Our results are similar to those of No da et al. (1996), but are more explicit. As in § 3 a , tak e eac h candidate regression mo del M l to b e normal linear, of the form y = X ( l ) β ( l ) + ǫ ( l ) where ǫ ( l ) = σ z ( l ) , with z ( l ) a column vector of indep end en t standard normals. The param eter θ ( l ) = ( β ( l ) ; σ 2 ) no w includes b esides β ( l ) ∈ R k l the residual v ariance σ 2 , whic h m ust also b e ﬁtted. By MLE, if a single datum y ∈ R n is av ailable, then ˆ σ 2 equals RSS l /n . That is, ˆ σ 2 = y t Q ( l ) y /n , where Q ( l ) pro jects onto the left null space of X ( l ) (the error space). The true mo del M ∗ is y = y 0 + ǫ 0 with ǫ 0 = σ 0 z 0 , in whic h b oth y 0 ∈ R n and σ 2 0 are unkno wn. The deterministic m is-sp eciﬁcatio n of M l , if an y , is quant iﬁed by the parameter λ ( l ) = y t 0 Q ( l ) y 0 /σ 2 0 , whic h is a measure of the distance in R n b et w een y 0 and the column space of X ( l ) (the estimation s pa ce). In man y reasonable data gathering and regression pro ce dures, n can b e tak en arbitrarily large; so the large- n b eha vior of λ ( l ) merits discussion. One p ossibilit y is that λ ( l ) /n will tend to a limit as n → ∞ , like ˆ σ 2 = y t Q ( l ) y /n . That is, in the limit some fractio n of the RSS ma y be attributable to ﬁttin g errors of non-zero mean, coming from mis-sp eciﬁcation , rather than to the r andom errors of me an zero and t ypical size σ 0 that come from sto c hasticit y in the data-generat ing pro cess M ∗ . (This p ossibilit y was discuss ed in § 3 a .) Another p ossibilit y is that λ ( l ) will gro w sublinearly in n or ev en tend to a ﬁnite v alue, for a subtle reason: as n increases, it ma y b e p ossible to enhance the regression b y impro ving or expanding the mo del M l , giving an even b etter ﬁt to M ∗ . But it m ust b e stressed that in the presen t framew ork, which do es not make explicit the p ossibilit y of taking n to inﬁnit y or ev en of v arying n , there is no wa y of distinguishing the fractional contr ibution made to R SS l b y a non-zero mis- sp eciﬁcat ion λ ( l ) , or of estimating its magnitud e. Of course, in applications where the comp onen ts ( y i ) n i =1 of the observed v ector y are the v alues of a resp on se v ariable corresponding to v alues ( x i ) n i =1 of a explanatory one, one ma y sometimes b e able to estimate this fraction b y examining a residual plot. As in § 3 a , where there w as no need to es timate σ 2 , an explicit expression for the AIC-t yp e selection criterion MSC (see Deﬁnition 2.2) is readily obtained. The MSC is deﬁned so that MSC / 2 for an y candidate M = M θ of the form y = X β + ǫ , with β ∈ R k , is an un biased estimator of the expected o v erall discre pancy E y OD ( y ) under M ∗ . This is b ecau se according to the p olicy of § 2, it is the latter that should b e used in mo del selection. F or the discrepancy d KL , OD ( y ) equals d KL ( g ; f ˆ θ ( y ) ) , in whic h ˆ θ = ( ˆ β , ˆ σ 2 ) is the ﬁtted parameter obtained by MLE. Here ˆ β ( y ) = ( X t X ) − 1 X t y ∈ R k as usual; and now ˆ σ 2 = y t Qy /n , where Q = I n − P and P pro jects on to the column space of X . The PDF’s g , f θ of M ∗ , M θ are g ( y ) = (2 π σ 2 0 ) − n/ 2 exp  − ( y − y 0 ) t ( y − y 0 ) / 2 σ 2 0  , (4.1) f θ ( y ) = (2 π σ 2 ) − n/ 2 exp  − ( y − X β ) t ( y − X β ) / 2 σ 2  , (4.2) and by direct computation, d KL ( f θ , f θ ) = C n + n 2 ln  σ 2  (4.3) as in (3.4), where we no w write C n := ( n/ 2) [ 1 + ln(2 π )] . Information criteria and normal regression 20 What can b e calculated from the datum y ∈ R n is not OD ( y ) but the ﬁtted discrepancy FD ( y ) , i.e., d KL ( g y ; f ˆ θ ( y ) ) , where g y is a 1-p oin t atomic PDF. This is simply the neg ativ e log-lik eliho o d − ln L f ( ˆ θ | y ) of the ﬁtted model M ˆ θ . A s in § 3 a , the MSC is giv en by MSC / 2 = FD ( y ) + B := FD ( y ) + E y [ OD ( y ) − FD ( y )] , (4.4) where the ‘ B ’ term p erforms the un biasing. Also muc h as b efore (see (3.5) ), FD ( y ) = − ln L f ( ˆ θ | y ) = d KL ( f ˆ θ ; f ˆ θ ) − n 2 + RSS 2 ˆ σ 2 (4.5) expresses the ﬁtted discrepancy in terms of the RSS, whic h is y t Qy . But now, under the true mo del M ∗ the ﬁtted v ariance ˆ σ 2 as well as the RSS is a random v ariable. Since ˆ σ 2 equals RSS /n , (4.5) simpliﬁes to FD ( y ) = − ln L f ( ˆ θ | y ) = d KL ( f ˆ θ ; f ˆ θ ) = C n + n 2 ln  ˆ σ 2  . (4.6) The follow ing is the coun terpart of Theorem 3.1. In the statemen t, λ := y t 0 Qy 0 /σ 2 0 is the (presumably unkno wn) M -sp eciﬁc mis-sp eciﬁcation parameter. Theorem 4.1 . Under the mo del M ∗ , the over al l and ﬁtte d discr ep ancies of the ﬁtte d mo del M ˆ θ ( y ) , OD = O D ( y ) and FD = FD( y ) , ar e distribute d ac c or ding to OD = d KL ( g ; f ˆ θ ( y ) ) ∼ C n + n 2 ln  ˆ σ 2  + n 2  σ 2 0 ˆ σ 2 − 1  + σ 2 0 2 ˆ σ 2 χ 2 k ( λ ) , FD = d KL ( g y ; f ˆ θ ( y ) ) ∼ C n + n 2 ln  ˆ σ 2  , wher e ˆ σ 2 e quals ( σ 2 0 /n ) ti mes a r andom variable with distribution χ 2 n − k ( λ ) , and χ 2 k ( λ ) signiﬁes a r andom variable that is indep endent of ˆ σ 2 . Pr o of. That ˆ σ 2 equals ( σ 2 0 /n ) times a random v ariable with non-cen tral c hi- squared distribution χ 2 n − k ( λ ) follow s from the represen tation ˆ σ 2 / ( σ 2 0 /n ) = y t Qy /σ 2 0 = ( z 0 + y 0 /σ 0 ) t Q ( z 0 + y 0 /σ 0 ) . (4.7) (Cf. (3.6b.) As in the pro of of Theorem 3.1, OD ( y ) is calculated b y us ing the deﬁnition (2.9) of d KL and the deﬁnitions (4.1),(4.2) of g , f θ . The deﬁnite in tegral in the deﬁnition of d KL can b e ev aluated in closed form, and the resulting quadratic form ( Py − y 0 ) t ( Py − y 0 ) /σ 2 0 has distribution χ 2 k ( λ ) . (Cf. (3.6a).) That this and the quadratic form ˆ σ 2 = y t Qy /n are indep enden t follo ws from the ‘if ’ part of the Craig–Sak amoto theorem, men tioned ab o v e, since the pro jection matrices P , Q are complemen tary: they satisfy PQ = 0 .  Theorem 4.2 . The unbiasing term ‘ 2 B ’ in the deﬁnition (4.4) of the AIC - typ e sele ction criterion MSC is expr esse d in terms of the moments of non-c entr al Information criteria and normal regression 21 chi-squar e d r an dom variables by 2 B = 2 E y [OD( y ) − FD( y )] = n n n E [( χ 2 n − k ( λ )) − 1 ] − 1 + [ E ( χ 2 k ( λ ))] E [( χ 2 n − k ( λ )) − 1 ] o = n  − 1 + ( n + k + λ )  1 n − k − 2 − 1 ( n − k )( n − k − 2) λ + · · ·  . Pr o of. This comes from the form ulas of Theorem 4.1 b y exploiting ˆ σ 2 ∼ ( σ 2 0 /n ) χ 2 n − k ( λ ) and independence. The ﬁrst momen t E ( χ 2 k ( λ )) equals k + λ , and the series in λ for the negativ e ﬁrst momen t E [( χ 2 r ( λ )) − 1 ] (where r = n − k ), app ear ing in square brac k ets, is tak en from A pp endix A .  The preceding calculation reduces when λ = 0 and M is not mis-sp eciﬁed to a deriv ation of the AICc that has b een giv en by several authors (Sugiura 1978; Hurvic h & T sai 1989; Ca v anaugh 1997). But it is the generalization to non-zero λ , whic h is similar to one of Hurvic h & T sai (1991), whic h is of in terest. The ch i- squared v ariables in the form ula for the unbia sing term, whic h if λ w ere zero w ould b e cen tral, b ecome non-cen tral with non-cen tralit y parameter λ . As was explained ab ov e, in some applications it is reasonable for the mis- sp eciﬁcat ion λ of a candidate mo del to b e large in the sense that it gro ws linearly in n ; that is, if the regression pro cedure is suc h that n can b e taken arbitrarily large. But it is also useful to consider the case of ‘medium mis-sp eciﬁcation, ’ when to leading order λ grows prop ortionately to n 1 / 2 , and that of ‘small mis- sp eciﬁcat ion,’ when λ is b ounded in n as n → ∞ . Hence λ will no w b e allo w ed to gro w according to λ ∼ λ 1 n + λ 1 / 2 n 1 / 2 + λ 0 + o (1) , n → ∞ . Theorem 4.3 . L et the additive c onstant 2 C n b e dr opp e d fr om the deﬁnition of the AIC -typ e sele ction criterion MSC . Then if the mis -sp e ciﬁc ati on λ of a mo del M e quals zer o, i ts MSC r e duc es to the standar d AICc given in (1.5) , AICc = n ln  ˆ σ 2  + 2( k + 1) n n − k − 2 ∼ AIC + 2 ( k + 1)( k + 2) n + O (1 /n 2 ) , n → ∞ , wher e AIC = n ln  ˆ σ 2  + 2( k + 1) . In the r e gime of smal l mis-sp e ciﬁc ation, when λ ∼ λ 0 + o (1) as n → ∞ with λ 0 > 0 , MSC ∼ AICc − λ 0 (2 k + λ 0 ) n + O (1 /n 2 ) , n → ∞ . In the r e gime of me dium mis -sp e ciﬁc ation, when λ ∼ λ 1 / 2 n 1 / 2 + o ( n 1 / 2 ) as n → ∞ with λ 1 / 2 > 0 , MSC ∼ AIC − λ 2 1 / 2 + o (1 /n 1 / 2 ) , n → ∞ . In the r e gime of lar ge mi s-sp e ciﬁc ation, when λ ∼ λ 1 n + o ( n ) as n → ∞ with λ 1 > 0 , the MSC e quals A IC plus a λ 1 -dep endent quantity gr owing with n . Information criteria and normal regression 22 Pr o of. Substitute the leading-order b eha vior of λ into the formu la giv en in Theorem 4.2. It should b e noted that if λ ∼ λ 1 n , all terms of the p o wer series in λ for E  ( χ 2 n − k ( λ )) − 1  will con tribute, as eac h will b e of order 1 /n .  This theorem has disconcer ting implicat ions for the usefulness of the AIC and AICc in mo del selection. It rev eals ho w diﬀerent the case of an unkno wn error v ariance σ 2 is, from the case of a known σ 2 (treated in § 3 a ). If λ = 0 and the mo del is not mis-sp eciﬁed , the theorem co nﬁrms that including the standard AICc correction term of magnitude O (1 /n ) is justiﬁed. This term ma y aﬀect the selection pro cedu re if n is small. But if λ = λ 0 + O (1 /n ) as n → ∞ with λ 0 non-zero, the co eﬃcie n t of 1 /n in the correction term will deviate from the AICc form. Since λ 0 is t ypically not known , this renders diﬃcult an y small- n correcting of the AIC. I n the regime of medium mis-sp eciﬁcation the problem is w orse: the O (1) un biasing term 2( k + 1) in the AIC itself is shifted by an amoun t dep endi ng on λ 1 / 2 . A nd in the regime of large mis-sp eciﬁ cation, in whic h applications of mo del selection ma y w ell lie, the AIC is shif ted by a p oten tially large amoun t, grow ing with n . This shift may sw amp the term 2( k + 1) . Theorem 4.3 in dicates that when deciding b et wee n candid ate models that ha v e b een ﬁtted to a data set without error bars, it ma y b e un wise to use the AICc or ev en the AIC, if there is an y poss ibi lit y that the models are mis -s peciﬁed relativ e to the true data-generating pro cess M ∗ , and if the amoun t of mis-sp eciﬁcation is unkno wn but is exp ected to b e substan tial. Since in the ph ys ical s ciences M ∗ is t ypically inﬁnite-dime nsional, candid ate models that are mis-sp eciﬁed, at least to some extent , are exp ected to o ccur quite widely . 5. Summary and discussion In this paper the AIC and AICc w ere dev elop ed from ﬁrst principles, to clarify their abilit y to assess comp eting regression mo dels of the sort common in the ph ysical sciences: ones with normal errors and known error v ariances, coming from the error bars of a size- n data set. The data set w as view ed as providing a single observ ation ( N = 1 ) of a n S -v alued random q ua n tity , the space S b eing R n . In § 2 a mo del s ele ction p olicy was form ulated, applying to arbitrary S and arbitrary N . The Kullbac k–Leibler div ergence d KL w as then c hosen as the discrepancy functional, whic h the p oli cy left unsp eciﬁ ed. The ch oice of d KL ensures that ﬁtted mo dels are compared on the basis of their ﬁtted log-likelihoo ds, suitably un biased (i.e., p enalized ). It w as noted that other measures of the discrepancy b et w een a parametric mo del and a data set could b e used, suc h a s the p opula r Hellinger distance. This option is w orth exploring, s in ce M L E is not robust and ma y not b e the b est c hoice if, sa y , the regression is non-linear or non-normal errors are presen t. But as w as explained, this will require that the selection p olicy b e mo diﬁ ed to include a f orm of k ernel densit y es timation. It w as sho wn in § 3 a that when ﬁtting a linear regression mo de l to data with error bars, the AIC and not the AICc should b e used. (F or commen ts on the recen t astroph ys ics literature, see App endix B.) If the mo del incorporates a known error v ariance σ 2 , its mis-sp eciﬁc ation if any do es not aﬀect the v alidit y of the AIC, though it causes certain discrepancy statistics to ha ve non-cen tral rather than cen tral c hi-squared distributions. That no additional unbi asing of the AI C Information criteria and normal regression 23 is needed when σ 2 is kno wn has in fact b een noticed (Kuha 2004, p. 209), but seems to ha ve attracte d little atten tion. I n ap plications of the AIC in the ph y sical sciences, it is of considerable imp ortan ce. In § 3 b , it w as sho wn that in the same setting as that of § 3 a , the v ariabilit y of an AIC diﬀerence can b e estimated. A test of signiﬁcance w as prop osed, whic h exploits what under reasonable conditions of mis-sp eciﬁcation is the asymptotic ( n → ∞ ) normalit y of this statistic. The signiﬁcance test is a test for sele ction , whic h can p oten tially replace the use of Ak aike w eigh ts in deciding b et w een regression mo dels with known error v ariances. The approac h of § 3 b to model s ele ction resem bles the approac h of Commenges et al. (20 08). In a general large-sample ( N → ∞ ) con text, not fo cused on the comparison of regression mo del s, they prop osed a test for s election based on a diﬀerence of t wo AIC’s. They w ere able to w ork out the asymptotic distribution of a normalized version of this diﬀerence b y exploiting large-sample theory for the likelih o od ratio statistic. This included the class ic al result of W ald (1943) on the comparison of nested mo dels, in v olving a no n-cen tral c h i-s q ua red distribution, and an asymptotic normalit y result of V uong (1989), who dealt with non-nested mo dels. The test prop osed in § 3 b is similar in spirit, but in the f orm ulation used here the n → ∞ limit of a regression mo del is rather diﬀeren t f ro m an N → ∞ large-sample limit, and requires its o wn analysis. In § 4 the usual deriv ation of the AICc statistic, applying to linear regression mo dels ﬁtted to data sets without error bars, w as exte nded to mo dels with non-zero mis-sp eciﬁca tion λ . The app earance of a non-cen tral ch i-squared in the distribution of the o ver all Kullbac k–Leibler discrepancy is not unexp ected . But the b eha v io r of the un biasing term in the large- n limit is cause for concern. The case when the mo del mis-sp eciﬁc ation λ is o (1) as n → ∞ is the nicest. (It has a close analogue in the large-sample theory of the lik elihoo d ratio: the case of ‘lo cal alternativ es,’ when the true v alue of a mo del paramete r is taken to approac h the pseudo-true v alue as N → ∞ .) Except in this case, the shift in the AICc due to the mis-sp eciﬁcation ma y sw amp in the large- n limit the AICc correction term, and even the usual 2( k + 1) unbia sing term. F or mis-sp eciﬁed regression mo dels ﬁtted t o data without error ba rs, th is ma y w ell aﬀect the usef ulness of the AICc as a to ol in mo del selection . The extent to whic h this problem o ccurs in the ph ysical sciences remains to b e studied. App endi x A. Non-cen tral c hi-squared distributions and quadratic forms A (cen tral) χ 2 distribution with r degrees of f re edom, den oted χ 2 r , is the distribution of the sum of the squares of r independent standard normal random v ariables. That is, if z is a column v ector of r standard normals, then z t z ∼ χ 2 r . There is a generalizat ion: if P is an n × n pro jection matrix of rank s with 0 6 s 6 r , the quadratic form ( Pz ) t ( Pz ) = z t Pz has distribution χ 2 s . A χ 2 distribution with r degrees of freedom and non-cen tralit y parameter λ , denoted χ 2 r ( λ ) , is the distribution of ( z + u ) t ( z + u ) , where u is a ﬁxed column v ector. That is, it is the distribut ion of the squares of r indep en den t unit-v ariance normal v ariables, not necessarily of mean zero. The parameter λ equals u t u . There is a generalizati on: the quadratic f orm [ Pz + u ] t [ Pz + u ] has distribution Information criteria and normal regression 24 χ 2 s ( λ ) with λ = u t u . A second generalization is that [ P ( z + u )] t [ P ( z + u )] = ( z + u ) t P ( z + u ) has distribution χ 2 s ( λ ) with λ = u t Pu . When r > 1 , the PDF of χ 2 r ( λ ) cannot b e expressed in terms of elemen tary functions, though it can in terms of the conﬂuen t h y p ergeometric function 0 F 1 , or alternativ ely a mo diﬁed Bessel function of the ﬁrst kind. If X ∼ χ 2 r ( λ ) then X has mean and v ariance E X = r + λ, V ar X = 2 r + 4 λ, (A.1) and negativ e ﬁrst momen t E [ X − 1 ] = e − λ/ 2 X ∞ m =0 ( λ/ 2) m m ! 1 r − 2 + 2 m . (A.2) F or details, see Mathai & Pro v ost (1992) and Bo c k et al. (1984). App endi x B. The AICc in recen t pap ers The AIC and AICc ha ve recen tly entered the ph ysical s ci ences and in particular astroph ysics by b eing used to compare cosmological mo dels. Suc h mo dels ha ve a relativ ely small nu m b er of parameters, and comp eting mo dels are usually not nested. Mo dels ha v e b een compared, e.g., on the basis of their predictions of the distance–re dshift relation, whic h c haracterizes the expansion of the Univ erse. After a mo del is ﬁtted b y non-linear regression to observ ational data, its goo dness of ﬁt is a ssessed b y calcu lating its AIC or AICc. One recen t comparison o f models, emplo ying the AIC and the Ba yesian criterion BIC, is that of Shi et al. (2012). A searc h rev eals that man y though not all publications in this area use the AI C and AICc in a fas hi on that on the basis of the presen t w ork, can b e considered correct. If the observ ational data are accompanied by error bars, or a common error v ariance σ 2 is kno wn or assumed, the AIC should b e used; and if the v ariance is treated as a n uisance parameter to b e ﬁtted, the AICc should b e us ed . Da vis et al. (2007), Li et al. (2010) and T an & Bi sw as (2012) emplo y the AICc, despite their data sets b eing accompani ed by error bars, whic h strictly sp eaking is incorrect; but they observ e in their analyses that the AICc correction is of negligible size and do es not aﬀect mo del comparisons. The pap er of T an & Biswas is esp ecially v aluable from a statistician’s p oin t of view, b ecause they in vestigat e AIC(c) v ariabilit y empirically rather than theoretically , using a b ootstrap pro cedure. Unfortunately , a num b er of pap ers in the lit erature ar e ba sed on data with explicit e rror bars, but use the A I Cc without commen ting on whether its O (1 /n ) correction term aﬀects their results. This includes the papers of Biesiada & Piórk o wsk a (2009), F ebruary et al. (2010), Kelly et al. (2010), Dan tas et al. (2011), Basilak os & Pouri (2012), P apageorgiou et al. (2012) and W ang & Zhang (2012). A re-examin ation of the model comparisons in these pap ers is surely desirable. Information criteria and normal regression 25 References Ak aike, H. 1973 Information theory and an extension of the maximum likel ihoo d principle. In Se c ond international symp osium on i nformat ion the ory ( Tsahkadsor, Armenia, 1971 ) (eds B. N. P etrov & F. Csá ki), pp. 267–281 . Budap est, H ungary: Aka démiai Kiadó. Ak aike, H. 1992 Information theory and an extension of the maximum likel ihoo d principle. In Br e akthr oughs in statistics, Volume I (eds S. Kotz & N. L. Johnson), pp. 610–624. New Y ork/Berlin: Springer-V erlag. Amari, S. & Nagaok a, H . 2000 Metho ds of i nformat ion ge ometry , vol. 191 of T ransl. Math. Monographs. Providence, RI: A merican Mathematical So ciet y (AMS). Anderson, D. A . 198 1 Maxim um likelihood estimation in the non ce ntral c hi distribution with unknown s cale parameter. Sankhya Ser. B 43 , 58–67. Basila kos, S. & P ouri, A. 2012 The gro wth index of matter p erturbations and modiﬁed gra v it y . Monthly Notic es R oy. Astr onom. So c. 423 , 3761–3767. A va ilable on- li ne as arXiv:12 03.672 4 . Basu, A., Harris, I . R. & Basu, S. 1997 Minim um-distance estimation: The approac h using density-based distances. In Ro bust infer enc e (eds G. S. Maddala & C. R. Rao), vol. 15 of Handb ook of Statistics, pp. 21–4 8. New Y ork/Amsterdam: Else vier. Basu, A., Shioy a, H. & Park, C. 2011 Statistic al infer enc e: The mi nimum distanc e appr o ach . Boca R a ton, FL: Chapman & Hall/ CRC . Bates, D. M. & W atts, D. G. 1988 Nonli ne ar r e gr ession analysis and its applic ations . New Y ork: Wiley . Beran, R. 1977 Minim um Hellinger distance estimates for p a rametric mo dels. Ann. Statist. 5 , 445–46 3. Bevington, P . R. & R obinson, D . K. 2003 Data r e duction and err or analysis for the physic al scienc es , 3rd ed n . Boston: McGraw-Hill. Bhansali, R. J. 1986 A deriv ation of the information criteria fo r selecting autoregressive mod el s. A dv. in Appl. Pr ob ab. 18 , 360–387 . Bhansali, R . J. & Downham, D. Y . 1977 Some p rop erties of the order of an autoregressiv e mo del selected by a generalization of Ak aik e’s EP F criterion. Biometrika 64 , 547–551 . Biesiada, M. & Piórko wsk a, A . 2 009 Loren tz inv ariance violation-induced time delays in GRBs in diﬀerent cosmol ogical mo dels. Classic al Quantum Gr avity 26 , 125007 (9 pp.). A v ailable on-line as arXiv:1008 .2615. Bock, M. E., Judge, G. G. & Y ancey , T. A . 1984 A simple form for t he inv erse moments of n on-cen tral χ 2 and F random v ariables and certain conﬂuent hypergeometric functions. J. Ec onometrics 25 , 217–234 . Burnham, K. P . & Anderson, D. R. 2002 Mo del sele ction and mul t imo del infer enc e: A pr actic al information-the or etic appr o ach . New Y ork: Springer-V erlag. Ca va naugh, J. E. 1997 Unifying th e d eri v ations for t h e A k aike and corrected Ak aik e information criteria. Statist. Pr ob ab. L ett. 33 , 201 –208. Ca va naugh, J. E. 1999 A large-sample model selection criteri on based on Kullbac k’s symmetric diverge nce. Statist. Pr ob ab. L ett. 42 , 333– 343. Ca va naugh, J. E. 2004 Criteria for linear model selection based on Kullbac k’s symmetric diverge nce. Aust. N. Z. J. Stat . 46 , 257– 274. Claesk ens, G. & H j ort, N . L. 2008 Mo del sele ction and mo del aver aging . Cam bridge, UK: Cam b ri dge Univ. Press. Commenges, D., S ayyareh, A., Letenn eur, L., Guedj, J. & Bar-Hen , A. 2008 Estimating a diﬀerence of Kullbac k – Leibler ris ks using a normalize d diﬀerence of AIC. Ann. A p pl. St atist. 2 , 1123–11 42. A v ailable on-line as arXiv:0807. 4086. Co x , D. R. 1 962 F urther results on tests of separate fa milies of hypotheses. J. R oy. Statist. So c. Ser. B 24 , 406–424. Csiszár, I. & Körner, J. 2011 Information the ory: Co ding the or ems f or discr ete memoryless systems , 2nd edn. Cam bridge, UK: Cam b ri dge Univ. Press. Dantas , M. A., Alcaniz, J. S., Mania, D. & Ratra, B. 2011 Time and distance constraints on accelerating cosmological mo dels . Phys. L ett. B 699 , 239–245. A va ilable on-line as Information criteria and normal regression 26 arXiv:1010.0 995 . Davis , T. M., Mörtsell, E., Sollerman, J., et al. 2007 Scru t inizi ng exotic cosmologi cal mo d els using ESSENCE sup erno va d a ta com bined with other cosmologi cal probes. Astr ophys. J. 666 , 716–72 5. A v ailable on-line as arXiv:astro-ph/070 1510 . Donoho, D. L. & Liu, R. C. 1988 The “automatic” robu stness of minim um distance f unctionals. An n. Statist. 16 , 55 2–586. Drap er, N . R . & Smith, H. 1998 Applie d r e gr ession analysis , 3rd edn. N ew Y ork: Wiley . Efron, B. 1984 Co mparing non-nested linear models. J. Amer. Stat ist. Asso c. 79 , 791 –803. F ebruary , S., Larena, J., Smith, M. & Clarkso n, C. 2010 Rendering dark energy void. M on thly Notic es R oy. Astr onom. So c. 405 , 223 1–2242 . A va ilable on-line as a rXiv:0909.14 79 . F raser, D. A. S. & Gebotys, R. J. 1987 Non-nested linear mod els : A conditional conﬁdence approac h . Canad. J. Statist. 15 , 375– 385. Hurvich, C. M. & T sai, C.-L. 1989 Regression and time seri es model selection in small samples. Biometrika 76 , 297–30 7. Hurvich, C. M. & T sai, C.-L. 1991 Bias of the corrected AIC criterion for underﬁtted regression and time-series models. Biometrika 78 , 499–509. Inagaki, N. 1977 T wo errors in s tatistical mod el ﬁ tting. Ann. Inst. Sta tist. Math. 29 , 131 –152. Kapur, J. N. 1989 Maximum-entr opy mo dels i n scienc e and engine ering , revised edn. New Delhi: Wiley Eastern Ltd. Kelly , P . L., H ic ken, M., Burke, D. L., Mandel, K. S. & Kirshne r, R. P . 2010 Hubb l e residuals of nearby Typ e Ia sup erno v ae are correlated with host galaxy masses. Astr ophys. J. 715 , 743–75 6. A v ailable on-line as arXiv:0912.0929 . Keuzenk amp, H. A., McAleer, M. & Zellner, A. (eds) 2001 Simplicity, infer enc e and mo del ling . Cam b ri dge, UK: Cam b ri dge Univ. Press. Ko nishi, S. & Kitaga wa, G. 2008 Information criteria and statistic al mo deli ng . New Y ork: Springer. Kuha, J. 2004 A IC and BIC: Comparison of assumptions and p erfo rmance. So ciol. M et ho ds R es. 33 , 188– 229. Li, M., Li, X. & Zhang, X . 2010 Comparison of dark energy mo dels: A p erspective from the latest observ ational data. Scienc e China: Physics Me chanics Astr onomy 53 , 1631–1645. A v ailable on-line as arXiv:0912 .3988. Liddle, A. R. 200 7 Information criteria for astroph ysical mo del sel ection. Monthly Notic es R oy. Astr onom. So c. 377 , L74–L78 . A v ailable on-line as arXiv:astro-ph/07011 13 . Liese, F. & V a jda, I. 198 7 Convex statistic al di s tanc es . Leipzig, Germany: T eu b ner. Linhart, H. 1988 A test whether tw o AIC’s diﬀer signiﬁcan tly . South Afric an Statist. J. 22 , 153–16 1. Linhart, H. & Zucc hini, W. 1 986 Mo del sele ction . New Y ork: Wiley . Mardia, K. V., Kent, J. T. & Bibb y , J. M. 1979 Multivariate analysis . New Y ork/London: Academi c Press. Mathai, A. M. & Prov ost, S. B. 1992 Quadr atic forms in r andom variables . New Y ork/Basel: Marcel Dekker. McQuarrie, A. D. R. & T sai, C.-L. 1 998 R e gr ession and time series mo del sele ction . Singap ore: W orld Scientiﬁc. McQuarrie, A., Shumw ay , R. & T sai, C.-L. 1997 The mo del selection criterion A ICu. Statist. Pr ob ab. L ett. 34 , 285 –292. Nod a, K., Miya ok a, E. & It oh, M. 1996 On bias correctio n of the Akai ke informatio n criterio n in linear models. Comm. Statist . T h e ory Metho ds 25 , 1845–1857 . Oga w a, J. & Olkin, I. 2008 A tale of tw o countri es: The Craig–Sak amoto t h eo rem. J. Statist. Plann. Infer enc e 138 , 3419–342 8. P apageorgi ou, A., Plionis, M., Basilak os, S. & Ragone-Figueroa, C. 2012 A consistent comparison of bias mo dels using observ ational data. Monthly Not ic es R oy. Astr onom. So c. 422 , 106–116. A v ailable on-line as arXiv:1201.48 78 . P ardo, L. 2006 Stat istic al infer enc e b ase d on diver genc e me asur es . Bo ca Raton, FL: Chapman & Hall/CR C. Information criteria and normal regression 27 P arr, W . C. 1981 Minim um distance estimation: A bibliography . Comm. Statist. The ory Metho ds 10 , 1205–1 224. P arr, W. C. & S c hucan y , W. R . 1980 Minimum distance and robust estimation. J. A mer. Statist. Asso c. 75 , 616–624. Reschenhofe r, E. 1999 Impro ved estimation of t h e exp ected Ku ll bac k–Leibler discrepancy in case of misspeciﬁcation. Ec onometric The ory 15 , 377–3 87. Sahler, W. 19 68 A surv ey o n distribution-free statisti cs based on distances b et ween distribution functions. Metrika 13 , 149–169 . Sak amoto, Y., Ishiguro, M. & Kitaga wa, G. 1986 A kaike inf or mation criterion stat istics . T okyo: KTK Scientiﬁc Publishers. Saw a, T. 1978 Information criteria for discriminating among alternative regression mo dels . Ec onometric a 46 , 1273 –1291 . Shi, K., H uang, Y. F. & Lu, T. 2 012 A comprehensive co mparison of cosmological mo dels from the latest observ ational data. Monthly Notic es R oy. Astr onom. So c. 426 , 2452–25 62. A v ailable on-line as arXiv:1207 .5875. Steiger, J. H., Shapiro, A. & Bro wne, M. W. 19 85 On the multiv ariate asymptotic distribution of sequential chi-square statistics . Psychometrika 50 , 253–264. Stone, M. 1977 An asymptotic equiv alence of c hoice of model b y cross-v alidation and Ak aike’s criterion. J. R oy. Statist. So c. Ser. B 39 , 44–47. Stone, M. 1978 Cross-v alidation: A review. Math. Op er ationsforsch. Statist. Ser. Statist. 9 , 127–13 9. Sugiura, N. 1978 F urth er analysis of th e data by Ak aike’s information criterion and the ﬁn ite corrections. Comm. Statist. The ory Metho ds 7 , 13–2 6. T akeuc hi, T. T. 2000 Application of the information criterion to th e estimatio n of galaxy luminosit y function. Astr ophys. Sp ac e Sci. 271 , 213–2 26. A v ailable on-line as arXiv:astro-ph/99093 24 . T an, M. Y. J. & Bisw as, R . 201 2 The reliabil it y of the Ak aike in f ormation criterion method in cosmolog ical mo d el selection. Monthly Notic es R oy. Astr onom. So c. 419 , 3292–330 3. A va ilable on-line as arXiv:1105 .5745. T rosset, M. W . & Sand s, B. N. 1995 On the choice of a discrepancy fun ctional for mo del selection. Comm. Statist. The ory Metho ds 24 , 2841–2863. T sallis, C. 1988 Possible generalization of Boltzmann–Gibbs statistics. J. Statist. Phys. 52 , 479–48 7. V uong, Q. H. 1989 Likelihoo d ratio tests for mod el selection and non-nested hypotheses. Ec onometric a 57 , 307– 333. W ald, A. 194 3 T ests of statistica l hyp otheses concerning several parameters when the n umber of observ ations is large. T r ans. Amer. Math. So c. 54 , 4 26–482 . W ang, H. & Zhang, T.-J. 2012 Constrain ts on Lemaître–Tol man–Bondi models from observ ational Hubble parameter data. Astr ophys. J. 748 , 111 (13 pp.). A v ailable on- l ine as arXiv:1111.24 00. W eisb erg, S. 2005 Applie d line ar r e gr ession , 3rd edn. N ew Y ork: Wiley .

Information Criteria for Deciding between Normal Regression Models

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment