Information Criteria for Deciding between Normal Regression Models

Regression models fitted to data can be assessed on their goodness of fit, though models with many parameters should be disfavored to prevent over-fitting. Statisticians' tools for this are little known to physical scientists. These include the Akaik…

Authors: Robert S. Maier

Informati on criteria for deciding b et w een normal regres sion mo dels By Ro ber t S . Maier ∗ Dep artments of Mathematics and Physics, and S tatistics Pr o g r am University of Arizona, T ucson, AZ 85721, USA Regression mo dels fitt ed to data can b e assessed on their go odness of fit, though mo dels with ma n y parameters sho uld be disfavored to prevent o ver-fitting. Statisticians’ to ols for this are little known to ph ysical scientists. These include the Ak a ik e Information Criterion (AIC), a penalized g oo dness-of-fit statistic, and the AICc, a v aria n t including a s mall-sample correction. T hey entered the physical scien ces through b eing used by astrophysicists to compare cosmologica l mo dels; e.g., predictions of the distance–r edshif t relation. The AICc is shown to have b een mis-applied, b eing a pplicable o nly if err o r v a riances are unknown. If err or bar s accompany the data, the AIC sho uld b e used instead. Erroneous applications o f the AICc a re listed in an app endix. It is also shown how the v ariability o f the AIC difference b et ween mo dels with a known error v ariance can b e estimated. This yields a sig nifica nce test that can p otent ially replace the use of ‘Ak a ik e weight s’ for deciding b et ween such mo dels. Additionally , the effects of mo del mis-sp ecification are examined. F or re g ression mo dels fitted to data sets without (rather than with) error bars, they are ma jor: the AICc ma y b e shifted b y an unknown a moun t. The exten t of this in the fitting of physical mo dels remains to be studied. 1. Intr o duct ion ( a ) Backgr ound and overview Ph ysical scien tists are f am iliar with the task of fitting a parametric mo del suc h as a regression model to a data set, by using maxim um lik eliho od estimation (MLE) or other parameter estimation tec hniques. But they are less familiar with mo del sele ction : deciding among t w o or more mo dels fitted to the s ame data in a w ay that f or eac h mo del tak es accoun t of b oth its go o dness of fit and its n umber of para meters. T o what extent s hould one attempt to preven t ov er-fitting, i.e., ‘fitting the noise,’ b y p enalizing models with to o man y paramete rs? This question of parsimon y can b e viewed not only as a problem in data analysis, but as one in the philosoph y of science (Keuzenk amp et al. 2001). The case when the mo dels being compared are incompatible, i.e., are non-nested in that they are not related by parametric restrictions, is esp ecially vexing . So is the case when they are mis-sp ecifie d, i.e., do not agree with the ‘truth’ (the unknown and p erhap s infinite-dime nsional data-generating pro cess), regardless of what v alues for their parameters are chosen; so that fitting errors of non-zero mean are present. ∗ rsm@math.arizona.edu Pr oc. R. So c . A 1–27; doi: 10.1098/rspa.00000000 This journal is c  2013 The Roy al Socie t y Information criteria and normal regression 2 T ec hniques for mo del s ele ction that p en alize ov er-fitting h av e b een applied in the life sciences, so cial sciences and econometrics, and several b ook-length exp ositions of these tec hniques by statisticians are a v ailable (Sak amoto et al. 1986; McQuarrie & T sai 1998; Cla esk ens & Hjort 20 08; K onishi & Kitaga wa 2008). A fruitful concept is the AIC (Ak aike I nform ation Criterion), a certain p enalized lik eliho o d or go odness of fit statistic (Ak aik e 1973, reprin ted as Ak aike 1992). It is an estimate of the discre pancy , in a sense re lated to MLE and information theory , b et w een a fitted statistical mo del and the unkno wn data-generating pro cess; the latter b eing statistical also, if measuremen t uncertain ti es are incorporated. In simple cases th e AIC is effectiv ely a p enalized sum of s quare d prediction errors. By comparing AIC’s one can compare mo dels with differen t n um b ers of parameters, including incompatible mo dels. But in the absence of a sys te matic theory of the v ariabilit y of the AIC statistic, using AIC’s to decide b et w een fitted ver sions of parametric models M 1 , M 2 cannot be view ed as a procedure in classical stati stical inference, i.e., as a significance test. No p -v alue, as in a frequen tist ass essmen t of the evidence against a null h yp othesis, is actually calculated. I ns tead one simply sa y s , e.g., that if ∆ 12 := AIC 2 − AIC 1 is less than 2 . 0 , the evidence that M 1 is to b e preferred o ver M 2 is weak; and that if ∆ 12 is greater than 5 . 0 , it is strong. The ‘Ak aike weigh t’ exp( − AIC l / 2) is often viewed as an unno rmalized probabilit y (in some s ense) that M l is to b e preferred (Burnha m & Anderson 2002), but this in terpretation has not b een univ ersally accepted. A few y ears ago, the AIC and related criteria (such as AICc, a v arian t including a small correction) ente red the physical sciences b y b eing in tro duc ed in to astroph ysics (T ak euchi 2000; Liddle 2007). They ha v e b een used to compare cosmological mo dels, suc h as regression mo dels of the distance–redshift relation that c haracterizes the ex pansion of the Univ erse. U nfort unately , in man y pap ers a mistak e in data analysis has been made. It can perhaps be attributed to a misreading of the exp ositions of Burnham & Anderson (2002) and Liddle (2007). The mistak e is this. A data set may b e accompanied by error bars (i.e., measuremen t uncertain ties), or not. If the latter, regression mo del fitting will in volv e the estimation of a ‘n uisance parame ter,’ namely the unkno wn v ariance σ 2 of the measuremen t errors. The AICc, whic h w as designed to un bias completely the estimate of the Kullbac k–Leibler information-theoretic discrepancy pro v ided b y the AIC, is appro priate only if σ 2 is unkno wn. But data in the ph ysical sciences are typically accompanied b y error bars. When asses sing statistical mo dels that incorporate known error bars, the AIC and not the AI Cc s ho uld b e used. T o sho w this, w e first place the AIC in a general framew ork that can be used to deriv e man y information-theoretic mo del-selection statis tics. (See § 2 .) In § 3 a w e restrict our fo cus to linear regression and MLE, and sho w that the applicabilit y of the AICc is limited as claimed. In App endix B, pap ers from the astroph y sics literature that ha ve erroneously applied the AICc are listed. In § 3 b w e obtain a further result: under reasonable conditions of mis- sp ecificat ion, using the correct statistic (the AIC) to decide betw een normal regression mo de ls M 1 , M 2 that incorporate error bars can indeed b e viewed as a test of significance. That is, the decision can b e made in a class ica l w ay . One can calculate a p -v alue as sociated to the null hypothesis that M 1 , M 2 are equidistan t in an information-theoretic sense from the true but unkno wn mo del M ∗ , as opp osed to the alternativ e h yp othesis that they are not. This is b ecause the v ariabilit y Information criteria and normal regression 3 of ∆ 12 = AIC 2 − AIC 1 can b e estimated. F or data sets with error bars, this can p oten tially render Ak aik e w eigh ts obsolete. It is explained how the estimation can b e carri ed out for mis-specified normal linear mo dels, and a hypothesis test based on the estimate is prop osed. The test can b e extended to non-linear mo dels. In decisions b et w een statistical mo dels M 1 , M 2 that incorp orate know n error bars, the v alidit y of the uncorrected AIC is unaffected if the mo dels are mis- sp ecified, as is shown in § 3 a . This result is une xp ected, sinc e the usual deriv ation of the AI C statistic (an d indeed of the AICc) requires th at there be nesting and no mis-sp ecifica tion; and its widespread application to non-nested, p ot en tially mis- sp ecified mo dels has in fact b een somewhat heuristic. In § 4 w e sho w that if the data set to whic h M 1 , M 2 are fitted is not accompanied b y error bars, problems with the AICc can indeed arise. F or a normal linear mo del fitted to a data set without error bars, we discuss the b eha vior of the AICc under mis-sp ecifica tion, and obt ain an asymptoticall y exact expression f or the resulting shift. I f the exten t of the mis-sp ecification is unkno wn, this shift ma y render th e A ICc of little v alue. This fact deserv es to b e b etter kno wn. Besides deriving the AIC and AIC corrections from first principles, w e briefly discuss the applicabi lit y to normal linear mo dels of such v ariants as AIC γ , whic h suppresses o v er-fitting to a greater exten t than do es the A IC. Many additional v arian ts hav e app eared in the literature, such as the KIC (Kullbac k I nf or mation Criterion) and KICc (Ca v anaugh 1999, 2004), but they are b ey ond the scop e of this pap er. In the final section (§ 5), we summarize our results . ( b ) AIC b asics A regression mo del fitted to data can b e linear or non-linear, according to its parameter d ep ende nce. The linear case is familiar (Bevington & Robinson 2003; Drap er & Smith 1998; W eisb erg 2005). Supp ose the data s et comprises y 1 , . . . , y n ∈ R ; whic h could b e, e.g., the v alues of a resp onse v ariable corresp onding to n distinct v alues of an explanatory v ariable x , ch osen b y an observer or an exp erimen ter. Supp ose that y 1 , . . . , y n w ould dep end linearly on parameters β 1 , . . . , β k ∈ R in t he absence of measur emen t errors or other noi se. That is , y = X β , where y = ( y i ) n i =1 , β = ( β j ) k j =1 are column ve ctors and X is an n × k design matrix. (It will b e ass um ed throughout that n > k and that X is of full rank, i.e., of rank k .) A statistical mo del M of the data w ould then b e y i = X k j =1 X ij β j + ǫ i , (1.1) where ǫ 1 , . . . , ǫ n are residuals, i.e., errors. In the simplest case the residuals would b e tak en to b e indep enden t. In a (homoscedastic) normal mo del one wou ld also tak e ǫ i ∼ N (0 , σ 2 ) , i.e., tak e eac h ǫ i to b e normally distributed with mean zero and a common v ariance σ 2 . The parameter σ 2 ma y b e known, or it ma y b e an unkno wn n uisance paramete r that needs to be estim ated (w hic h is the case if error bars are not supplied). Note that from a data set ¯ y = ( ¯ y i ) n i =1 including error bars of differing lengths, i.e., k nown but differing v ariances σ 2 1 , . . . , σ 2 n , one can obtain a data set y = ( y i ) n i =1 with a know n common v ariance σ 2 b y defining y i := ( σ /σ i ) ¯ y i . Whether o r not σ 2 is kno wn, an estimate ˆ β = ( ˆ β j ) k j =1 of β can be computed b y MLE, whic h reduces to ordinary least-squares f or an y normal linear model with Information criteria and normal regression 4 independent , iden tically distribute d (i.i.d.) errors. By a standard calculation, ˆ β = ( X t X ) − 1 X t y . A ccompan ying the observ ed data ve ctor y ther e is then a predicted data v ector ˆ y := X ˆ β = Py , where the ‘hat’ matrix P = X ( X t X ) − 1 X t pro jects on to the column space of X (the estimation space L ⊂ R n ). The residual sum of squares (RSS) for the fit is the s um of squared errors. That is, RSS = ( y − ˆ y ) t ( y − ˆ y ) = ( y − Py ) t ( y − Py ) = y t Qy , (1.2) where Q = I n − P is complemen tary to P and pro jects onto the left n ull space of X (the error space L ⊥ ⊂ R n ). If σ 2 is kno wn, so that the parameter vector θ of M is simply β , the standard definition of the A I C for the fitted mo del is AIC = RSS /σ 2 + 2 k . (1.3) If alternativ ely σ 2 is unkno wn, s o that θ = ( β ; σ 2 ) , it is AIC = n ln  ˆ σ 2  + 2 ( k + 1) = n ln ( RSS /n ) + 2 ( k + 1) , (1.4) where ˆ σ 2 = RSS /n is the maxim um likel iho od estimate of σ 2 . In b oth (1.3) and (1.4 ) the first term equals up to an unimportant constant the s ta tistic − 2 ln L ( ˆ θ N | y ) , where ln L ( ˆ θ N | y ) is the log-lik eliho o d of the fitted mo del. So the first term is a measure of go o dness of fit. In mo del selection a smaller AIC is preferred; hence the second term (which will b e seen to originate as an unb iasing term) p enali zes M according to its n um b er of fitted parameters ( k , resp. k + 1 ). It is usually differ enc es of A I C’s that are importan t, so an y term not inv olving k ma y b e added to (1.3) and (1.4). The c hoice ‘2’ in (1.3) and (1.4) for the co efficien t of k (resp. k + 1 ) is motiv ated b y information theory , as will b e explained. But applied s tat istic ians ha ve long b een in terested in the effects on model selec tion of choosing a mor e general p enalt y term γ k , where γ > 0 ma y differ from 2. The resulting mo dified AIC is denoted AIC γ (McQuarrie & T sai 1998). Bhansali & Dow nham (1 977) considered the effects of v arying γ on order selection in autoregressive mo dels, and sho wed empirically that it ma y b e useful f or γ t o range, sa y , b et ween 2 and 6. The ab o v ementi oned KIC lik e the AIC has an information- theoretic j ustifi cation, and in the n → ∞ limit turns out to b e equiv alen t to AIC 3 . Men tion should also b e made of the AICu (McQuarrie et al. 1997 ), whic h is a heuristic mo dification of (1.4) in whic h the ML estimate ˆ σ 2 is replaced by the un biased estimator s 2 := RSS / ( n − k ) of σ 2 . In the n → ∞ limit, AICu is also equiv alent to AIC 3 . This can b e seen b y workin g to leading order in 1 /n and using the asy m ptotic appro ximation n ln[ n/ ( n − k )] ∼ k + O (1 / n ) , n → ∞ . The most familiar mo dified or corrected AIC, AICc, is a less drastic mo difi cation of (1.4), the mo difica tion being a ma jor one only f or small n . U nder the ass um ption of i.i.d. normal residuals, and the traditional ass um ption of no mis-sp ecifica tion, it is given b y AICc = n ln  ˆ σ 2  + 2( k + 1) n n − k − 2 (1.5a) ∼ AIC + 2 ( k + 1)( k + 2) n + O (1 / n 2 ) , n → ∞ (1.5b) Information criteria and normal regression 5 (Sugiura 1978; Hurvich & T sai 1989; Ca v anaugh 1997). Wh y an O (1 /n ) correction term should b e ad ded to (1.4), but not to the expression (1.3) that applie s if σ 2 is kno wn, will b e explained. 2. Minim um discrepancy estimation ( a ) A gener al fr amework In this section the AIC, a p enalized go o dness-of-fit statistic, is placed in a mo del- selection framew ork that go es w ell b ey ond the use of MLE in regression. The AI C for a candidate mo del can b e viewe d as an un biased esti ma tor of its discrepancy , in a certain sense, from the true mo del. This wi ll ev en tually lead to the introduction of a n ull hypothesis that tw o candidate mo dels are equally discrepan t, and to sys tematic results on AI C corrections. But the theme of this section is the existence of alternativ es to the AIC, whic h h a ve not y et b ee n applied in the ph ysical sciences. It is hoped that in terest in this area will b e stim ulated. A framew ork resem bling the one used here w as first dev elop ed b y Linhart & Zucch ini (1986). Supp os e one has a parametric s ta tistical mo del M θ that will b e used for appro ximation or fitting purp oses, with θ ∈ Θ (a parameter space); and a true, underlying statistical mo del M ∗ of the data-gene rating process, whic h is not kno wn explicitly . If eac h generated datum is an elemen t of a set S , b oth M θ and M ∗ will be probabilit y distributions on S . (The c hoice S = R n is approp riate for regression, as in § 1 b .) Their resp ectiv e probabilit y densit y functions (PDF’s) will b e denoted f θ ( y ) and g ( y ) . In general it will not b e assumed that g = f θ ∗ for an y θ ∈ Θ , i.e., mis-sp ecification will b e allo w ed. T o an y random sample y N of size N from the true distribution g , comprising y (1) , . . . , y ( N ) ∈ S , there corresp on ds an empirical distribution g N = g N , y N on S . It is defined by g N , y N ( · ) = N − 1 X N i =1 δ ( · − y ( i ) ) , (2.1) where δ ( · ) is the Dirac delta function, if S is a Euclidean space such as R n . If alternativ ely S is a discrete space, then instead of a PDF there will b e a probabilit y mass function (PMF), defined using a Kronec k er d elt a rather than a delta function. The restrict ion N = 1 , meani ng that there is only a single replication, i.e. only one observ ation of y ∈ S , was implicitl y made in § 1 b , where y w as a random v ector in R n ; but here it will b e relaxed. The definition of MLE, whic h is an almost univ ersally applicable but hardly unique fitting sche me, is familiar. F rom the sample y N one constructs the lik eliho o d function L ( θ | y N ) = Q N i =1 f θ ( y ( i ) ) , and computes a parameter estimate ˆ θ N = ˆ θ N ( y N ) b y maximizing the likel iho od, or equiv alen tly by minimizing the negativ e log-likelih o od − ln L ( θ | y N ) . The b est fit to the dat a is then the mo del M ˆ θ N , with PDF f ˆ θ N . This sche me generalizes to m i nimum discr ep ancy estimation , whic h is itself a generaliza tion of minim um distanc e estimation. In an abstra ct description one starts with y N or equiv alen tly an N -p oin t empirical distribution g N on S , and computes ˆ θ N b y minimizing d ( g N ; f θ ) o v er θ ∈ Θ . That is, ˆ θ N = Information criteria and normal regression 6 arg m in θ ∈ Θ d ( g N ; f θ ) . Here d ( g ; f ) signifies some real-v alued measure of the discrepancy b et ween the PDF’s g , f , whic h quan tifies ho w difficult it is to discriminate b et ween them. The case when d s atisfies the axioms f or a metric can b e esp ecially nice (Donoho & Liu 1988; T rosset & Sands 199 5 ), but this will not b e assumed. Thus d may b e asymmetric, i.e. , d irected, and ma y not satisfy the triangle inequalit y . A lso, it may not satisfy d ( g ; f ) > 0 . But it is useful to require that d ( g ; f ) > d ( g ; g ) , with equalit y holding only if g = f . Then the normalized discrepancy D ( g ; f ) := d ( g ; f ) − d ( g ; g ) > 0 will satisfy D ( g ; f ) = 0 only if g = f . As will b e seen, it is sometimes p ossible for an unnormalized discrepancy d to b e defined on a larger class of PDF’s than is the case for a normalized one. In minim um discrepancy estimation one m ust distinguish b etw een the mo del M θ fitted to an empirical distribution g N generated by M ∗ , whic h has PDF f ˆ θ N , and the bes t appro ximating mo del, the PDF of whic h is some f θ ∗ . Here θ ∗ = arg min θ ∈ Θ d ( g ; f θ ) ma y differ from ˆ θ N . The v alue θ ∗ is called the ‘pseudo-true’ v alue of θ . (If there is no mis-specification, i.e., g = f θ ∗ for some θ ∗ , it is the true v alue.) The discr ep ancy due to appr oximation (AD) is d ( g ; f θ ∗ ) . In the absence of mis-sp ecification this would equal the constan t d ( g ; g ) , and what w ould b e more imp ortan t w ould b e the dis c r ep ancy due to estimation (ED), i.e. d ( f θ ∗ ; f ˆ θ N ) . The over al l discr ep ancy (OD), of the fitted mo del f ro m the truth, is the quant it y d ( g ; f ˆ θ N ) . If there is no mis-sp ecification , OD reduces to ED. None of the se three discrepancies can be calculated if the true model is unkno wn, though they can b e estimated from the data y N , meaning from g N . (It is d ( g N ; f ˆ θ N ) , the fitte d discr ep ancy (FD) of the m o del from the data, that can b e calculate d from the data.) The f ol lo wing p olicy is an abstraction of Ak aik e’s . Selection Policy. Given a discrepancy functional d , data y N generated b y an unkno wn true mo del M ∗ with PDF g , and candidate models M 1 θ 1 , M 2 θ 2 with parametric P DF’s f 1 θ 1 , f 2 θ 2 and par ameter spaces Θ 1 , Θ 2 , one should ideally assess the go o dn ess of fit of eac h mo del on the basis of its exp e cte d over al l discr ep ancy from M ∗ . That is, if when fitted to data y N or equiv alen tly to the empirical distribution g N = g N , y N , mo del M l θ l w ould ha ve parameter ˆ θ l N = ˆ θ l N ( y N ) , one should select the mo del with the minim um v alue of d ′ ( g , f l . ) := E y N OD l d ( y N ) = E y N d  g , f l ˆ θ l N ( y N )  . (2.2) The exp ectation (i.e., a veragi ng) is computed ov er data y N generated by the true mo del, meaning o ver the PDF g . As an alternativ e, the double exp ectation d ′′ ( g , f l . ) := E y N E y ′ N d  g N , y ′ N , f l ˆ θ l N ( y N )  , (2.3) where data y N , y ′ N are generated indep enden tly b y g , ma y b e emplo yed. The exp ected OD, d ′ ( g , f l . ) , is a p enalized version of the AD d ( g ; f θ l ∗ ) , where θ l ∗ is the pseudo-true v alue of the parameter θ l ∈ Θ l . It is at least as large as the AD, and because the num b er of w a y s in whic h the data-dependen t fitted PDF f l ˆ θ l N ( y N ) can deviate from the pseudo-true PDF f l θ l ∗ is the dimensionalit y Information criteria and normal regression 7 of the parameter s pa ce Θ l , one exp ects that relying on the exp ected OD as a measure of closeness of M l ˆ θ l N to M ∗ will disfa vor o v er-fitting in an AIC-like w ay . It m ust b e stressed that neither d ′ ( g , f l . ) nor d ′′ ( g , f l . ) can b e calculated directly , though they can b e estimated (with bias) by the fitted discrepancy of the i ’th mo del from the data, FD l d ( y N ) = d ( g N , y N ; f ˆ θ l N ( y N ) ) . (This is an R SS -lik e quan tity .) The selection policy can therefore b e implemen ted by c ho osing the mo del with the minim um v alue of a certain function MSC, defined as follo ws to b e an un biased estimator of the exp ected o verall discrepancy . (It will sp ecialize to the AIC.) Definition 2.1 . The mo del- selection criterion funct ion based on a discrepancy functional d , denot ed MSC d or simply MSC, is defined so that the v alue MSC l / 2 for the l ’th mo del equals its fitted discrepancy FD l d ( y N ) = d ( g N , y N ; f ˆ θ l N ( y N ) ) , plus an un biasing term B l equal to E y N OD l d ( y N ) min us E y N FD l d ( y N ) . That is, B l = d ′ ( g ; f l . ) − E y N d  g N , y N ; f ˆ θ l N ( y N )  = E y N d  g ; f l ˆ θ l N ( y N )  − E y N d  g N , y N ; f ˆ θ l N ( y N )  , (2.4) so that in expectation, MSC l / 2 equals the exp ected o verall discrepancy , on whic h it is p osited that mo del selection should b e based. R emark. Distinct discrepancies could be used for (i) p erforming th e fitt ing and computing the FD, and (ii) computing the exp ected OD. This w ould generalize the selection p olicy (and turns out to b e needed in the definition of the KIC, as will b e discussed elsewhere). F or practical reasons, one could also p erform the fitting using MLE and compute the FD using a discrepancy not related to MLE (Linhart & Zucc hini 1986, § 4.4). But this seems less theoretically justifiable. The alternativ e d ′′ to d ′ w as men tioned for tw o reasons. First, for MLE it is iden tical to d ′ , as will b e seen. Second, using d ′′ mak es s election in effect a pro ced ure of cr oss validation , in whic h a fitted mo del is assessed according to its empirically exp ec ted predictio n errors (Stone 1978). The definitio n (2.3) of d ′′ in volv es t w o h y p othetical sets of data: o ne ( y N ) used to estimate the parameter θ l , and one ( y ′ N ) used to as s ess the fit of the resulting mo del. The equiv alence b et w een c ho osing mo dels b y cross-v alidation and Ak aik e’s techni que of p enal izing mo de ls b y their num b er of parameters ha s lon g b een recogniz ed (Stone 1977; Kuha 2004). ( b ) Discr ep ancies, informati on the ory and MLE The selection policy of § 2 a can b e applied v ery generally , to b oth discrete and con tin uous mo dels. Give n a discrepancy functional d and a sample y N comprising y (1) , . . . , y ( N ) ∈ S , w hic h yields an N -p oin t empirical distribution g N = g N , y N on S , one w ould naiv ely decide whether to select paramet ric mo del M l θ l with P DF f l θ l on S on the basis of its fitted discrepancy d ( g N , y N ; f ˆ θ l N ( y N ) ) from the sample. But the selection p olicy mo difies this R SS-like q uantit y by adding an unb iasing term, giving an un biased estimate MSC / 2 of the exp ected o v erall discrepancy . Selection based on the latter disfa v ors o ver-fitti ng, as des ired . Information criteria and normal regression 8 This is the natural setting for the AIC and AICc statistics. But for regression mo dels, in whic h S = R n (and most often N = 1 ), the c hoice of a discrepancy functional closely tied to MLE and in fact to information theory m ust first b e justified. Only f or one sp ecial functional do es the fitted discrepancy turn out to b e the negativ e log-lik eliho o d. F or discrete rather than con tinuou s mo dels, with S a discrete set such as { 1 , . . . , m } or { 1 , 2 , 3 , . . . } , a wide v ariety of discrepancy f unc tionals d ( g ; f ) hav e b een used in statistics and elsewhere. They include the Pea rson X 2 statistic P n ∈ S [ g ( n ) − f ( n )] 2 /f ( n ) , used w hen S is the set o f cells in a con tingency table to compare an empirical and a theoretical distribution. The Neyman X 2 is similar. There are also man y discrepancies ro oted in information theory , suc h as the (discret e) Kullbac k–Leibler (KL) div ergence, often calle d the information al div ergence (Csiszár & Körner 2011). Its normalized f or m is D KL ( g ; f ) := X n ∈ S g ( n ) ln g ( n ) f ( n ) > 0 (2.5) and its denormaliz ed form is d KL ( g ; f ) := − X n ∈ S g ( n ) ln f ( n ) > d KL ( g ; g ) . (2.6) They are related by D KL ( g ; f ) = d KL ( g ; f ) − d KL ( g ; g ) . The subtracted quant it y d KL ( g ; g ) is the (Shannon) en trop y of the distribution g , and in statistical mec hanics D KL ( g ; f ) w ould therefore b e called a relativ e en trop y . The denormalized d KL ( g ; f ) is s ometimes called an ‘inaccuracy .’ Discrepancies in information theory are often of the ‘ ϕ -diverg ence’ form D ϕ ( g ; f ) := P n ∈ S g ( n ) ϕ ( f ( n ) /g ( n )) , for ϕ some conv ex function (Par do 2006). F or instance, if ϕ ( u ) equals − ln u then D ϕ = D KL . If ϕ ( u ) ∝ 1 − u (1+ α ) / 2 then D ϕ b ecom es the so-called α -div ergence D ( α ) , whic h reduces to D KL in a scaled α → − 1 limit. This generalized discrepancy arises in the geometry of statistical inference (Amari & Nagaok a 2000), and there is a corresp onding denormalized d ( α ) ( g ; f ) ∝ X n ∈ S g ( n ) (1 − α ) / 2 h 1 − f ( n ) (1+ α ) / 2 i , (2.7) a s ca led α → − 1 limit of whic h equals d KL ( g ; f ) . The self-div ergence d ( α ) ( g ; g ) of g is called in ph ysics the T sallis en trop y of g (T sallis 1988). Man y other discrepancies ha v e b een in v estigated. (See Kapur (1989, Chap. 7) and Basu et al. (2011, Chap. 11).) But the KL div ergence is p erhaps the most imp ort an t, b ecause of its connection to MLE. F rom a sample y N comprising y (1) , . . . , y ( N ) ∈ S , where S i s a discrete set, obtaining a b est-fit PMF f ˆ θ N b y maximizing ov er θ the likelihoo d L f ( θ | y N ) = f ( y N | θ ) of a candidate PMF f θ on S is equiv alen t to minimizing D KL ( g N ; f θ ) or d KL ( g N ; f θ ) o v er θ , where g N = g N , y N is the N -point empirica l distribution defined b y the data. This is b ecause d KL ( g N ; f θ ) equals − ln L f ( θ | y N ) , the negativ e log-lik elihoo d, as follow s from (2.6). In particular, d KL ( g N ; f ˆ θ N ) equals − ln L f ( ˆ θ N | y N ) . MLE can th us b e in terpreted as a minim um discrepancy estimation. No w consider the contin uous case, when the data lie in a space S that is Euclidean, such as the c hoice S = R n arising in regression. The selection p olicy of § 2 a requires the computation of (an unb iased version of ) some Information criteria and normal regression 9 fitted discrepancy d ( g N ; f ˆ θ N ) . The empirical distribution g N = g N , y N is a linear com bination of N delta functions computed from the data y N , as in (2.1 ), and ˆ θ N = ˆ θ N ( y N ) is computed by minimizing d ( g N ; f θ ) o v er θ , wh ere f θ is a candidate PDF on S . F or the selection p olicy to b e implemen ted as stated, the discrepancy functional d ( g ; f ) b eing emplo yed m ust allo w its first argumen t to b e a PDF g N that is ‘atomic,’ in the sense that it is a combi nation of delta functions. This is a stringen t requiremen t (Liese & V a jda 1987, § 10.9). When S = R , man y statistical discrepancies d ( g ; f ) hav e b een emplo yed ; e.g., in the robust estimation of lo cation and scale parameters of distributions on S (Sahler 1968; Par r & Sc h ucan y 1980). Most are computed f ro m the cum ulativ e distributions (CDF’s) G, F corresp on ding to g, f , so they are w ell-defined ev en if one or the other is an empirical distribution. A ls o, most are metrics, or legitimate distances; in fact minim um discrepancy estimation grew out of minim um distance estimation, which has a long history (Par r 1981). Examples include the K olmogoro v–Smirno v and Cramér–v on M ises discrepancies, whic h are widely used as goo dness-of-fit statistics. But their g eneralizations to the m ultiv ariate case, when S = R n with n > 1 , are not s tra igh tforw ard at all. When S = R n with n ar bitrary , the most widely used discrep ancy is the (con tin uous) KL divergenc e, with its close ties to MLE. Its normalized form is D KL ( g ; f ) := Z R n g ( y ) ln g ( y ) f ( y ) d n y > 0 (2.8) and its denormaliz ed form is d KL ( g ; f ) := − Z R n g ( y ) ln f ( y ) d n y > d KL ( g ; g ) . (2.9) They are related b y D KL ( g ; f ) = d KL ( g ; f ) − d KL ( g ; g ) , as in the discrete case. A k ey observ ation is that the in tegral in (2.9 ) is w ell-defined even if g is a purely atomic function of y , s uc h as an empirical PDF g N . But the in tegral in (2.8) is not, as one cannot tak e the logarithm of a s um of delta functions. The distinctio n can b e viewed as arising from the en tropy d KL ( g ; g ) not b eing defined when g = g N : an y empirical distribution has undefined en tropy . The go o d b ehavior of the in tegral in (2.9) justifies the use of d KL in mo del selection, to the exclusion of D KL . F or an y s am ple size N > 1 , a veraging ov er data y ′ N sampled from the PDF g yields g itself; whic h is to sa y , E y ′ N g N , y ′ N = g . By the linearit y in g of the in tegral in (2.9 ), it follo ws that E y ′ N d KL ( g N , y ′ N ; f ) equals E d KL ( g ; f ) . This confirms a claim made in § 2 a : if d KL is used as the discrepancy d , the tw o exp ected o ver all discrepancies d ′ , d ′′ defined in (2.2),(2.3) are e qual , and giv e rise to identic al mo del- selection p olicies. F or non-KL discrepancies, this ma y not hold. In the con tin uous case as in the discrete, the fitted discrepancy FD d KL ( y N ) , i.e., d KL ( g N , y N ; f ˆ θ N ( y N )) , equals − ln L f ( ˆ θ N | y N ) , the negativ e of the fitted log- lik eliho o d. Using d KL as the discrepancy , as in the follo wing definition, s p ecializes the MSC of Definition 2.1 to what will b e called a M SC of AIC t yp e. Definition 2.2 . A d KL -based mo del-selec tion criterion MSC, of A IC type, is defined th us: the v alue MSC l / 2 for the l ’th mo del, calculate d after fitting, equals its negativ e log-lik eliho o d − ln L f ( ˆ θ l N | y N ) , plus an unbi asing correction B l that Information criteria and normal regression 10 equals E y N OD l d KL ( y N ) minu s E y N FD l d KL ( y N ) . That is, B l = E y N d KL  g ; f l ˆ θ l N ( y N )  − E y N h − ln L f ( ˆ θ l N | y N ) i , so that in exp ectation, MSC l / 2 equals the d KL -based ex p ected o verall discrepancy , on whic h it is p osited that mo del s election should b e based. In the f oll o wing, only MSCs of AIC t yp e will b e emplo yed. The choice of the KL divergen ce as the discrepancy used in mo del fitting has a cl ear justification: it allo ws the selection p olicy of § 2 a to b e applied as s ta ted. But it s ho uld b e noted that there are altern ativ es that merit examination. By adaptin g the selection p olicy it ma y b e p ossible to emplo y q uite differen t discrepancies, suc h as the ab o vem en tioned α -diverg ence. This requires a brief explanation. The con tin uous, denormalize d version of the α -divergen ce is d ( α ) ( g ; f ) ∝ Z R n g ( y ) (1 − α ) / 2 h 1 − f ( y ) (1+ α ) / 2 i d n y , (2.10) whic h is undefined if g is an empirical distribution. But the α = 0 case of this, d (0) ( g ; f ) , is equiv alen t to the Hellinger distance, whic h has long b een used in parametric estimation (Beran 1977). F rom data y N , or an empirical distribution g N = g N , y N defined f rom it as in (2.1), a fitted parametric mo del f ˆ θ can indeed b e found by minimizing the Hellinger distance. The fitting, t hou gh, inv olv es a preliminary s te p: replacing the delta functions of (2.1) b y appro ximate deltas. That is, one first engages in kernel densit y estimatio n, b y conv olving g N with some in tegral k ernel. The in tegral in (2.10 ) will b e w ell-defined if the resulting ‘smo othed ’ g is used. Alternativ ely , the model PDF f as w ell as the data PDF g N can b e smoothed (Basu et al. 1997, § 3.3). How ev er, the smo othing of g N can apparen tly b e justified only in the large-sample ( N → ∞ ) limit. In what follo ws N = 1 , and the limit take n (if an y) will b e n → ∞ . The usefulness in this setting of a preliminary smo othing of the empirical distribution remains to b e explored. 3. Selection with error bars In this section we sp ecialize to the case when the true m o del M ∗ and the candidate mo dels fitted to a s et of n data point s are normal regre ssion models, inco rp orati ng kno wn error bars a s explained in § 1 b . W e in v estigate the mo del-selec tion crit erion function giv en in Definition 2.2 (a d KL -based MSC of AIC type). In § 3 a it is sho wn th at irresp ectiv e of n and the exten t of m is-sp ecificatio n, the MSC f or a candidate linear mo del reduces to the standard AI C of (1.3): a sum of sq ua red residuals p enalized by 2 k , i.e., b y t wice the n um b er of parameters. In terestingly , it is p ossible to deriv e the mo dified AIC kno wn as the AIC γ , in whic h the p enalt y γ k replaces 2 k , b y sligh tly mo difying the selection p olicy of § 2. But ev en when n is s ma ll, the stand ard AIC is nev er extended b y an O (1 /n ) correction term; th us the use of the AICc is not appropriate here. This realization is new. I n App endix B, pap ers from the astroph ys ics literatu re that ha v e erroneously applied the AICc are listed. In § 3 b the v ariabilit y of ∆ 12 := AIC 2 − AIC 1 is determined, and an asymptotically v alid hypothesis test for mo del selection that is based on ∆ 12 Information criteria and normal regression 11 is prop osed. A t an y sp ec ified significance lev el α , the test either rejects or accepts the n ull hypothesis that the fitted mo de ls M 1 ˆ θ 1 , M 2 ˆ θ 2 are equally close to M ∗ in the ‘exp ected o v erall discrepancy’ s ense of § 2. F or mis-sp ecified linear models incorporating error bars, this approac h to mo del selection can potent ially replace the rule-of-th um b use of Ak aike w eigh ts. The v ariabilit y estimation and the test of significance can b e extended to the case of non-linear mo dels, as is sket c hed. ( a ) Exp e cte d di sc r ep ancies Consider a true mo del M ∗ and a candidate linear mo del M that are b oth normal, as in § 1 b . They are y = y 0 + ǫ 0 and y = X β + ǫ , where β = ( β j ) k j =1 is a column v ector of parameters, and the error vectors ǫ 0 , ǫ ha v e mean zero and co v ariance matrices σ 2 0 I n , σ 2 I n . Th us ǫ 0 = σ 0 z 0 and ǫ = σ z , where z 0 and z are v ectors of indep enden t standard normal v ariables. It is not assumed that y 0 is in the column s pa ce of the n × k design matrix X , i.e., mis-sp ecification is allo w ed. In this s ec tion the v ariance σ 2 is sp ec ified and not estimated, so the full parameter vector θ of M is simply β . If the true v ariance σ 2 0 is known , as is the case when the data are accompanied b y error bars, then it is natural to c ho ose σ 2 = σ 2 0 . But for the momen t this will not b e assumed: s tat istical as w e ll as deterministic mis-s pecification will b e allo w ed. There is assumed to b e only one observ ation ( N = 1 ), s o only one instance of the random vector y ∈ S = R n is av ailable as a datum. Thus y will b e written for y , and the subscript N dropp ed. MLE is equiv alen t to c ho osing β ∈ R k so as to minimize the discrepancy (in the d KL sense) of the 1-p oin t atomic PDF g y ( · ) = δ ( · − y ) on R n from M β . Equiv alen tly , MLE minimizes the negativ e log- lik eliho o d − ln L ( β | y ) of M β . It yields a fitted model M ˆ β , where the estimated parameter v ector ˆ β = ˆ β ( y ) ∈ R k is giv en b y ˆ β = ( X t X ) − 1 X t y . The n × n hat matrix P and its complemen t Q = I n − P , whic h pro ject on to the estimation and error subspaces of R n , i.e., the column and left n ull spaces of X , are defined as usual by P = X ( X t X ) − 1 X t . The predicted data v ector ˆ y is defined by ˆ y = Py . The RSS (sum of squared residuals) is ( y − ˆ y ) t ( y − ˆ y ) = y t Qy . The p olicy of § 2 requires that to the exten t that it can b e estimated, the exp ected o verall discrepancy E y OD ( y ) should b e used for mo del s ele ction. The AIC-t yp e selection c riterion MSC (see Definition 2.2) has the property that MSC / 2 for M is an un biased estimator of E y OD ( y ) . It is defined b y MSC/2 = FD ( y ) + E y [ OD ( y ) − FD ( y )] , (3.1) where the fitted discrepancy FD ( y ) , i.e., d KL ( g y ; f ˆ β ( y ) ) , is simply the negativ e log-lik eliho od − ln L ( ˆ β | y ) of the fitted mo del M ˆ β . Since OD ( y ) is d KL ( g ; f ˆ β ( y ) ) , the second, un biasing term in (3.1), whic h w as denoted B in the previous section, can b e calculated f rom the Gaussian PDF’s g , f β of M ∗ , M β . They are g ( y ) = (2 π σ 2 0 ) − n/ 2 exp  − ( y − y 0 ) t ( y − y 0 ) / 2 σ 2 0  , (3.2) f β ( y ) = (2 π σ 2 ) − n/ 2 exp  − ( y − X β ) t ( y − X β ) / 2 σ 2  . (3.3) Information criteria and normal regression 12 F or con venien ce w e s hall write d KL ( f ; f ) := d KL ( f β , f β ) = n 2 [1 + ln (2 π )] + n 2 ln  σ 2  , (3.4) since d KL ( f β , f β ) do es not dep end on β . By examination, FD ( y ) = − ln L f ( ˆ β | y ) = d KL ( f ; f ) − n 2 + RSS 2 σ 2 (3.5) expresses the fitted discrepancy in terms of the RSS, whic h is y t Qy . In the follo wing theorem, λ := y t 0 Qy 0 /σ 2 0 is an M -sp ecific mis-sp eci fication parameter, χ 2 r is a c hi-squared random v ariable with r degrees of f ree dom and χ 2 r ( λ ) is a similar but non-cen tral v ariable, with non-cen tralit y parameter λ . (F or cen tral and non-cen tral c hi-squared distributions, see App endix A.) Theorem 3.1 . Under the mo del M ∗ , the over al l and fitte d discr ep ancies of the fitte d mo del M ˆ β ( y ) , OD = OD( y ) and FD = FD( y ) , ar e distri bute d ac c or ding to OD = d KL ( g ; f ˆ β ( y ) ) ∼ d KL ( f ; f ) + n 2  σ 2 0 σ 2 − 1  + σ 2 0 2 σ 2 χ 2 k ( λ ) , FD = d KL ( g y ; f ˆ β ( y ) ) ∼ d KL ( f ; f ) − n 2 + σ 2 0 2 σ 2 χ 2 n − k ( λ ) , wher e ˆ β ( y ) = ( X t X ) − 1 X t y ∈ R k is the fi tte d value of the p ar ameter β . F or the discr ep ancies due to appr oximation and estimation, AD and ED = ED( y ) , the c orr esp onding statements ar e AD = d KL ( g ; f β ∗ ) = d KL ( f ; f ) + n 2  σ 2 0 σ 2 − 1  + σ 2 0 2 σ 2 λ, ED = d KL ( f β ∗ ; f ˆ β ( y ) ) ∼ d KL ( f ; f ) + σ 2 0 2 σ 2 χ 2 k , wher e β ∗ = ( X t X ) − 1 X t y 0 ∈ R k is the pseudo-true value of the p ar ameter β . Pr o of. Use the definition (2.9) of d KL and the definitions (3.2),(3.3) of g , f β . Eac h in tegral ov er R n in the computation of a d KL is a normal exp ectation that can b e ev aluated in closed form. In the expressions for OD,FD,ED the distributions χ 2 k ( λ ) , χ 2 n − k ( λ ) , χ 2 k arise resp ectiv ely as the distributions of ( Py − y 0 ) t ( Py − y 0 ) /σ 2 0 = ( Pz 0 − Qy 0 /σ 0 ) t ( Pz 0 − Qy 0 /σ 0 ) , (3.6a) ( Py − y ) t ( Py − y ) /σ 2 0 = ( z 0 + y 0 /σ 0 ) t Q ( z 0 + y 0 /σ 0 ) , (3.6b) ( Py − Py 0 ) t ( Py − Py 0 ) /σ 2 0 = z t 0 Pz 0 , (3.6c) if one uses the fact that P , Q are n × n pro jection matrices of ranks k , n − k . (F or distributions of quadratic forms, see A pp endix A.) Similarly , the ‘ λ ’ in the expression for AD is the non-random v alue of ( Py 0 − y 0 ) t ( Py 0 − y 0 ) /σ 2 0 .  Information criteria and normal regression 13 Theorem 3.2 . The c orr esp onding exp e ctations over M ∗ -gener ate d data ar e E y OD( y ) = d KL ( f ; f ) + n 2  σ 2 0 σ 2 − 1  + σ 2 0 2 σ 2 ( k + λ ) , E y FD( y ) = d KL ( f ; f ) + n 2  σ 2 0 σ 2 − 1  + σ 2 0 2 σ 2 ( − k + λ ) , AD = d KL ( f ; f ) + n 2  σ 2 0 σ 2 − 1  + σ 2 0 2 σ 2 λ, E y ED( y ) = d KL ( f ; f ) + σ 2 0 2 σ 2 k . Thus in exp e ctation only , the variable OD − d KL ( f ; f ) is the sum of AD − d KL ( f ; f ) and the variable ED − d KL ( f ; f ) . Pr o of. Use E  χ 2 r ( λ )  = r + λ , with χ 2 r equalling χ 2 r (0) .  Cor ollar y 3.3 . In the definition of the AIC -typ e mo del-sele ction criterion MSC for m o del M , ac c or ding to which MS C / 2 e quals FD( y ) plus an unbiasing term B , the term B ( i.e. E y [OD( y ) − FD( y )] ) e quals σ 2 0 2 σ 2 times 2 k . Henc e MSC = 2 d KL ( f ; f ) + RSS /σ 2 + σ 2 0 σ 2 2 k . Pr o of. Compute E y [OD( y ) − FD( y )] from the theorem, and then use the form ula (3.5) for FD ( y ) .  Th us wit h the exception of a constan t term eq ual to 2 d KL ( f ; f ) , whic h does not affect the relativ e ranking of mo dels, f or any normal linear mo del M β fitted to data the A I C-t yp e s ele ction criterion reduces to the standard A IC giv en in (1.3): the usual RSS /σ 2 , p enali zed b y t wice th e n um b er of parameters. Pro vided, that is, the mo del incorp orates a σ 2 equal to the v ariance param eter σ 2 0 of th e true mo del M ∗ . There m ust b e no statistical mis-sp ecification : M m ust incorp orate error bars of the correct length. I f so, the form ula (1.3 ) is exact f or all n . There is no sign of an y small- n correction term, such as app ears in the AICc. It m ust b e stressed that deterministi c mis-sp ecificati on is allow ed here. There ma y b e a non-zero v alue for the mis-sp ecification parameter λ = y t 0 Qy 0 /σ 2 0 , indicating that the constan t vector y 0 in the definition of the true model M ∗ do es not lie in the estimation space, i.e., the column space of the design matrix X ; so the model M β do es not agree wi th the true mo del M ∗ for an y v alue of β . Because the parameter λ app ears in b oth E y OD ( y ) and E y FD ( y ) , it cancels. The preceding analysis w as based ent irely on the mo del-selection p olicy of § 2 , according to whic h the exp ected o v erall discrepancy E y OD ( y ) of the true mo del M ∗ from a candidate mo del M should b e used for selection purp oses. This p olicy giv es rise to the AIC, but it is in teresting to cons id er the effects of generalizing it s lig h tly . By the form ulas of Theorem 3.2 , this p olicy is equiv alen t to c ho osing the mo del with the smallest v alue of the sum AD + E y ED ( y ) , the di screpancy due to appro ximation plus the expected discrepancy du e to Information criteria and normal regression 14 estimation. Supp ose that instead, one assessed the go o dness of fit of M to M ∗ b y emplo y ing (any m ultiple of ) the con vex com bination  1 γ  AD +  1 − 1 γ  E y ED ( y ) , (3.7) where γ > 1 is free. By increasing γ one em phasizes the discrepancy due to estimation, rather than the discrepancy of M ∗ from M due to appro ximation (whic h if there w ere no mis-sp ecification w ould b e a constan t, i.e., w ould effectiv ely b e zero). If there is no statistical mis-sp ecification ( σ 2 = σ 2 0 ), it follo ws from the form ulas in the theorem that an un biased estimator of this conv ex com bination, obtained by un biasing FD, is M SC γ /γ , where MSC γ = 2 d KL ( f ; f ) + RSS /σ 2 + γ k . With its constant first term dropp ed, the mo del-selection criterion MSC γ b ecom es what is widely kno wn as the AIC γ , whic h p enalizes any mo del by γ times its n umber of parameters. As γ increases, ov er-fitting is increasingly disfav ored. The conceptual difference betw een the t wo sorts of error in statistical mo del fitting w as p oin ted out b y Inagaki (1977), and an A IC γ -lik e criterio n resem bling (3.7) w as defined for autoregressiv e mo dels b y Bhansali (1986, Eq. (2.12)). But it seems not ha ve b een noticed that for normal linear mo dels, AIC γ arises rather naturally . Of course in applications, domain-sp ecific considerations that are less axiomatic than practical ma y affect the c hoice of γ . ( b ) AIC variability and a signific anc e test The procedure of deciding among candidate normal linear mo dels will now b e placed in the classical h yp othesis testing framew ork. A tes t of significance for the evidence that M 1 , M 2 are not eq ually close to the true data-generating pro cess M ∗ will b e prop osed. The test is v alid in the n → ∞ limit, when applied to mo dels that in a certain precise sense, are separately mis-sp ecified. It is assumed that the data p oin ts are accompani ed b y error bars, i.e., that the residual v ariance σ 2 0 is kno wn and is incorp ora ted in M 1 , M 2 . The prop osed test is based on the statistic ∆ 12 := AIC 2 − AIC 1 and an expression for its v ariance, and can p oten tially replace the traditiona l use of Ak aik e w eights. F o cusing on the v ariabilit y and hence the significance of an AIC difference has m uch in common with the approac hes of Efron (1984) and F raser & Geb ot ys (1987). But unlik e Efron w e do not use a b o otstrap pro cedure, and unlik e F raser & Geb ot ys w e allow arbitrary mis-sp ecification. The genera l approac h is distinguished from the lik eliho o d ratio testing approac h originating with Co x (1962), in that it decides b et w een M 1 , M 2 on the b asis of whic h is closer to the truth , not on the basis of whic h is m or e li kely to b e c orr e ct . F or regression applications, consider the case when the mo dels are fitted to a random v ector y ∈ R n that is generated b y an unkno wn true mo del y = y 0 + ǫ 0 , and N = 1 : only one observ ation of the random v ector y is av ailable as a datum. The candidates M l , l = 1 , 2 , are defined b y y ( l ) = X ( l ) β ( l ) + ǫ ( l ) , where X ( l ) is an n × k l design matrix of f ull rank (with k l < n ) and β ( l ) ∈ R k l is a column vector of parameters. The estimation space L l ⊂ R n (i.e., the column space of X ( l ) ) is a linear subspace of dimension k l . Since the mo dels are given, the subspaces L 1 , L 2 are sp ecified in adv ance; thus the analysis b elo w is in a sense conditional. The Information criteria and normal regression 15 error v ectors ǫ , ǫ (1) , ǫ (2) are tak en to be σ 0 z 0 , σ 0 z (1) , σ 0 z (2) , where z 0 , z (1) , z (2) are v ectors of n indep enden t s ta ndard normal v ariables. In this setting the d KL -based MSC of AIC t yp e reduces to the standard AIC, b y Corollary 3.3. Dropping the additiv e constan t d ( f ; f ) , w e write AIC l = RSS l /σ 2 0 + 2 k l , (3.8a) RSS l /σ 2 0 =  y − ˆ y ( l )  t  y − ˆ y ( l )  /σ 2 0 (3.8b) = y t Q ( l ) y /σ 2 0 = ( z 0 + y 0 /σ 0 ) t Q ( l ) ( z 0 + y 0 /σ 0 ) , since ˆ y := P ( l ) y . (Cf. (3.6b)). The n × n matrices P ( l ) , Q ( l ) pro ject on to L l , L ⊥ l ⊂ R n , with Q ( l ) = I n − P ( l ) ; note that tr P ( l ) = k l and tr Q ( l ) = n − k l . Being an inhomogene ous quadratic form in z 0 , RSS l /σ 2 0 has a non-cen tral ch i- squared distribution, whic h is χ 2 n − k l ( λ ( l ) ) , where λ ( l ) := y t 0 Q ( l ) y 0 /σ 2 0 c haracterizes the mis-sp ecification of mo del M l against M ∗ . AIC l is therefore distributed b y AIC l ∼ χ 2 n − k l ( λ ( l ) ) + 2 k l . (3.9) It shoul d b e note d that as r → ∞ , the distribution of χ 2 r ( λ ) is increasingly normal, whether or not λ g ro ws with r ; th us a s n → ∞ , th e distribution of AIC l is increasingly normal. But it is the distribution of ∆ 12 := AIC 2 − AIC 1 that is of in terest in mo del selection, and this is determined b y the joint distribution of AIC 1 , AIC 2 , and hence b y the joint distribution of the q uadratic f or ms y t Q (1) y and y t Q (2) y . As will be seen, obtaining an n → ∞ limit theor em requires that the relationship b et wee n Q (1) , Q (2) b e somewhat restricted. Theorem 3.4 . The exp e ctation and varianc e of AIC 1 , AIC 2 and the differ enc e ∆ 12 := AIC 2 − AIC 1 ar e given by E AIC l = tr Q ( l ) + y t 0 Q ( l ) y 0 /σ 2 0 + 2 k l = ( n − k l ) + λ ( l ) + 2 k l , V ar A I C l =2 tr Q ( l ) + 4 y t 0 Q ( l ) y 0 /σ 2 0 = 2( n − k l ) + 4 λ ( l ) , E ∆ 12 = tr( Q (2) − Q (1) ) + y t 0 ( Q (2) − Q (1) ) y 0 /σ 2 0 = ( k 2 − k 1 ) + ( λ (2) − λ (1) ) , V ar ∆ 12 =2 tr  ( Q (2) − Q (1) ) 2  + 4 y t 0  ( Q (2) − Q (1) ) 2  y 0 /σ 2 0 . Pr o of. The first three of these follo w immed iately from (3.9) by using E  χ 2 r ( λ )  = r + λ and V ar  χ 2 r ( λ )  = 2 r + 4 λ . All four follo w from (3.8) b y using the known expressions for normal momen ts (i.e., the momen ts of the comp onen ts of the normal random vecto r y ∈ R n ) .  The join t distribution of a pair of quadratic forms in a normal vector such as y ∈ R n is c omplicated, and in gen eral ca n only b e express ed in terms of sp ecial functions (Mathai & Pro vost 1992). But some cases can b e treated in closed form. F or instance, y t Q (1) y and y t Q (2) y are indep endent if (and only if ) Q (1) Q (2) = 0 , b y the Craig–Sak amoto theorem (Oga w a & Olkin 2008). A case more importan t in applications is the follo wing. Supp ose th at normal linear regression mo dels M 1 , M 2 , such as the pair considered here, satisfy M 2 ⊂ M 1 . That is , M 2 is a reduced version of the fuller mo del M 1 , obtained b y parametric Information criteria and normal regression 16 restriction. Then they are nested: their estimation subspaces L 1 , L 2 are related b y L 2 ⊂ L 1 and L ⊥ 2 ⊃ L ⊥ 1 , so that Q (1) Q (2) = Q (1) . If the v ector y 0 in the true mo del M ∗ satisfies y 0 ∈ L 2 ⊂ L 1 , so that neither of M 1 , M 2 is mis-sp ecified and λ (1) = λ (2) = 0 , then in addition to the distributional s tat emen t RSS l /σ 2 0 ∼ χ 2 n − k l , one has ( RSS 2 − RSS 1 ) /σ 2 0 ∼ χ 2 k 1 − k 2 . If alternativ ely y 0 ∈ L 1 \ L 2 , so that M 2 is mis-sp ecified but the fuller mo del M 1 is not, then ∆ 12 + 2( k 1 − k 2 ) = ( RSS 2 − RSS 1 ) /σ 2 0 ∼ χ 2 k 1 − k 2 ( λ (2) ) . (3.10) Suc h situations are familiar from m ultiv ariate regression, and lead to (partial) F-tests of the significance of linear regressors (Mardia et al. 1979). H o wev er, w e wish also to handle Q (1) , Q (2) or equiv alen tly L 1 , L 2 that are less closely related: non-nestedness and more general mis-sp ecifications should b e allo w ed. T o motiv ate the propos ed h y pothesis test a simple limit theorem will no w b e pro v ed, on the distribution of ∆ 12 in a case often encoun tered in the physical sciences. This is when the mo dels M 1 , M 2 are at least sligh tly mis-sp ecified relativ e to the (unkno wn, presumably infinite-dimensional ) true mo del M ∗ , in the rather consequen tial s ense that eac h has a non-zero mean fitting error p er data p oin t. Hence one exp ects that in the n → ∞ limit, the mis-sp ecificat ion parameters λ ( l ) := y t 0 Q ( l ) y 0 /σ 2 0 will gro w prop ortionate ly to n (genericall y , at differen t rates). F or further discussion of mis-sp ecification regimes, see § 4. In the theorem a certain trac e condition will app ear as a hypothesis. It is motiv ated by the followin g consideration. F rom tr Q ( l ) = n − k l it follo ws that tr( Q (2) − Q (1) ) = k 1 − k 2 . If there is nestedness and L 2 ⊂ L 1 , then Q (2) − Q (1) is also a pro jection, and idemp oten t; th us for an y sp ecified m > 1 , tr  ( Q (2) − Q (1) ) m  is O (1) , i.e. it do es not gro w with n . It is reasonable to supp ose that this condition will hold if M 1 , M 2 , even if non-nested, are sufficien tly similar to justify their b eing used as comp eting mo dels of the same data. (Note that if the condition holds for m = 2 then it holds f or all m > 2 , by a standard trace norm inequalit y .) The condition do es not hold in the maximally dissimilar case Q (1) Q (2) = 0 , as tr  ( Q (2) − Q (1) ) 2  then equals tr Q (1) + tr Q (2) . Definition 3.5 . Consider a sequence of triples ( M 1 , M 2 , M ∗ ) indexed b y n (including s eque nces of ve ctors y 0 ∈ R n and design matrices), with a common error v ariance σ 2 0 . The n × n pro jections Q (1) , Q (2) are defined as usual. If as n → ∞ , y t 0  ( Q (2) − Q (1) ) 2  y 0 is b ounde d b elo w b y a p ositiv e m ultiple of n , while (m uch more routinely) the mis-sp ecifica tion parameters λ ( l ) = y t 0 Q ( l ) y 0 /σ 2 0 of the t wo mo dels are b ounded ab o ve b y a p ositiv e m ultiple of n , the mo dels are said to b e asymptotic al ly sep ar ately mis-sp e cifie d . R emark. N ested mo dels M 1 , M 2 will b e asymptotically separately mis-s pecified if λ (1) , λ (2) ,   λ (1) − λ (2)   all gro w prop ortionately to n . Theorem 3.6 . In this setting of a se quenc e of triples ( M 1 , M 2 , M ∗ ) indexe d by n , if the c andidate mo dels ar e asymptotic al ly s ep ar ately mis-sp e cifie d, and also satisfy the tr ac e c onditi on that tr  ( Q (2) − Q (1) ) 2  is O (1 ) , then the distribution of ∆ 12 := AIC 2 − AIC 1 , the exp e ctation and varianc e of which ar e given i n The or em 3.4 , is asymptotic al ly normal as n → ∞ . Information criteria and normal regression 17 R emark. Under the conditions of this theorem, V ar ∆ 12 will b e b ounded belo w b y a p os iti v e mu ltiple of n . This is b ecause according to Theorem 3.4, V ar ∆ 12 equals a com bination of tr  ( Q (2) − Q (1) ) 2  and y t 0  ( Q (2) − Q (1) ) 2  y 0 . Pr o of. The second cum ulan t c 2  ∆ 12  = V ar ∆ 12 is b ound ed b elow b y a p ositiv e multip le of n , as just remark ed. It is easily seen that eac h higher cum ulan t c m  ∆ 12  , m > 3 , is O ( n ) . F or instance, c 3  ∆ 12  equals a com bination of tr  ( Q (2) − Q (1) ) 3  and y t 0  ( Q (2) − Q (1) ) 3  y 0 , and these are resp ecti v ely O (1) and O ( n ) . Therefore the momen ts of  ∆ 12 − E ∆ 12  /  V ar ∆ 12  1 / 2 tend to those of N (0 , 1) , b ecause its higher cumu lan ts tend to zero.  The h yp othesis test is suggested b y the f oll o wing. Recall that up to an unimportant additiv e constan t, A IC / 2 is an unb iased es timator of the exp ected o verall discrepancy E y OD l ( y ) , i.e. the negativ e of the exp ected log-lik eliho o d after fitting, the exp ectation b eing ov er data generated by M ∗ . Cor ollar y 3.7 . In the ab ove setting, under the nul l hyp othesis that E y OD 1 ( y ) = E y OD 2 ( y ) f or al l n , i.e., that M 1 , M 2 ar e e qual ly discr ep ant fr om M ∗ for al l n , the distribution of ∆ 12 . n 2 tr  ( Q (2) − Q (1) ) 2  + 4 y t 0  ( Q (2) − Q (1) ) 2  y 0 /σ 2 0 o 1 / 2 tends to N (0 , 1) as n → ∞ . Pr o of. The denominator is  V ar ∆ 12  1 / 2 , as given in Theorem 3.4 .  An un biased estimator of V ar ∆ 12 is the quan tit y − 2 tr  ( Q (2) − Q (1) ) 2  + 4 y t  ( Q (2) − Q (1) ) 2  y /σ 2 0 (3.11) as follo ws b y ev aluating its exp ectation o ver y . It is not guaran teed to b e positive, but the probabilit y of its b eing so ten ds to unit y as n → ∞ , s in ce the second term increasingly dominates the first. (A maxim um lik eliho o d estimator could p erhaps b e used instead, but even when M 1 , M 2 are nested and ∆ 12 has essent ially a non- cen tral chi- squared distribution as in (3.10), MLE is difficult to p erform (Anderson 1981).) The expression (3.11) is the ke y to the followin g test, whic h can b e applied at any fixed n . Hypothe sis Test. T o test the n ull h yp othesis H 0 that M 1 , M 2 are equally discrepan t from the true mo del M ∗ , against the alternativ e that they are not, calculate what is asymptotically an N (0 , 1) test statistic, z (12) := ∆ 12 . n − 2 tr  ( Q (2) − Q (1) ) 2  + 4 y t  ( Q (2) − Q (1) ) 2  y /σ 2 0 o 1 / 2 . If   z (12)   > z 0 , where P ( | Z | > z 0 ) = α for a standard normal v ariable Z , the evidence against H 0 is s ign ifican t at level α . Equiv alentl y , the p -v alue as s ociated to H 0 is given b y the form ula p = P ( | Z | >   z (12)   ) . T o test against a one-sided alternativ e that one mo del is less divergen t than the other, pro ceed similarly . Information criteria and normal regression 18 Estimating the v ariance of the AIC difference by (3.11) is what mak es this z -test p ossible. (The small probabilit y that the estimated v ariance ma y b e non- p ositiv e should b e noted.) It s ho uld b e s tre ssed that the normalit y of the test statistic is a go o d appro ximation only for large- n mo dels M 1 , M 2 that differ appreciably in their mis-sp ecification. In general one would need to exploit the join t distribution of the forms y t Q (1) y and y t Q (2) y , whic h is complicated. This test of the significance of an AIC difference is mo delled after a z -test prop osed by Linhart (1988). His test is b ased on the large-sample ( N → ∞ ) prop er ties of minim um discrepancy estimators, and is not re stricte d to normal regression. Our test applies when N = 1 , and is v alid in th e rather differen t n → ∞ limit. How ever, the need for an asymptotic mis-s p ecification o ccurs in his analysis, as in ours. In its absence, the test statistic could ha ve a limiting non-cen tral c hi- squared distribution, rather than a normal one (cf. Steiger et al. 1985) Throughout this section w e ha ve dealt with line ar regression mo dels. But it is not difficult to extend the estimation of V ar ∆ 12 , and hence the prop osed h yp othesis test, to mo dels M 1 , M 2 that are non-linear. The follow ing is a sket c h. Supp os e that mo del M l , l = 1 , 2 , is defined by y = y ( l ) 0 + ǫ ( l ) , where y ( l ) 0 = y ( l ) 0 ( β ( l ) ) is a sufficien tly smo oth function, not necessarily linear, of the parameter vecto r β ( l ) ∈ R k l . In the non-linear case the estimation subspace L l ⊂ R n is replaced b y an estimation submanifold of dimension k l , but M l can b e fitted to an y datum y ∈ R n b y non-linear regression (Bates & W atts 1988). There will b e a b est-fit c hoice ˆ β ( l ) ∈ R k l for the parameter v ector, and a predicted v ector ˆ y ( l ) = ˆ y ( l ) 0 ( ˆ β ( l ) ) . As usual, the residual sum of squares R SS l equals  y − ˆ y ( l )  t  y − ˆ y ( l )  . The differe nce from th e linear case is this: R SS l is no longer quadratic in y 0 , as in Eq. (3.8b). But it is straigh tforw ard to der iv e a pow er series in y 0 for R SS l from a T aylor expansion of y 0 ( β ( l ) ) ab out the p oin t β ( l ) = ˆ β ( l ) . In m uc h the same w ay , one can obtain an expansion of V ar ∆ 12 in pow ers of y 0 . F rom this one can readily construct an unbiased estimator of V ar ∆ 12 as a p o wer series in y , b y requiring un biasedness to each order. By emplo ying a truncation of this series, whic h is a generalizatio n of the quadratic estimator (3.11), on e can extend the proposed test to candidate normal regression mo dels that are non-linear. Th us for non-linear mo dels as for linear ones, it ma y b e p ossible to emplo y a decision pro cedure that relies on the v ariabilit y of the ∆ 12 statistic, rather than on Ak aike w eigh ts. 4. Selection without error bars The applicabilit y in model selection of the AI C and AI C c statistics will no w b e considered, in the case when the linear regression models b eing assessed are fitted to a data s et without error bars. This is quite differen t f rom the case when error (i.e. residual) v ariances are kno wn and hav e b een incorp orated in eac h mo del. The calculations b elo w rev eal the need for the AICc correction, but also revea l a serious difficu lt y when a cand idate model is mis-sp ecifi ed b y an unkno wn amoun t. It has long b een k no wn that applying the AIC(c) to a mis-sp ecified mo del is problematic (Sa w a 1978; Resc henhofer 1999), but w e obtain precise expressions Information criteria and normal regression 19 for the asymptotic ( n → ∞ ) shift in the AICc, coming from the mis-specification. Our results are similar to those of No da et al. (1996), but are more explicit. As in § 3 a , tak e eac h candidate regression mo del M l to b e normal linear, of the form y = X ( l ) β ( l ) + ǫ ( l ) where ǫ ( l ) = σ z ( l ) , with z ( l ) a column vector of indep end en t standard normals. The param eter θ ( l ) = ( β ( l ) ; σ 2 ) no w includes b esides β ( l ) ∈ R k l the residual v ariance σ 2 , whic h m ust also b e fitted. By MLE, if a single datum y ∈ R n is av ailable, then ˆ σ 2 equals RSS l /n . That is, ˆ σ 2 = y t Q ( l ) y /n , where Q ( l ) pro jects onto the left null space of X ( l ) (the error space). The true mo del M ∗ is y = y 0 + ǫ 0 with ǫ 0 = σ 0 z 0 , in whic h b oth y 0 ∈ R n and σ 2 0 are unkno wn. The deterministic m is-sp ecificatio n of M l , if an y , is quant ified by the parameter λ ( l ) = y t 0 Q ( l ) y 0 /σ 2 0 , whic h is a measure of the distance in R n b et w een y 0 and the column space of X ( l ) (the estimation s pa ce). In man y reasonable data gathering and regression pro ce dures, n can b e tak en arbitrarily large; so the large- n b eha vior of λ ( l ) merits discussion. One p ossibilit y is that λ ( l ) /n will tend to a limit as n → ∞ , like ˆ σ 2 = y t Q ( l ) y /n . That is, in the limit some fractio n of the RSS ma y be attributable to fittin g errors of non-zero mean, coming from mis-sp ecification , rather than to the r andom errors of me an zero and t ypical size σ 0 that come from sto c hasticit y in the data-generat ing pro cess M ∗ . (This p ossibilit y was discuss ed in § 3 a .) Another p ossibilit y is that λ ( l ) will gro w sublinearly in n or ev en tend to a finite v alue, for a subtle reason: as n increases, it ma y b e p ossible to enhance the regression b y impro ving or expanding the mo del M l , giving an even b etter fit to M ∗ . But it m ust b e stressed that in the presen t framew ork, which do es not make explicit the p ossibilit y of taking n to infinit y or ev en of v arying n , there is no wa y of distinguishing the fractional contr ibution made to R SS l b y a non-zero mis- sp ecificat ion λ ( l ) , or of estimating its magnitud e. Of course, in applications where the comp onen ts ( y i ) n i =1 of the observed v ector y are the v alues of a resp on se v ariable corresponding to v alues ( x i ) n i =1 of a explanatory one, one ma y sometimes b e able to estimate this fraction b y examining a residual plot. As in § 3 a , where there w as no need to es timate σ 2 , an explicit expression for the AIC-t yp e selection criterion MSC (see Definition 2.2) is readily obtained. The MSC is defined so that MSC / 2 for an y candidate M = M θ of the form y = X β + ǫ , with β ∈ R k , is an un biased estimator of the expected o v erall discre pancy E y OD ( y ) under M ∗ . This is b ecau se according to the p olicy of § 2, it is the latter that should b e used in mo del selection. F or the discrepancy d KL , OD ( y ) equals d KL ( g ; f ˆ θ ( y ) ) , in whic h ˆ θ = ( ˆ β , ˆ σ 2 ) is the fitted parameter obtained by MLE. Here ˆ β ( y ) = ( X t X ) − 1 X t y ∈ R k as usual; and now ˆ σ 2 = y t Qy /n , where Q = I n − P and P pro jects on to the column space of X . The PDF’s g , f θ of M ∗ , M θ are g ( y ) = (2 π σ 2 0 ) − n/ 2 exp  − ( y − y 0 ) t ( y − y 0 ) / 2 σ 2 0  , (4.1) f θ ( y ) = (2 π σ 2 ) − n/ 2 exp  − ( y − X β ) t ( y − X β ) / 2 σ 2  , (4.2) and by direct computation, d KL ( f θ , f θ ) = C n + n 2 ln  σ 2  (4.3) as in (3.4), where we no w write C n := ( n/ 2) [ 1 + ln(2 π )] . Information criteria and normal regression 20 What can b e calculated from the datum y ∈ R n is not OD ( y ) but the fitted discrepancy FD ( y ) , i.e., d KL ( g y ; f ˆ θ ( y ) ) , where g y is a 1-p oin t atomic PDF. This is simply the neg ativ e log-lik eliho o d − ln L f ( ˆ θ | y ) of the fitted model M ˆ θ . A s in § 3 a , the MSC is giv en by MSC / 2 = FD ( y ) + B := FD ( y ) + E y [ OD ( y ) − FD ( y )] , (4.4) where the ‘ B ’ term p erforms the un biasing. Also muc h as b efore (see (3.5) ), FD ( y ) = − ln L f ( ˆ θ | y ) = d KL ( f ˆ θ ; f ˆ θ ) − n 2 + RSS 2 ˆ σ 2 (4.5) expresses the fitted discrepancy in terms of the RSS, whic h is y t Qy . But now, under the true mo del M ∗ the fitted v ariance ˆ σ 2 as well as the RSS is a random v ariable. Since ˆ σ 2 equals RSS /n , (4.5) simplifies to FD ( y ) = − ln L f ( ˆ θ | y ) = d KL ( f ˆ θ ; f ˆ θ ) = C n + n 2 ln  ˆ σ 2  . (4.6) The follow ing is the coun terpart of Theorem 3.1. In the statemen t, λ := y t 0 Qy 0 /σ 2 0 is the (presumably unkno wn) M -sp ecific mis-sp ecification parameter. Theorem 4.1 . Under the mo del M ∗ , the over al l and fitte d discr ep ancies of the fitte d mo del M ˆ θ ( y ) , OD = O D ( y ) and FD = FD( y ) , ar e distribute d ac c or ding to OD = d KL ( g ; f ˆ θ ( y ) ) ∼ C n + n 2 ln  ˆ σ 2  + n 2  σ 2 0 ˆ σ 2 − 1  + σ 2 0 2 ˆ σ 2 χ 2 k ( λ ) , FD = d KL ( g y ; f ˆ θ ( y ) ) ∼ C n + n 2 ln  ˆ σ 2  , wher e ˆ σ 2 e quals ( σ 2 0 /n ) ti mes a r andom variable with distribution χ 2 n − k ( λ ) , and χ 2 k ( λ ) signifies a r andom variable that is indep endent of ˆ σ 2 . Pr o of. That ˆ σ 2 equals ( σ 2 0 /n ) times a random v ariable with non-cen tral c hi- squared distribution χ 2 n − k ( λ ) follow s from the represen tation ˆ σ 2 / ( σ 2 0 /n ) = y t Qy /σ 2 0 = ( z 0 + y 0 /σ 0 ) t Q ( z 0 + y 0 /σ 0 ) . (4.7) (Cf. (3.6b.) As in the pro of of Theorem 3.1, OD ( y ) is calculated b y us ing the definition (2.9) of d KL and the definitions (4.1),(4.2) of g , f θ . The definite in tegral in the definition of d KL can b e ev aluated in closed form, and the resulting quadratic form ( Py − y 0 ) t ( Py − y 0 ) /σ 2 0 has distribution χ 2 k ( λ ) . (Cf. (3.6a).) That this and the quadratic form ˆ σ 2 = y t Qy /n are indep enden t follo ws from the ‘if ’ part of the Craig–Sak amoto theorem, men tioned ab o v e, since the pro jection matrices P , Q are complemen tary: they satisfy PQ = 0 .  Theorem 4.2 . The unbiasing term ‘ 2 B ’ in the definition (4.4) of the AIC - typ e sele ction criterion MSC is expr esse d in terms of the moments of non-c entr al Information criteria and normal regression 21 chi-squar e d r an dom variables by 2 B = 2 E y [OD( y ) − FD( y )] = n n n E [( χ 2 n − k ( λ )) − 1 ] − 1 + [ E ( χ 2 k ( λ ))] E [( χ 2 n − k ( λ )) − 1 ] o = n  − 1 + ( n + k + λ )  1 n − k − 2 − 1 ( n − k )( n − k − 2) λ + · · ·  . Pr o of. This comes from the form ulas of Theorem 4.1 b y exploiting ˆ σ 2 ∼ ( σ 2 0 /n ) χ 2 n − k ( λ ) and independence. The first momen t E ( χ 2 k ( λ )) equals k + λ , and the series in λ for the negativ e first momen t E [( χ 2 r ( λ )) − 1 ] (where r = n − k ), app ear ing in square brac k ets, is tak en from A pp endix A .  The preceding calculation reduces when λ = 0 and M is not mis-sp ecified to a deriv ation of the AICc that has b een giv en by several authors (Sugiura 1978; Hurvic h & T sai 1989; Ca v anaugh 1997). But it is the generalization to non-zero λ , whic h is similar to one of Hurvic h & T sai (1991), whic h is of in terest. The ch i- squared v ariables in the form ula for the unbia sing term, whic h if λ w ere zero w ould b e cen tral, b ecome non-cen tral with non-cen tralit y parameter λ . As was explained ab ov e, in some applications it is reasonable for the mis- sp ecificat ion λ of a candidate mo del to b e large in the sense that it gro ws linearly in n ; that is, if the regression pro cedure is suc h that n can b e taken arbitrarily large. But it is also useful to consider the case of ‘medium mis-sp ecification, ’ when to leading order λ grows prop ortionately to n 1 / 2 , and that of ‘small mis- sp ecificat ion,’ when λ is b ounded in n as n → ∞ . Hence λ will no w b e allo w ed to gro w according to λ ∼ λ 1 n + λ 1 / 2 n 1 / 2 + λ 0 + o (1) , n → ∞ . Theorem 4.3 . L et the additive c onstant 2 C n b e dr opp e d fr om the definition of the AIC -typ e sele ction criterion MSC . Then if the mis -sp e cific ati on λ of a mo del M e quals zer o, i ts MSC r e duc es to the standar d AICc given in (1.5) , AICc = n ln  ˆ σ 2  + 2( k + 1) n n − k − 2 ∼ AIC + 2 ( k + 1)( k + 2) n + O (1 /n 2 ) , n → ∞ , wher e AIC = n ln  ˆ σ 2  + 2( k + 1) . In the r e gime of smal l mis-sp e cific ation, when λ ∼ λ 0 + o (1) as n → ∞ with λ 0 > 0 , MSC ∼ AICc − λ 0 (2 k + λ 0 ) n + O (1 /n 2 ) , n → ∞ . In the r e gime of me dium mis -sp e cific ation, when λ ∼ λ 1 / 2 n 1 / 2 + o ( n 1 / 2 ) as n → ∞ with λ 1 / 2 > 0 , MSC ∼ AIC − λ 2 1 / 2 + o (1 /n 1 / 2 ) , n → ∞ . In the r e gime of lar ge mi s-sp e cific ation, when λ ∼ λ 1 n + o ( n ) as n → ∞ with λ 1 > 0 , the MSC e quals A IC plus a λ 1 -dep endent quantity gr owing with n . Information criteria and normal regression 22 Pr o of. Substitute the leading-order b eha vior of λ into the formu la giv en in Theorem 4.2. It should b e noted that if λ ∼ λ 1 n , all terms of the p o wer series in λ for E  ( χ 2 n − k ( λ )) − 1  will con tribute, as eac h will b e of order 1 /n .  This theorem has disconcer ting implicat ions for the usefulness of the AIC and AICc in mo del selection. It rev eals ho w different the case of an unkno wn error v ariance σ 2 is, from the case of a known σ 2 (treated in § 3 a ). If λ = 0 and the mo del is not mis-sp ecified , the theorem co nfirms that including the standard AICc correction term of magnitude O (1 /n ) is justified. This term ma y affect the selection pro cedu re if n is small. But if λ = λ 0 + O (1 /n ) as n → ∞ with λ 0 non-zero, the co efficie n t of 1 /n in the correction term will deviate from the AICc form. Since λ 0 is t ypically not known , this renders difficult an y small- n correcting of the AIC. I n the regime of medium mis-sp ecification the problem is w orse: the O (1) un biasing term 2( k + 1) in the AIC itself is shifted by an amoun t dep endi ng on λ 1 / 2 . A nd in the regime of large mis-sp ecifi cation, in whic h applications of mo del selection ma y w ell lie, the AIC is shif ted by a p oten tially large amoun t, grow ing with n . This shift may sw amp the term 2( k + 1) . Theorem 4.3 in dicates that when deciding b et wee n candid ate models that ha v e b een fitted to a data set without error bars, it ma y b e un wise to use the AICc or ev en the AIC, if there is an y poss ibi lit y that the models are mis -s pecified relativ e to the true data-generating pro cess M ∗ , and if the amoun t of mis-sp ecification is unkno wn but is exp ected to b e substan tial. Since in the ph ys ical s ciences M ∗ is t ypically infinite-dime nsional, candid ate models that are mis-sp ecified, at least to some extent , are exp ected to o ccur quite widely . 5. Summary and discussion In this paper the AIC and AICc w ere dev elop ed from first principles, to clarify their abilit y to assess comp eting regression mo dels of the sort common in the ph ysical sciences: ones with normal errors and known error v ariances, coming from the error bars of a size- n data set. The data set w as view ed as providing a single observ ation ( N = 1 ) of a n S -v alued random q ua n tity , the space S b eing R n . In § 2 a mo del s ele ction p olicy was form ulated, applying to arbitrary S and arbitrary N . The Kullbac k–Leibler div ergence d KL w as then c hosen as the discrepancy functional, whic h the p oli cy left unsp ecifi ed. The ch oice of d KL ensures that fitted mo dels are compared on the basis of their fitted log-likelihoo ds, suitably un biased (i.e., p enalized ). It w as noted that other measures of the discrepancy b et w een a parametric mo del and a data set could b e used, suc h a s the p opula r Hellinger distance. This option is w orth exploring, s in ce M L E is not robust and ma y not b e the b est c hoice if, sa y , the regression is non-linear or non-normal errors are presen t. But as w as explained, this will require that the selection p olicy b e mo difi ed to include a f orm of k ernel densit y es timation. It w as sho wn in § 3 a that when fitting a linear regression mo de l to data with error bars, the AIC and not the AICc should b e used. (F or commen ts on the recen t astroph ys ics literature, see App endix B.) If the mo del incorporates a known error v ariance σ 2 , its mis-sp ecific ation if any do es not affect the v alidit y of the AIC, though it causes certain discrepancy statistics to ha ve non-cen tral rather than cen tral c hi-squared distributions. That no additional unbi asing of the AI C Information criteria and normal regression 23 is needed when σ 2 is kno wn has in fact b een noticed (Kuha 2004, p. 209), but seems to ha ve attracte d little atten tion. I n ap plications of the AIC in the ph y sical sciences, it is of considerable imp ortan ce. In § 3 b , it w as sho wn that in the same setting as that of § 3 a , the v ariabilit y of an AIC difference can b e estimated. A test of significance w as prop osed, whic h exploits what under reasonable conditions of mis-sp ecification is the asymptotic ( n → ∞ ) normalit y of this statistic. The significance test is a test for sele ction , whic h can p oten tially replace the use of Ak aike w eigh ts in deciding b et w een regression mo dels with known error v ariances. The approac h of § 3 b to model s ele ction resem bles the approac h of Commenges et al. (20 08). In a general large-sample ( N → ∞ ) con text, not fo cused on the comparison of regression mo del s, they prop osed a test for s election based on a difference of t wo AIC’s. They w ere able to w ork out the asymptotic distribution of a normalized version of this difference b y exploiting large-sample theory for the likelih o od ratio statistic. This included the class ic al result of W ald (1943) on the comparison of nested mo dels, in v olving a no n-cen tral c h i-s q ua red distribution, and an asymptotic normalit y result of V uong (1989), who dealt with non-nested mo dels. The test prop osed in § 3 b is similar in spirit, but in the f orm ulation used here the n → ∞ limit of a regression mo del is rather differen t f ro m an N → ∞ large-sample limit, and requires its o wn analysis. In § 4 the usual deriv ation of the AICc statistic, applying to linear regression mo dels fitted to data sets without error bars, w as exte nded to mo dels with non-zero mis-sp ecifica tion λ . The app earance of a non-cen tral ch i-squared in the distribution of the o ver all Kullbac k–Leibler discrepancy is not unexp ected . But the b eha v io r of the un biasing term in the large- n limit is cause for concern. The case when the mo del mis-sp ecific ation λ is o (1) as n → ∞ is the nicest. (It has a close analogue in the large-sample theory of the lik elihoo d ratio: the case of ‘lo cal alternativ es,’ when the true v alue of a mo del paramete r is taken to approac h the pseudo-true v alue as N → ∞ .) Except in this case, the shift in the AICc due to the mis-sp ecification ma y sw amp in the large- n limit the AICc correction term, and even the usual 2( k + 1) unbia sing term. F or mis-sp ecified regression mo dels fitted t o data without error ba rs, th is ma y w ell affect the usef ulness of the AICc as a to ol in mo del selection . The extent to whic h this problem o ccurs in the ph ysical sciences remains to b e studied. App endi x A. Non-cen tral c hi-squared distributions and quadratic forms A (cen tral) χ 2 distribution with r degrees of f re edom, den oted χ 2 r , is the distribution of the sum of the squares of r independent standard normal random v ariables. That is, if z is a column v ector of r standard normals, then z t z ∼ χ 2 r . There is a generalizat ion: if P is an n × n pro jection matrix of rank s with 0 6 s 6 r , the quadratic form ( Pz ) t ( Pz ) = z t Pz has distribution χ 2 s . A χ 2 distribution with r degrees of freedom and non-cen tralit y parameter λ , denoted χ 2 r ( λ ) , is the distribution of ( z + u ) t ( z + u ) , where u is a fixed column v ector. That is, it is the distribut ion of the squares of r indep en den t unit-v ariance normal v ariables, not necessarily of mean zero. The parameter λ equals u t u . There is a generalizati on: the quadratic f orm [ Pz + u ] t [ Pz + u ] has distribution Information criteria and normal regression 24 χ 2 s ( λ ) with λ = u t u . A second generalization is that [ P ( z + u )] t [ P ( z + u )] = ( z + u ) t P ( z + u ) has distribution χ 2 s ( λ ) with λ = u t Pu . When r > 1 , the PDF of χ 2 r ( λ ) cannot b e expressed in terms of elemen tary functions, though it can in terms of the confluen t h y p ergeometric function 0 F 1 , or alternativ ely a mo dified Bessel function of the first kind. If X ∼ χ 2 r ( λ ) then X has mean and v ariance E X = r + λ, V ar X = 2 r + 4 λ, (A.1) and negativ e first momen t E [ X − 1 ] = e − λ/ 2 X ∞ m =0 ( λ/ 2) m m ! 1 r − 2 + 2 m . (A.2) F or details, see Mathai & Pro v ost (1992) and Bo c k et al. (1984). App endi x B. The AICc in recen t pap ers The AIC and AICc ha ve recen tly entered the ph ysical s ci ences and in particular astroph ysics by b eing used to compare cosmological mo dels. Suc h mo dels ha ve a relativ ely small nu m b er of parameters, and comp eting mo dels are usually not nested. Mo dels ha v e b een compared, e.g., on the basis of their predictions of the distance–re dshift relation, whic h c haracterizes the expansion of the Univ erse. After a mo del is fitted b y non-linear regression to observ ational data, its goo dness of fit is a ssessed b y calcu lating its AIC or AICc. One recen t comparison o f models, emplo ying the AIC and the Ba yesian criterion BIC, is that of Shi et al. (2012). A searc h rev eals that man y though not all publications in this area use the AI C and AICc in a fas hi on that on the basis of the presen t w ork, can b e considered correct. If the observ ational data are accompanied by error bars, or a common error v ariance σ 2 is kno wn or assumed, the AIC should b e used; and if the v ariance is treated as a n uisance parameter to b e fitted, the AICc should b e us ed . Da vis et al. (2007), Li et al. (2010) and T an & Bi sw as (2012) emplo y the AICc, despite their data sets b eing accompani ed by error bars, whic h strictly sp eaking is incorrect; but they observ e in their analyses that the AICc correction is of negligible size and do es not affect mo del comparisons. The pap er of T an & Biswas is esp ecially v aluable from a statistician’s p oin t of view, b ecause they in vestigat e AIC(c) v ariabilit y empirically rather than theoretically , using a b ootstrap pro cedure. Unfortunately , a num b er of pap ers in the lit erature ar e ba sed on data with explicit e rror bars, but use the A I Cc without commen ting on whether its O (1 /n ) correction term affects their results. This includes the papers of Biesiada & Piórk o wsk a (2009), F ebruary et al. (2010), Kelly et al. (2010), Dan tas et al. (2011), Basilak os & Pouri (2012), P apageorgiou et al. (2012) and W ang & Zhang (2012). A re-examin ation of the model comparisons in these pap ers is surely desirable. Information criteria and normal regression 25 References Ak aike, H. 1973 Information theory and an extension of the maximum likel ihoo d principle. In Se c ond international symp osium on i nformat ion the ory ( Tsahkadsor, Armenia, 1971 ) (eds B. N. P etrov & F. Csá ki), pp. 267–281 . Budap est, H ungary: Aka démiai Kiadó. Ak aike, H. 1992 Information theory and an extension of the maximum likel ihoo d principle. In Br e akthr oughs in statistics, Volume I (eds S. Kotz & N. L. Johnson), pp. 610–624. New Y ork/Berlin: Springer-V erlag. Amari, S. & Nagaok a, H . 2000 Metho ds of i nformat ion ge ometry , vol. 191 of T ransl. Math. Monographs. Providence, RI: A merican Mathematical So ciet y (AMS). Anderson, D. A . 198 1 Maxim um likelihood estimation in the non ce ntral c hi distribution with unknown s cale parameter. Sankhya Ser. B 43 , 58–67. Basila kos, S. & P ouri, A. 2012 The gro wth index of matter p erturbations and modified gra v it y . Monthly Notic es R oy. Astr onom. So c. 423 , 3761–3767. A va ilable on- li ne as arXiv:12 03.672 4 . Basu, A., Harris, I . R. & Basu, S. 1997 Minim um-distance estimation: The approac h using density-based distances. In Ro bust infer enc e (eds G. S. Maddala & C. R. Rao), vol. 15 of Handb ook of Statistics, pp. 21–4 8. New Y ork/Amsterdam: Else vier. Basu, A., Shioy a, H. & Park, C. 2011 Statistic al infer enc e: The mi nimum distanc e appr o ach . Boca R a ton, FL: Chapman & Hall/ CRC . Bates, D. M. & W atts, D. G. 1988 Nonli ne ar r e gr ession analysis and its applic ations . New Y ork: Wiley . Beran, R. 1977 Minim um Hellinger distance estimates for p a rametric mo dels. Ann. Statist. 5 , 445–46 3. Bevington, P . R. & R obinson, D . K. 2003 Data r e duction and err or analysis for the physic al scienc es , 3rd ed n . Boston: McGraw-Hill. Bhansali, R. J. 1986 A deriv ation of the information criteria fo r selecting autoregressive mod el s. A dv. in Appl. Pr ob ab. 18 , 360–387 . Bhansali, R . J. & Downham, D. Y . 1977 Some p rop erties of the order of an autoregressiv e mo del selected by a generalization of Ak aik e’s EP F criterion. Biometrika 64 , 547–551 . Biesiada, M. & Piórko wsk a, A . 2 009 Loren tz inv ariance violation-induced time delays in GRBs in different cosmol ogical mo dels. Classic al Quantum Gr avity 26 , 125007 (9 pp.). A v ailable on-line as arXiv:1008 .2615. Bock, M. E., Judge, G. G. & Y ancey , T. A . 1984 A simple form for t he inv erse moments of n on-cen tral χ 2 and F random v ariables and certain confluent hypergeometric functions. J. Ec onometrics 25 , 217–234 . Burnham, K. P . & Anderson, D. R. 2002 Mo del sele ction and mul t imo del infer enc e: A pr actic al information-the or etic appr o ach . New Y ork: Springer-V erlag. Ca va naugh, J. E. 1997 Unifying th e d eri v ations for t h e A k aike and corrected Ak aik e information criteria. Statist. Pr ob ab. L ett. 33 , 201 –208. Ca va naugh, J. E. 1999 A large-sample model selection criteri on based on Kullbac k’s symmetric diverge nce. Statist. Pr ob ab. L ett. 42 , 333– 343. Ca va naugh, J. E. 2004 Criteria for linear model selection based on Kullbac k’s symmetric diverge nce. Aust. N. Z. J. Stat . 46 , 257– 274. Claesk ens, G. & H j ort, N . L. 2008 Mo del sele ction and mo del aver aging . Cam bridge, UK: Cam b ri dge Univ. Press. Commenges, D., S ayyareh, A., Letenn eur, L., Guedj, J. & Bar-Hen , A. 2008 Estimating a difference of Kullbac k – Leibler ris ks using a normalize d difference of AIC. Ann. A p pl. St atist. 2 , 1123–11 42. A v ailable on-line as arXiv:0807. 4086. Co x , D. R. 1 962 F urther results on tests of separate fa milies of hypotheses. J. R oy. Statist. So c. Ser. B 24 , 406–424. Csiszár, I. & Körner, J. 2011 Information the ory: Co ding the or ems f or discr ete memoryless systems , 2nd edn. Cam bridge, UK: Cam b ri dge Univ. Press. Dantas , M. A., Alcaniz, J. S., Mania, D. & Ratra, B. 2011 Time and distance constraints on accelerating cosmological mo dels . Phys. L ett. B 699 , 239–245. A va ilable on-line as Information criteria and normal regression 26 arXiv:1010.0 995 . Davis , T. M., Mörtsell, E., Sollerman, J., et al. 2007 Scru t inizi ng exotic cosmologi cal mo d els using ESSENCE sup erno va d a ta com bined with other cosmologi cal probes. Astr ophys. J. 666 , 716–72 5. A v ailable on-line as arXiv:astro-ph/070 1510 . Donoho, D. L. & Liu, R. C. 1988 The “automatic” robu stness of minim um distance f unctionals. An n. Statist. 16 , 55 2–586. Drap er, N . R . & Smith, H. 1998 Applie d r e gr ession analysis , 3rd edn. N ew Y ork: Wiley . Efron, B. 1984 Co mparing non-nested linear models. J. Amer. Stat ist. Asso c. 79 , 791 –803. F ebruary , S., Larena, J., Smith, M. & Clarkso n, C. 2010 Rendering dark energy void. M on thly Notic es R oy. Astr onom. So c. 405 , 223 1–2242 . A va ilable on-line as a rXiv:0909.14 79 . F raser, D. A. S. & Gebotys, R. J. 1987 Non-nested linear mod els : A conditional confidence approac h . Canad. J. Statist. 15 , 375– 385. Hurvich, C. M. & T sai, C.-L. 1989 Regression and time seri es model selection in small samples. Biometrika 76 , 297–30 7. Hurvich, C. M. & T sai, C.-L. 1991 Bias of the corrected AIC criterion for underfitted regression and time-series models. Biometrika 78 , 499–509. Inagaki, N. 1977 T wo errors in s tatistical mod el fi tting. Ann. Inst. Sta tist. Math. 29 , 131 –152. Kapur, J. N. 1989 Maximum-entr opy mo dels i n scienc e and engine ering , revised edn. New Delhi: Wiley Eastern Ltd. Kelly , P . L., H ic ken, M., Burke, D. L., Mandel, K. S. & Kirshne r, R. P . 2010 Hubb l e residuals of nearby Typ e Ia sup erno v ae are correlated with host galaxy masses. Astr ophys. J. 715 , 743–75 6. A v ailable on-line as arXiv:0912.0929 . Keuzenk amp, H. A., McAleer, M. & Zellner, A. (eds) 2001 Simplicity, infer enc e and mo del ling . Cam b ri dge, UK: Cam b ri dge Univ. Press. Ko nishi, S. & Kitaga wa, G. 2008 Information criteria and statistic al mo deli ng . New Y ork: Springer. Kuha, J. 2004 A IC and BIC: Comparison of assumptions and p erfo rmance. So ciol. M et ho ds R es. 33 , 188– 229. Li, M., Li, X. & Zhang, X . 2010 Comparison of dark energy mo dels: A p erspective from the latest observ ational data. Scienc e China: Physics Me chanics Astr onomy 53 , 1631–1645. A v ailable on-line as arXiv:0912 .3988. Liddle, A. R. 200 7 Information criteria for astroph ysical mo del sel ection. Monthly Notic es R oy. Astr onom. So c. 377 , L74–L78 . A v ailable on-line as arXiv:astro-ph/07011 13 . Liese, F. & V a jda, I. 198 7 Convex statistic al di s tanc es . Leipzig, Germany: T eu b ner. Linhart, H. 1988 A test whether tw o AIC’s differ significan tly . South Afric an Statist. J. 22 , 153–16 1. Linhart, H. & Zucc hini, W. 1 986 Mo del sele ction . New Y ork: Wiley . Mardia, K. V., Kent, J. T. & Bibb y , J. M. 1979 Multivariate analysis . New Y ork/London: Academi c Press. Mathai, A. M. & Prov ost, S. B. 1992 Quadr atic forms in r andom variables . New Y ork/Basel: Marcel Dekker. McQuarrie, A. D. R. & T sai, C.-L. 1 998 R e gr ession and time series mo del sele ction . Singap ore: W orld Scientific. McQuarrie, A., Shumw ay , R. & T sai, C.-L. 1997 The mo del selection criterion A ICu. Statist. Pr ob ab. L ett. 34 , 285 –292. Nod a, K., Miya ok a, E. & It oh, M. 1996 On bias correctio n of the Akai ke informatio n criterio n in linear models. Comm. Statist . T h e ory Metho ds 25 , 1845–1857 . Oga w a, J. & Olkin, I. 2008 A tale of tw o countri es: The Craig–Sak amoto t h eo rem. J. Statist. Plann. Infer enc e 138 , 3419–342 8. P apageorgi ou, A., Plionis, M., Basilak os, S. & Ragone-Figueroa, C. 2012 A consistent comparison of bias mo dels using observ ational data. Monthly Not ic es R oy. Astr onom. So c. 422 , 106–116. A v ailable on-line as arXiv:1201.48 78 . P ardo, L. 2006 Stat istic al infer enc e b ase d on diver genc e me asur es . Bo ca Raton, FL: Chapman & Hall/CR C. Information criteria and normal regression 27 P arr, W . C. 1981 Minim um distance estimation: A bibliography . Comm. Statist. The ory Metho ds 10 , 1205–1 224. P arr, W. C. & S c hucan y , W. R . 1980 Minimum distance and robust estimation. J. A mer. Statist. Asso c. 75 , 616–624. Reschenhofe r, E. 1999 Impro ved estimation of t h e exp ected Ku ll bac k–Leibler discrepancy in case of misspecification. Ec onometric The ory 15 , 377–3 87. Sahler, W. 19 68 A surv ey o n distribution-free statisti cs based on distances b et ween distribution functions. Metrika 13 , 149–169 . Sak amoto, Y., Ishiguro, M. & Kitaga wa, G. 1986 A kaike inf or mation criterion stat istics . T okyo: KTK Scientific Publishers. Saw a, T. 1978 Information criteria for discriminating among alternative regression mo dels . Ec onometric a 46 , 1273 –1291 . Shi, K., H uang, Y. F. & Lu, T. 2 012 A comprehensive co mparison of cosmological mo dels from the latest observ ational data. Monthly Notic es R oy. Astr onom. So c. 426 , 2452–25 62. A v ailable on-line as arXiv:1207 .5875. Steiger, J. H., Shapiro, A. & Bro wne, M. W. 19 85 On the multiv ariate asymptotic distribution of sequential chi-square statistics . Psychometrika 50 , 253–264. Stone, M. 1977 An asymptotic equiv alence of c hoice of model b y cross-v alidation and Ak aike’s criterion. J. R oy. Statist. So c. Ser. B 39 , 44–47. Stone, M. 1978 Cross-v alidation: A review. Math. Op er ationsforsch. Statist. Ser. Statist. 9 , 127–13 9. Sugiura, N. 1978 F urth er analysis of th e data by Ak aike’s information criterion and the fin ite corrections. Comm. Statist. The ory Metho ds 7 , 13–2 6. T akeuc hi, T. T. 2000 Application of the information criterion to th e estimatio n of galaxy luminosit y function. Astr ophys. Sp ac e Sci. 271 , 213–2 26. A v ailable on-line as arXiv:astro-ph/99093 24 . T an, M. Y. J. & Bisw as, R . 201 2 The reliabil it y of the Ak aike in f ormation criterion method in cosmolog ical mo d el selection. Monthly Notic es R oy. Astr onom. So c. 419 , 3292–330 3. A va ilable on-line as arXiv:1105 .5745. T rosset, M. W . & Sand s, B. N. 1995 On the choice of a discrepancy fun ctional for mo del selection. Comm. Statist. The ory Metho ds 24 , 2841–2863. T sallis, C. 1988 Possible generalization of Boltzmann–Gibbs statistics. J. Statist. Phys. 52 , 479–48 7. V uong, Q. H. 1989 Likelihoo d ratio tests for mod el selection and non-nested hypotheses. Ec onometric a 57 , 307– 333. W ald, A. 194 3 T ests of statistica l hyp otheses concerning several parameters when the n umber of observ ations is large. T r ans. Amer. Math. So c. 54 , 4 26–482 . W ang, H. & Zhang, T.-J. 2012 Constrain ts on Lemaître–Tol man–Bondi models from observ ational Hubble parameter data. Astr ophys. J. 748 , 111 (13 pp.). A v ailable on- l ine as arXiv:1111.24 00. W eisb erg, S. 2005 Applie d line ar r e gr ession , 3rd edn. N ew Y ork: Wiley .

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment