Focused Relative Risk Information Criterion for Variable Selection in Linear Regression

The F o cused Relativ e Risk Information Criterion for V ariable Selection in Linear Regression July 2020 Nils Lid Hjort Departmen t of Mathematics, Univ ersity of Oslo Abstract This pap er motiv ates and dev elops a no vel and focused approach to v ariable selection in linear regression models. F or estimating the regress ion mean µ = E ( Y | x 0 ), for the co v ariate v ector of a giv en individual, there is a list of competing estimators, sa y b µ S for each submo del S . Exact expressions are found for the relativ e mean squared error risks, when compared to the widest model av ailable, sa y mse S / mse wide . The theory of conﬁdence distributions is used for accurate assessmen ts of these relative risks. This leads to certain F o cused Relativ e Risk Information Criterion scores, and asso ciated FRIC plots and FRIC tables, as well as to Conﬁdence plots to exhibit the conﬁdence the data giv e in the submodels. The machinery is extended to handle many focus parameters a t the same time, with appropriate av eraged FRIC scores. The particular case where all a v ailable cov ariate vectors hav e equal imp ortance yields a new o verall criterion for v ariable selection, balancing complexit y and ﬁt in a natural fashion. A connection to the Mallo ws criterion is demonstrated, leading also to natural mo diﬁcations of the latter. The FRIC and AFRIC strategies are illustrated for real data. Key wor ds: focused information criteria, FRIC plots, linear regression, Mallows Cp, v ariable selection 1 In tro duction and summary Mrs. Jones is pregnant (again). She is whit e, 40 years old, of av erage w eight 60 kg before pregnancy , and a smoker. What is the exp ected birth weigh t of her c hild-to-come? T o address this and similar questions, inv olving comparison and ranking of many submodels of a given wide regression model, we use a dataset on n = 189 mothers and babies, discussed and analysed in Claeskens & Hjort (2008, Ch. 2), and apply linear regression for the birth weigh t y in terms of ﬁv e co v ariates x 1 (age), x 2 (w eight in kg before pregnancy), x 3 (indicator for smoker or not), x 4 (ethnicit y indicator 1), x 5 (ethnicit y indicator 2), and where ‘white’ corresp onds to these tw o b eing zero, i.e. not b elonging to the t wo other ethnic groups in question. The full linear regression mo del has y i = β 0 + β 1 x i, 1 + · · · + β 5 x i, 5 + ε i for i = 1 , . . . , n, (1.1) with the ε i assumed i.i.d. from a normal N(0 , σ 2 ). The task is to estimate µ = E ( Y | x 0 ), with x 0 = ( x 0 , 1 , . . . , x 0 , 5 ) the cov ariate v ector for Mrs. Jones. F or each of the 2 5 = 32 submodels, 1 corresponding to taking cov ariates in and out of the ab o ve regression equation, there is a p oin t estimate, sa y b µ S , with S a subset of { 1 , . . . , 5 } , and these are plotted on the v ertical axis of the FRIC plot of Figure 1.1, along with submo del-speciﬁc 80% conﬁdence in terv als, to indicate their relativ e precision. The crucial new and extra aspect of the plot are the FRIC sc or es , plotted on the horizontal axis. These F o cuse d R elative R isk Information Criterion scores are accurately constructed estimates of the relative risk, the ratio of mean squared errors, relativ e to the wide mo del, i.e. rr S = mse S mse wide = E ( b µ S − µ ) 2 E ( b µ wide − µ ) 2 for S subset of { 1 , . . . , 5 } . (1.2) Th us mo dels with FRIC scores b elow 1 are judged to be b etter than the full wide mo del of (1.1), for the given purpose of estimating the mean w ell for the sp eciﬁed co v ariate vector; FRIC v alues higher than 1 indicate that the submo del do es a w orse job than the wide model itself. Only the wide mo del based estimator b µ wide is guaranteed to hav e zero bias, so the statistical game is to use the data to h unt for submodels leading to lo wer v ariances and biases not far from zero. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 2 3 2.8 3.0 3.2 3.4 FRIC bir thweight estimates Figure 1. 1: FRIC plot for the 2 5 = 32 mo dels for estimating the birthw eight of the child-to-come, for Mrs. Jones (white, age 40, 60 kg, smoker). The FRIC scores are estimates of the relativ e risks rr S = mse S / mse wide of (1.2); the blue circles are the asso ciated p oint estimates; and the v ertical lines are submodel-based 80% conﬁdence in terv als. Here 16 submodels hav e FRIC scores smaller than 1 and are judged to b e better than the wide mo del. See T able 1.1 for iden tiﬁcation of the b est mo dels and their FRIC scores and estimates. 1.1 FRIC plots and tables for relative risks Inside the natural framework of linear regression mo dels, basic formulae for the required quan tities are work ed out in Section 2. The op erating assumption is that the wide mo del, with all cov ariates on b oard, is in force. The denominator of (1.2) is standard, whereas more care is needed to 2 ﬁnd a fruitful form ula for the numerator, since submo dels will carry biases. In Section 3 w e construct sev eral natural estimators for the relative risk quan tities rr S , and such are indeed used to pro duce Figure 1.1. Imp ortan tly , as part of this developmen t w e construct informativ e and exact c onﬁdenc e distributions for the relative risks. These are data driven cumulativ e distribution functions C S (rr S , data) with the property pr β ,σ { rr S : C S (rr S , data) ≤ α } = α for all α ∈ (0 , 1) . (1.3) In (1.3) the ‘data’ are random, with distribution go verned by (1.1), and the identit y , being v alid for all ( β , σ ), secures that accurate conﬁdence in terv als at all levels can b e read oﬀ for the relativ e risks mse S / mse wide . As we sho w in Section 3, these conﬁdence distributions also start out with certain p ointmas ses at observ able minimum p ositions, say rr S, min . Natural conﬁdence in terv als for the relative risks rr S therefore do not take the usual form of b rr S ± z α , say , an estimate along with a plus-minus estimated error. Instead, conﬁdence in terv als induced by the conﬁdence distributions (1.3) are often asymmetric, and sometimes start out at the indicated minimum p ossible v alue; see indeed Figure 3.1. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 0.2 0.4 0.6 0.8 1.0 2.8 3.0 3.2 3.4 confidence in submodel bir thweight estimates Figure 1.2: Conﬁdence FRIC plots for the 2 5 = 32 mo dels for estimating the birthweigh t of the c hild-to-come, for Mrs. Jones (white, age 40, 60 kg, smoker). The conf ( S ) v alues are also p-v alues for testing mse S ≤ mse wide , so submo dels to the left in the ﬁgure are found not useful for the task of estimating the fo cus parameter, the mean µ = E ( Y | x 0 ). Submo dels to the far right are those in which we place trust in their ability to do b etter than the wide mo del. See T able 1.1 for identiﬁcatio n of the best mo dels and their conf ( S ) scores and estimates. Tw o to ols ﬂo wing from these insigh ts are as follo ws. First, one may use the me dian c onﬁdenc e estimators b rr 0 . 50 S = C − 1 S (0 . 50 , data) as FRIC scores when constructing the FRIC plots. This b ypasses certain diﬃculties otherwise encountered, related to a decision of whether one needs to truncate nonnegative estimates of squared biases to zero or not; see the discussion of Cunen & Hjort (2020). Second, the epistemic conﬁdence in a submodel’s ability to w ork b etter than the 3 wide mo del can b e read oﬀ separately , i.e. conf ( S ) = C S (1 , data) = pr ∗ { mse S / mse wide ≤ 1 } for S ∈ { 1 , . . . , 5 } . (1.4) Here pr ∗ denotes the p ost-data epistemic probabilit y placed on the even t mse S ≤ mse wide ; see further discussion in Section 3. Goo d mo dels, again for the fo cused purpose of estimating a given mean w ell, are those for whic h these conf ( S ) scores are high; conv ersely , candidate mo dels where that score is lo w can be seen as not doing w ell for the purp ose. Indeed, conf ( S ) is the p-v alue for testing the hypothesis that mse S ≤ mse wide . Figure 1.2 displays suc h a Conﬁdence plot, with these conﬁdence scores for the case of Mrs. Jones. FRIC table: Conf table: in-or-out b µ S FRIC c onf in-our-out b µ S FRIC c onf 1 0 0 0 0 0 2.945 0.076 0.845 0 0 1 1 1 2.908 0.224 0.943 2 0 1 0 0 0 2.956 0.076 0.815 0 1 1 1 1 2.922 0.226 0.900 3 0 0 0 1 0 2.981 0.088 0.744 0 1 1 1 0 2.844 0.209 0.863 4 0 1 0 1 0 3.008 0.090 0.667 0 0 0 0 0 2.945 0.076 0.845 5 0 0 0 0 1 3.022 0.117 0.623 0 1 1 0 1 2.833 0.203 0.832 6 0 1 0 0 1 3.011 0.118 0.654 0 0 1 0 1 2.829 0.203 0.821 7 0 0 1 0 0 2.773 0.193 0.655 0 1 0 0 0 2.956 0.076 0.815 8 0 1 1 0 0 2.791 0.195 0.707 0 0 1 1 0 2.809 0.205 0.759 9 0 0 1 0 1 2.829 0.203 0.821 0 0 0 1 0 2.981 0.088 0.744 10 0 1 1 0 1 2.833 0.203 0.832 0 1 1 0 0 2.791 0.195 0.707 17 1 1 1 1 1 2.889 1.000 – – – – – T able 1.1: FRIC S (left part) and conf ( S ) (right part) tables for Mrs. Jones, where the task is precise estimation of µ = E ( Y | x 0 ) for her giv en co v ariate vector x 0 . The left part sho ws the ten best submodels of the 2 5 = 32 candidate models, sorted by FRIC scores, see Figure 1.1; the righ t part the ten b est submodels, sorted b y the conf ( S ) scores, see Figure 1.2. Mo dels with low FRIC scores tend to ha ve high conf ( S ) scores, and vice versa. The in-our-out columns indicate submo dels b y presence-absence of cov ariates. The FRIC plot and Conﬁdence plot are already informativ e, displaying the many submo del estimates along with accurate information ab out estimated relativ e risks and about degree of conﬁdence. One migh t pro ceed by taking an av erage of the the say ten b est p oin t estimates, or with more elaboration form model av erage estimators of the form b µ ∗ = P S v ( S ) b µ S , with w eights v ( S ) normalised to sum to one, and chosen higher for the best mo dels, according to either the FRIC or the conf ( S ) scores. F urther p ertinent statistical information may b e gleaned from insp ection of which mo dels do w ell, how ever, and also from examination of those which do badly . Along with FRIC plots and Conﬁdence plots, therefore, our R programme pro duces tables of the form shown here in T able 1.1. In terestingly , for the co v ariate information x 0 for Mrs. Jones, age doesn’t matter, as the cov ariate x 1 do es not en ter an y of the ten b est candidate mo dels. Naturally , an essential p oin t for the FRIC and Conﬁdence plots and tables is that they are fo cuse d ; they w ork for one give n co v ariate v ector x 0 at the time. They deliver p ersonalised mo del ranking lists and top estimates for the focused purpose. If Mrs. Jones w ere a non-smoker, w e put x 3 equal to 0 instead of 1, run our FRIC programmes again, and ﬁnd new plots and tables, new top mo dels, and of course new estimates of the p ertinent µ = E ( Y | x 0 ) parameter. F or the con trafactual non-smoking Mrs. Jones one then learns that age x 1 matters, after all, and that the 4 smok er indicator x 3 en ters several more of the top ranked mo dels – also, the av erage estimate ov er the ten top models b ecomes b µ = 3 . 197 kg, rather than the 2.915 kg she can expect as a smoker (of the same age, w eight, and ethnicit y). These general p oints are broadly v alid for the growing list of FIC methods in the literature, where the F o cused Information Criteria, introduced and dev elop ed in Claesk ens & Hjort (2003, 2008), hav e suc h fo cused aims, of ﬁnding the best mo dels for giv en fo cus parameters. These FIC t yp e strategies hav e later b een generalised and extended in sev eral directions, and for new classes of mo dels; see Claeskens, Cunen & Hjort (2019) for a review. The present paper is diﬀerent in that the FRIC metho ds developed here are ac cur ate and exact , with no large-sample approximations in volv ed; in a sense ﬁnite-sample corrections are not required as such are already built into the pap er’s exact form ulae. This is partly due to the relativ e simplicity of the linear normal regression mo del and its submo dels. The exact ﬁnite-sample form ulae work ed out in sections b elo w contin ue to be v alid, as go od approx imations, in the case of error distributions not b eing exactly normal; the more vital assumptions are those of the linear mean, the constan t v ariance, and indep endence. Similar FRIC methods migh t be set up for addressing, appro ximating, estimating, and ranking relativ e risks, sa y mse S / mse wide for general fo cus parameters in generalised linear mo dels. Such constructions w ould need some elements of large-sample approximations, ho wev er, for v ariances, biases, and then the assessmen t of estimates for these again; see indeed Cunen & Hjort (2020) for suc h dev elopmen ts. 1.2 The presen t pap er As explained and exempliﬁed with the case of Mrs. J ones abov e, the dev elopment in Sections 2 and 3 concerns given fo cus p ar ameters , giving in particular FRIC formulae for estimating relative risks and also focused conﬁdence distributions, for each p oten tial µ ( x 0 ) = E ( Y | x 0 ). The apparatus can also b e used for assessing and ranking all candidate mo dels for their abilit y to estimate a particular regression parameter w ell. Imp ortan tly , the machinery can b e extended to handling an ensem ble of fo cus parameters join tly , as when one needs submo dels w orking w ell for certain regions of cov ariates, like the stratum of all pregnant w omen of age b elow t wen ty in the birthw eight study . Suc h extensions, leading to certain AFRIC scores, av eraging ov er focus parameters, are dev elop ed in Section 4. A notable sp ecial case is when all a v ailable cov ariate vectors are considered equally important, leading to particularly illuminating AFRIC form ulae, balancing ov erall ﬁt with model complexit y . This ‘unfo cused AFRIC’ is demonstrated to ha ve a connection to the Mallows Cp criterion in Section 5, which also leads to both a mo diﬁcation of and additional insigh ts in to the Mallo ws statistic. The AFRIC is illustrated using the birthw eight dataset. Our pap er is then rounded oﬀ with a list of concluding remarks in Section 6, some p oin ting to further researc h issues. These touc h on model av eraging, on generalisations to more complex regression mo dels, and on certain simpliﬁcations v alid under sp ecial circumstances. Whereas the present paper presen ts nov el v ariable selection methods, via focused assessment of relativ e risks, it is clear that other FIC based mo del selection methods also can w ork well; see the references p oin ted to ab o ve, and the review b y Claesk ens, Cunen & Hjort (2019). There is an enormous literature on mo del selection in general and v ariable and subset selection metho ds for regressions in particular, see Claeskens & Hjort (2008) and e.g. Miller (2002). F or the literature on conﬁdence distributions and their man y related themes, see Xie & Singh (2013), Sch w eder & Hjort (2016), Hjort & Sch w eder (2018). 5 2 Submo dels and relativ e risks Consider the classic linear regression mo del y i = x t i β + ε i = β 1 x i, 1 + · · · + β p x i,p + ε i for i = 1 , . . . , n, (2.1) with the error terms ε i b eing i.i.d. from N(0 , σ 2 ), and with co v ariate v ectors x i of dimension p . W e call this the wide mo del, as we shall b e considering the submodels associated with selecting co v ariates for inclusion and exclusion. In the present and the following section we dev elop the necessary formalism for handling all submo dels, ﬁnd explicit form ulae for the mean squared errors mse S and mse wide of (1.2), construct sev eral natural estimators for t he relativ e risks mse S / mse wide , and also dev elop the conﬁ dence distributions to ols p oin ted to in (1.3). This is the basis for the FRIC and Conﬁdence plots of the type shown in Figures 1.1 and 1.2, and also of FRIC and Conﬁdence tables as in T ab le 1.1. The least squares estimator for β in the wide mo del is b β wide = Σ − 1 n n − 1 n X i =1 x i y i , with Σ n = n − 1 n X i =1 x i x t i , and with this p × p matrix assumed to hav e full rank. The traditional and unbi ased estimator of the residual v ariance is b σ 2 = ( n − p ) − 1 P n i =1 ( y i − x t i b β ) 2 , and under normalit y w e all know that b β wide ∼ N p ( β , ( σ 2 /n )Σ − 1 n ) , indep enden t of b σ 2 /σ 2 ∼ χ 2 m /m, with the c hi-squared with degrees of freedom m = n − p . No w consider a submo del, emplo ying only x i,j for j ∈ S , a subset of { 1 , . . . , p } . W e need some notation for handling these submodels and their ensuing estimators. Let π S b e the pro jection function associated with S , so that π S u = u S pic ks the elements u j of u = ( u 1 , . . . , u p ) for whic h j ∈ S ; π S can then b e written as a matrix of size | S | × p with 0s and 1s, with | S | the num b er of elemen ts in S . The π S matrix can also b e seen as containing the rows j of the p × p identit y matrix corresponding to j ∈ S . The submo del in question, indexed by S , has mean function x t i,S β S , and least squares estimator b β S = Σ − 1 n,S n − 1 n X i =1 x i,S y i , with Σ n,S = n − 1 n X i =1 x i,S x t i,S = π S Σ n π t S . (2.2) V ariable selection may serv e several purp oses. Here we shall fo cus atten tion on the task of estimating a given fo cus p ar ameter , namely µ = E ( Y | x 0 ) = x t 0 β , for a giv en co v ariate v ector x 0 , p erhaps associated with some given individual or ob ject. This is exempliﬁed by the pregnan t Mrs. Jones featured in our introduction section. Th us there are candidate estimators b µ S = x t 0 ,S b β S , one for eac h S . The mean squared error (mse) of the wide mo del estimator is easily written down, since it is un biased, and one ﬁnds mse wide = V ar x t 0 b β = ( σ 2 /n ) x t 0 Σ − 1 n x 0 . (2.3) Setting up a clear form ula for the mse of the S based estimator is somewhat more tric ky . Its v ariance is V ar b µ S = ( σ 2 /n ) x t 0 ,S Σ − 1 n,S x 0 ,S , but some algebraic eﬀorts are needed to sort out its bias in a fruitful fashion. W rite Σ n = Σ 00 ,S Σ 01 ,S Σ 10 ,S Σ 11 ,S ! , with in verse Σ − 1 n = Σ 00 ,S Σ 01 ,S Σ 10 ,S Σ 11 ,S ! , (2.4) 6 with the relev ant blo c king into blo c ks inside and outside S . Thus Σ 00 ,S is the same as Σ n,S of (2.2), etc.; also, the matrix Q S = Σ 11 ,S = (Σ 11 ,S − Σ 10 ,S Σ − 1 00 ,S Σ 01 ,S ) − 1 (2.5) will b e needed b elo w, of size | S c | × | S c | , with S c the complement set of S . Lemma 1. F or estimating µ = x t 0 β , the submo del based estimator b µ S = x t 0 ,S b β S has bias ω t S β S c , with ω S = Σ 10 ,S Σ − 1 S, 00 x 0 ,S − x 0 ,S c , of length | S c | = p − | S | . The iden tity x t 0 Σ − 1 n x 0 = x t 0 ,S Σ − 1 n,S x 0 ,S + ω t S Q S ω S holds. The mean squared error of b µ S can b e expressed as mse S = ( σ 2 /n ) x t 0 ,S Σ − 1 n,S x 0 ,S + ( ω t S β S c ) 2 = ( σ 2 /n ) { x t 0 ,S Σ − 1 n,S x 0 ,S + n ( ω t S β S c ) 2 /σ 2 } = ( σ 2 /n ) { x t 0 Σ − 1 n x 0 + n ( ω t S β S c ) 2 /σ 2 − ω t S Q S ω S } . Pr o of. W e start from E b β S = Σ − 1 n,S n − 1 n X i =1 x i,S ( x i,S β S + x i,S c β S c ) = Σ − 1 00 ,S (Σ 00 ,S β S + Σ 01 ,S β S c ) , whic h leads to the bias expression E b µ S − µ = x t 0 ,S ( β S + Σ − 1 00 ,S Σ 01 ,S β S c ) − x t 0 ,S β S − x t 0 ,S c β S c = ( x t 0 ,S Σ − 1 00 ,S Σ 01 ,S − x 0 ,S c ) β S c , pro ving the ﬁrst assertion. The identit y decomp osing x t 0 Σ − 1 n x 0 in to tw o parts is prov en via the required algebraic manipulations and patience; also, the expressions for mse S follo w readily . The lemma, com bined with (2.3), leads to the relativ e risk expression rr S = mse S mse wide = x t 0 ,S Σ − 1 n,S x 0 ,S + nλ 2 S x t 0 Σ − 1 n x 0 , where λ S = ω t S β S c /σ, (2.6) the point also being that the σ 2 /n term cancels out. In particular, submo del S is b etter than the wide mo del pro vided | λ S | = | ω t S β S c | /σ < ( x t 0 Σ − 1 n x 0 − x t 0 ,S Σ − 1 n,S x 0 ,S ) 1 / 2 / √ n = ( ω t S Q S ω S ) 1 / 2 / √ n. (2.7) Note that with more data it t ypically becomes increasingly harder for a simple model to beat the wide mo del, but for moderate datasets it is perfectly p ossible for (2.7) to hold, depending on the size of the implied bias ω t S β S c . It is noteworth y that if β S c is close to b eing orthogonal to ω S , then | λ S | is small and the submo del in question ma y do very w ell, even if β S c is far from zero, i.e. even if the submodel is far from being correct as suc h. This is a v ersion of the classical bi as-v ersus-v ariance balancing game, whic h now activ ely dep ends on the given x 0 under fo cus. Of course the statistician cannot directly know whether (2.7) holds or not, since β S c /σ is unkno wn, but clear estimators may b e constructed for the risk ratios (2.6), of the form FRIC S = b rr S = x t 0 ,S Σ − 1 n,S x 0 ,S + b Λ S x t 0 Σ − 1 n x 0 , with b Λ S estimating Λ S = nλ 2 S . (2.8) There are several natural suc h, as w e come bac k to b elo w. F or each of these constructions, we ha ve a FRIC plot and a FRIC table, as explained and exempliﬁed in the in tro duction section. 7 It is fruitful to factor in the particular v ariance τ 2 S = V ar ( ω t S √ n b β wide ,S ) = σ 2 ω t S Q S ω S , with Q S = Σ 11 ,S the matrix of (2.5). No w consider the parameter κ S = ω t S √ nβ S c τ S = √ nω t S β S c ( ω t S Q S ω S ) 1 / 2 σ , (2.9) with estimator b κ S = ω t S √ n b β wide ,S c b τ S = ω t S √ n b β wide ,S c ( ω t S Q S ω S ) 1 / 2 b σ , with distribution N( κ S , 1) b σ /σ ∼ t m ( κ S ) , (2.10) the noncen tral t m distribution with excen tre parameter κ S . Also, b κ 2 S ∼ F 1 ,m ( κ 2 S ), the noncentral F with degrees of freedom (1 , m ) and excen tre parameter κ 2 S . With this κ S w e hav e the exact relativ e risk expression rr S = mse S mse wide = x t 0 ,S Σ − 1 n,S x 0 ,S + ω t S Q S ω S κ 2 S x t 0 Σ − 1 n x 0 = 1 − ω t S Q S ω S x t 0 Σ − 1 n x 0 (1 − κ 2 S ) , (2.11) cf. (2.6). In this risk ratio expression all quantit ies are known, from the n × p co v ariate matrix of the x i , apart from κ S . F or estimating the relative risk ratios (2.11), for comparing and ranking the diﬀerent submo d- els, and in particular to decide on which submo del is the best for the given purp ose of estimating x t 0 β , there are a few related options. Using b κ 2 S directly leads to ov ersho oting, as E b κ 2 S = E κ 2 S + 1 b σ 2 /σ 2 = m m − 2 ( κ 2 S + 1) . Tw o natural estimators of the κ 2 S term are hence m − 2 m ( b κ 2 S − 1) and max n m − 2 m ( b κ 2 S − 1) , 0 o . The ﬁrst is unbiased for κ 2 S , the second is the truncated version where negative estimates of the nonnegativ e quan tity is set to zero. This leads to the un biased FRIC scores, sa y FRIC u S = n x t 0 ,S Σ − 1 n,S x 0 ,S + m − 2 m ω t S Q S ω S ( b κ 2 S − 1) o. x t 0 Σ − 1 n x 0 = h x t 0 ,S Σ − 1 n,S x 0 ,S + m − 2 m n n ( ω t S b β wide ,S c ) 2 b σ 2 − ω t S Q S ω S oi. x t 0 Σ − 1 n x 0 , (2.12) and its truncated cousin, say FRIC t S , where the squared bias estimator comp onen t, i.e. the second term of the numerator here, is set to zero in cases where b κ 2 S < 1. The probability of this even t taking place is F 1 ,m (1 , κ 2 S ), whic h can b e as high as pr { χ 2 1 < 1 } = 0 . 683 in the case of zero bias and m = n − p high. The tw o related sets of scores FRIC u S and FRIC t S otherwise tend to b e highly correlated and hence to pro duce highly related model rankings. F or Figures 1.1 – 1.2 and T able 1.1, w e hav e used the truncated ve rsion of (2.12). 3 Conﬁdence distributions for the rel ativ e risks Ab o ve we w ere able to deriv e clear expressions for the relativ e risks rr S = mse S / mse wide , as with (2.11), and then constructed natural FRIC scores for estimating these, essen tially via un derstanding 8 0 1 2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0 relative risk confidence Figure 3.1: Conﬁdence cumulativ e distribution functions C S (rr S ) for the relative risks risk S / risk wide , as per (3.2). Those with high conﬁdence in v alues b elo w 1 are better for the fo cused prediction job for Mrs. Jones’s baby than the wide mo del. The median conﬁdence estimates FRIC 0 . 50 S are also read oﬀ from the plot. the extent to which b κ 2 S of (2.10) ov ersho ots the κ 2 S parameter of (2.9). These scores ha ve indeed b een used for the FRIC Figure 1.1 and FRIC T able 1.1. Imp ortan tly , w e may also set up a clear and exact conﬁdence distribution for the κ 2 S parameter, C ∗ S ( κ 2 S , data) = pr κ S { b κ 2 S ≥ b κ 2 S, obs } = 1 − F 1 ,m ( b κ 2 S, obs , κ 2 S ) , (3.1) where again m = n − p . It may b e used for more n uanced inference, complete with accurate conﬁdence int erv als at all levels, whic h we also exploit b elo w on the more canonical rr S scale. Note that the conﬁdence distribution here has a p ointmass at zero, of size C ∗ S (0 , data) = 1 − F 1 ,m ( b κ 2 S ), with F 1 ,m ( u ) the cumulativ e distribution function for the ordinary F distribution with degrees of freedom (1 , m ). F urthermor e, since κ 2 S = x t 0 Σ − 1 n x 0 rr S − x t 0 ,S Σ − 1 n,S x 0 ,S ω t S Q S ω S , there is a clear, full, and exact conﬁdence distribution for the relative risk ratio of rr S = mse S / mse wide , namely C S (rr S , data) = 1 − F 1 ,m  b κ 2 S , x t 0 Σ − 1 n x 0 rr S − x t 0 ,S Σ − 1 n,S x 0 ,S ω t S Q S ω S  , (3.2) v alid ab o ve a w ell-deﬁned minim um point, rr S ≥ rr S, min = x t 0 ,S Σ − 1 n,S x 0 ,S /x t 0 Σ − 1 n x 0 . As in Cunen & Hjort (2020) it is useful to display all these conﬁdence distributions in a diagram, with their diﬀeren t starting p oints. It is also useful to read oﬀ all C S (1 , data) = 1 − F 1 ,m  b κ 2 S , x t 0 Σ − 1 n x 0 − x t 0 ,S Σ − 1 n,S x 0 ,S ω t S Q S ω S  = 1 − F 1 ,m ( b κ 2 S , 1) , 9 the epistemic conﬁdence probabilit y that rmse S ≤ rmse wide . Goo d mo dels are those where these probabilities of b eing b etter than the wide mo del are high, whic h means | b κ S | ratios suﬃcien tly small. Figure 1.2 illustrates this, where the statistician learns which submo dels can b e expected to be particularly well-w orking when it comes to predicting the birth weigh t of Mrs. Jones’s c hild- to-come. The conﬁdence distributions also in vite the me dian c onﬁdenc e estimators for the relativ e risks, yielding one more natural FRIC score, FRIC 0 . 50 S = b rr 0 . 50 S = C − 1 S (0 . 50 , data) = min { rr S ≥ rr S, min : C S (rr S , data) ≥ 0 . 50 } . In cases where the conﬁdence p oin tmass at the minim um v alue, i.e. C S (rr S, min , data) = 1 − F 1 ,m ( b κ 2 S ), is already ab o ve 0.50, then the median conﬁdence estimate is equal to this minim um start v alue rr S, min . Figure 3.1 sho ws the 31 conﬁdence curves of (3.2), indicating also the median conﬁdence estimates. W e may also read oﬀ conﬁdence interv als, e.g. { rr S : C S (rr s , data) ≤ 0 . 90 } , to chec k which mo dels are asso ciated with high conﬁdence for doing a b etter job than the wide mo del. Op ening t he FRIC box in this case, chec king whic h submo dels do parti cularly w ell for the case of Mrs. Jones, with the conﬁdence distribution s and median conﬁdence estimates, one learns that there is high and essential agreement with the list of go od mo dels giv en in T able 1.1. Also, the correlations betw een the three relativ e risks estimates prop osed in this and the preceding section, i.e. FRIC u S , FRIC t S , and now FRIC 0 . 50 S , are all abov e 0.977. 4 AFRIC scores for v ariable selection Ab o ve w e hav e developed versions of FRIC, with one particular focus parameter E ( y | x 0 ) = x t 0 β at a time. Supp ose no w that we tak e an int erest in man y such at the same time, p erhaps a stratum in the space of cov ariates, like all white smoking mothers-to-b e in the context of the Mrs. Jones example of our introduction. Consider cov ariate v ectors x ( u ) of in terest, for an index set u ∈ U , along with a measure of relativ e imp ortance, sa y v ( u ) for E ( Y | x ( u )) = x ( u ) t β . These x ( u ), and the importance weigh t v ( u ), are selected by the statistician, inside the given context, and they ma y or ma y not form a subset of the co v ariate vectors { x 1 , . . . , x n } asso ciated with the observ ed data. A natural combined measure of loss asso ciated with selecting submodel S for the purp ose of estimating all these x ( u ) t β parameters is L S = X u ∈U v ( u ) { x S ( u ) t b β S − x ( u ) t β } 2 . By the eﬀorts of Section 2 the associated risk is risk S = E L S = ( σ 2 /n ) X u ∈U v ( u )[ x S ( u ) t Σ − 1 n,S x S ( u ) + n { ω S ( u ) t β S c } 2 /σ 2 ] , with ω S ( u ) = Σ 10 ,S Σ − 1 00 ,S x S ( u ) − x S c ( u ) for u ∈ U . In particular, for the wide mo del the risk is risk wide = ( σ 2 /n ) X u ∈U v ( u ) x ( u ) t Σ − 1 n x ( u ) , leading to a clear relativ e risk ratio of total risks, rr S = risk S risk wide = P u ∈U v ( u )[ x S ( u ) t Σ − 1 n,S x S ( u ) + n { ω S ( u ) t β S c } 2 /σ 2 ] P u ∈U v ( u ) x ( u ) t Σ − 1 n x ( u ) . (4.1) 10 It is useful to work out a clearer expression for the second term in the n umerator, also when it comes to the task of estimating the full rr S from data. W e ma y write γ S = X u ∈U v ( u ) n σ 2 β t S c ω S ( u ) ω S ( u ) t β S c = n σ 2 β t S c A S β S c , with A S = P u ∈U v ( u ) ω S ( u ) ω S ( u ) t . Since √ n b β S c /σ ∼ N p −| S | ( √ nβ S c /σ, Q S ), with Q S as in (2.5), w e ﬁnd that the natural start estimator e γ S = n b β t wide ,S c A S b β wide ,S c / b σ 2 has mean E e γ S = m m − 2 T r { A S ( nβ S c β t S c /σ 2 + Q S ) } = m m − 2 { γ S + T r( A S Q S ) } , Tw o natural estimators of the γ S quan tity are therefore b γ u S = m − 2 m { e γ S − T r( A S Q S ) } and b γ t S = m − 2 m max { 0 , e γ S − T r( A S Q S ) } , the unbiased estimator and its truncated-to-zero version. This leads to a veraged FRIC scores, say AFRIC u S = b rr u S = P u ∈U v ( u ) x S ( u ) t Σ − 1 n,S x S ( u ) + b γ u S P u ∈U v ( u ) x ( u ) t Σ − 1 n x ( u ) , (4.2) and an accompanying AFRIC t S with the truncated version b γ t S instead of the un biased one. F or the giv en collection of x ( u ) cov ariate vectors , along with imp ortance n umbers v ( u ), one ma y no w compute the AFRIC S scores for all candidate mo dels, with the best models those with smallest suc h scores. 0 1 2 3 4 5 6 0.0 0.2 0.4 0.6 0.8 1.0 relative risk rr confidence Figure 4.1: Conﬁdence cumulativ e distribution functions C ∗ S (rr S ) for the relativ e risks risk S / risk wide , for the 31 submodels for the mothers-and-babies data, for the case of putting equal importance to all av ailable cov ariate v ectors, i.e. using v ( u ) = 1 /n for the n av ailable vectors. Only the submo del corresp onding to keeping x 2 , x 3 , x 4 , x 5 , but excluding x 1 , with the black full curv e, exhibits a clear conﬁdence for working better than the wide mo del. The AFRIC 0 . 50 S scores of (4.6) are read oﬀ where the conﬁdence curves cross 0 . 50. 11 An imp ortan t sp ecial case of the AFRIC strategy is to declare all av ailable cov ariate vectors x i equally important, with imp ortance w eight 1 /n . F or this case we ha ve P n i =1 n − 1 x t i Σ − 1 n x i = T r(Σ − 1 n Σ n ) = p , and similarly n X i =1 n − 1 x t i,S Σ n,S x i,S = T r(Σ − 1 n,S Σ n,S ) = | S | , the cardinalit y of S . Also, the ω S v ector for x i is ω i,S = Σ 10 ,S Σ − 1 00 ,S x i,S − x i,S c . The second term of the numer ator for the relativ e risk (4.1) ma y hence b e written γ S = nβ t S c A S β S c /σ 2 , with A S = n X i =1 n − 1 (Σ 10 ,S Σ − 1 00 ,S x i,S − x i,S c )(Σ 10 ,S Σ − 1 00 ,S x i,S − x i,S c ) t = Σ 10 ,S Σ − 1 00 ,S Σ 00 ,S Σ − 1 00 ,S Σ 01 ,S − 2 Σ 10 ,S Σ − 1 00 ,S Σ 01 ,S + Σ 11 ,S = Σ 11 ,S − Σ 10 ,S Σ − 1 00 ,S Σ 01 ,S = Q − 1 S , see (2.5). W e learn that with equal importance for all cov ariate v ectors, the relativ e risk illumi- natingly can b e written rr S = risk S risk wide = | S | + γ S p = | S | + nβ t S c Q − 1 S β S c /σ 2 p . (4.3) With e γ S = n b β t wide ,S c Q − 1 S b β wide ,S c / b σ 2 the AFRIC scores abov e simplify to AFRIC u S = b rr u S = h | S | + m − 2 m { e γ S − ( p − | S | ) } i /p, (4.4) with a sister version using truncation to zero for the second term in the case of e γ S < p − | S | . These scores are easily computed, yielding a complete ranking of all candidate mo dels, for those with smallest estimated relativ e risks to the highest. F or this sp ecial case, when all av ailable cov ariates are considered equally imp ortan t, w e also learn that a submodel S is b etter than the wide one provided γ S = nβ t S c Q − 1 S β S c /σ 2 ≤ p − | S | . This can b e assessed accurately via e γ S = n b β t wide ,S c Q − 1 S b β wide ,S c b σ 2 ∼ χ 2 p −| S | ( γ S ) χ 2 m /m ∼ ( p − | S | ) F p −| S | ,m ( γ S ) , featuring the noncentral F with degrees of freedom ( p − | S | , m ) and excen tre parameter γ S . This leads to a full and exact conﬁdence distribu tion for the γ S parameter, and via the represen tation (4.3) ab ov e also for the relative risk rr S itself: C ∗ S (rr S ) = pr γ S { ( | S | + e γ S ) /p ≥ ( | S | + e γ S, obs ) /p } = 1 − F p −| S | ,m  e γ S, obs p − | S | , p rr S − | S |  for rr S ≥ | S | /p. (4.5) Note that these start at minimal rr S v alues, namely | S | /p , with start conﬁdence pointmasses equal to 1 − F p −| S | ,m ( e γ S, obs / ( p − | S | )). There are several uses of these conﬁdence distributions for the relative risks. First, we may use the median conﬁdence estimator to create a diﬀerent AFRIC score: AFRIC 0 . 50 S = ( C ∗ S ) − 1 (0 . 50) = min { rr S ≥ | S | /p : C ∗ S (rr S ) ≥ 0 . 50 } . (4.6) 12 This v ersion bypasses the diﬃculties related to truncation-to-zero or not. Second, we may read oﬀ the p ost-data conﬁdence probabilities that giv en submodels work b etter than the wide model, via C ∗ S (1) = pr ∗ { rr S ≤ 1 } = 1 − F p −| S | ,m  e γ S, obs p − | S | , p − | S |  . Figure 4.1 displa ys the conﬁdence distrib ution functions C ∗ S (rr S ) for all the relativ e risks risk S / risk wide for the dataset described in the introduction, with ﬁve co v ariates x 1 , . . . , x 5 recorded to assess their p oten tial inﬂuence on birthw eight y . Using fo cused FRIC metho ds dev elop ed in Sections 2 – 3 w e ha ve seen that for v arious giv en purposes, lik e predicting the outcome for a giv en mother-to-be, there might b e sev eral submodels doing far better than the wide mo del; see Figures 1.1 – 1.2. Using the AFRIC apparatus, for the case of all av ailable cov ariates deemed to ha v e equal importance, Figure 4.1 indicates that most submo dels can be exp ected to do w orse than the full wide mo del, how ever. Only one mo del, the one k eeping x 2 , x 3 , x 4 , x 5 but excluding x 1 (age of mother), is then scoring signiﬁcantly b etter than the wide mo del. This is also shown by insp ecting the AFRIC scores, where those of (4.4) and (4.6) turn out to b e very highly correlated. 5 AFRIC, the AIC, and the Mallo ws criterion The FIC and FRIC metho ds are indeed fo cused, set up to work w ell with a given fo cus parameter, and are also for that reason diﬀeren t in spirit from ov erall v ariable and mo del selection metho ds lik e the AIC, the BIC, the Mallo ws statistic, etc. The particular AFRIC method of (4.4) is how ever an in tended de-fo cused ov erall selection criterion, where all cov ariate vectors in the data collection of ( x i , y i ) are considered equally imp ortan t, and b elo w we mak e some comparisons with other methods. F or general material on these selection criteria, see Claeskens & Hjort (2008). 5.1 The AIC The log-likelihoo d for the S subset mo del takes the form ℓ n,S ( β S , σ S ) = − n log σ S − 1 2 (1 /σ 2 S ) n X i =1 ( y i − x t i,S β S ) 2 − 1 2 n log (2 π ) , with maximisers b β S of (2.2) and b σ 2 S = rss( S ) n , with rss( S ) = n X i =1 ( y i − x t i,S b β S ) 2 . (5.1) Often the unbiased versi on rss( S ) / ( n − | S | ) is used instead, but b σ 2 S with denominator n is the maxim um likelihoo d estimator. The log-lik eliho o d maxim um is hence ℓ n, max = − n log b σ S − 1 2 n − 1 2 n log (2 π ), so that the AIC score becomes AIC S = 2 ℓ n,S, max − 2 ( | S | + 1) = − 2 n log b σ S − 2 | S | + c n , with c n an immaterial constant not dep ending on the data or the candidate mo del. In this formu- lation mo dels with higher AIC v alues are preferred, so the AIC hence selects the mo del with the lo west score n log b σ S + | S | . There are several v ariations on the AIC scheme, including ﬁnite-sample corrections and other appro ximations, see Claeskens & Hjort (2008, Ch. 2). One version is related to a simple T aylor expansion, v alid when the b σ S / b σ wide ratios are not far from 1. Starting with AIC S = − n log  b σ 2 wide b σ 2 S b σ 2 wide  − 2 | S | + c n = − n log  1 + b σ 2 S b σ 2 wide − 1  − 2 | S | + c ′ n 13 w e reach the approximation AIC S . = AIC ∗ S , where AIC ∗ S = − n b σ 2 S b σ 2 wide − 2 | S | + c ′′ n = − rss( S ) b σ 2 wide − 2 | S | + c ′′ n , (5.2) with c ′ n , c ′′ n further immaterial constan ts. Thus this appro ximate AIC sele cts the set S with smallest n b σ 2 S / b σ 2 wide + 2 | S | = n rss( S ) / rss(wide) + 2 | S | . This v ersion of AIC is also arrived at if one starts with the mo delling setup that for submo del S , one has y i ∼ N( x t i,S β S , σ 2 ), with the same σ across candidate models. This is arguably conceptually wrong, as mo dels with more cov ariates ought to ha ve smaller and not identical residual v ariances, but as the T a ylor argument shows the error is sligh t if the b σ S / b σ wide ratios are not too big. There are connections to the AFRIC of (4.4), which w e comment on below, after noting that AIC ∗ is also equiv alent to the Mallows Cp method. 5.2 The Mallows statistic There are several equiv alent or close-to-equiv alent deﬁnitions of the Mallows Cp statistic, from Mallo ws (1973, 1995); see e.g. Hansen (2007) and Efron & Hastie (2016, Ch. 12) for somewhat diﬀeren t p erspectives and t yp es of uses. F or the present purp oses, for eac h submodel S , let M S = rss( S ) / b σ 2 − n + 2 | S | , (5.3) where the tradition is to use the unbiased b σ 2 = rss(wide) / ( n − p ) in the denominator, as opp osed to the maximum likelihoo d v ersion b σ 2 wide = rss(wide) /n encountere d for the AIC ab ov e. With this deﬁnition, w e ha v e M wide = n − p − n + 2 p = p for the wide mo del. The diﬀerence b et ween the t wo v ariance estimators is small in the traditional setup with p small or moderate and n mo derate or large. So apart from this often minor detail, the Mallo ws score (5.3) is just a minus sign and a constant aw ay from the AIC approximation (5.2). So high scores AIC ∗ S are equiv alent to low scores M S . There is a certain connection from the Mallo ws score and hence also the AIC approximation to the neutral-v ersion AFRIC score, whic h we now lo ok into. It will b ecome apparent that the AFRIC score (4.4) can be seen as a more sophisticated cousin of Mallows. Note that with 1 − 2 /m appro ximated as 1 we hav e AFRIC S equal to e γ S + 2 | S | = n b β t wide ,S c Q − 1 S b β wide ,S c / b σ 2 + 2 | S | , (5.4) up to constants, which has a clear structural similarity to the rss( S ) / b σ 2 + 2 | S | with Mallows. Also, b oth e γ S and rss( S ) / b σ 2 relate to the size of the mean parameter mo delling bias, the β S c , but this mo delling bias is b eing assessed diﬀeren tly by the t wo methods, and actually more precisely via the AFRIC. W e now examine the required details, to learn the extent to whic h low scores of e γ S + 2 | S | is related to or sometimes equiv alent to lo w scores for rss( S ) / b σ 2 + 2 | S | . The nutshe ll story for the neutral AFRIC, as developed in Section 4, is (i) that the crucial quan tity to estimate is | S | + γ S , with γ S = nβ t S c Q − 1 S β S c /σ 2 , and (ii) that when using the estimator e γ S = n b β t wide ,S c Q − 1 S b β wide ,S c / b σ 2 , and correcting for bias, the result is | S | + e γ S − ( p − | S | ), mo dulo approximating 1 − 2 /m by 1. The parallel n utshell story for the Mallows criterion must b e presen ted diﬀeren tly , in that step (ii), the statistic itself, comes b efore the insigh t of t ype (i), uncov ering the underlying crucial population parameter b eing estimated. It turns out to b e the same as for the neutral AFRIC. Lemma 2. Consider the Mallo ws v ariable M 0 S = rss( S ) /σ 2 − n + 2 | S | , here using the real σ rather than the estimated one, as with M S of (5.3). Its mean is | S | + γ S . 14 Pr o of. Note ﬁrst that the textb o ok formula E rss( S ) = ( n − | S | ) σ 2 do es not apply here, since submodel S do es not hold, but has a mean mo delling bias. There are sev eral form ulae within reasonable reach for the mean of rss( S ), under the conditions of the wide model (2.1), but certain linear algebra eﬀorts are required to prov e that is can be expressed as | S | + γ S , with the same γ S as for the AFRIC. In Section 2 w e wrote the basic mo del in terms of the individual y i = x t i β + ε i , and now it is practical to also work with the linear algebra version, writing y = X β + ε ∼ N n ( X β , σ 2 I ), with X the n × p matrix of the cov ariate v ectors. W e start with y − X S b β S = { I − X S ( X t S X S ) − 1 X t S } y = ( I − H S ) y , with H S = X S ( X t S X S ) − 1 X t S , the idemp oten t hat matrix with H 2 S = H S and T r( H S ) = | S | . F or the mean vector of the data vector y , write ξ = X β = X S β S + X S c β S c , so that b S = E y − X S β S = X S c β S c is the mean modelling error when emplo ying submodel S . W e no w ha ve ξ t ( I − H S ) ξ = ( X S β S + X S c β S c ) t ( I − H S )( X S β S + X S c β S c ) = ( X S β S + X S c β S c ) t ( I − H S ) X S c β S c = b t S ( I − H S ) b S , whic h leads to E rss( S ) = E y t ( I − H S ) y = ξ t ( I − H S ) ξ + T r { ( I − H S ) V ar y } = b t S ( I − H S ) b S + σ 2 ( n − | S | ) . W e ha ve reached E M 0 S = n − | S | + ϕ S − n + 2 | S | = | S | + ϕ S , with ϕ S = b t S ( I − H S ) b S /σ 2 = β S c X t S c ( I − H S ) X S c β S c /σ 2 . But this is v eriﬁably the same as γ S , in that the matrix iden tity n − 1 X t S c ( I − H S ) X S c = Q − 1 S = Σ 11 ,S − Σ 10 ,S Σ − 1 00 ,S Σ 01 ,S (5.5) can b e pro ved separately , with Q S of (2.5). The exact mean of the Mallows statistic M S of (5.3) is a more complicated expression than for the M 0 S of the lemma, but since b σ /σ is close to 1 with high probability , for m = n − p mo derate or large, we w ould still ha ve | S | + γ S as the approx imate mean of M S . The neutral AFRIC of (4.4) has a couple of extra sales p oin ts when compared to the partly similar Mallo ws criterion. It is exact, and prop erly ﬁne-tuned for ﬁnite samples, with no approx- imations inv olved; also, it has a clear statistical interpretation, as an exactly unbiased estimator of ov erall relative risk, the rr S = risk S / risk wide of (4.3). With the AFRIC formulation, part of the usefulness of the strategy is to only care about submo dels with AFRIC scores less than 1. Finally there is a clear, exact, and optimal associated conﬁdence distribution, the C ∗ S (rr S ) of (4.5). 6 Concluding remarks W e round oﬀ our paper b y oﬀering a list of concluding remarks, pointing both to related themes, further applications of the dev elop ed methods, and to a few issues for further studies. 15 A. The birthweight dataset. The dataset on birthw eigh ts for 189 babies, with cov ariate in- formation for their mothers, stems from Hosmer & Lemesho w (1989), and has later b een used b y several authors for reanalyses and illustration of new metho dology; see e.g. Claeskens & Hjort (2008). In nearly all of these publication the emphasis has b een on the ev ent ‘lo wer than 2.50 kg or not’, how ever, i.e. with logistic regressions with v ariations, whereas we here actually use the actual birth weigh ts on their con tinuous scale. Our statistical narrative has fo cused Mrs. Jones (age 40, weigh t 60, white, smoker), to conv ey that the metho dology is intended to w ork for one p erson or ob ject at the time; w ere she not a smok er w e could easily re-run all programmes to produce FRIC plots and FRIC tables for sorting and ranking all candidate mo dels again, indeed ﬁnding a diﬀerent ‘b est mo del’ for that purp ose. B. Candidate mo dels. Our v ariable selection machi nery is able to handle comparisons b et ween all subset driv en mo dels, as seen in the introduction example with Mrs. Jones and the ensuing 2 5 = 32 candidate mo dels for her ﬁve cov ariates. It is useful if the statistician can carry out an initial screening, ho wev er, omitting implausible mo dels; one migh t e.g. allo w in teraction terms x j x k to enter a candidate mo del only when b oth x j and x k are on board. Sometimes there is also a natural ordering of mo dels, as in nested situations. Screening a wa y implausible mo dels makes the rest of the FRIC and AFRIC analyses simpler and sharper. A pertinent reminder is that results dev elop ed and used in our pap er do rely on the starting assumption that the wide mo del (2.1) holds. This migh t b e chec ked separately , b y any of battery of go o dness-of-ﬁt c hecks for the linear regression mo del. The more critical of these underlying assumptions are indep endence and constancy of error v ariance; our metho ds will contin ue to work w ell eve n if residuals are not exactly normally distribut ed. C. A single submo del versus the wide. A sp ecial case of the FRIC and AFRIC setups is when there is a single submo del S to chec k against the wide mo del. Using Lemma 1 w e reached the c haracterisation (2.7), that S do es b etter than the wide if and only if √ n | ω t S β S c | < ( ω t S Q S ω S ) 1 / 2 . This is an inﬁnite strip in the space of β S c , of those not far from perp endicularit y to ω S , and this depends on the x 0 under focus. The FRIC metho ds of Section s 2 – 3 are built to take this p ossibilit y in to accoun t, that mo del S can perform well even if it is far from correct, and, speciﬁcally , ev en if it might be doing a bad job for another E ( Y | x ∗ 0 ). This makes the FRIC methods diﬀeren t from those which directly or indirectly in volv e testing whether β S c is zero or close to zero. The b est fo cused to ol for comparing S with the wide is the full conﬁdence curve C S (rr S , data) of (3.2), see Figure 3.1. The neutral or de-fo cused AFRIC scheme dev eloped in Section 4 is diﬀerent, ho wev er, and when all cov ariate vectors are seen as equally imp ortan t matters are seen to hinge on the ‘global’ statistic e γ S = n b β t S c Q − 1 S b β wide ,S c . The best to ol for comparing S with the wide is no w the conﬁdence distribution C ∗ S (rr S ) of (4.5) for the o verall relativ e risk rr S = risk S / risk wide . D. Mo del aver aging. A natural follo w-up to the present study is that of constructing goo d mo del a veraging pro cedures, of the type b µ ∗ = P S w ( S ) b µ S , p erhaps with data-driven w eights summing to one ov er the diﬀerent submodel based choices b µ S . These weigh ts could reﬂect ho w w ell the diﬀeren t candidate mo dels do according to the FRIC( S ) or conf S criteria. This would relate to similar strategies studied in Hansen (2007), Claesk ens (2016), and Cunen & Hjort (2020). E. rmse and mse. W e hav e given our estimators and conﬁdence curv es for the scale of relative risks, rr S = mse S / mse wide ; also, the AFRIC S = b rr S w orks by inten tion on that same scale. It is sometimes more meaningful to w ork in the ro ot-relativ e-risk scale of rrr S = rmse S / rmse wide , 16 0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.0 0.2 0.4 0.6 0.8 1.0 root mse f or Mrs Jones's point estimate confidence cur v e Figure 6.1: Conﬁdence curve for the ro ot mean squared error of the estimator b µ wide = x t 0 b β wide for the case of Mrs. Jones. The estimate itself is 2.889 kg, and the estimated root-mse of that estimate is 0.180, with the ﬁgure displaying the full cc(rmse wide ). ho wev er, since the ro ot-mse are on the scale of the fo cus parameter. This is a matter of choice and conv enience; all relev ant n umbers, and hence also the FRIC and Conﬁdence plots, are easily con verted to the rrr S scale if wished for. F or the 2 5 = 32 submo dels encoun tered for Mrs. Jones w e c hose rr S for visual clarit y with the FRIC plot; b oth the b est and the worst candidate mo dels are easier to spot with Figure 1.1 than with a corresponding FRIC 1 / 2 plot. Inciden tally , the Conﬁdence plot of Figure 1.2 is not aﬀected b y the scale used for relativ e risks. In addition to letting FRIC do accurate assessment s of all relativ e risks rmse S / rmse wide , an analysis might not b e quite complete without also assessing the rmse wide itself, the ro ot-mse of the estimator b µ wide = x t 0 b β wide . A formula for this quantit y is rmse wide = ( σ / √ n )( x t 0 Σ − 1 n x 0 ) 1 / 2 , and b oth estimation and accurate assessment are now easily av ailable through b σ ∼ σ ( χ 2 m /m ) 1 / 2 . F or the case of Mrs. Jones, estimating µ = E ( Y | x jones ) via the wide mo del giv es b µ wide = 2 . 889, its estimated ro ot-mse is 0.180, and a full conﬁdence curv e is cc(rmse wide ) =    1 − 2 Γ m  m d mse wide mse wide     , here with d mse wide observ ed at 0 . 180 2 , and with Γ m ( · ) the distribution function for the χ 2 m . This is displa yed in Figure 6.1, p oin ting to the median conﬁdence estimate, and with conﬁdence in terv als easily read oﬀ for eac h wished-for conﬁdence lev el. This conﬁdence curve is the optimal one, see Sc hw eder & Hjort (2016, Ch. 6). The FRIC analysis of Section 1 indicates that there are several submodel based estimates of E ( Y | x jones ) with smaller root-mse than 0.180. It migh t of course also be of interest to assess the rmse S quan tities directly . This is indeed approac hable, via mse S = mse wide rr S = ( σ 2 /n )( x t 0 ,S Σ − 1 n,S x 0 ,S + ω t S Q S ω S κ 2 S ) , 17 using Lemma 1 again. The technical obstacle is that estimators of σ and κ S are dep enden t, see (2.10), and there is no exact conﬁ dence curve for the pro duct expression of mse S . Go od appro ximations may be work ed out, via techniques of Sc hw eder & Hjort (2016, Ch. 3-4), but our fo cus on the rr S has bypassed the need for suc h appro ximations. F. When the Σ n matrix is diagonal. Supp ose the v ariance matrix Σ n = n − 1 P n i =1 x i x t i is equal to the iden tity matrix. Then the comp onen ts of b β wide = n − 1 P n i =1 x i y i are indep enden t and b β wide ,j ∼ N( β j , σ 2 /n ). The structure of FRIC and AFRIC formulae simplify and it also b ecomes easier to examine asp ects of p erformance. Considering ﬁrst the AFRIC of (4.4), we ﬁnd AFRIC u S = h | S | + (1 − 2 /m ) n X j / ∈ S n b β 2 wide ,j / b σ 2 − ( p − | S | ) oi /p, where the n umerator may b e written X j ∈ S 1 + X j / ∈ S b ϕ j = p X j =1 [ I { j ∈ S } + b ϕ j I { j / ∈ S } ] , where b ϕ j = m − 2 m  n b β 2 wide ,j b σ 2 − 1  . The winning subset S ∗ , where this becomes smallest, is where j is selected if and only if b ϕ j > 1, whic h means n b β 2 wide ,j / b σ 2 > 2 + 2 / ( m − 2). This corresponds to co ordinate-wise insp ection for non-zero-ness of the β j , with a signiﬁcance level of about pr { χ 2 1 > 2 } = 0 . 157, for large n . F or the FRIC of (2.12) w e similarly ﬁnd FRIC u S = P j ∈ S x 2 0 ,j + (1 − 2 /m ) P j / ∈ S x 2 0 ,j ( n b β 2 wide ,j / b σ 2 − 1) P p j =1 x 2 0 ,j . The numerator can b e written P p j =1 x 2 0 ,j [ I { j ∈ S } + b ϕ j I { j / ∈ S } ], leading to precisely the same winner S ∗ as just found for the AFRIC case, i.e. indep enden t of the x 0 in question. This is a consequence of the diagonal assumption for Σ n , how ever, and in general the b est FRIC mo del v ery m uch dep ends on the x 0 under scrutiny . G. Mor e gener al r e gr ession mo dels. The present paper has work ed inside the wide linear re- gression mo del (2.1), leading to precise ﬁnite-sample conﬁdence distributions for all relativ e risks, for any focused mean parameter E ( Y | x 0 ). F or more general regression mo dels, as with the gen- eralised linear mo dels v ariety , and for general focus parameters, like abov e-threshold probabilities pr { Y ≥ y 0 | x 0 } , one must rely on appro ximations, via large-sample argumen ts or b o otstrapping. Suc h are work ed out in Cunen & Hjort (2020), with applications to logistic and Poisson regres- sion setups; it is in general not possible to reach exact ﬁnite-sample co verage outside the linear regression mo del, ho wev er. Ac kno wledgments This article has b eneﬁtted from fruitful conv ersations with C ´ eline Cunen and Emil Stoltenberg, and from discussions with other members of the Norw egian Research Council funded pro ject F o cu- Stat: F o cus Driv en Statistical Inference with Complex Data, at the Departmen t of Mathematics, Univ ersity of Oslo. References Claeskens, G. (2016). Statistical mo del choice. A nnual R eview of Statistics and its Applic ations 3 , 233–256. 18 Claeskens, G. , Cunen, C. & Hjor t, N. L. (2019). Model selection via Fo cused Information Criteria for complex data in ecology and evolution. F r ontiers in Ec olo gy and Evolution 7 , 415–428. Claeskens, G. & Hjor t, N. L. (2003). The fo cused information criterion [with discussion and a rejoinder]. Journal of the Americ an Statistic al Asso ciation 98 , 900–916. Claeskens, G. & Hjor t, N. L. (2008). Mo del Sele ction and Mo del A ver aging . Camb ridge: Cambridge Univ ersity Press. Cunen, C. & Hjor t, N. L. (2020). Conﬁdence distributions for FIC scores. Ec onometrics [to app e ar] xx , 1–28. Efron, B. & Hastie, T. (2016). Computer A ge Statistic al Infer enc e: Algorithms, Evidenc e, and Data Scienc e . Cambridge: Cambridge Universit y Press. Hansen, B. E. (2007). Least squares mo del av eraging. Ec onometric a 75 , 1175–1189. Hjor t, N. L. & Schweder, T. (2018). Conﬁdence distributions and related themes: introduction to the special issue. Journal of Statistic al Planning and Infer enc e 195 , 1–13. Hosmer, D. & Lemeshow, S. (1989). Applie d L o gistic R e gr ession . New Y ork: Wiley . Mallows, C. L. (1973). Some comments on Cp. T e chnometrics 15 , 661–675. Mallows, C. L. (1995). More comments on Cp. T e chnometrics 37 , 362–372. Miller, A. J. (2002). Subset Sele ction in R e gr ession [2nd e d.] . Singapore: W orld Scien tiﬁc. Schweder, T. & Hjor t, N. L. (2016). Conﬁdenc e, Likeliho o d, Pr ob ability: Statistic al Infer ence with Conﬁdenc e Distributi ons . Cambridge: Cambridge Universit y Press. Xie, M.-g. & Singh, K. (2013). Conﬁd ence distributio n, the frequen tist distribution estimator of a parameter: A review [with discussion and a rejoinder]. International Statistic al R eview 81 , 3–39. 19

Focused Relative Risk Information Criterion for Variable Selection in Linear Regression

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment