Nonparametric Independence Screening in Sparse Ultra-High Dimensional Additive Models

Nonparametric Indep endence Screening in Sparse Ultra-Hig h Dimensional Add itiv e Mo dels ∗ Jianqin g F an, Y ang F eng and Rui Song Octob er 30, 2018 Abstract A v ariable screening p r o cedure via correlation learning was prop osed in F an and L v (2008) to r educe dimensionalit y in sp arse ultra-high di- mensional mo dels. Ev en when the true mo del is linear, the marginal regression can b e highly nonlinear. T o addr ess this issue, we further ex- tend the correlatio n learning to marginal nonparametric learning. O u r nonparametric indep end ence screening is called NIS, a sp eciﬁc mem- b er o f t he sure indep endence screening. S ev eral closely related v ariable screening p ro cedures are p rop osed. Under general non p arametric mo d - els, it is shown t hat under some mild tec h nical conditions, the prop osed indep end ence screening m etho ds enjo y a sure screening p rop erty . The ∗ Jianqing F a n is F rederick L. Mo ore Professor of Finance, Depar tmen t of Op e rations Research and Financial Engineer ing , Princeton Universit y , Princeton NJ 08 5 44 (Email: jq- fan@princeton.edu). Y a ng F eng is Assistant Professor, Depa rtment of Statistics, Co lumbia Univ ersity , New Y ork, NY 1002 7 (Email: y a ngfeng@stat.co lumbia.edu). Rui Song is Assis- tant Profess o r, Depa rtment of Statistics, Colorado St ate Univ ersity , F o rt Collins, CO 805 23 (Email: song@s ta t.colostate.edu). The ﬁnancial supp o rt from NSF gra nt s DMS-07 1 4554 , DMS-07043 37, D MS-100 7 698 and NIH grant R01-GM07261 1 are gr eatly ackno wledged. The authors are in deep d ebt to Dr. Luk as Meier for sharing the codes of p enGAM. The authors thank the editor, the asso ciate editor, and re fer ees for their constructive comments. 1 exten t to w h ic h the dimensionalit y can b e reduced by indep endence screening is also explicitly quantiﬁed. As a metho dological extension, a data-driv en thresholding and an iterativ e nonparametric indep end ence screening (INIS) are also p r op osed to enhance the ﬁ nite sample p erfor- mance for ﬁtting sparse additive mo d els. The simulat ion results and a real data analysis d emonstrate that the prop osed pro cedure works well with mo der ate sample size and large dimension and p erforms b etter than comp eting metho d s . Keyw or ds: Additiv e mo del, indep enden t learning, nonparametric regression, sparsit y , sure indep endence screening, nonparametric indep endence screen ing, v ariable selection. 1 In tro duction With rapid adv ances of computing p ow er and other mo dern tec hnolo g y , high- throughput data of unpreceden ted size and complexit y are frequen tly seen in man y con temp orary statistical studies. Examples include dat a from genetic, microarra ys, proteomics, fMRI, functional data and high frequency ﬁnancial data. In all these examples, the num b er of v ariables p can grow m uc h fa ster than the n umber of observ ations n . T o be more sp eciﬁc, w e assu me log p = O ( n a ) fo r some a ∈ (0 , 1 / 2). F ollow ing F an and Lv (20 09), w e call it non- p olynomial (NP) dimensionalit y o r ult r a -high dimensionality . What mak es the under-determined statistical inference p ossible is the sparsit y assumption: only a small set of indep enden t v ariables contribute to the resp onse. Therefore, dimension reduction and feature selection pla y piv otal r o les in these ult r a -high dimensional pro blems. The statistical literature con tains n umerous pro cedures on the v ar ia ble selection for linear mo dels a nd o ther parametric mo dels, suc h as the Lasso 2 (Tibshirani, 1996), the SCAD a nd other folded-conca ve p enalty ( F an, 1997; F an and Li, 2001), the D an tzig se lector (Candes and T ao, 2 007), the Elastic net (Enet) p enalt y (Zou and Hastie, 2005), the MCP (Zha ng, 2010) a nd r e- lated metho ds (Zou, 2006; Zou and Li, 2008). Nev ertheless, due to the “curse of dimensionalit y” in t erms of sim ultaneous challenges on t he computationa l exp ediency , statistical accuracy and alg orithmic stability , these metho ds meet their limits in ultra- high dimensional problems. Motiv ated by these concerns, F an and Lv (2008) in tro duced a new frame- w ork for v ariable screening via correlatio n learning with NP-dimensionality in the con text of least squares. Hall et al. (2009) used a diﬀeren t marg inal util- it y , deriv ed from an empirical lik eliho o d p oin t of view. Hall and Miller (2009) prop osed a generalized correlation ranking, whic h a llo ws nonlinear regression. Huang et al. (2 0 08) also inv estigated the marginal bridge regression in the or- dinary linear mo del. These metho ds fo cus on studying the marginal pseudo- lik eliho o d and are fa st but crude in terms of reducing the NP-dimensionalit y to a more mo derat e size. T o enhance the p erfor ma nce, F an and Lv (2008) and F an et al. (2009) in t r o duced some metho dological extensions includin g it era- tiv e SIS (ISIS) a nd m ulti-stage pro cedures, suc h as SIS-SCAD and SIS-L ASSO, to select v ariables a nd estimate parameters sim ultaneously . Nev ertheless, these marginal screening metho ds ha ve some metho dological c ha llenges. When the co v ariates are not join t ly normal, ev en if the linear mo del holds in the jo in t regression, the marginal regression can be highly nonlinear. Therefore, sure screening based on nonparametric marginal regression b ecomes a natural can- didate. In practice, there is o ften little prior informatio n t hat the eﬀects of the co v ariates tak e a linear form o r b elong to an y other ﬁnite-dimensional para- 3 metric family . Substan tial impro ve men ts are sometimes p ossib le by using a more ﬂex ible clas s of nonparametric mo dels, suc h as the additive mo del Y = P p j =1 m j ( X j ) + ε , introduced by Stone (1985). It increases substan tially the ﬂexibility of the ordinary linear mo del and a llo ws a data-ana lytic tra nsform of the cov ar ia tes to en ter in to the linear mo del. Y et, the literature on v ar i- able selection in nonpara metric additiv e mo dels are limited. See, for example, Koltc hinskii and Y uan (2008), Ra vikumar et al. (200 9), Huang et al. (201 0) and Meier et a l. (2009). Koltc hinskii and Y uan (2008) and Ravikumar et al. (2009) are closely related with COSSO prop osed in Lin and Zhang (2006) with ﬁxed minimal signals, whic h do es no t con v erge to zero. Huang et al. (2010) can b e view ed as an extension of adaptive lasso to additiv e mo dels with ﬁxed minimal signals. Meier et al. (20 09) prop osed a p enalty which is a com binatio n of sparsit y and smo ot hness with a ﬁxed design. Under ultra-high dimensional settings, all these metho ds still suﬀer fro m the af o remen tioned three c hallenges as they can b e view ed as extensions of p enalized pseudo-like liho o d appro ac hes to additive mo deling. The commonly used algor ithm in additive mo deling suc h as bac kﬁtting mak es the situatio n ev en more c hallenging, a s it is quite computationally exp ensiv e. In this pap er, we consider indep endence learning by ranking the ma g nitude of marginal estimators, nonparametric marginal correlations, and the margina l residual sum of squares. That is, we ﬁt p margina l nonparametric regressions of the resp onse Y against eac h co v ariate X i separately and rank their im - p ortance to the join t mo del according to a measure of the go o dness of ﬁt of their margina l mo del. The magnitude of these marginal utilities can preserv e the non-sparsity of the joint additiv e mo dels under some r easonable condi- tions, ev en with conv erging minim um strength of signals. Our work can be 4 regarded as an imp ortant and nontrivial extension of SIS pro cedu res prop osed in F an and Lv (2008) and F an and Song (2010). Compared with these pap ers, the minim um distinguishable signal is related with not only the sto c hastic error in estimating the nonpara metric components, but also appro ximation errors in mo deling nonpar a metric comp onen t s, whic h dep ends on the n umber of ba- sis functions used for the approximation. This brings signiﬁcan t challenges to the theoretical dev elopmen t and leads to an in teresting result on the extent to whic h the dimensionalit y can b e reduced by nonparametric independence screening. W e also prop ose an iterativ e nonparametric indep endence scree n- ing pro cedure , INIS-p enGAM, to reduce the false p ositiv e rate and stabilize the computation. This tw o-stage pro cedure can deal with the aforemen tioned three c hallenges b etter than other metho ds, as will b e demonstrated in our empirical studies. W e approx imate t he nonparametric additiv e components by using a B- spline ba sis. Hence, the comp onent selection in additiv e mo dels can b e view ed as a functional v ersion of the group ed v ariable selection. An early litera- ture on the group v ariable selection using group p enalized least-squares is An tonia dis and F an (2001) (see pag e 966), in whic h blo c ks of w a v elet co ef- ﬁcien ts are either killed or selected. The group v ariable selection w as more thoroughly studied in Y uan and Lin (2006), Kim et al. (2006), W ei and Huang (2007) and Meier et al. (20 09). Our metho ds and results ha ve imp ortan t im- plications on the group v ariable selections, as in additiv e regression, eac h com- p onen t can b e expressed as a linear com bination of a set of basis functions, whose co eﬃc ien ts hav e to b e either killed or selected sim ulta neously . The rest o f the paper is organized as f o llo ws. In Section 2, w e in tro duce the nonparametric indep endence screening (NIS) pro cedure in additiv e mo dels. 5 The theoretical prop erties fo r NIS a re presen ted in Section 3. As a metho d- ological extension, INIS-p enGAM a nd its gr eedy v ersion g-INIS-p enGAM a re outlined in Section 4. Monte Carlo sim ulatio ns and a real data analysis in Sec- tion 5 demonstrate the eﬀectiv eness of the INIS metho d. W e conclude with a discussion in Section 6 and relegate the pro ofs to Section 7. 2 Nonparametric ind e p endence sc reening Supp ose that w e hav e a rando m sample { ( X i , Y i ) } n i =1 from the p opulation Y = m ( X ) + ε , (1) in whic h X = ( X 1 , . . . , X p ) T , ε is t he random error with conditional mean zero. T o exp editiously identify imp ortan t v ariables in mo del (1), without the “curse-of-dimensionalit y”, w e consider the following p marginal nonpar a metric regression pro blems: min f j ∈ L 2 ( P ) E  Y − f j ( X j )  2 , (2) where P denotes the joint distribution of ( X , Y ) a nd L 2 ( P ) is t he class of square in tegra ble functions under the measure P . The minimizer of (2) is f j = E ( Y | X j ), the pro jection of Y on t o X j . W e rank the utility of cov aria tes in mo del (1) according to, for example, E f 2 j ( X j ) and select a small group of co v ariates via thresholding. T o obtain a sample v ersion of the marginal no npa r a metric regression, we emplo y a B-Spline basis. Let S n b e the space of p olynomial splines of degree l ≥ 1 and { Ψ j k , k = 1 , · · · , d n } denote a normalized B-Spline ba sis with 6 k Ψ j k k ∞ ≤ 1, where k · k ∞ is the sup norm. F or any f nj ∈ S n , w e hav e f nj ( x ) = d n X k =1 β j k Ψ j k ( x ) , 1 ≤ j ≤ p, for some co eﬃcien ts { β j k } d n k =1 . Under some smo o t hness conditions, the non- parametric pr o jections { f j } p j =1 can well b e approx imated b y functions in S n . The sample v ersion of the marginal regression problem can b e expressed as min f nj ∈S n P n  Y − f nj ( X j )  2 = min β j ∈ R d n P n  Y − Ψ T j β j  2 , (3) where Ψ j ≡ Ψ j ( X j ) = (Ψ 1 ( X j ) , · · · , Ψ d n ( X j )) T denotes the d n dimensional basis functions and P n g ( X , Y ) is t he exp ectation with resp ect to the empir- ical measure P n , i.e., the sample a v erage of { g ( X i , Y i ) } n i =1 . This univ ariate nonparametric smo othing can b e ra pidly computed, ev en for NP-dimensional problems. W e corresp ondingly deﬁne the p opula t io n ve rsion o f the minimizer of the comp onen twise least square regression, f nj ( X j ) = Ψ T j ( E Ψ j Ψ T j ) − 1 E Ψ j Y , j = 1 , · · · , p. where E denotes the exp ectation under the tr ue mo del. W e now select a set of v ariables c M ν n = { 1 ≤ j ≤ p : k ˆ f nj k 2 n ≥ ν n } , (4) where k ˆ f nj k 2 n = n − 1 P n i =1 ˆ f nj ( X ij ) 2 and ν n is a predeﬁned threshold v alue. Suc h an indep endence screening ranks the imp ortance according to the marg inal strength of the marginal nonparametric regression. This scree ning can also b e 7 view ed as ranking b y the magnitude of the correlatio n of the marginal non- parametric es timate { ˆ f nj ( X ij ) } n i =1 with the resp onse { Y i } n i =1 , since k ˆ f nj k 2 n = k Y ˆ f nj k n . In this sense, the prop osed NIS pro cedure is related to the correlatio n learning pro p osed in F an and Lv (2 0 08). Another screening approac h is to rank a ccording to the desce n t order of the residual sum of squares of the comp onent wise nonparametric regressions, where we select a set of v ariables: b N γ n = { 1 ≤ j ≤ p : u j ≤ γ n } , with u j = min β j P n ( Y − Ψ T j β j ) 2 is the residual sum of squares of the marginal ﬁt and γ n is a predeﬁned threshold v alue. It is straigh tforw ard to sho w that u j = P n ( Y 2 − ˆ f 2 nj ). Hence, the t w o metho ds are equiv alen t. The nonparametric indep endence screening reduces the dimensionalit y fro m p t o a p ossibly muc h smaller space with mo del size | c M ν n | or | b N γ n | . It is a ppli- cable to all mo dels. The question is whether we hav e mistak enly deleted some activ e v ariables in mo del (1). In other w ords, whether t he pro cedure has a sure screening prop ert y as p ostulated b y F an and Lv (2008). In the next section, w e will sho w that the sure screening prop erty indeed holds fo r nonpar ametric additiv e mo dels with a limited false selection rate. 3 Sure Scre e ning Prop er t ies In this section, w e establish the sure screenin g prop erties for additiv e mo dels with results presen ted in three steps. 8 3.1 Preliminaries W e now assume that the true regression function admits the additiv e structure: m ( X ) = p X j =1 m j ( X j ) . (5) F or identiﬁabilit y , w e a ssume { m j ( X j ) } p j =1 ha ve mean zero. Consequen tly , the resp onse Y has zero mean, to o. Let M ⋆ = { j : E m j ( X j ) 2 > 0 } b e the true sparse model with non-sparsit y size s n = |M ⋆ | . W e allo w p to grow with n and denote it as p n whenev er needed. The theoretical basis of the sure screenin g is that the marginal signal of the a ctive comp onen ts ( k f j k , j ∈ M ⋆ ) do es not v anish, where k f j k 2 = E f 2 j . The follo wing conditions mak e t his p ossible. F o r simplicit y , let [ a, b ] b e the supp ort of X j . A. The nonparametric mar ginal pro jections { f j } p j =1 b elong to a class of functions F whose r th deriv ativ e f ( r ) exists and is Lipsc hitz of or der α : F = n f ( · ) :    f ( r ) ( s ) − f ( r ) ( t )    ≤ K | s − t | α , for s, t ∈ [ a, b ] o , for some p o sitiv e constan t K , where r is a non- negativ e integer and α ∈ (0 , 1] suc h that d = r + α > 0 . 5 . B. The mar g inal densit y function g j of X j satisﬁes 0 < K 1 ≤ g j ( X j ) ≤ K 2 < ∞ on [ a, b ] fo r 1 ≤ j ≤ p for some constants K 1 and K 2 . C. min j ∈M ⋆ E { E ( Y | X j ) 2 } ≥ c 1 d n n − 2 κ , for some 0 < κ < d/ (2 d + 1) and c 1 > 0. 9 Under conditions A and B, the follow ing three facts hold when l ≥ d a nd will b e used in the pap er. W e state them here fo r readability . F act 1. There exists a p ositiv e constant C 1 suc h that (Stone, 1 985) k f j − f nj k 2 ≤ C 1 d − 2 d n . (6) F act 2. There exists a p ositiv e constan t C 2 suc h that (Sto ne, 198 5; Huang et al., 2010) E Ψ 2 j k ( X ij ) ≤ C 2 d − 1 n . (7) F act 3. There exist some p ositiv e constan ts D 1 and D 2 suc h that (Zhou et al., 1998) D 1 d − 1 n ≤ λ min ( E Ψ j Ψ T j ) ≤ λ max ( E Ψ j Ψ T j ) ≤ D 2 d − 1 n . (8) The following lemma sho ws that the minim um signal of {k f nj k} j ∈M ∗ is at the same leve l of the marginal pro jection, prov ided that the appro ximation error is negligible. Lemma 1 . Unde r c onditions A–C, we have min j ∈M ⋆ k f nj k 2 ≥ c 1 ξ d n n − 2 κ , pr ovide d that d − 2 d − 1 n ≤ c 1 (1 − ξ ) n − 2 κ /C 1 for some ξ ∈ (0 , 1) . A mo del selection consistency result can b e established with nonpara- metric indep endence screening under the partial orthogo nalit y condition, i.e., 10 { X j , j / ∈ M ⋆ } is indep enden t of { X i , i ∈ M ⋆ } . In this case, there is a sep- aration b et w een t he strength of marginal signals k f nj k 2 for activ e v ariables { X j ; j ∈ M ⋆ } and inactiv e v ariables { X j , j / ∈ M ⋆ } , whic h are zero. When the separation is suﬃcien tly large, these tw o sets of v ariables can b e easily iden tiﬁed. 3.2 Sure Screening In this section, w e establish the sure screening prop erties of the nonparametric indep endence screening (NIS). W e need the following additional conditions: D. k m k ∞ < B 1 for some p o sitiv e constan t B 1 , where k · k ∞ is the sup norm. E. The random error { ε i } n i =1 are i.i.d. with conditional mean zero and for an y B 2 > 0, there exists a p ositiv e constan t B 3 suc h t ha t E [exp( B 2 | ε i | ) | X i ] < B 3 . F. There exist a p ositiv e constan t c 1 and ξ ∈ (0 , 1) such that d − 2 d − 1 n ≤ c 1 (1 − ξ ) n − 2 κ /C 1 . The following theorem giv es the sure screening prop erties. It r eveals that it is only the size of non- sparse elemen ts s n that matters for the purp ose of sure screening, not the dimensionalit y p n . The ﬁrst result is on the uniform con ve rgence of k ˆ f nj k 2 n to k f nj k 2 . Theorem 1 . Supp ose that Condi tions A, B, D and E ho ld. (i) F or any c 2 > 0 , ther e exist some p ositive c onstants c 3 and c 4 such that P  max 1 ≤ j ≤ p n    k ˆ f nj k 2 n − k f nj k 2    ≥ c 2 d n n − 2 κ  ≤ p n d n n (8 + 2 d n ) exp  − c 3 n 1 − 4 κ d − 3 n  + 6 d n exp  − c 4 nd − 3 n o . (9) 11 (ii) If, in addition, Condi tions C and F hold, then by taking ν n = c 5 d n n − 2 κ with c 5 ≤ c 1 ξ / 2 , we have P ( M ⋆ ⊂ c M ν n ) ≥ 1 − s n d n n (8 + 2 d n ) exp  − c 3 n 1 − 4 κ d − 3 n  +6 d n exp  − c 4 nd − 3 n o . Note that the second part of the upp er b ound in Theorem 1 is related to the uniform conv ergence rates of the minimum eigenv alues of the design matrices. It giv es a n upp er b ound o n the num b er of basis d n = o ( n 1 / 3 ) in order to hav e the sure screening prop erty , whereas Condition F requires d n ≥ B 4 n 2 κ/ (2 d +1) , where B 4 = ( c 1 (1 − ξ ) /C 1 ) − 1 / (2 d +1) . It fo llows from Theorem 1 that w e can handle the NP-dimensionalit y: log p n = o ( n 1 − 4 κ d − 3 n + nd − 3 n ) . (10) Under this condition, P ( M ⋆ ⊂ c M ν n ) → 1 , i.e., the sure screening prop erty . It is w o rth while to p o int out that the n um b er of spline basis d n aﬀects the order of dimensionalit y , comparing with the results of F an a nd Lv (2008) and F an and Song (2010) in whic h univ a riate marginal regression is used. Equation (10) shows that the larg er the minim um signal lev el o r the smaller t he n umber of basis functions, the higher dimensionalit y the nonparametric independence screening (NIS) can handle. This is in line with our in tuition. On the other hand, the n umber of basis functions can not b e to o small, since the appro ximation error can no t b e to o large. As required b y Condition F, d n ≥ B 4 n 2 κ/ (2 d +1) ; the smo other the underlying function, 12 the smaller d n w e can tak e and the higher the dimension that the NIS can handle. If the minim um signal do es not con ve rge to zero, as in Lin and Zha ng (2006), Koltc hinskii and Y uan (200 8) a nd Huang et al. (2 010), then κ = 0. In this case, d n can b e taken to b e ﬁnite as long as it is suﬃcie n tly large so that minimum signal in Lemma 1 exceeds the noise lev el. By taking d n = n 1 / (2 d +1) , the optimal rate for no nparametric regr ession (Sto ne, 198 5), w e hav e log p n = o ( n 2( d − 1) / ( 2 d +1) ). In other words, the dimensionality can b e as high as exp { o ( n 2( d − 1) / ( 2 d +1) ) } . 3.3 Con trolling false selection rates The sure screening prop erty , without controlling false selection rates, is no t insigh tful. It basically states that the NIS has no false negativ es. An ideal case for t he v anishing false p ositiv e rate is tha t max j / ∈M ⋆ k f nj k 2 = o ( d n n − 2 κ ) , so that there is a g ap b et w een active v ariables and inactiv e v aria bles in mo del (1) when using the marginal nonpara metric screener. In this case, by Theorem 1(i), if (9) tends to zero, with pro babilit y tending to one that max j / ∈M ⋆ k ˆ f nj k 2 n ≤ c 2 d n n − 2 κ , for any c 2 > 0. Hence, by t he c hoice of ν n as in Theorem 1 ( ii) , we can ac hieve mo del selection consistency : P ( c M ν n = M ⋆ ) = 1 − o (1) . W e no w deal with the more general case. The idea is to bo und the size 13 of the selected set b y using the fact that v ar( Y ) is b ounded. In this part, we sho w that the correlations among the basis functions, i.e., the design matrix of the basis functions, are related to t he size of selected mo dels. Theorem 2 . Supp ose Co nditions A–F hold and va r ( Y ) = O (1) . Then , for any ν n = c 5 d n n − 2 κ , ther e exist p ositive c onstants c 3 and c 4 such that P [ | c M ν n | ≤ O { n 2 κ λ max ( Σ ) } ] ≥ 1 − p n d n n (8 + 2 d n ) exp( − c 3 n 1 − 4 κ d − 3 n ) + 6 d n exp( − c 4 nd − 3 n ) o , wher e Σ = E ΨΨ T and Ψ = ( Ψ 1 , · · · , Ψ p n ) T . The signiﬁcance of the result is that when λ max ( Σ ) = O ( n τ ), t he se- lected mo del size with the sure screening pro p ert y is only of p olynomial order, whereas the original mo del size is of NP-dimensionalit y . In other words, the false selection rate conv erges to zero exp onen tially fa st. The size of the se- lected v ariables is of order O ( n 2 κ + τ ). This is of the same order as in F an and Lv (200 8). Our result is an extension of F a n and Lv (2008), ev en in this v ery sp eciﬁc case without the condition 2 κ + τ < 1. The results are also consisten t with that in F a n and Song (2010): the n umber of selected v ariables is related to the correlation structure of the cov ariance matrix. In the sp eciﬁc case where the cov a r iates ar e indep enden t, then the matrix Σ is blo c k diagona l with j -th blo c k Σ j . Hence, it follow s from (8) that λ max ( Σ ) = O ( d − 1 n ). 14 4 INIS Metho d 4.1 Description of the A lgorithm After v ariable screening, the next step is naturally to select the v ariables using more reﬁned tec hniques in the additiv e mo del. F or example, the p enalized metho d for additiv e mo del (p enGAM) in Meier et al. (2009) can b e emplo yed to select a subs et of active v ariables. This results in NIS-penG AM. T o fur- ther enhance the p erformance of the metho d, in terms of false selection rates, follo wing F an and Lv (2008) and F an et al. (20 0 9), we can iterativ ely emplo y the lar ge-scale screening and mo derate-scale selection strategy , resulting in the INIS-p enGAM. Giv en t he da ta { ( X i , Y i ) } , i = 1 , · · · , n , for each comp onen t f j ( · ) , j = 1 , · · · , p , w e c ho ose the same truncation term d n = O ( n 1 / 5 ). T o determine a da t a-driv en thresholding for indep endence screening, w e extend the random p erm uta tion idea in Z hao and Li (2010), whic h allo ws only 1 − q pro p ortion (for a given q ∈ [0 , 1]) of inactiv e v ariables to en ter the mo del when X and Y are not related (the null mo del). The random p erm utation is used to de- couple X i and Y i so that t he resulting data ( X π ( i ) , Y i ) follow a n ull mo del, where π (1) , · · · , π ( n ) are a random p ermutation of the index 1 , · · · , n . The algorithm works as fo llo ws: Step 1: F or eve ry j ∈ { 1 , · · · , p } , we compute ˆ f nj = argmin f nj ∈S n P n  Y − f nj ( X j )  2 , for 1 ≤ j ≤ p. Randomly p ermute the rows of X , yielding ˜ X . Let ω ( q ) b e the q th quan tile 15 of {k ˆ f ∗ nj k 2 n , j = 1 , 2 , · · · , p } , where ˆ f ∗ nj = argmin f nj ∈S n P n  Y − f nj ( ˜ X j )  2 . Then, NIS selects the follo wing v ariables: A 1 = { j : k ˆ f ∗ nj k 2 n ≥ ω ( q ) } . In our numeric al examples, w e use q = 1 (i.e., tak e the maximum v alue of the empirical norm of the p erm ut ed estimates). Step 2: W e apply further the penalized metho d for additiv e mo del (penGAM) in Meier et al. (2009) o n t he set A 1 to sele ct a su bset M 1 . Inside the p enGAM algo r ithm, the p enalty parameter is selected b y cross v alida- tion. Step 3: F or eve ry j ∈ M c 1 = { 1 , · · · , p }\M 1 , we minimize P n  Y − X i ∈M 1 f ni ( X i ) − f nj ( X j )  2 , (11) with resp ect to f ni ∈ S n for all i ∈ M 1 and f nj ∈ S n . This regression reﬂects the additio nal con tribution of the j -th comp onen ts conditioning on the existence of the v ariable set M 1 . After marginally screening as in the ﬁrst step, w e can pick a set A 2 of indices. Here the size determination is the same as in Step 1 , except that only the v ariables not in M 1 are randomly p erm uted. Then w e apply further the p enGAM algorithm on the set M 1 S A 2 to select a subset M 2 . Step 4: W e iterate the pro cess until |M l | ≥ s 0 or M l = M l − 1 . 16 Here are a few commen ts ab out the metho d. In Step 2, w e use the p enGAM metho d. In fact, an y v ariable selection metho d for additiv e mo dels will w ork suc h as the SpAM in Ravikum ar et al. (200 9) and also the adaptiv e gr o up LASSO fo r additiv e mo dels in Huang et al. (2010). A similar sample splitting idea as describ ed in F an et al. (2 009) can b e applied here to further r educe false selection rate. 4.2 Greedy INIS (g-INIS) W e no w prop ose a greedy mo diﬁcation to the INIS algorithm to sp eed up t he computation and to enhance the p erformance. Sp eciﬁcally , we restrict the size of the set A j in the iterativ e screening steps to b e at most p 0 , a small p ositiv e in teger, and the algorithm stops when none of the v ariables is recruited, i.e., exceeding the thresholding for the n ull mo del. In the n umerical studies, p 0 is tak en to be one for simplicit y . This greedy vers ion of the INIS algorithm is called “g- INIS”. When p 0 = 1 , the g-INIS metho d is connected with the forw ard selec tion (Efro ymson, 196 0; Drap er and Smith, 1966). R ecen tly , W ang (2009) sho w ed that under certain technic al conditions, forw a r d selection can also ac hieve the sure screening pro p ert y . Both g-INIS and forw a rd selection recruit at most one new v ariable into the mo del at a t ime. The ma jor diﬀerence is that unlik e the forward selection whic h ke eps a v ariable once selected, g-INIS has a deletion step via p enalize d least-squares that can remov e multiple v ariables. This mak es the g- INIS alg orithm more at tractiv e since it is more ﬂexible in terms of r ecruiting and deleting v ariables. The g -INIS is pa r t icularly eﬀectiv e when the cov ariates are highly cor- related or conditiona lly correlated. In this case, the original INIS metho d 17 tends to select many unimp ortan t v aria bles that ha v e high correlat io n with imp ortant v ariables as t hey , to o, ha v e large marginal eﬀects on the response. Although greedy , the g- INIS metho d is b etter at choosing true p ositiv es due to more stringent screening and impro ves the c hance of the remaining imp ortant v ariables to b e selected in subsequen t stages due to less fa lse p ositiv es at eac h stage. This leads to conditio ning on a smaller set of more relev a n t v ariables and improv e the ov erall p erformance. F rom our num erical experience, the g - INIS metho d o utp erforms the original INIS metho d in all examples in t erms of higher true p ositiv e rate, smaller false p ositive rate and smaller prediction error. 5 Numerical Results In this section, we will illustrate our metho d b y studying the p erformance on the sim ula ted data and a real data analysis. P art of the sim ulation settings are adapted fr om F an and Lv (200 8), Meier et al. (2 0 09), Huang et al. (2010), and F an and Song (2010). 5.1 Comparison of Minim um Mo del Size W e ﬁrst illustrate the b eha vior of t he NIS pro cedu re under diﬀerent correlation structures. F ollow ing F an and Song (2010), the minim um mo del size(MMS) required for the NIS pro cedure and the p enGAM pro cedure to hav e the sure screening prop ert y , i.e., to con ta in the true mo del M ∗ , is used a s a measure of the eﬀectiv eness of a screening metho d. W e also include the correlation screening of F an and Lv (2008) for comparison. The adv a n ta ge of the MMS metho d is that w e do not need to choose the thresholding para meter or p enal- 18 ized parameters. F o r NIS, w e take d n = ⌊ n 1 / 5 ⌋ + 2 = 5. W e set n = 400 and p = 1000 for all examples. Example 1 . F ollow ing F an a nd Song (2 010), let { X k } 950 k =1 b e i.i.d standard normal random v ariables and X k = s X j =1 X j ( − 1) j +1 / 5 + r 1 − s 25 ε k , k = 95 1 , · · · , 1000 , where { ε k } 1000 k =951 are standard normally distributed. W e consider the following linear mo del as a sp eciﬁc case of the additiv e mo del: Y = β ∗ T X + ε , in which ε ∼ N (0 , 3) and β ∗ = (1 , − 1 , · · · ) T has s non-v anishing components , taking v alues ± 1 a lternately . Example 2 . In this example, the da ta is generated from the simple linear regression Y = X 1 + X 2 + X 3 + ε , where ε ∼ N (0 , 3). Ho wev er, the co v ariat es are not normally distributed: { X k } k 6 =2 are i.i.d standard normal random v ariables whereas X 2 = − 1 3 X 3 1 + ˜ ε , where ˜ ε ∼ N (0 , 1). In this case, E ( Y | X 1 ) and E ( Y | X 2 ) ar e nonlinear. T a ble 1: Minim um mo del size and robust estimate o f standard deviations (in paren theses). Mo del NIS P enGAM SIS Ex 1 ( s = 3 , S N R ≈ 1 . 01) 3(0) 3(0) 3(0) Ex 1 ( s = 6 , S N R ≈ 1 . 99 ) 56(0) 1000(0) 56(0) Ex 1 ( s = 12 , S N R ≈ 4 . 07) 66(7) 1000(0) 62(1) Ex 1 ( s = 24 , S N R ≈ 8 . 20) 269(134) 1000(0) 109(43) Ex 2 ( S N R ≈ 0 . 83) 3(0) 3(0) 360(361) The minim um mo del size(MMS) for eac h metho d and its asso ciated r o- bust estimate of t he standard deviation( RS D = I QR/ 1 . 34) are sho wn in 19 T a ble 1. The column “NIS”, “penGAM”, and “SIS” summarizes the results on the MMS ba sed o n 100 sim ulatio ns, r esp ectiv ely for the nonpara metric in- dep endence screening in the pap er, p enalize d metho d for additive mo del o f Meier et al. (2009), and the linear correlation ranking metho d of F an and Lv (2008). F or Example 1 , when the nonsparsit y size s > 5, the irrepresen table condition required for the mo del selection consistency of LASSO fa ils. F or these cases, p enGAM fails even to include the true mo del un til the last step. In con trast, the pro p osed nonparametric indep endence screening p erfor ms rea- sonably w ell. It is also worth noting that SIS p erfo rms better than NIS in the ﬁrst example, particularly for s = 24. This is due to the fact that the t r ue mo del is linear and the co v ariates are join tly normally distributed, which im- plies t ha t the marginal pro jection is also linear. In this case, NIS selects v ariables from pd n parameters whereas SIS selects only from p parameters. Ho wev er, for the nonlinear problem like Example 2, b oth nonlinear method NIS and p enGAM b ehav e nicely , where as SIS fails badly ev en though the underlying tr ue mo del is indeed linear. 5.2 Comparison of Mo del Selection and Estimation As in the previous section, w e set n = 400 and p = 1000 fo r a ll t he examples to demonstrate the p o w er of our new ly prop osed metho ds INIS and g-INIS. Here in the NIS step, w e ﬁx d n = 5 as in the last subsection. The num b er of sim ulations is 100. Here, w e use ﬁv e-fold cross v alidation in Step 2 of the INIS algorithm. F or simplicit y of notations, w e let g 1 ( x ) = x, g 2 ( x ) = (2 x − 1) 2 , g 3 ( x ) = sin(2 π x ) 2 − sin(2 π x ) 20 and g 4 ( x ) = 0 . 1 sin(2 π x )+0 . 2 cos(2 π x )+0 . 3 sin(2 π x ) 2 +0 . 4 cos(2 π x ) 3 +0 . 5 sin(2 π x ) 3 . Example 3 . F ollo wing Meier et al. (2009), w e g enerate the data fro m the follo wing additive mo del: Y = 5 g 1 ( X 1 ) + 3 g 2 ( X 2 ) + 4 g 3 ( X 3 ) + 6 g 4 ( X 4 ) + √ 1 . 74 ε The cov ariates X = ( X 1 , · · · , X p ) T are sim ulated according to the random eﬀect mo del X j = W j + tU 1 + t , j = 1 , · · · , p, where W 1 , · · · , W p and U a r e i.i.d. Unif(0 , 1) a nd ε ∼ N (0 , 1). When t = 0, the cov ariates are all indep endent, and when t = 1 the pairwise correlation of co v ariates is 0.5. Example 4 . Again, we a da pt the simulation mo del from Meier et al. (2009). This example is a more diﬃcult case than Example 3 since it has 12 imp ortant v aria bles with diﬀeren t co eﬃcien ts. Y = g 1 ( X 1 ) + g 2 ( X 2 ) + g 3 ( X 3 ) + g 4 ( X 4 ) + 1 . 5 g 1 ( X 5 ) + 1 . 5 g 2 ( X 6 ) + 1 . 5 g 3 ( X 7 ) + 1 . 5 g 4 ( X 8 ) + 2 g 1 ( X 9 ) + 2 g 2 ( X 10 ) + 2 g 3 ( X 11 ) + 2 g 4 ( X 12 ) + √ 0 . 5184 ε, where ε ∼ N (0 , 1). The co v ariates are sim ulated as in Example 3. Example 5 . W e follo w the simulation mo del o f F an et al. (2 0 09), in whic h Y = β 1 X 1 + β 2 X 2 + β 3 X 3 + β 4 X 4 + ε is sim ulated, where ε ∼ N (0 , 1). 21 The co v ariates X 1 , · · · , X p are jointly Ga ussian, marginally N (0 , 1), and with corr( X i , X 4 ) = 1 / √ 2 for all i 6 = 4 and corr( X i , X j ) = 1 / 2 if i and j are dis- tinct elemen t s of { 1 , · · · , p } \ { 4 } . The co eﬃcien ts β 1 = 2 , β 2 = 2 , β 3 = 2 , β 4 = − 3 √ 2, and β j = 0 for j > 4 are tak en so that X 4 is independent of Y , eve n though it is the most imp orta n t v ariable in the join t mo del, in terms of the regression co eﬃcie n t . F or eac h example, w e compare the p erformances of INIS-p enGAM, g-INIS- p enGAM prop osed in the pap er, p enGAM(Me ier et al., 2009), and ISIS-SCAD (F an et al., 2009) whic h aims for sparse linear mo del. Their results are sho wn resp ectiv ely in the ro ws “INIS”, “g - INIS”, “p enGAM” and “ISIS” of T able 2 , in whic h the T rue P o sitives (TP), F alse Pos itiv es(FP), Prediction Error(PE) and Computation Time (Time) are rep or t ed for eac h metho d. Here the prediction error is calculated on an indep enden t test data set of size n/ 2. First of all, for the gr eedy mo diﬁcation, g-INIS-p enGAM, t he n umber of false p o sitiv e v ariables is approximately 1 f o r all examples and the n umber of false p ositiv e for b oth INIS-p enGAM and ISIS-SCAD are m uc h smaller than that for p enG AM. In terms of fa lse p ositiv es, we can see that in Examples 3 and 4, INIS-p enGAM and p enGAM ha v e similar p erfor mance, whereas p enGAM misses one v ariable most of the time in Example 5. The linear metho d ISIS- SCAD missed imp ortan t v ariables in the nonlinear mo dels in Examples 3 and 4. One ma y notice that in Example 4 ( t = 1), ev en INIS and g- INIS miss more t han one v ariables o n a ve rage. T o explore the reason, w e to ok a close lo ok at the iterativ e pro ces s f or this example and ﬁnd out the v ariable X 1 and X 2 are missed quite often. The explanation is that although the ov erall SNR (Signal to Noise Ratio) fo r this example is around 10.89 , t he individual 22 T a ble 2: Av erage v alues of the n um b ers o f true (TP) and false (FP) p ositiv es, prediction error (PE), and Time (in seconds). Robust standard deviations a re giv en in par enthes es. Mo del M etho d TP FP PE Time INIS 4.00(0.0 0 ) 2.58(2.24) 3.02(0.34) 18.50(7.22) Ex 3( t = 0 ) g-INIS 4.00(0.00) 0.67(0.75) 2.92(0.30) 25.03( 4 .87) ( S N R ≈ 9 . 02) p enGAM 4 .00(0.00) 31.86(23.51 ) 3.30(0.40) 180.6 3 (6.92) ISIS 3.03(0.00) 29.97(0.00 ) 15.95(1.74 ) 12.95(4.18) INIS 3.98(0.0 0 ) 15.76 (6.72) 2.97(0.39) 78.80(26.91 ) Ex 3( t = 1 ) g-INIS 4.00(0.00) 0.98(1.49) 2.61(0.26) 33.89( 9 .99) ( S N R ≈ 7 . 58) p enGAM 4 .00(0.00) 39.21(24.63 ) 2.97(0.28) 254.0 6 (13.06) ISIS 3.01(0.00) 29.99(0.00 ) 12.91(1.39 ) 18.59(4.37) INIS 11.97(0 .0 0) 3.22(1.49) 0.97(0.11) 73.60 (25.77) Ex 4( t = 0 ) g-INIS 12.00(0.00 ) 0 .73(0.75) 0.91(0.10) 160.75(1 9.94) ( S N R ≈ 8 . 67) p enGAM 1 1.99(0.00) 8 0 .10(18.28) 1 .27(0.14) 233.72( 10.25) ISIS 7.96(0.75) 25.04(0.75 ) 4.70(0.40) 12.89 ( 5 .00) INIS 10.01(1 .4 9) 15.56(0.93 ) 1.03(0.13) 1 25.11(39.99 ) Ex 4( t = 1 ) g-INIS 10.78(0.75 ) 1 .08(1.49) 0.87(0.11) 156.37(2 8.58) ( S N R ≈ 10 . 89) penGAM 10.51(0.75) 62.11( 2 6.31) 1.13(0 .12) 278.61(16.9 3 ) ISIS 6.53(0.75) 26.47(0.75 ) 4.30(0.44) 17.02 ( 4 .01) INIS 3.99(0.0 0 ) 21.96 (0.00) 1.62(0.18) 94.50(7.12) Ex 5 g-INIS 4.00(0.00) 1.04(1.49) 1.16(0.12) 39.78( 1 2.45) ( S N R ≈ 6 . 11) p enGAM 3 .00(0.00) 195.03(21.0 8 ) 1.93( 0 .28) 1481.12(18 1 .93) ISIS 4.00(0.00) 29.00(0.00 ) 1.40(0.17) 17.78 ( 3 .85) 23 con tributio ns to the total signal v ary signiﬁcan tly . Now , let us in tro duce the notion of individual SNR. F or example, v ar( m 1 ( X 1 )) / v ar( ε ) in the additiv e mo del Y = m 1 ( X 1 ) + · · · + m p ( X p ) + ε is the individual SNR for the ﬁrst comp onen t under the oracle mo del where m 2 , · · · , m p are kno wn. In Example 4 ( t = 1 ), the v ariance of all 12 comp onen ts are as follo ws: 1 2 3 4 5 6 7 8 9 10 11 12 0.08 0.0 9 0.21 0.26 0.19 0.20 0.47 0.58 0.33 0 .36 0.84 1.03 W e can see that the v ariance v aries a lot among the 12 comp o nents, which leads to v ery diﬀerent marginal SNRs. F or example, the individual SNR for the ﬁrst compo nen t is merely 0 . 08 / 0 . 518 = 0 . 154, whic h is v ery c hallenging to b e detected. With the ov erall SNR ﬁxed, the individual SNRs play an imp ortant r o le in measuring the diﬃcult y for selecting individual v ariables In the p ersp ectiv e of the prediction erro r , INIS-p enGAM, g - INIS-p enGAM and p enGAM outp erforms ISIS-SCAD in the nonlinear mo dels whereas their p erformances are w orse than ISIS-SCAD in the linear mo del, Example 5. Ov er- all, it is quite clear that the greedy mo diﬁcation g-INIS is a comp etitiv e v ari- able selection metho d in ultra-high dimensional additiv e mo dels where we hav e v ery low false selection rate, small prediction errors, and f ast computation. 5.3 d n and SNR In this subsection, we conduct sim ulatio n study to inv estigate the p erformance of INIS-penGAM estimator under diﬀeren t SNR settings using diﬀeren t n um- 24 b er ( d n ) of basis functions. Example 6 . W e generate the data fro m the follow ing a dditiv e mo del: Y = 3 g 1 ( X 1 ) + 3 g 2 ( X 2 ) + 2 g 3 ( X 3 ) + 2 g 4 ( X 4 ) + C √ 3 . 3843 ε, where the co v aria t es X = ( X 1 , · · · , X p ) T are simulated according to Example 3. Here C tak es a series of diﬀeren t v alues ( C 2 = 2 , 1 , 0 . 5 , 0 . 25) to make the corresp onding S N R = 0 . 5 , 1 , 2 , 4. W e rep ort the resu lts of using n um b er of basis functions d n = 2 , 4 , 6 , 8, in T a bles 4 and 5 in the App endix. F rom T able 4 in the App endix where all the v ariables are indep enden t, b oth metho ds hav e v ery go o d t r ue p ositiv es under v arious SNR when d n is not to o large. How ev er, for the case of SNR = 0 . 5 and d n = 16, the INIS a nd p enGAM p erform p o orly in terms of lo w true p ositiv e rate. This is due to the fa ct tha t when d n is larg e, the estimation v ariance will b e large and this mak es it diﬃcult t o diﬀeren tiate the a ctiv e v ariables from inactiv e ones when the signals are we ak. No w let us ha v e a lo ok at the more diﬃcult case in T able 5 (in the Ap- p endix) where pairwise correlation b et w een v ariables is 0.5. W e can see that INIS ha v e a comp etitiv e p erformance under v arious SNR v alues except when d n = 16. When SNR = 0 . 5, w e can no t a c hiev e sure screening under the curren t sample size and conﬁguration fo r the afo remen tioned reasons. 5.4 An analysis on Aﬀymetric GeneChip Rat Genome 230 2.0 A rra y W e use the data set rep orted in Sc heetz et al. (20 0 6) and analyzed by Huang et al. (2010) to illustrate the a pplication of the prop osed method. F or this data set, 25 120 tw elv e-w eek-old male r a ts w ere selec ted for tissue harv esting from the ey es and for microarray ana lysis. The microarra ys used to analyze t he RNA from the ey es of these animals con t ain o v er 31,042 diﬀeren t prob e sets (Aﬀymetric GeneChip Ra t Genome 230 2.0 Arra y). The in tensit y v alues w ere normalized using the robust multi-c hip av eraging metho d (Irizarry et a l., 2003) metho d to obtain summary express ion v alues for eac h prob e set. Gene expres sion lev els w ere analyzed o n a lo g arithmic scale. F ollow ing Huang et al. (2010), w e are in terested in ﬁnding t he genes that are related to the gene TRIM32, which w as recen tly found to cause Bardet- Biedl syndrome (Chiang et al., 2006), a nd is a genetically heterogeneous dis- ease of multiple orga n systems including the retina. Although ov er 30,000 prob e sets are represen ted on the Rat Genome 230 2.0 Arra y , many o f them are not express ed in the ey e tissue. W e only fo cus o n the 18975 prob es whic h are expressed in the eye t issue. W e use our INIS-p enGAM metho d directly on this dataset, where n = 12 0 a nd p = 18975, a nd the metho d is denoted as INIS-p enGAM ( p = 18975) . Direct applicatio n of p enGAM a p- proac h on the whole dataset is to o slo w. F ollowing Huang et al. (2010), we use 2000 prob e sets that are expressed in the ey e a nd hav e highest ma r g inal correlation with TRIM32 in the analysis. On the subset of the data ( n = 120 , p = 2000), w e apply the INIS-p enGAM and p enGAM to mo del the relation b etw een the exp ression of TRIM32 a nd those of the 2000 g enes. F or simplicit y , w e did no t implemen t g-INIS-p enGAM. Prior to the analy- sis, we standardize each prob e to b e o f mean 0 and v ariance 1. No w, w e ha ve three diﬀeren t estimators, INIS-p enGAM ( p = 1 8975), INIS-p enGAM ( p = 2000) and p enGAM ( p = 2 0 00). The INIS-p enGAM ( p = 189 75) se- lects the following 8 prob es: 13717 55 at, 1 372928 at, 137 3 534 at, 1 373944 at , 26 Figure 1: Fitted regression functions f or the 8 prob es that a r e selected b y INIS-p enGAM ( p = 18975) . −4 −2 0 2 −0.3 −0.1 0.1 1371755_at f 1 −6 −4 −2 0 2 −0.4 0.0 0.4 1372928_at f 2 −4 −2 0 2 −1.0 0.0 0.5 1373534_at f 3 −4 −3 −2 −1 0 1 2 −0.20 −0.05 0.10 1373944_at f 4 −4 −2 0 2 −0.6 −0.2 0.2 1374669_at f 5 −4 −2 0 2 −0.4 0.0 0.4 1376686_at f 6 −4 −2 0 2 −0.4 0.0 0.2 1376747_at f 7 −4 −2 0 2 −0.4 0.0 0.2 1377880_at f 8 1374669 at, 1376686 at, 1376747 at, 137788 0 at. The INIS-p enG AM ( p = 2000) selec ts the follo wing 8 probes: 1376686 at, 1376747 at, 13 78590 at, 1373534 at, 1377 8 80 at, 13729 2 8 at, 137466 9 at, 13 73944 at. On the other hand, the p enGAM ( p = 2000) selects 32 prob es. The r esidual sum of squares (RSS) for these ﬁttings are 0 .24, 0.26 and 0.1 for INIS-p enGAM ( p = 1 8975), INIS-p enGAM ( p = 2000) and p enGAM ( p = 2 000), resp ectiv ely . In order to further ev aluate t he p erformances of the tw o metho ds, w e use cross-v alidation and compare the prediction mean square error (PE). W e ran- domly pa rtition the data into a training set of 10 0 observ ations a nd a test 27 T a ble 3: Mean Mo del Size (MS) and Prediction Error ( PE) ov er 100 rep etitions and their robust standard deviations(in parenthe ses) for INIS ( p = 18975 ), INIS ( p = 2000) and p enGAM ( p = 2000). Metho d MS PE INIS ( p = 18975) 7.73 ( 0 .00) 0.47(0.13) INIS ( p = 2000) 7.68 ( 0 .75) 0.44(0.15) p enGAM ( p = 2000) 26.71(14.93) 0.48( 0 .16) set of 20 observ ations. W e compute the num b er of prob es selected using the 100 observ ations and the prediction errors on these 20 test sets. This pro cess is rep eated 100 times. T able 3 giv es the av erage v alues and their asso ciat ed robust standard deviations o ver 100 replications. It is clear in the table tha t b y applying the INIS-penGAM approac h, w e select far few er genes and g ive smaller prediction error. Therefore, in this example, the INIS-p enGAM pro- vides the biological in v estigator a more targeted list o f prob e sets, which could b e very useful in further study . 6 Discuss ion In t his pap er, w e study the nonparametric indep endence screening (NIS) metho d for v ariable selection in additiv e mo dels . B-spline basis functions are used for ﬁtting the marginal nonparametric comp onents . The prop osed marg inal pro- jection criteria is an imp orta n t extension of the marginal correlation. Iterative NIS pro cedures ar e also prop osed suc h that v ariable selection and co eﬃcien t estimation can b e ac hiev ed sim ultaneously . By a pplying the INIS-p enGAM metho d, w e can preserv e the sure screening prop ert y and substan t ially reduce the false selection rate. A greedy mo diﬁcation of the metho d g- INIS-p enGAM 28 is prop osed to further reduce the false selection rate. Moreo v er, w e can deal with the case where some v ariable is marginally uncorrelated but join tly cor- related with the resp onse. The prop osed metho d can b e easily generalized to generalized a dditive mo del with appropria te conditions. As the additiv e comp o nents are sp ec iﬁcally approx imated b y truncated series expansions with B-spline bases in this pap er, the theoretical results should hold in general and the prop osed framew ork can b e readily a da p- tiv e to o t her smo othing metho ds with additiv e mo dels (Horow itz et al., 200 6; Silv erman , 1984), suc h as lo cal p o lynomial r egression ( F an and Jiang, 2005), w av elets approxim ations(An toniadis and F an, 2001; Sardy and Tseng, 20 0 4) and smoothing spline (Speck man , 1985). This is an in teresting topic for fu- ture researc h. 7 Pro o fs Pr o of of L emma 1. By the prop ert y of the least-squares, E ( Y − f nj ) f nj = 0 and E ( Y − f j ) f nj = 0. Therefore, E f nj ( f j − f nj ) = E ( Y − f nj ) f nj − E ( Y − f j ) f nj = 0 . It follo ws from this and the or thogonal dec omp osition f j = f nj + ( f j − f nj ) that k f nj k 2 = k f j k 2 − k f j − f nj k 2 . The desired result fo llows from Condition C together with F act 1.  29 The follow ing t w o t yp es of Bernstein’s inequalit y in v an der V aart a nd W ellner (1996) will b e needed. W e repro duce them here for the sak e of readability . Lemma 2 (Bernstein’s inequality , Lemma 2.2.9, v an der V aart and W ellner (1996)) . F or indep end ent r ando m variables Y 1 , · · · , Y n with b ounde d r anges [ − M , M ] and zer o me ans, P ( | Y 1 + · · · + Y n | > x ) ≤ 2 exp {− x 2 / (2( v + M x/ 3)) } , for v ≥ var ( Y 1 + · · · + Y n ) . Lemma 3 (Bernstein’s inequalit y , Lemma 2.2 .11, v an der V a a rt and W ellner (1996)) . L et Y 1 , · · · , Y n b e indep endent r andom variab les wi th zer o me an such that E | Y i | m ≤ m ! M m − 2 v i / 2 , for every m ≥ 2 (and al l i ) and some c on stants M an d v i . Then P ( | Y 1 + · · · + Y n | > x ) ≤ 2 exp {− x 2 / (2( v + M x )) } , for v ≥ v 1 + · · · v n . The fo llo wing tw o lemmas will b e needed to pro ve Theorem 1. Lemma 4 . Under Co nditions A, B and D , for any δ > 0 , ther e exis t some p ositive c onstants c 6 and c 7 such that P ( | ( P n − E )Ψ j k Y | ≥ δ n − 1 ) ≤ 4 exp( − δ 2 / 2( c 6 nd − 1 n + c 7 δ )) , for k = 1 , · · · , d n , j = 1 , · · · , p . 30 Pr o of of L emma 4. Denote b y T j ki = Ψ j k ( X ij ) Y i − E Ψ j k ( X ij ) Y i . Since Y i = m ( X i ) + ε i , w e can write T j ki = T j ki 1 + T j ki 2 , where T j ki 1 = Ψ j k ( X ij ) m ( X i ) − E Ψ j k ( X ij ) m ( X i ) , and T j ki 2 = Ψ j k ( X ij ) ε i . By Conditions A, B, D and F act 2 , recalling k Ψ j k k ∞ ≤ 1, w e hav e | T j ki 1 | ≤ 2 B 1 , v a r ( T j ki 1 ) ≤ E Ψ 2 j k ( X ij ) m i ( X ij ) 2 ≤ B 2 1 C 2 d − 1 n . (12) By Bernstein’s inequalit y (Lemma 2), for any δ 1 > 0, P (    n X i =1 T j ki 1    > δ 1 ) ≤ 2 exp  − 1 2 δ 2 1 nB 2 1 C 2 d − 1 n + 2 B 1 δ 1 / 3  . (13) Next, we b ound t he tails of T j ki 2 . F or eve ry r ≥ 2, E | T j ki 2 | r ≤ E | Ψ j k ( X ij ) | 2 E ( | ε i | r | X i ) ≤ r ! B − r 2 E | Ψ j k ( X ij ) | 2 E exp( B 2 | ε i || X i ) ≤ B 3 C 2 d − 1 n r ! B − r 2 , where the last inequalit y utilizes Condition E a nd F act 2. By Bernstein’s inequalit y (Lemma 3), fo r any δ 2 > 0, P (    n X i =1 T j ki 2    > δ 2 ) ≤ 2 exp  − 1 2 δ 2 2 2 nB − 2 2 B 3 C 2 d − 1 n + B − 1 2 δ 2  . (14) Com bining (13) and (14), the desired result follows by taking c 6 = max( B 2 1 C 2 , 2 B − 2 2 B 3 C 2 ) 31 and c 7 = max(2 / 3 B 1 , B − 1 2 ).  Throughout the rest of the pro of , for a n y matrix A , let k A k = q λ max ( A T A ) b e the op erator norm and k A k ∞ = max i,j | A ij | b e the inﬁnit y norm. The next lemma is ab out the tail probabilit y of the eigen v alues of the design mat r ix. Lemma 5 . Unde r Conditions A and B, for any δ > 0 , P ( | λ min ( P n Ψ j Ψ T j ) − λ min ( E Ψ j Ψ T j ) | ≥ d n δ /n ) ≤ 2 d 2 n exp n − 1 2 δ 2 C 2 nd − 1 n + δ / 3 o . In a ddition, for any given c o nstant c 4 , ther e exi sts some p ositive c on stant c 8 such that P n      ( P n Ψ j Ψ T j ) − 1   −   ( E Ψ j Ψ T j ) − 1      ≥ c 8   ( E Ψ j Ψ T j ) − 1   o ≤ 2 d 2 n exp  − c 4 nd − 3 n  . (15) Pr o of of L emma 5 . F or an y symmetric matrices A and B and an y k x k = 1, where k · k is the Euclidean norm, x T ( A + B ) x = x T Ax + x T Bx ≥ min k x k =1 x T Ax + min k x k =1 x T Bx . T a king minim um among k x k = 1 o n the left side, w e ha ve min k x k =1 x T ( A + B ) x ≥ min k x k =1 x T Ax + min k x k =1 x T Bx , whic h is equiv alen t to λ min ( A + B ) ≥ λ min ( A ) + λ min ( B ). 32 Then we ha ve λ min ( A ) ≥ λ min ( B ) + λ min ( A − B ) , whic h is t he same a s λ min ( A − B ) ≤ λ min ( A ) − λ min ( B ) . By switc hing the roles of A and B , w e also hav e λ min ( B − A ) ≤ λ min ( B ) − λ min ( A ) In other w ords, | λ min ( A ) − λ min ( B ) | ≤ max { | λ min ( A − B ) | , | λ min ( B − A ) |} (16) Let D j = P n Ψ j Ψ T j − E Ψ j Ψ T j . Then, it follows from (16) that | λ min ( P n Ψ j Ψ T j ) − λ min ( E Ψ j Ψ T j ) | ≤ max {| λ min ( D j ) | , | λ min ( − D j ) |} . (17) W e no w b ound the right-hand side o f (17). Let D ( i,l ) j b e the ( i, l ) en try o f D j . Then, it is easy to see that for an y k x k = 1, | x T D j x | ≤ k D j k ∞  d n X i =1 | x i |  2 ≤ d n k D j k ∞ . (18) Th us, λ min ( D j ) = min k x k =1 x T D j x ≤ d n k D j k ∞ . 33 On the o ther hand, b y using (18) again, w e ha v e λ min ( D j ) = − max k x k =1 ( − x T D j x ) ≥ − d n k D j k ∞ . W e conclude that | λ min ( D j ) | ≤ d n k D j k ∞ . The same b o und on | λ min ( − D j ) | can b e obtained by using the same argumen t. Th us, b y (17 ), we ha ve | λ min ( P n Ψ j Ψ T j ) − λ min ( E Ψ j Ψ T j ) | ≤ d n k D j k ∞ . (19) W e no w use Berns tein’s inequalit y to bound the righ t-hand side o f (19). Since k Ψ j k k ∞ ≤ 1, and b y using F act 2, we ha v e that v ar(Ψ j k ( X j )Ψ j l ( X j )) ≤ E Ψ 2 j k ( X j )Ψ 2 j l ( X j ) ≤ E Ψ 2 j k ( X j ) ≤ C 2 d − 1 n . By Bernstein’s inequalit y (Lemma 2), for any δ > 0, P ( | ( P n − E )Ψ j k ( X j )Ψ j l ( X j ) | > δ /n ) ≤ 2 exp n − δ 2 2( C 2 nd − 1 n + δ / 3) o . (20) It follows from (19), (20) and the unio n b ound of pro ba bilit y that P ( | λ min ( P n Ψ j Ψ T j ) − λ min ( E Ψ j Ψ T j ) | ≥ d n δ /n ) ≤ 2 d 2 n exp n − δ 2 2( C 2 nd − 1 n + δ / 3) o . This completes the pro of of the ﬁrst inequality . T o prov e the sec ond inequalit y , let us ta ke δ = c 9 D 1 nd − 2 n in (20), where 34 c 9 ∈ (0 , 1). By recalling F act 3, it follows that P ( | λ min ( P n Ψ j Ψ T j ) − λ min ( E Ψ j Ψ T j ) | ≥ c 9 λ min ( E Ψ j Ψ T j )) ≤ 2 d 2 n exp  − c 4 nd − 3 n  , (21) for some p ositiv e constan t c 4 . The second pa r t o f the lemma th us follow s from the fact that λ min ( H ) − 1 = λ max ( H − 1 ), if w e establish P      n λ min ( P n Ψ j Ψ T j ) o − 1 − n λ min ( E Ψ j Ψ T j ) o − 1     ≥ c 8 n λ min ( E Ψ j Ψ T j ) o − 1  ≤ 2 d 2 n exp  − c 4 nd − 3 n  , (22) b y using (21), where c 8 = 1 / (1 − c 9 ) − 1. W e now deduce (2 2) fr o m (21). Let A = λ min ( P n Ψ j Ψ T j ) and B = λ min ( E Ψ j Ψ T j ). Then, A > 0 and B > 0. W e aim to sho w for a ∈ (0 , 1), | A − 1 − B − 1 | ≥ cB − 1 implies | A − B | ≥ aB , where c = 1 / (1 − a ) − 1 . Since | A − 1 − B − 1 | ≥ (1 / (1 − a ) − 1) B − 1 , w e ha ve A − 1 − B − 1 ≤ − (1 / (1 − a ) − 1 ) B − 1 , or ≥ (1 / (1 − a ) − 1) B − 1 . Note that fo r a ∈ (0 , 1), we ha ve 1 − 1 / (1 + a ) < 1 / (1 − a ) − 1. Then it follo ws 35 that A − 1 − B − 1 ≤ − (1 − 1 / (1 + a )) B − 1 , or ≥ (1 / ( 1 − a ) − 1) B − 1 , whic h is equiv alen t to | A − B | ≥ aB . This concludes the pro of of the lemma.  Pr o of of The or em 1. W e ﬁrst sho w part (i). Recall that k ˆ f nj k 2 n = ( P n Ψ j Y ) T ( P n Ψ j Ψ T j ) − 1 P n Ψ j Y , and k f nj k 2 = ( E Ψ j Y ) T ( E Ψ j Ψ T j ) − 1 E Ψ j Y . Let a n = P n Ψ j Y , B n = ( P n Ψ j Ψ T j ) − 1 , a = E Ψ j Y and B = ( E Ψ j Ψ T j ) − 1 . By some algebra, a T n B n a n − a T Ba = ( a n − a ) T B n ( a n − a ) + 2( a n − a ) T B n a + a T n ( B n − B ) a , w e ha ve k ˆ f nj k 2 n − k f nj k 2 = S 1 + S 2 + S 3 , (23) 36 where S 1 =  P n Ψ j Y − E Ψ j Y  T ( P n Ψ j Ψ T j ) − 1  P n Ψ j Y − E Ψ j Y  , S 2 = 2  P n Ψ j Y − E Ψ j Y  T ( P n Ψ j Ψ T j ) − 1 E Ψ j Y , S 3 = ( E Ψ j Y ) T  ( P n Ψ j Ψ T j ) − 1 − ( E Ψ j Ψ T j ) − 1  E Ψ j Y . Note that S 1 ≤ k ( P n Ψ j Ψ T j ) − 1 k · k P n Ψ j Y − E Ψ j Y k 2 . (24) By Lemma 4 and the union b ound of probability , P ( k P n Ψ j Y − E Ψ j Y k 2 ≥ d n δ 2 n − 2 ) ≤ 4 d n exp( − δ 2 / 2( c 6 nd − 1 n + c 7 δ )) . (25) Recall t he result in Lemma 5 that, for an y giv en constan t c 4 , there exists a p ositiv e constant c 8 suc h that P n    k ( P n Ψ j Ψ T j ) − 1 k − k ( E Ψ j Ψ T j ) − 1 k    ≥ c 8 k ( E Ψ j Ψ T j ) − 1 k o ≤ 2 d 2 n exp  − c 4 nd − 3 n  . Since b y F act 3,    ( E Ψ j Ψ T j ) − 1    ≤ D − 1 1 d n , it follows that P n    ( P n Ψ j Ψ T j ) − 1    ≥ ( c 8 + 1) D − 1 1 d n o ≤ 2 d 2 n exp  − c 4 nd − 3 n  . (26) 37 Com bining (24)–(26 ) and the union b ound of probability , w e hav e P ( S 1 ≥ ( c 8 + 1) D − 1 1 d 2 n δ 2 /n 2 ) ≤ 4 d n exp( − δ 2 / 2( c 6 nd − 1 n + c 7 δ )) + 2 d 2 n exp  − c 4 nd − 3 n  . (27) T o b o und S 2 , we note that | S 2 | ≤ 2 k P n Ψ j Y − E Ψ j Y k · k ( P n Ψ j Ψ T j ) − 1 E Ψ j Y k ≤ 2 k P n Ψ j Y − E Ψ j Y k · k ( P n Ψ j Ψ T j ) − 1 k · k E Ψ j Y k . (28) Since b y Condition D , k E Ψ j Y k 2 = d n X k =1 ( E Ψ j k Y ) 2 = d n X k =1 ( E Ψ j k m ) 2 ≤ d n X k =1 B 2 1 E Ψ 2 j k ≤ B 2 1 C 2 , ( 2 9) it follows from (25), (26), (2 8), (29) and the union b ound of pro ba bilit y that P ( | S 2 | ≥ 2( c 8 + 1) D − 1 1 C 1 / 2 2 B 1 d 3 / 2 n δ /n ) ≤ 4 d n exp( − δ 2 / 2( c 6 nd − 1 n + c 7 δ )) + 2 d 2 n exp  − c 4 nd − 3 n  . (30) No w w e b ound S 3 . Note tha t S 3 = ( E Ψ j Y ) T ( P n Ψ j Ψ T j ) − 1  E − P n  Ψ j Ψ T j ( E Ψ j Ψ T j ) − 1 E Ψ j Y . (31) By the fact that k AB k ≤ k A k · k B k , we ha ve | S 3 | ≤ k ( P n − E ) Ψ j Ψ T j k · k ( P n Ψ j Ψ T j ) − 1 k · k ( E Ψ j Ψ T j ) − 1 k · k E Ψ j Y k 2 . (32) 38 F or any k x k = 1 and d n -dimensional square matrix D , x T D T Dx = X i ( X j d ij x j ) 2 ≤ k D k 2 ∞ d n  d n X j =1 | x i |  2 ≤ d 2 n k D k 2 ∞ . Therefore, k D k ≤ d n k D k ∞ . W e conclude that     P n − E  Ψ j Ψ T j )    ≤ d n k ( P n − E ) Ψ j Ψ T j k ∞ . (33) By (20), (2 6 ), (29), (32), (3 3) and the union b ound of proba bility , it follo ws that P ( | S 3 | ≥ ( c 8 + 1) D − 2 1 B 2 1 C 2 d 3 n δ /n ) ≤ 2 d 2 n exp( − δ 2 / 2( c 6 nd − 1 n + c 7 δ )) + 2 d 2 n exp  − c 4 nd − 3 n  . (34) It follo ws from (23), (27), (30), (34) and the union b ound of probabilit y that for some p ositiv e constan ts c 10 , c 11 and c 12 , P     k ˆ f nj k 2 n − k f nj k 2    ≥ c 10 d 2 n δ 2 /n 2 + c 11 d 3 / 2 n δ /n + c 12 d 3 n δ /n  ≤ (8 d n + 2 d 2 n ) exp( − δ 2 / 2( c 6 nd − 1 n + c 7 δ )) + 6 d 2 n exp  − c 4 nd − 3 n  . (35 ) In (3 5), let c 10 d 2 n δ 2 /n 2 + c 11 d 3 / 2 n δ /n + c 12 d 3 n δ /n = c 2 d n n − 2 κ for a ny given c 2 > 0, i.e., taking δ = n 1 − 2 κ d − 2 n c 2 /c 12 , t here exist some p ositiv e constan t s c 3 and c 4 suc h that P (    k ˆ f nj k 2 n − k f nj k 2    ≥ c 2 d n n − 2 κ ) ≤ (8 d n + 2 d 2 n ) exp( − c 3 n 1 − 4 κ d − 3 n ) + 6 d 2 n exp  − c 4 nd − 3 n  . 39 The ﬁrst part thus follo ws the union b ound of pr o babilit y . T o prov e the second part , note that on the eve n t A n ≡ { max j ∈M ⋆    k ˆ f nj k 2 n − k f nj k 2    ≤ c 1 ξ d n n − 2 κ / 2 } , b y Lemma 1, w e hav e k ˆ f nj k 2 n ≥ c 1 ξ d n n − 2 κ / 2 , for all j ∈ M ⋆ . (36) Hence, by the c hoice of ν n , w e hav e M ⋆ ⊂ c M ν n . The result no w follows from a simple union b ound: P ( A c n ) ≤ s n n (8 d n + 2 d 2 n ) exp  − c 3 n 1 − 4 κ d − 3 n  + 6 d 2 n exp  − c 4 nd − 3 n o . This completes the pro of.  Pr o of of The or em 2. The ke y idea of the pro of is to sho w that k E Ψ Y k 2 = O ( λ max ( Σ )) . (37) If so, b y deﬁnition and k Ψ j k k ∞ ≤ 1, w e hav e p n X j =1 k f nj k 2 ≤ max 1 ≤ j ≤ p n λ max { ( E Ψ j Ψ T j ) − 1 }k E Ψ Y k 2 = O ( d n λ max ( Σ )) . This implies that the n um b er of { j : k f nj k 2 > εd n n − 2 κ } can not exceed O ( n 2 κ λ max ( Σ )) for an y ε > 0. Th us, on the set B n = { max 1 ≤ j ≤ p n    k ˆ f nj k 2 n − k f nj k 2    ≤ εd n n − 2 κ } , 40 the num b er of { j : k ˆ f nj k 2 n > 2 ε d n n − 2 κ } can no t exceed the nu m b er of { j : k f nj k 2 > εd n n − 2 κ } , whic h is b ounded b y O { n 2 κ λ max ( Σ ) } . By taking ε = c 5 / 2, w e ha ve P [ | c M ν n | ≤ O { n 2 κ λ max ( Σ ) } ] ≥ P ( B n ) . The conclusion follo ws f rom Theorem 1(i). It r emains to prov e (37). Note that (37) is more related to the joint regres- sion rather than the marginal regression. Let α n = argmin α E  Y − Ψ T α  2 , whic h is the join t regression co eﬃc ien ts in the p o pulation. By the score equa- tion of α n , we get E Ψ ( Y − Ψ T α n ) = 0 . Hence k E Ψ Y k 2 = α T n E ΨΨ T E ΨΨ T α n ≤ λ max ( Σ ) α T n E ΨΨ T α n , No w, it follo ws from t he orthogona l decomp osition that v ar( Y ) = v ar( Ψ T α n ) + v ar( Y − Ψ T α n ) . Since v ar( Y ) = O (1), we conclude tha t v ar( Ψ T α n ) = O (1), i.e. α T n E ΨΨ T α n = O (1) . 41 This completes the pro of.  . References Antoniadis, A. and F an, J. (2001 ). Regularization of w av elet appro xima- tions. Journal of the Americ an S tatistic al Asso ciation , 96 939–9 67. Candes, E. and T ao, T. ( 2007). The dan tzig selector: statistical estimation when p is muc h larger tha n n (with discussion). The A nnals of Statistics , 35 23 13–2404. Chiang, A. P. , Be ck, J. S. , Ye n, H.-J. , T a yeh, M. K. , Scheetz, T. E. , Swiderski, R. , Nishimura, D. , Braun, T. A. , Kim, K.- Y. , Huang, J. , Elbedour, K. , Carmi, R. , Slusarski, D. C. , Casa v ant, T. L. , Stone, E. M . and Sheffield, V. C. (2006 ). Homozygosit y mapping with snp arra ys identiﬁe s t rim32, an e3 ubiquitin lig a se, as a ba rdetcbiedl syndrome gene (bbs11). PNAS , 103 62 8 7–6292. Draper, N. R. and Smith, H. (1966). Applie d r e gr ession a nalysis . John Wiley & Sons Inc., New Y o rk. Efr oym son, M. A. (1960). Multiple regression analysis. In Mathematic al metho ds for d igital c omputers . Wiley , New Y o r k, 191–2 0 3. F an, J. (1 997). Commen ts on “w av elets in statistics: A review” b y a. an t o- niadis. j. Journal of the A meric an Statistic al Asso ci ation , 6 131 –138. F an, J. and Jiang, J. (2005). Nonparametric inferences for additiv e mo dels. Journal of the Americ an Statistic al Asso ci ation , 100 890–907. 42 F an, J. and Li, R. (2001). V ariable selection via nonconcav e p enalized like- liho o d and it s ora cle prop erties. Journal of the A meric an S tatistic al Asso- ciation , 96 1348–13 60. F an, J. and L v, J. (2008 ). Sure indep endence screening for ultrahigh di- mensional feature space. Journal of the R oyal Statistic al So ciety, Series B: Statistic al Metho dolo gy , 70 849–91 1. F an, J. and L v, J. (20 09). Non-conca v e p enalized lik eliho o d with np- dimensionalit y . Manuscri pt . F an, J. , Samwor th, R. a nd Wu, Y. (20 09). Ultra-dimensional v ariable selection via indep enden t learning: b ey ond the linear mo del. Journal of Machine L e arnin g R ese ar ch , 10 18 29–1853. F an, J. and Song, R. ( 2 010). Sure indep endence screening in generalized linear mo dels with np- dimensionalit y . The Annals of Statistics . T o app ear. Hall, P. and Miller, H. (2009 ). Using generalised correlation to eﬀect v ariable selection in v ery hig h dimensional pro blems. The Journal of Com- putational and Gr aphic al Statistics . T o app ear. Hall, P. , Titterington, D. and Xue, J. (2009). Tilting metho ds for assessing the inﬂuence of comp onen ts in a classiﬁer. Journal of the R o yal Statistic al So ciety, S eries B: Statistic al Metho do lo gy , 71 7 83–803. Hor ow itz, J. , Klemel ¨ a, J. and Mammen, E. (200 6). Optimal estimation in additive regression mo dels. Bernoul li , 12 271–2 98. Huang, J. , Horo witz, J. and Ma, S. (200 8 ). Asymptotic prop erties of 43 bridge estimators in sparse high-dimensional regression mo dels. The A nnals of Statistics , 36 587– 6 13. Huang, J. , Hor owitz, J. and Wei, F. (20 10). V ariable selection in non- parametric additiv e mo dels. The Annals o f Statistics , 38 2282–2 313. Irizarr y, R. A. , Hobbs , B. , Collin, F. , Beaze r-Barcla y, Y. D. , An- tonellis, K. J. , Sc herf, U. and Spee d, T. P. (2003 ). Exploration, normalization, and summaries of high densit y oligonu cleotide array pro b e lev el dat a . Biostatistics (Oxfor d) , 4 249–26 4. Kim, Y. , Kim, J. and Kim, Y. (2006). Blo ckw ise sparse regression. Statistic a Sinic a , 16 375–39 0. K ol tchinskii, V. and Yuan, M. (2 008). Sparse reco very in large ense m bles of k ernel mac hines. In CLOT (e ds. R.A. Serve dio and T . Zhang), Omni- pr ess. 229– 238. Lin, Y. and Zhang, H. H. (2006 ). Component selection and smo othing in m ultiv ariate nonparametric regression. The A n nals of Statistics , 34 2272– 2297. Meier, L. , Ge er, V. and B ¨ uhlmann, P. (2009). High- dimensional additiv e mo deling. Th e A nnals of Statistics , 37 3 779–3821 . Ra vikumar, P. , Liu, H. , Laffer ty, J. and W asserman, L. (2009) . Spa m: Sparse additiv e mo dels. Journal of the R oyal Statistic al So ciety: Series B , 71 10 09–1030. Sard y, S. and Tseng, P. (2004). Amlet, ramlet, and gamlet: Automatic 44 nonlinear ﬁtting of additiv e mo dels , robust and generalized, with wa v elets. Journal of Computational and Gr aphic al Statistics , 13 283–309. Scheetz, T. E. , Kim, K.-Y. A. , Swiderski, R. E. , Philp1, A. R. , Braun, T. A. , Knudtson, K. L. , Dorrance, A. M . , D iBona, G. F. , Huang, J. , Cas a v ant, T. L. , Sheffield, V. C. a nd Stone, E. M. (2006). Regulation of gene expression in the mammalian ey e and its rele- v ance to ey e disease. Pr o c e e dings of the National A c ademy of Scienc e s , 103 14429–14 434. Sil verman, B . (1984). Spline smo o t hing: The equiv alen t v ariable kerne l metho d. The A nn als of Statistics , 12 898 –916. Speckman, P. (1985 ) . Spline smo o thing and optimal rates of con v ergence in nonparametric regression mo dels. Th e A nnals of Statistics , 13 9 70–983. Stone, C. (1985). Additive regression and other nonparametric mo dels. The A nnals of Statistics , 13 6 8 9–705. Tibshirani, R. (1996). Regression shrink age and selection via the lasso. Journal of the R oyal Statistic al So ci ety: Serie s B , 58 267–288 . v an der V aar t, A. W. and Wellner, J. A. ( 1 996). We ak Conver genc e and Empi ric al Pr o c esses . Springer, New Y ork. W ang, H. (2009). F orw ard regression for ultra-high dimensional v ariable screening. Journal of the Americ an Statistic al Asso ciation , 104 1 512–1524 . Wei, F. and Huang, J. (2007). Consisten t group selection in high- dimensional linear regression. T e chnic al R ep ort No.387 . Univ ersit y of Io wa. 45 Yuan, M. and Lin, Y. (200 6 ). Mo del selection a nd estimation in regression with group ed v ariables. Journal of the R oyal Statistic al So ciety: Series B , 68 49 –67. Zhang, C.- H. (2010). Nearly unbiase d v ariable selection under minimax conca ve p enalt y . The Annals of S tatisitics , 38 8 94–942. Zhao, D. S. and Li, Y. (2010). Principled sure indep endence screening for co x mo dels with ultra-high- dimensional cov ar ia tes. Man uscript. Zhou, S. , Shen, X. and Wo lfe, D. A. (1998). Lo cal asymptotics for regression splines and conﬁdence regions. The Annals of Statistics , 26 1760– 1782. Zou, H. (2006). The adaptiv e lasso and its oracle prop erties. Journal of the A meric an Statistic al Asso ciation , 101 1418–1429. Zou, H. and Hastie, T. (2005). R egularization and v ariable selection via the elastic net. Journal of the R oyal Statistic al S o ciety, Series B : Statistic al Metho dolo gy , 67 768–768. Zou, H. and Li, R. ( 2 008). One-step sparse estimates in nonconca v e p enalized lik eliho o d mo dels. Th e A nnals of Statistics , 36 1 509–1533 . A APPENDIX: T ables for Si m ulation Results of Sect ion 5.3 46 T a ble 4 : Av erage v alues of the num b ers of true (TP), false (FP) p o sitiv es, prediction error (PE), computat io n time (Time) for Example 6 ( t = 0). Robust standard deviations are g iven in paren theses. SNR d n Metho d TP FP PE Time 0.5 2 INIS 3.96(0.0 0 ) 2.28(1.49) 7.74(0.7 9) 16.09(5.32 ) p enGAM 4 .00(0.00) 27.8 5(16.98) 8.07 (0.92) 354.46 (31.48) 4 INIS 3.93(0.0 0 ) 2.29(1.68) 7.90(0.8 1) 21.68(8.95 ) p enGAM 3 .99(0.00) 25.6 1(13.62) 8.21 (0.84) 421.17 (35.71) 8 INIS 3.81(0.0 0 ) 2.59(2.24) 8.16(1.0 8) 33.10(15.7 9) p enGAM 3 .95(0.00) 34.5 9(20.34) 8.49 (0.82) 484.17 (179.70) 16 INIS 3.38(0.7 5 ) 2.02(1.49) 8.60(1.1 3) 42.69(20.1 3) p enGAM 3 .74(0.00) 33.4 8(23.88) 9.04 (0.93) 685.97 (267.43) 1.0 2 INIS 4.00(0.0 0 ) 2.16(2.24) 3.98(0.3 4) 16.03(5.74 ) p enGAM 4 .00(0.00) 26.5 1(14.18) 4.20 (0.46) 284.85 (20.30) 4 INIS 4.00(0.0 0 ) 2.08(1.49) 3.97(0.4 5) 20.80(8.57 ) p enGAM 4 .00(0.00) 28.3 3(15.49) 4.24 (0.47) 362.02 (81.43) 8 INIS 4.00(0.0 0 ) 2.72(2.24) 4.04(0.4 3) 35.79(18.3 8) p enGAM 4 .00(0.00) 36.5 0(21.83) 4.37 (0.47) 427.60 (152.53) 16 INIS 4.00(0.0 0 ) 1.80(1.49) 4.26(0.4 5) 46.81(21.4 7) p enGAM 4 .00(0.00) 38.6 0(19.78) 4.80 (0.57) 595.87 (197.06) 2.0 2 INIS 4.00(0.0 0 ) 2.03(2.24) 2.12(0.1 7) 15.92(5.42 ) p enGAM 4 .00(0.00) 25.8 9(13.06) 2.25 (0.24) 235.69 (13.32) 4 INIS 4.00(0.0 0 ) 2.38(2.24) 2.06(0.2 2) 23.54(9.08 ) p enGAM 4 .00(0.00) 30.3 7(17.16) 2.21 (0.26) 341.13 (19.44) 8 INIS 4.00(0.0 0 ) 2.79(2.24) 2.03(0.2 1) 38.56(19.5 8) p enGAM 4 .00(0.00) 38.5 1(16.42) 2.24 (0.26) 396.84 (20.51) 16 INIS 4.00(0.0 0 ) 1.77(1.49) 2.17(0.2 5) 48.40(24.6 5) p enGAM 4 .00(0.00) 42.5 8(16.60) 2.54 (0.30) 540.89 (165.39) 4.0 2 INIS 4.00(0.0 0 ) 2.06(2.24) 1.19(0.1 3) 17.74(6.42 ) p enGAM 4 .00(0.00) 28.5 7(14.37) 1.27 (0.15) 213.43 (12.09) 4 INIS 4.00(0.0 0 ) 2.33(1.49) 1.09(0.1 0) 23.28(9.37 ) p enGAM 4 .00(0.00) 30.7 5(17.35) 1.18 (0.14) 300.69 (12.21) 8 INIS 4.00(0.0 0 ) 2.88(2.24) 1.02(0.1 2) 39.21(19.1 7) p enGAM 4 .00(0.00) 40.5 1(17.54) 1.14 (0.14) 340.06 (11.49) 16 INIS 4.00(0.0 0 ) 1.72(1.49) 1.10(0.1 2) 49.79(25.7 8) p enGAM 4 .00(0.00) 45.7 7(19.03) 1.33 (0.16) 481.19 (141.51) 47 T a ble 5 : Av erage v alues of the num b ers of true (TP), false (FP) p o sitiv es, prediction error (PE), computat io n time (Time) for Example 6 ( t = 1). Robust standard deviations are g iven in paren theses. SNR d n Metho d TP FP PE Time 0.5 2 INIS 3.35(0.7 5 ) 33.67(8.96) 9.49(1.28 ) 1 96.87(91.48 ) p enGAM 3 .10(0.00) 17.7 4(15.11) 7.92 (0.89) 1107.7 8(385.95) 4 INIS 3.02(0.0 0 ) 20.22(2.43) 8.70(1.14 ) 1 09.51(56.11 ) p enGAM 2 .78(0.00) 15.9 1(10.07) 7.99 (0.91) 734.08 (227.55) 8 INIS 2.51(0.7 5 ) 10.48(0.75) 8.37(0.89 ) 6 5.12(16.64) p enGAM 2 .59(0.75) 16.4 7(9.70) 8.13(0.90) 624.3 1(56.23) 16 INIS 2.10(0.0 0 ) 4.47(0.75) 8.44(1.0 0) 46.84(15.6 1) p enGAM 2 .41(0.75) 15.5 6(10.63) 8.42 (0.97) 786.45 (244.02) 1.0 2 INIS 3.83(0.0 0 ) 32.46(9.70) 4.86(0.60 ) 1 64.97(64.14 ) p enGAM 3 .64(0.75) 24.6 1(21.08) 4.19 (0.49) 849.23 (294.03) 4 INIS 3.56(0.7 5 ) 20.53(1.68) 4.42(0.52 ) 1 18.14(43.97 ) p enGAM 3 .46(0.75) 22.0 7(16.04) 4.18 (0.49) 614.93 (97.36) 8 INIS 3.09(0.0 0 ) 10.67(0.75) 4.28(0.49 ) 7 1.16(32.10) p enGAM 3 .12(0.00) 19.9 2(10.63) 4.30 (0.50) 548.60 (33.88) 16 INIS 2.68(0.7 5 ) 4.18(0.75) 4.45(0.5 2) 46.08(15.3 5) p enGAM 2 .95(0.00) 16.3 9(11.19) 4.57 (0.55) 710.56 (199.86) 2.0 2 INIS 3.99(0.0 0 ) 29.45(11.57 ) 2.55(0.38) 13 9.67(70.45 ) p enGAM 3 .97(0.00) 36.5 7(22.57) 2.25 (0.28) 626.84 (210.44) 4 INIS 3.93(0.0 0 ) 19.12(3.73) 2.26(0.24 ) 1 11.01(21.82 ) p enGAM 3 .91(0.00) 31.3 1(20.52) 2.19 (0.23) 481.87 (52.11) 8 INIS 3.50(0.7 5 ) 10.29(0.75) 2.21(0.23 ) 7 8.06(32.23) p enGAM 3 .71(0.75) 27.0 6(19.03) 2.28 (0.29) 448.38 (26.63) 16 INIS 2.93(0.0 0 ) 4.07(0.00) 2.42(0.3 2) 51.69(1.10 ) p enGAM 3 .22(0.00) 19.5 1(12.13) 2.53 (0.30) 661.93 (46.27) 4.0 2 INIS 4.00(0.0 0 ) 29.47(11.38 ) 1.45(0.21) 14 4.22(72.54 ) p enGAM 4 .00(0.00) 37.2 7(20.71) 1.27 (0.17) 533.98 (69.29) 4 INIS 3.99(0.0 0 ) 17.36(5.22) 1.17(0.12 ) 1 02.97(32.71 ) p enGAM 4 .00(0.00) 38.7 1(20.34) 1.16 (0.11) 403.32 (28.29) 8 INIS 3.78(0.0 0 ) 10.00(0.00) 1.13(0.16 ) 8 8.79(12.02) p enGAM 3 .99(0.00) 41.4 2(15.86) 1.19 (0.13) 402.92 (16.94) 16 INIS 3.02(0.0 0 ) 3.98(0.00) 1.36(0.1 5) 49.13(1.85 ) p enGAM 3 .72(0.75) 29.5 8(19.40) 1.43 (0.18) 556.31 (35.48) 48

Nonparametric Independence Screening in Sparse Ultra-High Dimensional Additive Models

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment