Robust Parametric Classification and Variable Selection by a Minimum Distance Criterion

Robust P arametric Classiﬁcation and V ariable Selection b y a Minim um Distance Criterion Eric C. Chi ∗ and Da vid W. Scott † Marc h 15, 2018 Abstract W e in vestigate a robust penalized logistic regression algorithm based on a minimum dis- tance criterion. Inﬂuen tial outliers are often asso ciated with the explosion of parameter v ector estimates, but in the context of standard logistic regression, the bias due to outliers alw ays causes the parameter v ector to implo de, that is shrink to wards the zero vector. Th us, using LASSO-lik e penalties to p erform v ariable selection in the presence of outliers can result in missed detections of relev an t co v ariates. W e sho w that b y c ho osing a minim um distance criterion together with an Elastic Net p enalt y , we can simultaneously ﬁnd a parsimonious mo del and a v oid estimation implosion ev en in the presence of man y outliers in the imp or- tan t small n large p situation. Minimizing the p enalized minimum distance criterion is a c hallenging problem due to its nonconv exity . T o meet the challenge, we develop a simple and eﬃcient MM algorithm that can b e adapted gracefully to the small n large p con text. P erformance of our algorithm is ev aluated on simulated and real data sets. This article has supplemen tary materials online. Keywor ds: Logistic regression, Robust estimation, Implosion breakdown, LASSO, Elastic Net, Ma jorization-Minimization 1 In tro duction Regression, classiﬁcation and v ariable selection problems in high dimensional data are b ecoming routine in ﬁelds ranging from ﬁnance to genomics. In the latter case, tec hnologies suc h as ex- pression arra ys hav e made it p ossible to comprehensively query a patient’s transcriptional activit y at a cellular level. P atterns in these proﬁles can help reﬁne subtypes of a disease according to ∗ Eric C. Chi (E-mail: ecc hi@ucla.edu) is Postdoctoral Sc holar, Departmen t of Human Genetics, Univ ersit y of California, Los Angeles CA 90095-7088. † Da vid W. Scott (E-mail: scottdw@rice.edu) is Professor, Departmen t of Statistics, Rice Univ ersity , Houston, TX 77005. 1 sensitivit y to treatmen t options or iden tify previously unknown genetic comp onents of a disease’s pathogenesis. The immediate statistical c hallenge is ﬁnding those patterns when the n umber of predictors far exceeds the num b er of samples. T o that end the Least Absolute Shrink age and Selection Op erator (LASSO) has b een quite successful at addressing “the small n , big p problem” ( Tibshirani , 1996 ; Chen et al. , 1998 ). Indeed, ` 1 -p enalized maxim um likelihoo d mo del ﬁtting has inspired many related approac hes that simultaneously do mo del ﬁtting and v ariable selection. These approaches ha v e b een extended from linear regression to generalized linear mo dels. In particular, linear mo dels minimizing the logistic deviance loss with an Elastic Net p enalt y ( Zou and Hastie , 2005 ) ha ve b een w ell studied ( Genkin et al. , 2007 ; Liu et al. , 2007 ; W u et al. , 2009 ; F riedman et al. , 2010 ) Nonetheless while ` 1 -p enalized maxim um likelihoo d metho ds hav e pro ved their worth at re- co v ering parsimonious mo dels, less atten tion has b een given to extending these metho ds to handle outliers in high dimensional data. F or example in biological data, tissue samples may b e misla- b eled or b e contaminated. The ma jorit y of prior w ork cen ters on linear regression ( Rosset and Zh u , 2007 ; W ang et al. , 2007 ; Li et al. , 2011 ; Alfons et al. , 2012 ), although there are a few excep- tions. Rosset and Zhu ( 2007 ) and W ang, Zh u, and Zou ( 2008 ) discuss using a Hub erized hinge loss for regularized classiﬁcation, and v an de Geer ( 2008 ) studies LASSO p enalization of gener- alized linear mo dels. Nonetheless, with the exception of the ` 1 -p enalized least trimmed squares regression pro cedure of Alfons et al. ( 2012 ) and the Hub erized hinge loss, these approaches can pro vide robustness only to outliers in the response v ariable, not to outliers in the cov ariates. More- o v er, neither pap er on the Hub erized hinge loss is primarily concerned with robustness. Rosset and Zh u ( 2007 ) presen t impressiv e general conditions that ensure piecewise linear regularization paths. The Hub erized hinge loss is introduced as an illustration and applied on a small example that highligh ts its prediction accuracy in the presence of a single gross outlier. Despite b eing in tro duced as a loss for a robust pro cedure in Rosset and Zh u ( 2007 ), the primary motiv ation for using the Hub erized hinge loss in W ang et al. ( 2008 ) is the fast algorithm introduced in Rosset and Zhu ( 2007 ) for computing the entire regularization path, not its robustness prop erties. W e will see later that this loss can struggle under a heavy dose of outliers. Robustness against outlying co v ariate v alues w arrants further in vestigation. It is not surprising that outliers ma y bias estimation. What is less well appreciated is that outliers can strongly 2 inﬂuence v ariable selection. In this pap er w e identify some circumstances that motiv ate robust v arian ts of p enalized estimation and dev elop a minim um distance estimator for logistic regression. T o address the n  p scenario when predictors are correlated we add the Elastic Net p enalty . W e ev aluate the p erformance of our approac h through sim ulated and real data. Robust metho ds of logistic regression are not new in the classic n > p case. A broad class of solutions consists of do wnw eighting the con tribution of outlying p oin ts to the estimating equations. Do wn weigh ting can b e based on extreme v alues in cov ariate space ( K ¨ unsc h et al. , 1989 ; Carroll and P ederson , 1993 ) or on extreme predicted probabilities ( Copas , 1988 ; Carroll and Pederson , 1993 ; Bianco and Y ohai , 1996 ). An alternative approach is to use minim um distance estimation ( Donoho and Liu , 1988 ). The minim um distance estimator used in this pap er can also b e seen as a metho d that down weigh ts the contributions of outliers ( Chi , 2011 ). The w ork in Bondell ( 2005 ) is similar to ours in that he considered ﬁtting parameters b y minimizing a w eighted Cram ´ er-v on Mises distance. The diﬀerence b et w een the approac h proposed here and prior work is the application of regularization to handle high dimensional data and p erform v ariable selection in the presence of outliers. Moreo ver, the robust loss function we prop ose has a particularly simple form which, when combined with the Elastic Net p enalt y , can be solv ed v ery eﬃcien tly for large problems b y minimizing a series of p enalized least squares problems with co ordinate descen t. The rest of this pap er is organized as follows. In Section 2 w e review maxim um lik eliho o d estimation (MLE) of the logistic regression mo del and demonstrate the p oten tially deleterious eﬀects of outliers on v ariable selection with the ` 1 -p enalized MLE. W e introduce our robust loss function in Section 3 . In Section 4 we describ e algorithms for ﬁtting our robust logistic regression mo del. In Sections 5 and 6 w e presen t results on real and simulated data. Section 7 concludes with a summary of our work and also future directions. 2 Standard logistic regression and implosion breakdo wn Throughout this pap er w e adopt the following conv entions. W e assume that the columns of the design matrix X are cen tered. W e ov erload notation so that if f is a function of a scalar, then f ev aluated at vector or matrix should b e interpreted as b eing ev aluated element-wise. F or a linear mo del β 0 1 + X β we will often employ the compact notations ˜ X = ( 1 , X ) ∈ R n × ( p +1) and 3 θ = ( β 0 , β T ) T ∈ R p +1 . In binary regression, we seek to predict or explain an observ ed resp onse y ∈ { 0 , 1 } n using predictors X ∈ R n × p , where n  p ma y b e exp ected. In t ypical expression microarray data w e encoun ter n ≈ 100 and p ≈ 10 4 , while with single nucleotide p olymorphism (SNP) array data b oth n and p ma y b e larger b y a factor of 10. Let the conditional probabilities b e giv en b y P ( Y i = 1 | X i = x i ) = F ( ˜ x T i θ ) where F ( u ) = 1 / (1 + exp( − u )). Then under this assumption, in standard logistic regression ( McCullagh and Nelder , 1989 ) w e minimize the negative log-lik eliho o d of a linear summary of the predictors, y T ˜ X θ − 1 T log( 1 + exp( ˜ X θ )) . (2.1) A simple univ ariate example illustrates the bias that outliers can introduce in to this estimation pro cedure. In the top panel of Figure 1 w e see that the addition of 5 and 10 outliers among the con trols shrinks ˆ β to w ards zero. In fact, Croux et al. ( 2002 ) sho w ed that with p cov ariates only 2 p such outliers are required to mak e k ˆ β k 2 <  for an y desired  . Our robust estimator, which w e in tro duce in the next section, pro duces virtually the same curves shown in the b ottom panel of Figure 1 . This “implosion” breakdo wn phenomenon has implications for LASSO based v ariable selection. Consider what happ ens when w e add 999 noise co v ariates whic h are indep endent of the class lab els to the scenario depicted in Figure 2 and p erform ` 1 -p enalized logistic regression. The top panel of Figure 2 sho ws the corresp onding regularization paths or the v alues of the ﬁtted regression co eﬃcien ts as a function of the p enalization parameter. As outliers are added the regularization path for the relev ant cov ariate X 1 quic kly falls into the noise. The LASSO performs contin uous v ariable selection by shrinking to zero regression coeﬃcients of co v ariates with very low correlation with the resp onses. If outliers are present in relev ant co v ariates, then the com bination of implosion breakdo wn and soft-thresholding b y the LASSO can lead to missed detection of relev ant co v ariates. In con trast w e see in the b ottom panel of Figure 2 that the corresp onding regularization paths obtained using our robust estimator are insensitiv e to outliers and so relev an t co v ariates still hav e the c hance of b eing selected. This simple example highlights the p otential imp ortance of p enalized robust estimation pro cedures. In the next section w e describe our robust estimator. 4 0 5 10 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● MLE L 2 E −10 −5 0 5 10 −10 −5 0 5 10 −10 −5 0 5 10 x 1 Pr(Y=1|X 1 =x 1 ) Figure 1: Univ ariate regression on to X 1 . The dashed line denotes the logistic mo del that generated the data; the heavy solid line denotes the estimated resp onse. The num b er of outliers (0, 5, 10) increases from left to righ t. The ﬁrst ro w sho ws MLE results; the second shows L 2 E results. 0 5 10 −0.5 0.0 0.5 1.0 1.5 −0.5 0.0 0.5 1.0 1.5 MLE L 2 E 0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10 J( β ) Coefficients Figure 2: Regularization paths. The heavy line denotes the path for the relev ant regression co eﬃcien t β 1 ; J ( β ) is the 1-norm of β . The n umber of outliers (0, 5, 10) increases from left to righ t; 999 irrelev ant co v ariates hav e been added. The ﬁrst row sho ws MLE results; the second sho ws L 2 E results. 5 3 The Minim um Distance Estimator Let P θ b e a probabilit y mass function (PMF), sp eciﬁed by a parameter θ ∈ Θ ⊂ R p , b eliev ed to b e generating data Y 1 , . . . , Y n that take on v alues in the discrete set χ . Let P b e the unkno wn true PMF generating the data. If w e actually knew the true distribution, an in tuitively goo d solution is the one that is “closest” to the true distribution. Consequen tly , as an alternative to using the negative log-lik eliho o d, we consider the L 2 distance b et ween P θ and P . Thus, w e p ose the follo wing v ariational optimization problem; we seek ˆ θ ∈ Θ that minimizes X y ∈ χ [ P θ ( y ) − P ( y )] 2 . (3.1) Although ﬁnding suc h a θ is imp ossible since P is unkno wn, it is possible to ﬁnd a θ that minimizes an un biased estimate of this distance. Expanding the sum in ( 3.1 ) gives us X y ∈ χ P θ ( y ) 2 − 2 X y ∈ χ P θ ( y ) P ( y ) + X y ∈ χ P ( y ) 2 . The second summation is an expectation E [ P θ ( Y )] where Y is a random v ariable dra wn from P . This summation can b e estimated from the data by the sample mean. The third summation does not depend on θ . With these observ ations in mind, w e use the follo wing fully data-based loss function L ( θ ) = X y ∈ χ P θ ( y ) 2 − 2 n n X i =1 P θ ( y i ) (3.2) and seek a ˆ θ suc h that L ( ˆ θ ) = min θ ∈ Θ L ( θ ). The estimate ˆ θ is called an L 2 estimate or L 2 E in Scott ( 2001 ). The ab o ve minimization problem is a familiar one asso ciated with bandwidth selection for histograms and more generally for k ernel densit y estimators ( Scott , 1992 ). Applying a commonly used criterion in nonparametric density estimation to parametric estimation has the interesting consequence of trading oﬀ eﬃciency with robustness in the estimation pro cedure. In fact, previ- ously Basu et al. ( 1998 ) introduced a family of div ergences whic h includes the L 2 E as a sp ecial case and the MLE as a limiting case. The mem b ers of this family of div ergences are indexed by a parameter that explicitly trades oﬀ eﬃciency for robustness. The MLE is the most eﬃcien t but least robust mem b er in this family of estimation pro cedures. The L 2 E represents a reasonable tradeoﬀ b et ween eﬃciency and robustness. Scott ( 2001 , 2004 ) demonstrated that the L 2 E has tw o 6 b eneﬁts, the aforementioned robustness prop erties and computational tractability . The tradeoﬀ in asymptotic eﬃciency is similar to that seen in comparing the mean and median as a lo cation estimator. Indeed, while other mem b ers in this family may p ossess a better tradeoﬀ, the L 2 E has the adv antage of admitting a simple and fast computational solution as we will sho w in Section 4 . W e now sho w that the L 2 E metho d applied to logistic regression amounts to solving a non- linear least squares problem. W e seek to minimize a surrogate measure of the L 2 distance b etw een the logistic conditional probability and the conditional probabilit y generating the data. If the x i are unique, then y i ∼ B (1 , p i ) where p i = F ( ˜ x T i θ ). The L 2 E loss for this one sample is p 2 i + (1 − p i ) 2 − 2[ y i p i + (1 − y i )(1 − p i )]. Extending to the entire sample, a sensible approac h is to minimize the a v erage L 2 distance, namely 1 n n X i =1  p 2 i + (1 − p i ) 2 − 2[ y i p i + (1 − y i )(1 − p i )]  . (3.3) Up to an additive constan t that do es not dep end on θ , the criterion in ( 3.3 ) can b e compactly written as L ( y , ˜ X θ ) = 1 n k y − F ( ˜ X θ ) k 2 2 , after dividing b y tw o. Remark ably , minimizing this unassuming loss function produces robust logistic regression co eﬃcien ts. A closer inspection of the estimating equations gives some in tuition for the logistic L 2 E’s robustness. A stationary p oin t θ ∗ of the L 2 E loss satisﬁes 0 = n X i =1 γ ∗ i x i [ y i − F ( ˜ x T i θ ∗ )] where γ ∗ i = F ( ˜ x T i θ ∗ )[1 − F ( ˜ x T i θ ∗ )]. Th us, at a stationary p oin t θ ∗ , the discrepancies b et w een observ ed and ﬁtted v alues, namely y i − F ( ˜ x T i θ ∗ ), are small for samples with predicted v alues that are far from the extreme v alues of one and zero, namely samples for which γ ∗ i are not close to zero. The i th discrepancy is free to b e large for samples with predicted v alues close to zero or one, namely samples for which γ ∗ i are close to zero. V ery large and small predicted v alues tend to o ccur at extreme v alues of the co v ariates given the sigmoid shap e of F . Th us, observ ations that are extreme in the cov ariate space con tribute very little to the estimating equations at θ ∗ . Moreo v er, w e see that the robustness do es not rely on F b eing the logistic link; rather w e just require that F b e sigmoid. Finally , w e note that the estimating equations also sho w us that the L 2 E is aﬃne equiv ariant, namely linear transformations of the cov ariates c hange the estimated 7 regression co eﬃcien ts accordingly , and therefore linear transformations of the cov ariates do not c hange the ﬁtted resp onses. F or more in depth discussion on the theory b ehind minimum distance estimators lik e the L 2 E, w e refer readers to the w orks of Basu et al. ( 1998 ) and Donoho and Liu ( 1988 ). Before moving on to discuss our algorithm, w e remark that the L 2 distance has b een used b efore for classiﬁcation problems. Kim and Scott ( 2008 , 2010 ) used the L 2 distance to p erform classiﬁcation using k ernel densit y estimates. Their application of the L 2 distance, ho wev er, is more in line with its customary use in nonparametric densit y estimation whereas w e use it to robustly ﬁt a parametric mo del. 4 Estimation with con v ex quadratic ma jorizations W e no w deriv e an algorithm for ﬁnding the logistic L 2 E solution b y minimizing a series of conv ex quadratic losses. W e minimize the L 2 E loss with a Ma jorization-Minimization (MM) algorithm ( Lange, Hun ter, and Y ang , 2000 ; Hun ter and Lange , 2004 ) b ecause it is n umerically stable and easy to implemen t. Most imp ortan tly , our MM algorithm is also easily adapted to handle LASSO-lik e p enalties. The strategy b ehind MM algorithms is to minimize a surrogate function, the ma jorization, instead of the original ob jective function. The surrogate is c hosen with t w o goals in mind. First, an argument that decreases the surrogate should decrease the ob jectiv e function. Second, the surrogate should b e easier to minimize than the ob jectiv e function. F ormally stated, a real-v alued function h ma jorizes a real-v alued function g at v if h ( u ) ≥ g ( u ) for all u and h ( v ) = g ( v ). Given a pro cedure for constructing a ma jorization, w e can deﬁne the MM algorithm to ﬁnd a minimizer of a function g as follo ws. Let v ( k ) denote the k th iterate: (1) ﬁnd a ma jorization h ( v ; v ( k ) ) of g at v ( k ) ; (2) set v ( k +1) = arg min v h ( v ; v ( k ) ); and (3) rep eat un til conv ergence. This algorithm alwa ys tak es non-increasing steps with respect to g . By using the MM algorithm, w e can con vert a hard optimization problem in to a series of simpler ones, each of which is easier to minimize than the original. T o estimate ˆ θ such that L ( y , ˜ X ˆ θ ) = min θ L ( y , ˜ X θ ) we rely on the following conv ex quadratic ma jorization. 8 Theorem 4.1. The fol lowing function majorizes L ( y , ˜ X θ ) at ˜ θ : L ( θ ; ˜ θ ) = L ( y , ˜ X ˜ θ ) + 2 n z T ˜ θ ˜ X ( θ − ˜ θ ) + η n k ˜ X ( θ − ˜ θ ) k 2 2 , (4.1) wher e z ˜ θ = 2 G [ F ( ˜ X ˜ θ ) − y ] , G is diagonal with g ii = F ( ˜ x T i ˜ θ )[1 − F ( ˜ x T i ˜ θ )] , and η > 0 is suﬃciently lar ge. Using the ma jorization ( 4.1 ) in an MM algorithm results in iterative least squares. A pro of of Theorem 4.1 is given in the Supplementary Materials. W e are able to ﬁnd a simple con vex quadratic ma jorization since the logistic L 2 E loss has b ounded curv ature. A sharp low er b ound on η is giv en by the maxim um curv ature of the logistic L 2 E loss ov er all parameter v alues. The b ound is deriv ed in the Supplementary Materials. The practical implication is that the parameter η − 1 con trols the step size of our iterative solver. Consequently , in practice w e set η to its low er b ound to take the largest steps p ossible to sp eed up con vergence. W e can express the ma jorization L ( θ , ˜ θ ) in ( 4.1 ) as L ( θ , ˜ θ ) = η ( ˜ β 0 − β 0 − 1 η z ˜ θ ) 2 + η n k ζ ( ˜ θ ) − X β k 2 2 + K ( ˜ θ ) , where z ˜ θ = n − 1 1 T z ˜ θ , ζ ( ˜ θ ) = X ˜ β − η − 1 ( z ˜ θ − z ˜ θ 1 ), and K ( ˜ θ ) is a constant that do es not dep end on θ . When X is full rank, as is often the case when n > p , then the solution to the normal equations is unique and the parameter up dates are given by β ( m +1) 0 = β ( m ) 0 − η − 1 z θ ( m ) , β ( m +1) = β ( m ) − 1 η  X T X  − 1 X T z θ ( m ) . (4.2) The descent direction has a simple up date since the Hessian approximation is computed only once for all iterations. The ma jorization given in Theorem 4.1 can b e adapted for regularization. It follows im- mediately that (1 / 2) L ( θ ; ˜ θ ) + λJ ( β ) ma jorizes (1 / 2) L ( y , ˜ X θ ) + λJ ( β ) for a p enalt y function J : R p → R + and positive regularization parameter λ . Note that the in tercept parameter is not p enalized. Regularization is useful for stabilizing estimation pro cedures. F or example, if X is not full rank or has a large condition num b er, a ridge p enalt y can salv age the situation. W e then seek the minimizer to the follo wing problem min θ ∈ R p +1 1 2 n k y − F ( ˜ X θ ) k 2 2 + λ 1 2 k β k 2 2 , 9 whic h we can solve b y minimizing the ma jorization L ( θ , ˜ θ ) + λ k β k 2 2 . Since the intercept is not p enalized, the intercept up dates are the same as in ( 4.2 ). The update for β b ecomes β ( m +1) = β ( m ) − 1 η ( X T X + λ I ) − 1 X T z θ ( m ) . (4.3) Under suitable regularity conditions, the MM algorithm for solving the ridge p enalized logistic L 2 E problem is guaran teed to conv erge to a stationary p oint of L ( y , ˜ X θ ) + λ k β k 2 2 . This follows from global con v ergence prop erties of MM algorithms that in volv e contin uously diﬀeren tiable ob jectiv e and ma jorization functions ( Lange , 2010 ). On the other hand, the MM algorithm for the unregularized version of the problem is not guaranteed to con verge based on the suﬃcient conditions giv en in Lange ( 2010 ) b ecause the ob jectiv e function is not co ercive (i.e., not all its lev el sets are compact) and the quadratic ma jorization is not strictly con v ex in θ unless X is full rank. Adding the ridge p enalt y remedies b oth situations, and suﬃcien t conditions for global con v ergence are met. Another reason to consider regularization is to p erform contin uous v ariable selection via a LASSO-lik e p enalt y . In particular, consider the p enalized ma jorizer for the L 2 E loss regularized b y the Elastic Net p enalty , J ( β ) = λ ( α k β k 1 + (1 − α ) / 2 k β k 2 2 ) where α ∈ [0 , 1] is a mixing parameter b et ween the ridge and LASSO p enalt y . Since our w ork is motiv ated b y genomic data whic h are known to hav e correlated co v ariates, w e will fo cus on the Elastic Net p enalty b ecause it pro duces sparse mo dels but includes and excludes groups of correlated v ariables ( Zou and Hastie , 2005 ). The LASSO, in contrast, tends to select one co v ariate among a group correlated cov ariates and exclude the rest. If groupings among the co v ariates are kno wn in adv ance, a group LASSO p enalt y could b e used ( Y uan and Lin , 2006 ). The Elastic Net p enalt y is useful in that it p erforms group selection without prespeciﬁcation of the groups. Thus, we are interested in generating MM iterates θ ( m ) =  β ( m ) 0 , β ( m )  where β ( m +1) 0 = β ( m ) 0 − η − 1 z θ ( m ) β ( m +1) = arg min β ∈ R p η 2 n k ζ ( θ ( m ) ) − X β k 2 2 + λ  α k β k 1 + (1 − α ) 2 k β k 2 2  . (4.4) Before discussing ho w to practically solv e the surrogate minimization problem, note that re- gardless of ho w ( 4.4 ) is solved, we hav e the following guarantee on the conv ergence of the MM iterates. 10 Theorem 4.2. Under suitable r e gularity c onditions, for any starting p oint θ (0) , the se quenc e of iter ates θ (1) , θ (2) , . . . gener ate d by ( 4.4 ) c onver ges to a stationary p oint of 1 2 n k y − F ( ˜ X θ ) k 2 2 + λ  α k β k 1 + (1 − α ) 2 k β k 2 2  , wher e λ > 0 and α ∈ [0 , 1) . A proof is giv en in the Supplementary Materials and relies on an extension of the global con- v ergence prop erties of MM algorithms for lo cally Lipschitz contin uous ob jective and ma jorization functions ( Sc hifano et al. , 2010 ). Note that Theorem 4.2 restricts α < 1, i.e., algorithmic con ver- gence of the LASSO regularized logistic L 2 E is not guaran teed. This condition is imp osed to ensure that the ma jorization is strictly conv ex in β . In our exp erience, the LASSO regularized logistic L 2 E do es not hav e algorithmic conv ergence issues in practice. As a ﬁnal remark on algorithmic con v ergence, note that since the ridge p enalt y is a sp ecial case of the Elastic Net, Theorem 4.2 implies that ridge p enalized logistic L 2 E ( 4.3 ) will also con verge. T o solv e ( 4.4 ) w e turn to co ordinate descen t whic h has b een sho wn to eﬃcien tly solv e p enalized regression problems when selecting relativ ely few groups of correlated predictors ( F riedman, Hastie, H¨ oﬂing, and Tibshirani , 2007 ; W u and Lange , 2008 ). Co ordinate descent is a sp ecial case of blo ck relaxation optimization where, in a round-robin fashion, w e optimize the ob jectiv e function with resp ect to each co ordinate at a time while holding all other co ordinates ﬁxed. The j th co ordinate update during the k th round of co ordinate descen t of the m th MM iteration, denoted β ( m,k ) j , has a simple form ( Donoho and Johnstone , 1995 ) and is giv en b y the subgradient equations to b e β ( m,k ) j = S  η n x T ( j ) r ( m,k,j ) , λα  η n k x ( j ) k 2 2 + λ (1 − α ) , where x ( j ) denotes the j th column of X and r ( m,k,j ) is a vector of partial residuals with i th entry r ( m,k,j ) i = ζ i ( θ ( m ) ) − j − 1 X j 0 =1 x ij 0 β ( m,k ) j 0 + p X j 0 = j +1 x ij 0 β ( m,k − 1) j 0 ! , and S is the soft-threshold function: S ( a, λ ) = sign( a ) max( | a | − λ, 0) . Additional details on ho w co ordinate descen t is nested within the MM steps and how conv ergence is ev aluated can be found in the Supplementary Materials. 11 5 Sim ulations In this section we rep ort on three sim ulations comparing the MLE and L 2 E results. The ﬁrst t wo sim ulations examine the accuracy of estimation. W e then follow with a simulation exp erimen t designed to examine the v ariable selection prop erties. F or the ﬁrst t wo simulations w e generated 1000 data sets, with 200 binary outcomes each asso ciated with 4 co v ariates, from the logistic mo del sp eciﬁed b y the lik eliho o d in ( 2.1 ) with parameters β 0 = 0 and β = (1 , 0 . 5 , 1 , 2) T . The co v ariates x i w ere drawn from one of tw o p opulations. F or i = 1 , . . . , 100, the x i are i.i.d samples from N ( µ , 0 . 16 I p ) and for i = 101 , . . . 200, they are i.i.d samples from N ( − µ , 0 . 16 I p ), where p = 4 and µ = (0 . 25 , 0 . 25 , 0 . 25 , 0 . 25) T . The responses w ere generated indep endently as y i ∼ B (1 , F ( x T i β )). 5.1 Estimation in Lo w Dimensions In the ﬁrst scenario, we added a single outlier, ( y 201 , x 201 ) where y 201 = 0 and x 201 = ( δ, δ, δ, δ ) T and δ to ok on v alues in {− 0 . 25 , 1 . 5 , 3 , 6 , 12 , 24 } . In w ords, the 201st p oint was mov ed in cov ariate space along the line that runs through the centroids of the tw o subp opulations. In the second scenario, w e added a v ariable n um b er of outliers at a single location: { ( y i , x i ) } N i =201 , where y i = 0 and x i = (3 , 3 , 3 , 3) T for i = 201 , . . . , N and the num b er of outliers is N = 0 , 1 , 5 , 10 , 15 , 20. F or each sequence of scenarios describ ed, w e p erformed logistic regression and L 2 E regression. Figures 3 and 4 summarize the results of ﬁrst and second scenario, respectively . The results show t wo features of the L 2 E versus the MLE. Consider the ﬁrst scenario. Figure 3 sho ws how k ˆ β k 2 under each estimation pro cedure v aries with the p osition of outlier is mo ved. The MLE v alues suﬀer from implosion breakdow n as the 201st p oin t is mov ed from − 0 . 25 to 24, i.e., k ˆ β k 2 tends to wards 0 as the lev erage of the 201st p oin t increases. In contrast, the L 2 E is insensitive to the placement of the 201st p oin t. The second observ ation is that the L 2 E’s un biasedness comes at the cost of increased v ariance. The L 2 E’s spread is greater than the MLE’s for all lo cations of the outlier. Similar b ehavior is observed in the second scenario. Figure 4 sho ws that implosion breakdo wn ensues as outliers are added at ﬁxed p osition. Detailed n umerical summaries of the ﬁtted co eﬃcients (sample mean, standard deviation, estimated mean squared error) of these exp erimen ts can b e found in the Supplemen tary Materials. 12 Outlier P osition 2−norm of β 0 1 2 3 4 5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.25 1.5 3 6 12 24 Method MLE L 2 E Figure 3: The 2-norm of the regression co eﬃcien ts (intercept not included) as a function of the p osition of the single outlier. Number of Outliers 2−norm of β 0 1 2 3 4 5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 5 10 15 20 Method MLE L 2 E Figure 4: The 2-norm of the regression co eﬃcien ts (intercept not included) as a function of the n um b er of outliers at a ﬁxed p osition. 13 5.2 V ariable Selection in High Dimensions In the v ariable selection exp erimen t w e considered a high dimensional v ariation on the ﬁrst sce- nario. W e generated 10 data sets each with n = 500 observ ations. The co v ariates were dra wn from one of three m ultiv ariate normal p opulations. F or i = 1 , . . . 200, the x i are i.i.d. samples from N ( µ , 0 . 75 I p ). F or i = 201 , . . . , 400, the x i are i.i.d. samples from N ( − µ , 0 . 75 I p ). F or i = 401 , . . . , 500, the x i are i.i.d. samples from N ( ν , 0 . 25 I p ) where p = 500, µ i = 0 . 3 for i = 1 , . . . , 50 and µ i = 0 for i = 51 , . . . , 500, and ν i = 1 for i = 1 , . . . , 50 and ν i = 0 for i = 51 , . . . , 500. F or i = 1 , . . . , 400, the resp onses w ere generated indep enden tly as y i ∼ B (1 , F ( x T i β )), where β 0 = 0 and β ∈ R 500 with β i = 1 for i = 1 , . . . 50 and β i = 0 for i = 51 , . . . , 500. F or i = 401 , . . . , 500, the resp onses w ere set to y i = 0, W e then p erformed Elastic Net penalized regression ( α = 0 . 6) with the MLE and L 2 E. Before con tin uing we note that there are t wo practical issues that need to b e addressed, namely ho w to choose initial starting p oin ts since the optimization problem is not conv ex and how to c ho ose the amoun t of p enalization. In the Supplementary Materials, w e describe in detail a heuristic for c ho osing the initial starting p oin t based on the Karush-Kuhn-T uck er conditions of the optimization problem as well as a robust cross v alidation scheme for choosing the regularization parameter λ . T o p erform the Elastic Net p enalized logistic regression we used the glmnet pack age in R ( F riedman et al. , 2010 ). W e also compared the robust classiﬁer of W ang et al. ( 2008 ) - the Hybrid Hub erized Supp ort V ector Mac hine (HHSVM) using an MM algorithm. W ang et al. ( 2008 ) pro vide details of the implemen tation and code for computing the solution paths of the HHSVM. Ho wev er, their algorithm calculates the paths for a v arying LASSO regularization parameter with a ﬁxed ridge regularization parameter b ecause they can b e computed quic kly by exploiting the piece-wise linearit y of the paths under that parameterization of the Elastic Net. Our HHSVM implementation calculates regularization paths using the Elastic Net parameterization used in this article. Details on our implementation can b e found in the Supplementary Materials. T ables 1 and 2 sho w the n um b er of true p ositiv es and false p ositiv es resp ectively for eac h metho d. W e see that in scenarios of hea vy con tamination the L 2 E demonstrates superior sensitivity and speciﬁcity compared to b oth the MLE and HHSVM. It is in teresting to note that the MLE tends to b e more sensitive than the HHSVM, but at a cost of being drastically less sp eciﬁc. F or a closer lo ok comparing the three metho ds, the cross-v alidation curves and regularization paths for 14 T able 1: T rue p ositiv e coun t with n = p = 500 and 50 nonzero co v ariates. L 2 E is the most sensitiv e method. HHSVM is the least sensitiv e metho d. Replicate 1 2 3 4 5 6 7 8 9 10 MLE 14 10 8 10 1 10 0 14 11 15 HHSVM 1 3 2 2 1 2 1 2 4 2 L 2 E 48 47 48 49 48 48 49 46 48 49 T able 2: F alse p ositiv e coun t with n = p = 500 and 50 nonzero co v ariates. L 2 E is the most speciﬁc metho d. MLE is the least sp eciﬁc method. Replicate 1 2 3 4 5 6 7 8 9 10 MLE 141 95 56 148 0 141 0 128 136 170 HHSVM 0 4 1 1 1 0 1 0 0 0 L 2 E 0 0 2 0 0 0 1 1 0 1 a replicate can b e found in the Supplementary Materials. 6 Real data examples 6.1 An n > p example: Predicting abnormal and normal v ertebral columns W e ﬁrst consider a real data set in the n > p regime. W e presen t results on the vertebral column data set from the UCI mac hine learning rep ository , as describ ed b y F rank and Asuncion ( 2010 ). The data set consists of 310 pati ents which ha ve b een classiﬁed as b elonging to one of three groups: Normal (100 patients), Disk Hernia (60 patients), Sp ondylolisthesis (150 patients). In addition to a classiﬁcation lab el, six predictor v ariables are recorded for each patient: p elvic incidence (PI), 15 p elvic tilt (PT), lumbar lordosis angle (LLA), sacral slop e (SS), p elvic radius (PR) and grade of sp ondylolisthesis (GS). All six predictor v ariables are contin uous v alued. T able 3: Correlations among the six biomec hanical attributes in the v ertebrae data set. PI PT LLA SS PR GS PI 1.00 0.63 0.72 0.81 -0.25 0.64 PT – 1.00 0.43 0.06 0.03 0.40 LLA – – 1.00 0.60 -0.08 0.53 SS – – – 1.00 -0.34 0.52 PR – – – – 1.00 -0.03 GS – – – – – 1.00 W e consider the t wo class problem of discriminating normal v ertebral columns from abnormal ones (Disk Hernia and Sp ondylolisthesis). Figure 5 plots the v alues of individual co v ariates for eac h patient. T able 3 shows the correlations b etw een pairs of attributes. Note that the attributes for Disk Hernia and Normal patien ts o v erlap a go od deal. W e ma y exp ect similar results as seen in the second sim ulation scenario describ ed in Section 5.1 where Disk Hernia patients pla y the role of a cluster of outlying observ ations. Due to the correlation, how ever, the outlying observ ations are not as distinctly outlying as seen in the sim ulation examples of Section 5.1 . Consequently , it also migh t b e anticipated that there will not b e diﬀerences b etw een the MLE and L 2 E regularization paths. Indeed, Figure 6 shows the resulting regularization paths generated b y the MLE and logistic L 2 E for α = 0 . 2. The paths are v ery similar for b oth metho ds for other v alues of α and are not sho wn. Diﬀeren t initial starting p oin ts did not change the resulting logistic L 2 E regularization paths. 6.2 An n  p example: A genome wide asso ciation study W e examine the lung cancer data of Amos et al. ( 2008 ). The purp ose of this genome wide asso ciation study was to identify risk v ariants for lung cancer. The authors emplo yed a t wo stage study using 315,450 tagging SNPs in 1,154 curren t and former (ev er) smokers of Europ ean ancestry and 1,137 frequency matched, ever-smoking controls from Houston, T exas in the disco very stage. 16 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● PI PT LLA SS PR GS 0 5 10 0 5 10 0 50 100 150 200 250 300 0 50 100 150 200 250 300 0 50 100 150 200 250 300 P atient Attribute V alue Class ● DH SL NO Observed Class ● ● Abnormal Normal Figure 5: Dot plots of biomechanical attribute v alues for patien ts b elonging to one of three classes. P atien ts are randomly ordered within their classes. The attributes are p elvic incidence (PI), p elvic tilt (PT), lumbar lordosis angle (LLA), sacral slop e (SS), p elvic radius (PR) and grade of sp ondylolisthesis (GS). The three underlying classes are Disk Hernia (DH), Sp ondylolisthesis (SL), and Normal (NO). DH and SL are lump ed in to the observed class Abnormal. P atients with SL (61 to 210) occupy the plot within the ligh tly shaded band. J( β ) Coefficients −0.5 0.0 0.5 1.0 1.5 2.0 2.5 0 1 2 3 4 5 Attribute PI PT LLA SS PR GS Method MLE L 2 E Figure 6: The regularization ( α = 0 . 2) paths for the MLE and L 2 E are v ery similar for the six biomec hanical attributes in the v ertebrae data set. 17 The most signiﬁcan t SNPs found in the discov ery phases were then tested in a larger replication set. Tw o SNPs, rs1051730 and rs8034191, on chromosome 15 w ere found to b e signiﬁcantly asso ciated with lung cancer risk in the v alidation set. SNP mark ers can hav e a high degree of collinearity due to recombination mec hanics. SNPs that are ph ysically close to each other tend to be highly correlated and are said to b e in link age disequilibrium. The pair rs1051730 and rs8034191 for example are in “high” link age disequilibrium. In this section we reexamine the disco very data using logistic L 2 E and the logistic MLE. Note that it is curren t practice of geneticists to do univ ariate inference with an adjustmen t for multiple testing and this approach w as taken in Amos et al. ( 2008 ). T aking a m ultiv ariate approach as will b e done in this section, how ever, allows the analyst to tak e into accoun t dep endencies b et ween the SNPs. As an initial comparison we consider a subset of the entire data set and restrict our analysis to SNPs on c hromosome 15. W e impute missing genotypes at a SNP by using the MA CH 1.0 pack age, a Marko v Chain based haplot yp er ( Li, Ding, and Ab ecasis , 2006 ). After missing data are imputed and keeping only imputations with a qualit y score of at least 0.9, 8,701 SNPs are retained on 1152 cases and 1136 con trols. Figure 7 summarizes the v ariable selection results for the logistic L 2 E and MLE for α = 0 . 05 , 0 . 5 , and 0 . 95. There are three things to note. First, the regularization paths for the L 2 E and MLE are almost iden tical. Second, b oth metho ds pro duce regularization paths that iden tify rs1051730 (light-thic k line) and rs8034191 (dark-thic k line) as ha ving the greatest partial correla- tion with the case/con trol status. Third, the paths for rs1051730 and rs8034191 b eha v e as would b e expected with α . F or small α , or more ridge-like p enalty , the t wo paths b ecome more similar. F or large α , or more LASSO-lik e p enalty , only one of the t wo correlated predictors en ters the mo del while the other is excluded. 7 Discussion Outliers can introduce bias in some commonly used maximum likelihoo d estimation pro cedures. This w ell known fact, ho w ev er, w arran ts attention b ecause bias can hav e material eﬀects on the ubiquitous LASSO-based v ariable selection pro cedures. In the context of standard logistic regression, inﬂuential outliers cause implosion breakdo wn. In this pap er we hav e demonstrated that the combination of implosion breakdo wn and the soft-thresholding mec hanism of LASSO 18 0.05 0.5 0.95 −0.05 0.00 0.05 0.10 −0.05 0.00 0.05 0.10 MLE L 2 E 0.000 0.005 0.010 0.015 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.2 0.4 0.6 0.8 1.0 J ( β ) Coefficients Figure 7: Regularization paths of regression co eﬃcien ts of SNP markers on Chromosome 15 for L 2 E and MLE for α = 0 . 05 , 0 . 5, and 0 . 95. The regularization paths for rs1051730 are in ligh t-thic k lines; the paths for rs8034191 are in dark-thick lines. The L 2 E and MLE paths are nearly identical. F or α = 0 . 95, i.e. nearly LASSO regression, rs8034191 was not selected for the sho wn range of p enalizations b y either method. v ariable selection can lead to missed detection of relev an t predictors. T o guard against the undue inﬂuence of outliers on estimation and v ariable selection for binary resp onses, w e prop ose a robust metho d for p erforming sparse logistic regression. Our metho d is based on minimizing the estimated L 2 distance b etw een the logistic parametric mo del and the underlying true conditional distribution. The resulting optimization problem is a p enalized non-linear least squares problem which we solv e with an MM algorithm. Our MM algorithm in turn reduces the optimization problem to solving a series p enalized least squares problems whose solution paths can b e solved very eﬃciently with co ordinate descen t and w arm starts. Although we present our work as a metho d for robust binary logistic regression, our metho d immediately extends to other related contexts. Our algorithm can b e extended to handle more than t wo classes. The generalization to the K -class m ultinomial is straightforw ard. L ( Y , ˜ XΘ ) = K X k =1 k y k − F k ( ˜ XΘ ) k 2 2 , where y ik = 1 if the i th observ ation b elongs to class k and 0 otherwise and the i th element of 19 v ector F k ( ˜ XΘ ) is given by exp( ˜ x T i θ k ) 1 + P K j =1 exp( ˜ x T i θ j ) . This non-linear least squares problem also has b ounded curv ature and consequently can also b e solv ed b y minimizing a sequence of LASSO-penalized least squares problems. Our algorithm can also b e used as a subroutine in p erforming robust binary principal com- p onen t analysis and, more generally , robust binary tensor decomp ositions. A common strategy in arra y decomp ositions for multiw ay data, including multiw ay binary data, is to use blo ck co- ordinate descent or alternating minimization ( Collins, Dasgupta, and Sc hapire , 2001 ; Kolda and Bader , 2009 ; Lee, Huang, and Hu , 2010 ). F or binary multiw ay data, each blo ck minimization w ould perform a batc h of independent robust logistic regressions. W e w an t to make clear that the logistic L 2 E is not a comp etitor to the MLE but rather a complemen t. Both metho ds are computationally feasible and can b e run on data together. As seen in the real data examples of Section 6 , sometimes the logistic L 2 E recov ers the MLE solution. On the other hand, when discrepancies do occur, taking the MLE and L 2 E solutions together can pro vide insigh t in to the data that w ould b e harder to iden tify with the MLE solution alone. W e close with some in teresting directions for future w ork. W e hav e seen that LASSO-based v ariable selection in the presence of implosion breakdown can lead to missed detection of relev ant predictors. This motiv ates the question of whether explosion breakdown can lead to the inclusion of irrelev ant predictors. Finally , with resp ect to conv ergence issues of our algorithm, while w e ha v e established conditions under whic h our algorithm is guaranteed to con v erge to a stationary p oint w e do not ha v e rigorous results on the rate at which it do es so. As a complement to metho ds that ma y b e sensitive to the presence of outliers, c haracterizing the conv ergence sp eed of our algorithm has a great deal of practical imp ortance. SUPPLEMENT AL MA TERIALS Algorithm details, simulation results, pro ofs, and deriv ations: The Supplemen tary Ma- terials includes additional details on the algorithm (e.g. choosing initial starting p oin ts, stopping criteria, and c ho osing regularization parameters), additional results from the esti- mation exp erimen ts in Section 5.1 and v ariable selection experiments in Section 5.2 , pro ofs for Theorems 4.1 and 4.2 , and a deriv ation of our HHSVM algorithm. (Supplement.pdf ) 20 Co de: C and R co de used to generate results sho wn in the article along with relev an t data hav e also b een made av ailable. A readme ﬁle details ho w to compile and run the code. The SNP data is not included for conﬁdentialit y reasons. (GNU zipp ed tar ﬁle) A CKNOWLEDGMENTS The authors thank Christopher Amos for generously allowing them to work with the lung cancer data set. All plots were made using the op en source R pack age ggplot2 ( Wic kham , 2009 ). Eric Chi was supp orted by gran t DE-F G02-97ER25308 from the Department of Energy . David Scott w as supported in part by grant DMS-09-07491 from the National Science F oundation. Supplemen tary Materials 8 Pro ofs 8.1 Pro of of Theorem 4.1 It is immediate that L ( ˜ θ ; ˜ θ ) = L ( y , ˜ X ˜ θ ). W e turn our attention to proving that L ( θ ; ˜ θ ) ≥ L ( y , ˜ X θ ) for all θ , ˜ θ ∈ R p +1 . Since L ( y , ˜ X θ ) has b ounded curv ature our strategy is to represen t L ( y , ˜ X θ ) by its exact second order T aylor expansion ab out ˜ θ and then ﬁnd a tight uniform b ound o v er the quadratic term in the expansion. This approach applies in general to functions with con tin uous second deriv ativ e and bounded curv ature ( B¨ ohning and Lindsay , 1988 ). The exact second order T a ylor expansion of L ( y , ˜ X θ ) at ˜ θ is giv en b y L ( y , ˜ X θ ) = L ( y , ˜ X ˜ θ ) + ( θ − ˜ θ ) T ∇ L ( y , ˜ X θ ) + 1 2 ( θ − ˜ θ ) T H θ ∗ ( θ − ˜ θ ) , 21 where θ ∗ = γ ˜ θ + (1 − γ ) θ for some γ ∈ (0 , 1) and ∇ L ( y , ˜ X θ ) = 4 n − 1 X T G ( p − y ) H θ = 2 n X T M θ X , G = diag { p 1 (1 − p 1 ) , . . . , p n (1 − p n ) } M θ = diag { ψ u 1 ( p 1 ) , . . . , ψ u n ( p n ) } u = 2 y − 1 p = F ( ˜ X θ ) ψ u ( p ) = [2 p (1 − p ) − (2 p − 1)((2 p − 1) − u )] p (1 − p ) . Note that ( M θ ) ii is b ounded from ab o v e, i.e., sup θ ∈ Θ ( M θ ) ii < ∞ . W e no w introduce a surrogate function: L ( θ ; ˜ θ ) = L ( y , ˜ X ˜ θ ) + 4 n ( θ − ˜ θ ) T X T G ( F ( ˜ X ˜ θ ) − y ) + η n ( θ − ˜ θ ) T X T X ( θ − ˜ θ ) , where η ≥ max ( sup p ∈ [0 , 1] ψ − 1 ( p ) , sup p ∈ [0 , 1] ψ 1 ( p ) ) . Note that for an y θ ∈ R p +1 , ( M θ ) ii ≤ η . Therefore, ( θ − ˜ θ ) T H θ ∗ ( θ − ˜ θ ) = ( θ − ˜ θ ) T X T M θ ∗ X ( θ − ˜ θ ) ≤ η ( θ − ˜ θ ) T X T X ( θ − ˜ θ ) , and consequen tly L ( θ ; ˜ θ ) ma jorizes L ( y , ˜ X ˜ θ ) at ˜ θ . The follo wing observ ations lead to a simpler lo wer b ound on η . Note that sup p ∈ [0 , 1] ψ − 1 ( p ) = sup p ∈ [0 , 1] ψ 1 ( p ) , since ψ − 1 ( p ) = ψ 1 (1 − p ). So, the lo wer b ound on η can b e more simply expressed as sup p ∈ [0 , 1] ψ 1 ( p ) = max p ∈ [0 , 1] ψ 1 ( p ) = 1 4 max q ∈ [ − 1 , 1]  3 2 q 4 − q 3 − 2 q 2 + q + 1 2  . (8.1) The ﬁrst equality follo ws from the compactness of [0 , 1] and the contin uit y of ψ 1 ( p ). The second equalit y follo ws from reparameterizing ψ 1 ( p ) in terms of q = 2 p − 1. Since the deriv ative of the p olynomial in ( 8.1 ) has a ro ot at 1, it is straightforw ard to argue that the low er b ound of η is 22 attained at the second largest ro ot, which is ( − 3 + √ 33) / 12. Thus, the ma jorization holds so long as η ≥ 3 16 q 4 − 1 4 q 3 − 1 2 q 2 + 1 4 q + 1 16      q = − 3+ √ 33 12 . 8.2 Pro of of Theorem 4.2 A k ey condition in MM algorithm conv ergence pro ofs is co erciv eness since it is a suﬃcien t condition to ensure the existence of a global minimum. Recall that a con tinuous function f : U ⊂ R n → R is coercive if all its level sets S t = { x ∈ U : f ( x ) ≤ t } are compact. W e will use the MM algorithm global conv ergence results in Schifano et al. ( 2010 ). Let ξ ( θ ) denote the ob jective function and let ξ [ S ] ( θ , ˜ θ ) denote a surrogate ob jective function that will be minimized with resp ect to its ﬁrst argumen t in lieu of ξ ( θ ). The iteration map ϕ is given by ϕ ( ˜ θ ) = arg min θ ξ [ S ] ( θ , ˜ θ ) . W e now state a sligh tly less general set of regularity conditions than those in Sc hifano et al. ( 2010 ) that are suﬃcient for our purp oses. Supp ose ξ , ξ [ S ] , and ϕ satisfy the following set of conditions: R1. The ob jective function ξ ( θ ) is lo cally Lipsc hitz contin uous for θ ∈ Θ and coercive. The set of stationary p oints S of ξ ( θ ) is a ﬁnite set, where the notion of a stationary p oint is deﬁned as in Clarke ( 1983 ). R2. ξ ( θ ) = ξ [ S ] ( θ , θ ) for all θ ∈ Θ . R3. ξ [ S ] ( θ , ˜ θ ) < ξ [ S ] ( θ , θ ) for all θ , ˜ θ ∈ Θ where θ 6 = ˜ θ . R4. ξ [ S ] ( θ , ˜ θ ) is contin uous for ( θ , ˜ θ ) ∈ Θ × Θ and lo cally Lipsc hitz in Θ. R5. ϕ ( θ ) is a singleton set consisting of one b ounded v ector for θ ∈ Θ. Then { θ ( n ) , n ≥ 0 } conv erges to a ﬁxed p oin t of the iteration map ϕ . By Prop osition A.8 in Sc hifano et al. ( 2010 ) the ﬁxed p oin ts of ϕ coincide with S . In our case w e ha ve the follo wing ob jective and surrogate functions ξ ( θ ) = 1 2 n k y − F ( ˜ X θ ) k 2 2 + λ  α k β k 1 + (1 − α ) 2 k β k 2 2  ξ [ S ] ( θ , ˜ θ ) = 1 2 L ( θ , ˜ θ ) + λ  α k β k 1 + (1 − α ) 2 k β k 2 2  . 23 W e c heck eac h regularit y condition in turn. R1. Since k y − F ( ˜ X θ ) k 2 2 is b ounded b elo w and the p enalty term is co erciv e, ξ ( θ ) is co erciv e. Recall that the gradient of the L ( y , ˜ X θ ) is (4 /n ) X T G ( F ( ˜ X θ ) − y ). The norm of the gradien t is b ounded; sp eciﬁcally it is no greater than 2 σ 2 1 where σ 1 is the largest singular v alue of X . Therefore, L ( y , ˜ X θ ) is Lipsc hitz con tin uous and therefore lo cally Lipschitz con tinuous. Consequen tly , ξ ( θ ) is lo cally Lipsc hitz con tinuous. If the set of stationary p oin ts of ξ ( θ ) is ﬁnite, then R1 is met. R2 and R3. Recall the ma jorization w e are using is giv en b y L ( θ ; ˜ θ ) = L ( y , ˜ X ˜ θ ) + ( θ − ˜ θ ) T ∇ L ( y , ˜ X ˜ θ ) + η n ( θ − ˜ θ ) T X T X ( θ − ˜ θ ) , where η > 1 4 max q ∈ [ − 1 , 1]  3 2 q 4 − q 3 − 2 q 2 + q + 1 2  . T o ensure that the ma jorization is strict we need the inequalit y to be strict. Thus, the curv a- ture of the ma jorization exceeds the maxim um curv ature of L ( y , ˜ X θ ) and the ma jorization is strict. R2 and R3 are met. R4. The p enalized ma jorization is the sum of con tin uous functions in ( θ , ˜ θ ) ∈ Θ × Θ and is consequen tly con tin uous. The p enalized ma jorization as a function of its ﬁrst argumen t is the sum of a p ositive deﬁnite quadratic function and the 1-norm function, b oth of whic h are lo cally Lipsc hitz con tin uous so their sum is lo cally Lipsc hitz con tinuous. R4 is met. R5. If λ (1 − α ) > 0 then ξ [ S ] ( θ , ˜ θ ) is strictly con v ex in θ and th us has at most one global minimizer. Since ξ [ S ] ( θ , ˜ θ ) is also co erciv e in θ it has at least one global minimizer. R5 is met. Th us, Algorithm 1 will conv erge to a stationary p oint of ξ ( θ ), pro vided that there are only ﬁnitely man y stationary p oin ts and the co ordinate descen t minimization of the Elastic Net p enalized quadratic ma jorization is solv ed exactly . Remark 1. If ξ do es not have ﬁnitely many stationary p oints, it c an b e shown that the limit p oints of the se quenc e of iter ates ar e stationary p oints and that the set of limit p oints is c onne cte d ( Schifano et al. , 2010 ; Chi , 2011 ). 24 Remark 2. The iter ate up date θ ( m +1) = ϕ ( θ ( m ) ) c an b e ac c omplishe d by any me ans algorithmic al ly so long as the glob al minimum of the majorization is found. Iter ates of c o or dinate desc ent ar e guar ante e d to c onver ge to a glob al minimizer pr ovide d that the loss is diﬀer entiable and c onvex and the p enalty is c onvex and sep ar able ( Tseng , 2001 ). Thus, applying c o or dinate desc ent on the Elastic Net p enalize d quadr atic majorization wil l ﬁnd the glob al minimum. Remark 3. Our deﬁnition of stationary p oints has to change b e c ause the obje ctive functions of inter est ar e lo c al ly Lipschitz c ontinuous and ther efor e diﬀer entiable almost everywher e exc ept on a set of L eb esgue me asur e zer o. Clarke ( 1983 ) deﬁnes and pr oves pr op erties of a gener alize d gr adient for lo c al ly Lipschitz functions. Ap art fr om p atholo gic al c ases, when a function is c onvex the gener alize d gr adient is the sub diﬀer ential. Se e Pr op osition 2.2.7 in Clarke ( 1983 ). When a function is diﬀer entiable the gener alize d gr adient is the gr adient. Thus as would b e exp e cte d a p oint x is a stationary p oint of a lo c al ly Lipschitz function if the function ’s gener alize d gr adient at x c ontains 0 . 9 Algorithm Details Algorithm 1 gives pseudo co de for the resulting iterativ e solv er for a given pair of parameters α and λ . The symbol ∗ denotes the Hadamard element-wise pro duct. In practice w e also use activ e sets to sp eed up computations. That is, for a given initial β , we only up date the non-zero co ordinates of β , the activ e set, un til there is little change in the active set parameter estimates. The non-activ e set parameter estimates are then up dated once. If they remain zero, the Karush- Kuhn-T uck er (KKT) conditions hav e b een met and a global minimum of (4.4) has b een found. If not, then the active set is expanded to include the coordinates whose KKT conditions hav e b een violated and the pro cess is rep eated. 9.1 Cho osing the p enalty parameters 9.1.1 W arm Starts and Calculating Regularization P aths W e will need to compare the regression co eﬃcients obtained at man y v alues of the p enalty param- eter λ to p erform mo del selection. Typically we can rapidly calculate regression co eﬃcients for a decreasing sequence of v alues of λ through w arm starts. Namely , a solution to the problem using 25 Algorithm 1 Itera tive L 2 E sol ver θ ← initial guess rep eat p ← F ( ˜ X θ ) G ← diag { p ∗ ( 1 − p ) } z ← 2 G ( p − y ) ζ ← X β − 1 η ( z − z 1 ) β 0 ← β 0 − η − 1 z rep eat for k = 1 ..p do r ← ζ − ( X β − β k x k ) β k ← S  η n x T k r , λα    η n k x k k 2 2 + λ (1 − α )  end for un til conv ergence un til conv ergence return θ λ k as a regularization parameter is used as the initial starting v alue for the iterative algorithm applied to the subsequen t problem using λ k +1 as a regularization parameter. The idea is if λ k and λ k +1 are not to o far apart, the solutions to their corresp onding optimization problems will b e close to each other. Thus, the solution of one optimization problem will b e a very go o d initial starting point for the succeeding optimization problem. F or λ suﬃcien tly large, only the in tercept term θ 0 will come into the model. The smallest λ ∗ suc h that all regression coeﬃcients are shrunk to zero is giv en b y λ ∗ = 2 nα y (1 − y ) max j =1 ,...,p | x T ( j ) y | , (9.1) where x ( j ) denotes the j th column of the design matrix X . W e compute a grid of λ v alues equally spaced on a log scale b et ween λ max = λ ∗ and λ min = λ max where  < 1. In practice, we ha v e found the choice of  = 0 . 05 to b e useful. In general, w e are not in terested in making λ so small as to include all v ariables. Moreo v er, due to the p ossible m ulti-mo dalit y of the L 2 E loss, we recommend computing the regulation paths starting from a smaller regularization parameter and increasing the parameter 26 v alue until λ max . Since we face m ulti-mo dalit y initial starting p oints can make a signiﬁcan t diﬀer- ence in the answ ers obtained. 9.1.2 The heuristic for choosing starting v alues Since the logistic L 2 E loss is not con vex, it ma y ha v e m ultiple lo cal minima. F or the purely LASSO-p enalized problem, the KKT condition at a lo cal minim um is ν j = | x T ( j ) G ( y − F ( β 0 1 + X β )) | ≤ λ. Equalit y is met whenev er β j 6 = 0. Th us, the largest v alues of ν j will corresp ond to a set of co v ari- ates whic h include co v ariates with non-zero regression co eﬃcien ts. The leap of faith is that the largest v alues of ν j ev aluated at the null mo del will also corresp ond to a set of co v ariates whic h include cov ariates with non-zero regression co eﬃcien ts. This idea has b een used in a “swindle” rule ( W u et al. , 2009 ) and STR ONG rules for discarding v ariables ( Tibshirani, Bien, F riedman, Hastie, Simon, T aylor, and Tibshirani , 2012 ). In those instances the goal is to solve a smaller optimization problem. In contrast, we initialize starting parameter en tries to zero rather than ex- cluding v ariables with low scores from the optimization problem. Sp eciﬁcally , w e do the follo wing: (1) calculate the following scores z j = | x T ( j ) G 0 ( y − p 1 )) | , where p = y the sample mean of y and G 0 = p (1 − p ) I ; (2) set β (0) 0 = log ( y / (1 − y )); and (3) set β (0) j = I ( j ∈ S ), where I ( · ) denotes the indicator function and S = { j : z j is “large” } . 9.1.3 Robust Cross-V alidation Once we ha ve a set of mo dels computed at diﬀeren t regularization parameter v alues, we select the mo del that is optimal with resp ect to some criterion. W e use the follo wing robust 10-fold cross-v alidation sc heme to select the mo del. After partitioning the data in to 10 training and test sets, for each i = 1 , . . . , 10 folds w e compute regression co eﬃcien ts ˆ θ − i ( λ ) for a sequence of λ ’s b et w een λ max and λ min holding out the i th test set S i . Next we reﬁt the mo del using the reduced v ariable set S c i , those with nonzero regression co eﬃcien ts, and reﬁt using logistic L 2 E with α = 0. This reﬁtting pro duces less biased estimates. W e are adopting the same strategy as LARS-OLS in Efron, Hastie, Johnstone, and Tibshirani ( 2004 ). Our framework, how ever, could adopt a more sophisticated strategy along the lines of the 27 Relaxed LASSO in Meinshausen ( 2007 ). Henceforth let ˆ θ − i ( λ ) denote the regression co eﬃcien ts obtained after the second step. Let d − i j ( λ ) denote the contribution of observ ation j to the L 2 E loss under the mo del ˆ θ − i ( λ ), i.e., d − i j ( λ ) =  y j − F ( ˜ x T j ˆ θ − i ( λ ))  2 . W e use the following criterion to c ho ose λ ∗ : λ ∗ = arg min λ  median i =1 ,..., 10  median j ∈S i d − i j ( λ )  . The reason for c ho osing λ ∗ in this w ay is due to a feature of the robust ﬁtting pro cedure. Go o d robust mo dels will assign un usually large v alues of d − i j ( λ ) to outliers. Thus, the total L 2 E loss is an inappropriate measure of the prediction error if inﬂuential outliers were presen t. On the other hand, taking the median, for example, w ould pro vide a more unbiased measure of the prediction error regardless of outliers. The ﬁnal mo del selected would b e the one that minimizes the robust prediction error criterion. 28 10 Sim ulation Exp erimen ts in Lo w Dimensions T ables 3 and 4 pro vide summary statistics for simulations performed in Section 5.1. The exp eri- men ts show the un biasedness of the L 2 E compared to the MLE at the price of increased v ariance. The mse summarizes the bias-v ariance tradeoﬀ b etw een the t wo metho ds. 29 −log( λ ) Robust CV Error 0.0 0.5 1.0 L 2 E ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 3.0 3.5 4.0 4.5 5.0 5.5 MLE ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 4 6 8 10 HHSVM ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2.5 3.0 3.5 4.0 Figure 8: Robust 10-fold cross-v alidation curv es for the three metho ds. The vertical error bars around the dots indicate ± one median absolute deviation with a scale factor of 1 . 4826. The dash-dotted line indicates the minimizing λ . The dashed line indicates the 1-MAD rule λ . 11 V ariable Selection Exp erimen ts in High Dimensions W e sho w more detailed results for a single replicate for the simulations rep orted in Section 5.2. Figure 8 sho ws the robust cross v alidation curves for the three metho ds for the replicate. Figure 9 sho ws the regularization paths for the three metho ds for the replicate. Note the large jump in the L 2 E curve. By choosing the starting L 2 E p oint b y our heuristic, a lo cal minimum diﬀeren t from the MLE solution is found. F or suﬃciently large λ , ho w ever, the lo cal minimum v anishes, and the regularization paths mimic the MLE regularization paths. 30 −log( λ ) Coefficients −0.4 −0.2 0.0 0.2 0.4 L 2 E 3.0 3.5 4.0 4.5 5.0 5.5 MLE 2.5 3.0 3.5 4.0 HHSVM 2.5 3.0 3.5 4.0 Figure 9: Regularization paths for the three metho ds. Paths for nonzero regression co eﬃcien ts in the true mo del are drawn in hea vy solid lines. 12 The Hybrid Hub erized SVM Consider the follo wing classiﬁcation problem. Let X ∈ R n × p denote a cen tered matrix of co v ariates and y ∈ {− 1 , 1 } n denote binary class lab els. W e will emplo y the compact notation ˜ X = ( 1 , X ) ∈ R n × ( p +1) and θ = ( β 0 , β T ) T ∈ R p +1 . The Hybrid Hub erized Supp ort V ector Mac hine (HHSVM) ( W ang et al. , 2008 ) constructs a linear classiﬁer ˜ X θ by minimizing the follo wing loss. ` ( y , X ; θ ) = n X i =1 φ  y i ˜ x T i θ  + J ( β ) , where the function φ is a smo oth hinge loss, φ ( u ) =            (1 − t ) 2 + 2(1 − t )( t − u ) , if u ≤ t , (1 − u ) 2 , if t < u ≤ 1, 0 , otherwise, and J is the Elastic Net p enalt y ( Zou and Hastie , 2005 ). J ( β ) = λ  α k β k 1 + 1 − α 2 k β k 2 2  , where α ∈ [0 , 1] is a mixing parameter b etw een the 1-norm and 2-norm regularizers. W e now deriv e an MM algorithm for solving the en tire regularization path with respect to a v arying λ for a ﬁxed α . The ma jorization w e will use leads to a simple MM algorithm. This algorithm calculates 31 a diﬀerent regularization path than the algorithm in ( W ang et al. , 2008 ), which uses the follo wing parameterization of the Elastic Net J ( β ) = λ 1 k β k 1 + λ 2 2 k β k 2 2 , for v arying λ 1 for a ﬁxed λ 2 . The co de used in ( W ang et al. , 2008 ) is av ailable on the author’s w ebsite ( http://www.stat.lsa.umich.edu/ ~ jizhu/code/hhsvm ). 32 12.1 An MM Algorithm for Minimizing the Smo oth Hinge Loss W e begin b y deriving a quadratic ma jorization of φ . It is straigh tforward to verify that the ﬁrst and second deriv atives of φ are given b y φ 0 ( u ) =            − 2(1 − t ) , if u ≤ t , − 2(1 − u ) , if t < u ≤ 1, 0 , otherwise. φ 00 ( u ) =            0 , if u ≤ t , 2 , if t < u ≤ 1, 0 , otherwise. Then w e can express φ as an exact second order T a ylor expansion at a point ˜ u with φ ( u ) = φ ( ˜ u ) + φ 0 ( ˜ u )( u − ˜ u ) + 1 2 φ 00 ( u ∗ )( u − ˜ u ) 2 , where u ∗ = δ u + (1 − δ ) ˜ u for some δ ∈ (0 , 1). It follo ws immediately that the following function ma jorizes φ at ˜ u . g ( u ; ˜ u ) = φ ( ˜ u ) + φ 0 ( ˜ u )( u − ˜ u ) + ( u − ˜ u ) 2 . The u that minimizes g ( u ; ˜ u ) is u = ˜ u − 1 2 φ 0 ( ˜ u ) = ˜ u + [(1 − t ) I ( u ≤ t ) + (1 − u ) I ( u > t ) I ( u ≤ 1)] = ˜ u + 1 − min(max( ˜ u, t ) , 1) 12.2 An MM Algorithm for the Unregularized Classiﬁcation P roblem Returning to our original problem and applying the ab o ve results along with the c hain rule giv es us the relationship ` ( y , ˜ X ; θ ) ≤ ` ( y , ˜ X ; ˜ θ ) + ˜ ϕ T ˜ X ( θ − ˜ θ ) + k ˜ X ( θ − ˜ θ ) k 2 2 , where ˜ ϕ i = y i ϕ 0 ( y i ˜ x T i ˜ θ ) . 33 Since the equalit y o ccurs when θ = ˜ θ , the righ t hand side ma jorizes the left hand side. F urther- more, the ma jorization up to an additiv e constan t is separable in β 0 and β .     1 2 ˜ ϕ + ˜ X ( θ − ˜ θ )     2 2 =     ( ˜ X ˜ θ − 1 2 ˜ ϕ ) − ˜ X θ     2 2 =      X ˜ β − 1 2 ( ˜ ϕ − ϕ 1 ) − X β  +  ˜ β 0 1 − 1 2 ϕ 1 − β 0 1      2 2 = n  ˜ β 0 − β 0 − 1 2 n 1 T ˜ ϕ  2 + k ˜ z − X β k 2 2 , where ˜ z = X ˜ β − 1 2  ˜ ϕ − 1 n 1 T ˜ ϕ 1  . W e can write the up dates with the intercept and regression coeﬃcients separately . The in ter- cept update is β 0 = ˜ β 0 − 1 2 n 1 T ˜ ϕ . and if X is full rank the update for β is β = ˜ β − 1 2  X T X  − 1 X T  ˜ ϕ − 1 n 1 T ˜ ϕ 1  . 12.3 An MM Algorithm for the HHSVM Adding an Elastic Net penalty to the ma jorization giv es us the following loss function to minimize. 1 2  ˜ β 0 − β 0 − 1 2 n 1 T ˜ ϕ  2 + 1 2 n k ˜ z − X β k 2 2 + λ  α k β k 1 + 1 − α 2 k β k 2 2  . P enalized least squares problems of this v ariety are eﬃcien tly solv ed with coordinate descen t. The coordinate descen t updates are β j = S  1 n x T k r , λα  1 n k x k k 2 2 + λ (1 − α ) , where r = ˜ X ˜ θ − 1 2 ˜ ϕ − X j 6 = k β j x j . 34 References Alfons, A., Croux, C., and Gelp er, S. (2012), “Sparse least trimmed squares regression,” Annals of Applie d Statistics , to app ear. Amos, C. I., W u, X., Bro deric k, P ., Gorlov, I. P ., Gu, J., Eisen, T., Dong, Q., Zhang, Q., Gu, X., Vijay a ˜ krishnan, J., Sulliv an, K., Matakidou, A., W ang, Y., Mills, G., Dohen y , K., Tsai, Y.-Y., Chen, W. V., Shete, S. a., Spitz, M. R., and Houlston, R. S. (2008), “Genome-wide Asso ciation Scan of T ag SNPs Iden tiﬁes a Susceptibility Lo cus for Lung Cancer at 15q25.1,” Natur e Genetics , 40, 616–622. Basu, A., Harris, I. R., Hjort, N. L., and Jones, M. C. (1998), “Robust and Eﬃcien t Estimation b y Minimising a Densit y P o w er Div ergence,” Biometrika , 85, 549–559. Bianco, A. and Y ohai, V. (1996), “Robust Estimation in the Logistic Regression Mo dels,” in R obust Statistics, Data A nalysis, and Computer Intensive Metho ds, L e ctur e Notes in Statistics , ed. Rieder, H., New Y ork: Springer-V erlag, v ol. 109, pp. 17–34. B¨ ohning, D. and Lindsay , B. G. (1988), “Monotonicity of Quadratic-Appro ximation Algorithms,” A nnals of the Institute of Statistic al Mathematics , 40, 641–663. Bondell, H. D. (2005), “Minim um Distance Estimation for the Logistic Regression Mo dels,” Biometrika , 92, 724–731. Carroll, R. J. and P ederson, S. (1993), “On Robustness in the Logistic Regression Model,” Journal of the R oyal Statistic al So ciety, Ser. B , 55, 693–706. Chen, S. S., Donoho, D. L., and Saunders, M. A. (1998), “Atomic Decomp osition b y Basis Pur- suit,” SIAM Journal on Scientiﬁc Computing , 20, 33–61. Chi, E. C. (2011), “Parametric Classiﬁcation and V ariable Selection b y the Minimum In tegrated Squared Error Criterion,” Ph.D. thesis, Rice Univ ersit y . Clark e, F. H. (1983), Optimization and Nonsmo oth A nalysis , Wiley-In terscience. 35 Collins, M., Dasgupta, S., and Schapire, R. (2001), “A Generalization of Principal Comp onent Analysis to the Exp onen tial F amily ,” in A dvanc es in Neur al Information Pr o c essing Systems , v ol. 14. Copas, J. B. (1988), “Binary Regression Mo dels for Con taminated Data,” Journal of the R oyal Statistic al So ciety. Series B , 50, 225–265. Croux, C., Flandre, C., and Haesbro eck, G. (2002), “The Breakdown Beha vior of the Maximum Lik eliho o d Estimator in the Logistic Regression Mo dels,” Statistics & Pr ob ability L etters , 60, 377–386. Donoho, D. L. and Johnstone, I. M. (1995), “Adapting to Unknown Smo othness via W a velet Shrink age,” Journal of the Americ an Statistic al Asso ciation , 90, 1200–1224. Donoho, D. L. and Liu, R. C. (1988), “The “Automatic” Robustness of Minimum Distance F unc- tionals,” A nnals of Statistics , 16, 552–586. Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004), “Least Angle Regression,” A nnals of Statistics , 32, 407–499. F rank, A. and Asuncion, A. (2010), “UCI Mac hine Learning Rep ository ,” . F riedman, J., Hastie, T., H¨ oﬂing, H., and Tibshirani, R. (2007), “Path wise co ordinate optimiza- tion,” A nnals of Applie d Statistics , 1, 302–332. F riedman, J. H., Hastie, T., and Tibshirani, R. (2010), “Regularization P aths for Generalized Linear Models via Coordinate Descen t,” Journal of Statistic al Softwar e , 33, 1–22. Genkin, A., Lewis, D. D., and Madigan, D. (2007), “Large-Scale Ba yesian Logistic Regression for T ext Categorization,” T e chnometrics , 49, 291–304. Hun ter, D. and Lange, K. (2004), “A T utorial on MM Algorithms.” The Americ an Statistician , 58, 30–38. Kim, J. and Scott, C. (2008), “P erformance Analysis for L 2 Kernel Classiﬁcation,” in A dvanc es in Neur al Information Pr o c essing Systems , v ol. 21. 36 — (2010), “L 2 Kernel Classiﬁcation,” Pattern Analysis and Machine Intel ligenc e, IEEE T r ansac- tions on , 32, 1822–1831. Kolda, T. G. and Bader, B. W. (2009), “T ensor Decomp ositions and Applications,” SIAM R eview , 51, 455–500. K ¨ unsc h, H. R., Stefanski, L. A., and Carroll, R. J. (1989), “Conditionally Unbiased Bounded- Inﬂuence Estimation in General Regression Mo dels, with Applications to Generalized Linear Mo dels,” Journal of the Americ an Statistic al Asso ciation , 84, 460–466. Lange, K. (2010), Numeric al Analysis for Statisticians , Springer. Lange, K., Hunter, D. R., and Y ang, I. (2000), “Optimization T ransfer Using Surrogate Ob jectiv e F unctions,” Journal of Computational and Gr aphic al Statistics , 9, 1–20. Lee, S., Huang, J. Z., and Hu, J. (2010), “Sparse Logistic Principal Comp onen ts Analysis for Binary Data,” Annals of Applie d Statistics , 4, 1579–1601. Li, G., P eng, H., and Zh u, L. (2011), “Nonconcav e P enalized M-Estimation with a Div erging Num b er of Parameters,” Statistic a Sinic a , 21, 391–419. Li, Y., Ding, J., and Ab ecasis, G. R. (2006), “Mac h 1.0: Rapid Haplot yp e Reconstruction and Missing Genot yp e Inference.” Americ an Journal of Human Genetics , S79, 2290. Liu, Z., Jiang, F., Tian, G., W ang, S., Sato, F., Meltzer, S. J., and T an, M. (2007), “Sparse Logistic Regression with Lp Penalt y for Biomarker Identiﬁcation,” Statistic al Applic ations in Genetics and Mole cular Biolo gy , 6, 2–12. McCullagh, P . and Nelder, J. (1989), Gener alize d Line ar Mo dels , Boca Raton, Florida: Chapman and Hall. Meinshausen, N. (2007), “Relaxed Lasso,” Computational Statistics and Data A nalysis , 52, 374– 393. Rosset, S. and Zhu, J. (2007), “Piecewise Linear Regularized Solution P aths,” Annals of Statistics , 35, 1012–1030. 37 Sc hifano, E. D., Stra wderman, R. L., and W ells, M. T. (2010), “Ma jorization-Minimization Al- gorithms for Nonsmo othly Penalized Ob jectiv e F unctions,” Ele ctr onic Journal of Statistics , 4, 1258–1299. Scott, D. W. (1992), Multivariate Density Estimation. The ory, Pr actic e and Visualization , John Wiley & Sons, Inc. — (2001), “Parametric Statistical Mo deling b y Minim um Integrated Square Error,” T e chnomet- rics , 43, 274–285. — (2004), “P artial Mixture Estimation and Outlier Detection in Data and Regression,” in The ory and Applic ations of R e c ent R obust Metho ds , eds. Hub ert, M., Pison, G., Struyf, A., and Aelst, S. V., Birkhauser, Basel, pp. 297–306. Tibshirani, R. (1996), “Regression Shrink age and Selection via the Lasso,” Journal of the R oyal Statistic al So ciety, Ser. B , 58, 267–288. Tibshirani, R., Bien, J., F riedman, J., Hastie, T., Simon, N., T aylor, J., and Tibshirani, R. J. (2012), “Strong rules for discarding predictors in lasso-type problems,” Journal of the R oyal Statistic al So ciety: Series B (Statistic al Metho dolo gy) , 74, 245–266. Tseng, P . (2001), “Con vergence of a Blo ck Co ordinate Descen t Metho d for Nondiﬀeren tiable Minimization,” Journal of Optimization The ory and Applic ations , 109, 475–494. v an de Geer, S. A. (2008), “High-dimensional generalized linear models and the lasso,” A nnals of Statistics , 36, 614–645. W ang, H., Li, G., and Jiang, G. (2007), “Robust Regression Shrink age and Consisten t V ariable Selection Through the LAD-Lasso,” Journal of Business & Ec onomic Statistics , 25, 347–355. W ang, L., Zhu, J., and Zou, H. (2008), “Hybrid Hub erized Supp ort V ector Machines for Microarra y Classiﬁcation and Gene Selection,” Bioinformatics , 24, 412–419. Wic kham, H. (2009), ggplot2: Ele gant Gr aphics for Data Analysis , Springer New Y ork. W u, T. T., Chen, Y. F., Hastie, T., Sob el, E., and Lange, K. (2009), “Genomewide Asso ciation Analysis b y Lasso P enalized Logistic Regression,” Bioinformatics , 25, 714–721. 38 W u, T. T. and Lange, K. (2008), “Co ordinate Descent Algorithms for Lasso Penalized Regression,” A nnals of Applie d Statistics , 2, 224–244. Y uan, M. and Lin, Y. (2006), “Mo del Selection and Estimation in Regression With Group ed V ariables,” Journal of the R oyal Statistic al So ciety: Series B , 68, 49–67. Zou, H. and Hastie, T. (2005), “Regularization and V ariable Selection via the Elastic Net,” Journal of the R oyal Statistic al So ciety, Ser. B , 67, 301–320. 39 T able 3: Eﬀect of v arying the position of a single outlier from − 0 . 25 to 24. MLE L 2 E Outlier P osition Co eﬃcient T rue V alue mean std mse mean std mse -0.25 β 0 0 -0.002 0.182 0.033 -0.005 0.192 0.037 β 1 1 1.032 0.434 0.189 1.063 0.480 0.234 β 2 0.5 0.526 0.424 0.180 0.539 0.463 0.216 β 3 1 1.047 0.439 0.195 1.079 0.482 0.238 β 4 2 2.110 0.487 0.249 2.181 0.572 0.359 1.5 β 0 0 -0.024 0.168 0.029 0.002 0.192 0.037 β 1 1 0.868 0.394 0.173 1.052 0.476 0.229 β 2 0.5 0.401 0.391 0.162 0.532 0.460 0.212 β 3 1 0.880 0.396 0.171 1.068 0.478 0.233 β 4 2 1.860 0.430 0.204 2.160 0.567 0.347 3 β 0 0 -0.022 0.157 0.025 0.002 0.192 0.037 β 1 1 0.732 0.368 0.207 1.054 0.476 0.229 β 2 0.5 0.296 0.369 0.178 0.533 0.460 0.212 β 3 1 0.743 0.368 0.201 1.069 0.478 0.233 β 4 2 1.662 0.392 0.268 2.163 0.567 0.347 6 β 0 0 -0.020 0.142 0.021 0.002 0.192 0.037 β 1 1 0.508 0.337 0.356 1.054 0.476 0.229 β 2 0.5 0.112 0.344 0.268 0.533 0.460 0.212 β 3 1 0.516 0.334 0.346 1.069 0.478 0.233 β 4 2 1.350 0.347 0.543 2.163 0.567 0.347 12 β 0 0 -0.018 0.128 0.017 0.002 0.192 0.037 β 1 1 0.153 0.325 0.823 1.054 0.476 0.229 β 2 0.5 -0.201 0.336 0.604 0.533 0.460 0.212 β 3 1 0.158 0.316 0.808 1.069 0.478 0.233 β 4 2 0.906 0.317 1.297 2.163 0.567 0.347 24 β 0 0 -0.011 0.124 0.016 0.002 0.192 0.037 β 1 1 -0.088 0.330 1.293 1.054 0.476 0.229 β 2 0.5 -0.431 0.331 0.975 0.533 0.460 0.212 β 3 1 -0.086 0.315 1.279 1.069 0.478 0.233 β 4 2 0.641 0.324 1.952 2.163 0.567 0.347 40 T able 4: Eﬀect of v arying the n umber of outliers at a ﬁxed lo cation. MLE L 2 E Num b er of Outliers Co eﬃcient T rue V alue mean std mse mean std mse 0 β 0 0 0.005 0.182 0.033 0.002 0.192 0.037 β 1 1 1.026 0.433 0.188 1.054 0.476 0.229 β 2 0.5 0.521 0.422 0.179 0.533 0.460 0.212 β 3 1 1.041 0.438 0.193 1.069 0.478 0.233 β 4 2 2.099 0.485 0.245 2.163 0.567 0.347 1 β 0 0 -0.022 0.157 0.025 0.002 0.192 0.037 β 1 1 0.732 0.368 0.207 1.054 0.476 0.229 β 2 0.5 0.296 0.369 0.178 0.533 0.460 0.212 β 3 1 0.743 0.368 0.201 1.069 0.478 0.233 β 4 2 1.662 0.392 0.268 2.163 0.567 0.347 5 β 0 0 -0.090 0.126 0.024 0.002 0.192 0.037 β 1 1 0.086 0.320 0.937 1.054 0.476 0.229 β 2 0.5 -0.263 0.327 0.689 0.533 0.460 0.212 β 3 1 0.090 0.308 0.922 1.069 0.478 0.233 β 4 2 0.830 0.312 1.466 2.163 0.567 0.347 10 β 0 0 -0.110 0.124 0.027 0.002 0.192 0.037 β 1 1 -0.073 0.330 1.261 1.054 0.476 0.229 β 2 0.5 -0.417 0.333 0.951 0.533 0.460 0.212 β 3 1 -0.071 0.315 1.246 1.069 0.478 0.233 β 4 2 0.659 0.323 1.903 2.163 0.567 0.347 15 β 0 0 -0.117 0.124 0.029 0.002 0.192 0.037 β 1 1 -0.127 0.335 1.382 1.054 0.476 0.229 β 2 0.5 -0.470 0.338 1.055 0.533 0.460 0.212 β 3 1 -0.125 0.321 1.367 1.069 0.478 0.233 β 4 2 0.605 0.328 2.054 2.163 0.567 0.347 20 β 0 0 -0.122 0.124 0.030 0.002 0.192 0.037 β 1 1 -0.159 0.339 1.457 1.054 0.476 0.229 β 2 0.5 -0.502 0.342 1.120 0.533 0.460 0.212 β 3 1 -0.157 0.325 1.443 1.069 0.478 0.233 β 4 2 0.573 0.332 2.145 2.163 0.567 0.347 41

Robust Parametric Classification and Variable Selection by a Minimum Distance Criterion

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment