Linear Time Feature Selection for Regularized Least-Squares

Linear Time F eature Selection for Regularized Least-Squares T apio P ahikk ala, An tti Airola, and T apio Salak oski F ebruary 28, 2022 Abstract W e prop ose a no vel algorithm for greedy forw ard fea- ture selection for regularized least-squares (RLS) re- gression and classiﬁcation, also known as the least- squares supp ort v ector mac hine or ridge regression. The algorithm, whic h w e call greedy RLS, starts from the empt y feature set, and on each iteration adds the feature whose addition provides the b est leav e- one-out cross-v alidation performance. Our metho d is considerably faster than the previously prop osed ones, since its time complexit y is linear in the n um b er of training examples, the n umber of features in the original data set, and the desired size of the set of se- lected features. Therefore, as a side eﬀect w e obtain a new training algorithm for learning sparse linear RLS predictors whic h can b e used for large scale learning. This sp eed is p ossible due to matrix calculus based short-cuts for lea ve-one-out and feature addition. W e exp erimen tally demonstrate the scalability of our al- gorithm and its ability to ﬁnd go od quality feature sets. 1 In tro duction F eature selection is one of the fundamental tasks in mac hine learning. Simply put, giv en a set of features, the task is to select an informativ e subset. The se- lected set can be informative in the sense that it guar- an tees a go od performance of the machine learning metho d or that it will provide new knowledge ab out the task in question. Suc h increased understanding of the data is sough t after esp ecially in life sciences, where feature selection is often used, for example, to ﬁnd genes relev ant to the problem under considera- tion. Other beneﬁts include compressed represen ta- tions of learned predictors and faster prediction time. This can enable the application of learned mo dels when constrained b y limited memory and real-time resp onse demands, as is typically the case in embed- ded computing, for example. The feature selection metho ds are divided by [1] in to three classes, the so called ﬁlter, wrapp er, and em b edded metho ds. In the ﬁlter mo del feature se- lection is done as a prepro cessing step b y measur- ing only the in trinsic properties of the data. In the wrapp er mo del [2] the features are selected through in teraction with a machine learning method. The qualit y of selected features is ev aluated by measur- ing some estimate of the generalization p erformance of a prediction mo del constructed using them. The em b edded metho ds minimize an ob jective function whic h is a trade-oﬀ b et ween a go o dness-of-ﬁt term and a penalty term for a large num b er of features (see e.g. [3]). On one hand, the metho d w e prop ose in this work is a wrapper metho d, since equiv alen t results can be obtained by using a blac k-b o x learning algorithm to- gether with the considered search and p erformance estimation strategies. On the other hand it may also b e considered as an em bedded metho d, because the feature subset selection is so deeply integrated in the learning pro cess, that our w ork results in a nov el ef- ﬁcien t training algorithm for the metho d. As suggested by [4], w e estimate the generalization p erformance of mo dels learned with diﬀerent subsets of features with cross-v alidation. More sp eciﬁcally , w e consider the leav e-one-out cross-v alidation (LOO) approac h, where eac h example in turn is left out of the training set and used for testing. Since the num- 1 b er of possible feature combinations gro ws exp onen- tially with the n umber of features, a search strategy o ver the p o wer set of features is needed. The strategy w e choose is the greedy forw ard selection approac h. The metho d starts from the empt y feature set, and on eac h iteration adds the feature whose addition pro- vides the b est leav e-one-out cross-v alidation p erfor- mance. That is, it performs a steep est descen t hill- clim bing in the space of feature subsets of size up to a given limit k ∈ N and is greedy in the sense that it does not remo ve features once they hav e b een se- lected. Our metho d is built up on the Regularized least- squares (RLS), also known as the least-squares sup- p ort vector mac hine (LS-SVM) and ridge regression, whic h is is a state-of-the art machine learning method suitable b oth for regression and classiﬁcation [5 – 11]. An important property of the algorithm is that it has a closed form solution, which can be fully ex- pressed in terms of matrix op erations. This allows dev eloping eﬃcient computational shortcuts for the metho d, since small changes in the training data ma- trix corresp ond to low-rank c hanges in the learned predictor. Esp ecially it makes p ossible the devel- opmen t of eﬃcien t cross-v alidation algorithms. An up dated predictor, corresp onding to a one learned from a training set from which one example has b een remo ved, can b e obtained via a well-kno wn compu- tationally eﬃcient short-cut (see e.g. [12–14]) which in turn enables the fast computation of LOO-based p erformance estimates. Analogously to removing the eﬀect of training examples from learned RLS predic- tors, the eﬀects of a feature can b e added or remov ed b y up dating the learned RLS predictors via similar computational short-cuts. Learning a linear RLS predictor with k features and m training examples requires O (min { k 2 m, k m 2 } ) time, since the training can b e p erformed either in primal or dual form dep ending whether m > k or vice versa. Given that the computation of LOO p er- formance requires m retrainings, that the forward se- lection goes through of the order of O ( n ) features a v ailable for selection in each iteration, and that the forw ard selection has k iterations, the ov erall time complexit y of the forward selection with LOO crite- rion b ecomes O (min { k 3 m 2 n, k 2 m 3 n } ) in case RLS is used as a black-box metho d. In machine learning and statistics literature, there ha ve been several studies in whic h the computational short-cuts for LOO, as well as other t yp es of cross- v alidation, hav e b een used to sp eed up the ev aluation of feature subset quality for ordinary non-regularized least-squares (see e.g [15, 16]) and for RLS (see e.g. [17, 18]). How ever, the considered approac hes are still b e computationally quite demanding, since the LOO estimate needs to b e re-calculated from scratch for eac h considered subset of features. Recently , [19] in- tro duced a metho d which uses additional computa- tional short-cuts for eﬃcien t up dating of the LOO predictions when adding new features to those al- ready selected. The computational cost of the incre- men tal forw ard selection procedure is only O ( km 2 n ) for the metho d. The sp eed impro vemen t is notable esp ecially in cases, where the aim is to select a large n umber of features but the training set is small. How- ev er, the metho d is still impractical for large training sets due to its quadratic scaling. In this pap er, we improv e up on the ab o v e results b y sho wing how greedy forw ard selection according to the LOO criterion can be carried out in O ( kmn ) time. Th us, our algorithm, whic h we call greedy RLS or greedy LS-SVM, is linear in the num b er of examples, features, and selected features, while it still pro vides the same solution as the straightforw ard wrapp er ap- proac h. As a side eﬀect we obtain a nov el training algorithm for linear RLS predictors that is suitable for large-scale learning and guarantees sparse solu- tions. Computational exp erimen ts demonstrate the scalabilit y and eﬃciency of greedy RLS, and the pre- dictiv e pow er of selected feature sets. 2 Regularized Least-Squares W e start b y in tro ducing some notation. Let R m and R n × m , where n, m ∈ N , denote the sets of real v alued column vectors and n × m -matrices, resp ectiv ely . T o denote real v alued matrices and vectors we use cap- ital letters and b old low er case letters, resp ectively . Moreo ver, index sets are denoted with calligraphic capital letters. By denoting M i , M : ,j , and M i,j , we refer to the i th row, j th column, and i, j th entry of 2 the matrix M ∈ R n × m , resp ectiv ely . Similarly , for index sets R ∈ { 1 , . . . , n } and L ∈ { 1 , . . . , m } , w e denote the submatrices of M having their rows in- dexed by R , the columns b y L , and the rows by R and columns by L as M R , M : , L , and M R , L , resp ec- tiv ely . W e use an analogous notation also for column v ectors, that is, v i refers to the i th en try of the v ector v . Let X ∈ R n × m b e a matrix containing the whole feature represen tation of the examples in the training set, where n is the total n umber of features and m is the n umber of training examples. The i, j th en try of X contains the v alue of the i th feature in the j th training example. Moreov er, let y ∈ R m b e a v ector con taining the lab els of the training examples. In binary classiﬁcation, the lab els can b e restricted to b e either − 1 or 1, for example, while they can b e an y real n um b ers in regression tasks. In this pap er, we consider linear predictors of type f ( x ) = w T x S , (1) where w is the |S | -dimensional v ector representation of the learned predictor and x S can b e considered as a mapping of the data p oin t x in to |S | -dimensional feature space. 1 Note that the vector w only con tains en tries corresp onding to the features indexed by S . The rest of the features of the data p oin ts are not used in prediction phase. The computational com- plexit y of making predictions with (1) and the space complexit y of the predictor are b oth O ( k ), where k is the desired num b er of features to b e selected, pro- vided that the feature vector represen tation x S for the data p oint x is given. Giv en training data and a set of feature indices S , we ﬁnd w by minimizing the so-called regularized least-squares (RLS) risk. Minimizing the RLS risk can be expressed as the following problem: argmin w ∈ R |S | (( w T X S ) T − y ) T (( w T X S ) T − y ) + λ w T w . (2) The ﬁrst term in (2), called the empirical risk, mea- sures how well the prediction function ﬁts to the 1 In the literature, the formula of the linear predictors often also con tain a bias term. Here, we assume that if such a bias is used, it will be realized by using an extra constan t v alued feature in the data p oin ts. training data. The second term is called the regu- larizer and it con trols the tradeoﬀ b et ween the loss on the training set and the complexity of the predic- tion function. A straightforw ard approac h to solv e (2) is to set the deriv ative of the ob jective function with resp ect to w to zero. Then, by solving it with resp ect to w , w e get w = ( X S ( X S ) T + λI ) − 1 X S y , (3) where I ∈ R m × m is the iden tity matrix. W e note (see e.g. [20]) that an equiv alent result can b e obtained from w = X S (( X S ) T X S + λI ) − 1 y . (4) If the size of the set S of curren tly selected features is smaller than the num b er of training examples m , it is computationally b eneﬁcial to use the form (3) while using (4) is faster in the opp osite case. Namely , the computational complexity of the former is O ( |S | 3 + |S | 2 m ), while the that of the latter is O ( m 3 + m 2 |S | ), and hence the complexity of training a predictor is O (min {|S | 2 m, m 2 |S |} ). T o supp ort the considerations in the following sec- tions, w e introduce the dual form of the prediction function and some extra notation. According to [7], the prediction function (1) can be represented in dual form as follows f ( x ) = a T ( X S ) T x S . Here a ∈ R m is the vector of so-called dual v ariables, whic h can be obtained from a = G y , where G = ( K + λI ) − 1 (5) and K = ( X S ) T X S . (6) Note that if S = ∅ , K is deﬁned to be a matrix whose ev ery en try is equal to zero. In the context of k ernel-based learning algorithms (see e.g. [21 – 23]), K and a are usually called the ker- nel matrix and the vector of dual v ariables, resp ec- tiv ely . The kernel matrix con tains the inner pro d- ucts b etw een the feature v ectors of all training data 3 p oin ts. With kernel metho ds, the feature v ector rep- resen tation of the data points may b e even inﬁnite dimensional as long as there is an eﬃcien t, although p ossibly implicit, w ay to calculate the v alue of the inner pro duct and the dual v ariables are used to rep- resen t the prediction function. Ho wev er, while the dual represen tation plays an important role in the considerations of this pap er, we restrict our consid- erations to feature representations of type x S whic h can be represented explicitly . No w let us consider some w ell-known eﬃcient ap- proac hes for ev aluating the LOO p erformance of a trained RLS predictor (see e.g. [12 – 14, 24] and for the non-regularized case, see e.g [15, 16]). The LOO pre- diction for the j th training example can b e obtained in constan t time from (1 − q j ) − 1 ( f j − q j y j ) , (7) where f = ( w T X S ) T and q j = ( X S ,j ) T ( X S ( X S ) T + λI ) − 1 X S ,j , pro vided that the vectors f and q are computed and stored in memory . The time complexity of computing f and q is O ( |S | 3 + |S | 2 m ), which is the same as that of training RLS via (3). Alternatively , the constant time LOO predictions can b e expressed in terms of the dual v ariables a and the diagonal entries of the matrix G as follows: y j − ( G j,j ) − 1 a j . (8) Here, the time complexity of computing a and the diagonal entries of the matrix G is O ( m 3 + m 2 |S | ), whic h is equal to the complexity of training RLS via (4). Thus, using either (7) or (8), the LOO p erfor- mance for the whole training set can b e computed in O (min {|S | 2 m, m 2 |S |} ) time, which is equal to the time complexit y of training RLS. 3 F eature Selection In this section, we consider algorithmic implementa- tions of feature selection for RLS with leav e-one-out (LOO) criterion. W e start by considering a straigh t- forw ard wrapp er approach whic h uses RLS as a blac k- b o x metho d in Section 3.1. Next, we recall a previ- ously prop osed algorithm which pro vides results that are equiv alent to those of the wrapp er approac h but whic h has computational short-cuts to sp eed up the feature selection in Section 3.2. In Section 3.3, we presen t greedy RLS, our nov el linear time algorithm, whic h again provides the same results as the wrap- p er approach but whic h has muc h low er computa- tional and memory complexity than the previously prop osed algorithms. 3.1 W rapp er Approach Algorithm 1: Standard wrapp er algorithm for RLS Input : X ∈ R n × m , y ∈ R m , k , λ Output : S , w 1 S ← ∅ ; 2 while |S | < k do 3 e ← ∞ ; 4 b ← 0; 5 foreac h i ∈ { 1 , . . . , n } \ S do 6 R ← S ∪ { i } ; 7 e i ← 0; 8 foreac h j ∈ { 1 , . . . , m } do 9 L ← { 1 , . . . , m } \ { j } ; 10 w ← t ( X RL , y L , λ ); 11 p ← w T X R ,j ; 12 e i ← e i + l ( y j , p ); 13 end 14 if e i < e then 15 e ← e i ; 16 b ← i ; 17 end 18 end 19 S ← S ∪ { b } ; 20 end 21 w ← t ( X S , y , λ ); Here, we consider the standard wrapp er approach in which RLS is used as a black-box metho d which is re-trained for ev ery feature set to be ev aluated during 4 the selection pro cess and for ev ery round of the LOO cross-v alidation. In forward selection, the set of all feature sets up to size k is searched using hill clim b- ing in the steep est descent direction (see e.g. [25]) starting from an empty set. The metho d is greedy in the sense that it adds one feature at a time to the set of selected features but no features are remov ed from the set at any stage. The wrapp er approach is pre- sen ted in Algorithm 1. In the algorithm description, t ( · , · , · ) denotes the black-box training pro cedure for RLS which takes a data matrix, a lab el vector, and a v alue of the regularization parameter as input and re- turns a vector represen tation of the learned predictor w . The function l ( · , · ) computes the loss for one train- ing example given its true lab el and a predicted lab el obtained from the LOO procedure. Typical losses are, for example, the squared loss for regression and zero-one error for classiﬁcation. Giv en that the computation of LOO p erformance requires m retrainings, that the forward selection go es through O ( n ) features in each iteration, and that k features are chosen, the ov erall time com- plexit y of the forward selection with LOO criterion is O (min { k 3 m 2 n, k 2 m 3 n } ). Thus, the wrapp er ap- proac h is feasible with small training sets and in cases where the aim is to select only a small num b er of features. Ho wev er, at the very least quadratic com- plexit y with resp ect to both the size of the train- ing set and the num b er of selected features mak e the standard wrapper approach impractical in large scale learning settings. The space complexity of the wrap- p er approac h is O ( nm ) which is equal to the cost of storing the data matrix X . An immediate reduction for the ab o ve considered computational complexities can b e achiev ed via the short-cut metho ds for calculating the LOO p erfor- mance presented in Section 2. In machine learning and statistics literature, there hav e been sev eral stud- ies in whic h the computational short-cuts for LOO, as w ell as other types of cross-v alidation, hav e been used to sp eed up the ev aluation of feature subset quality for non-regularized least-squares (see e.g [15, 16]) and for RLS (see e.g. [17, 18]). F or the greedy forward selection, the short-cuts reduce the o v erall compu- tational complexity to O (min { k 3 mn, k 2 m 2 n } ), since LOO can then b e calculated as eﬃcien tly as train- ing the predictor itself. Ho wev er, b oth [18] and [17] considered only the dual formulation (8) of the LOO sp eed-up which is exp ensiv e for large data sets. Thus, w e are not a ware of any studies in which the primal form ulation (7) would be implemented for the greedy forw ard selection for RLS. 3.2 Previously Prop osed Sp eed-Up W e now present a feature selection algorithm pro- p osed b y [19] which the authors call low-rank updated LS-SVM. The features selected b y the algorithm are equal to those b y standard wrapper algorithm and by the metho d prop osed by [17] but the metho d has a lo wer time complexity with resp ect to k due to cer- tain matrix calculus based computational short-cuts. The lo w-rank up dated LS-SVM algorithm for feature selection is presented in Algorithm 2. The low-rank up dated LS-SVM is based on up dat- ing the matrix G and the v ector a as new features are added into S . In the b eginning of the forward up date algorithm, the set S of selected features is empt y . This means that the matrix G contains no information of the data matrix X . Instead, every en- try of the matrix K is equal to zero, and hence G and a are initialized to λ − 1 I and λ − 1 y , resp ectiv ely , in lines 2 and 3 in Algorithm 2. When a new feature candidate is ev aluated, the LOO p erformance corre- sp onding to the up dated feature set can b e calculated using the temp orarily up dated G and a , denoted in the algorithm as e G and ˜ a , resp ectiv ely . Here, the impro ved sp eed is based on tw o short-cuts. Namely , up dating the matrix G via the well kno wn Sherman- Morrison-W o o dbury (SMW) formula and using the fast LOO computation. Let us ﬁrst consider up dating the matrix G . F or- mally , if the i th feature is to be ev aluated, where i / ∈ S , the algorithm calculates the matrix e G corre- sp onding the feature index set S + { i } . According to (5), the matrix e G is ( K + vv T + λI ) − 1 , (9) where K is the k ernel matrix deﬁned as in (6) without ha ving the i th fea ture in S and v = ( X i ) T con tains the v alues of the i th feature for the training data. 5 Algorithm 2: Low-rank up dated LS-SVM Input : X ∈ R n × m , y ∈ R m , k , λ Output : S , w 1 S ← ∅ ; 2 a ← λ − 1 y ; 3 G ← λ − 1 I ; 4 while |S | < k do 5 e ← ∞ ; 6 b ← 0; 7 foreac h i ∈ { 1 , . . . , n } \ S do 8 v ← ( X i ) T ; 9 e G ← G − G v (1 + v T G v ) − 1 ( v T G ); 10 ˜ a ← e G y ; 11 e i ← 0; 12 foreac h j ∈ { 1 , . . . , m } do 13 p ← y j − ( e G j,j ) − 1 ˜ a j ; 14 e i ← e i + l ( y j , p ); 15 end 16 if e i < e then 17 e ← e i ; 18 b ← i ; 19 end 20 end 21 v ← ( X b ) T ; 22 G ← G − G v (1 + v T G v ) − 1 ( v T G ); 23 a ← G y ; 24 S ← S ∪ { b } ; 25 end 26 w ← X S a ; The time needed to compute (9) is cubic in m , which pro vides no b eneﬁt compared to the standard wrap- p er approac h. Ho w ever, pro vided that the matrix G is stored in memory , it is p ossible to sp eed up the computation of e G via the SMW form ula. F ormally , the matrix e G can also b e obtained from G − G v (1 + v T G v ) − 1 ( v T G ) (10) whic h requires a time quadratic in m provided that the multiplications are p erformed in the order indi- cated by the paren theses. Having calculated e G , the up dated vector of dual v ariables ˜ a can b e obtained from e G y (11) also with O ( m 2 ) ﬂoating p oin t op erations. No w let us consider the eﬃcient ev aluation of the features with LOO p erformance. Provided that e G and ˜ a are already stored in memory , the LOO p er- formance of RLS trained with the S + { i } training examples can be calculated via (8). After the fea- ture, whose inclusion w ould b e the most fav orable with respect to the LOO performance, is found, it is added to the set of selected features and RLS solu- tion is up dated accordingly in the last four lines of the algorithm. The outermost lo op is run k times, k denoting the n umber of features to be selected. F or each round of the outer lo op, the middle lo op is run O ( n ) times, b ecause each of the n − |S | features is ev aluated in turn b efore the b est of them is added to the set of se- lected features S . Due to the eﬃcient formula (8) for LOO computation, the calculation of the LOO pre- dictions for the m requires only O ( m ) ﬂoating p oint op erations in each run of the innermost lo op. The complexit y of the middle lo op is dominated by the O ( m 2 ) time needed for calculating e G in line 9 and ˜ a in line 10, and hence the ov erall time complexity of the lo w-rank updated LS-SVM algorithm is O ( k nm 2 ). The complexity with resp ect to k is only linear whic h is m uch better than that of the standard wrap- p er approac h, and hence the selection of large feature sets is made p ossible. How ever, feature selection with large training sets is still infeasible because of the quadratic complexity with resp ect to m . In fact, the running time of the standard wrapp er approach with 6 the LOO short-cut can b e shorter than that of the lo w-rank up dated LS-SVM method in cases where the training set is large and the num b er of features to be selected is small. The space complexity of the low-rank updated LS- SVM algorithm is O ( nm + m 2 ), b ecause the matrices X and G require O ( nm ) and O ( m 2 ) space, respe c- tiv ely . Due to the quadratic dep endence of m , this space complexity is worse than that of the standard wrapp er approach. 3.3 Linear Time F orward Selection Algorithm Here, we present our nov el algorithm for greedy for- w ard selection for RLS with LOO criterion. W e refer to our algorithm as greedy RLS, since in addition to feature selection p oin t of view, it can also b e consid- ered as a greedy algorithm for learning sparse RLS predictors. Our metho d resembles the low-rank up- dated LS-SVM in the sense that it also op erates by iterativ ely up dating the vector of dual v ariables, and hence we deﬁne it on the grounds of the considera- tions of Section 3.2. Pseudo co de of greedy RLS is presen ted in Algorithm 3. The time and space complexities of the low-rank up dated LS-SVM algorithm are quadratic in m , b e- cause it in v olves up dating the matrix G (lines 9 and 22 in Algorithm 2) and the vector a (lines 10 and 23 in Algorithm 2). In order to improv e the eﬃciency of the algorithm by getting rid of the quadratic dep en- dence of m in b oth space and time, we must av oid the explicit calculation and storing of the matrices G and e G . Next, we present an improv emen t which solv es these problems. In the middle lo op of the low-rank up dated LS- SVM algorithm in Algorithm 2, it is assumed that the matrix G constructed according to the curren t set S of selected features is stored in memory . It is temp orarily up dated to matrix e G corresponding to the set S + { i } , where i is the index of the feature to b e ev aluated. W e observe that the LOO compu- tations via the form ula (8) require the vector of dual v ariables ˜ a corresponding to S + { i } but only the diagonal elemen ts of e G . Let us ﬁrst consider sp eeding up the computation of ˜ a . Since w e already hav e a stored in memory , we can take adv antage of it when computing ˜ a . That is, since the v alue of the v ector a b efore the up date is G y , its up dated v alue ˜ a can b e, according to (10) and (11), obtained from a − G v (1 + v T G v ) − 1 ( v T a ) , (12) where v = ( X i ) T . This do es not y et help us muc h, b ecause the form (12) still contains the matrix G ex- plicitly . How ev er, if we assume that, in addition to a , the pro duct G v is calculated in adv ance and the result is stored in memory , calculating (12) only re- quires O ( m ) time. In fact, since the op eration (12) is p erformed for almost every feature during eac h round of the greedy forw ard selection, we assume that we ha ve an m × n cac he matrix, denoted by C = GX T , (13) computed in adv ance containing the required pro d- ucts for all features, eac h column corresp onding to one product. No w, given that C is a v ailable in memory , we cal- culate, in O ( m ) time, a vector u = C : ,i (1 + v T C : ,i ) − 1 (14) and store it in memory for later usage. This is done in lines 10 and 24 of the algorithm in Algorithm 3. Consequen tly , (12) can be rewritten as a − u ( v T a ) . (15) and hence the v ector a can be up dated in O ( m ) time. W e also hav e to ensure that fast computation of the LOO p erformance. The use of (8) for computing the LOO for the up dated feature set requires the diagonal elemen ts of the matrix ˜ G . Let d and ˜ d denote the v ectors containing the diagonal elements of G and ˜ G , resp ectiv ely . W e observe that the SMW formula (10) for calculating ˜ G can b e rewritten as G − u ( C : ,i ) T . (16) No w, we mak e an assumption that, instead of havin g the whole G b eing stored in memory , w e hav e only 7 stored its diagonal elemen ts, that is, the vector d . Then, according to (16), the j th elemen t of the v ector ˜ d can b e computed from d j − u j C j,i (17) requiring only a constant time, the ov erall time needed for computing ˜ d again b ecoming O ( m ). Th us, pro vided that w e ha v e all the neces sary cac hes av ailable, ev aluating each feature requires O ( m ) time, and hence one pass through the whole set of n features needs O ( mn ) ﬂoating p oin t op era- tions. But we still hav e to ensure that the caches can b e initialized and updated eﬃciently enough. In the initialization phase of the greedy RLS algo- rithm (lines 1-4 in Algorithm 3) the set of selected features is empty , and hence the v alues of a , d , and C are initialized to λ − 1 y , λ − 1 1 , and λ − 1 X T , resp ec- tiv ely , where 1 ∈ R m is a vector having every entry equal to 1. The computational complexit y of the ini- tialization phase is dominated by the O ( mn ) time required for initializing C . Thus, the initialization phase is no more complex than one pass through the features. When a new feature is added into the set of se- lected features, the vector a is up dated according to (15) and the v ector d according to (17). Putting to- gether (13), (14), and (16), the cache matrix C can b e up dated via C − u ( v T C ) , whic h requires O ( mn ) time. This is equally expensive as the abov e in troduced fast approach for trying eac h feature at a time using LOO as a selection criterion. Finally , if we are to select altogether k features, the ov erall time complexity of greedy RLS b ecomes O ( kmn ). The space complexit y is dominated by the matrices X and C which b oth require O ( mn ) space. 4 Exp erimen tal results In Section 4.1, w e explore the computational eﬃ- ciency of the in troduced metho d and compare it with the b est previously proposed implementation. In Sec- tion 4.2, we explore the quality of the feature selec- tion pro cess. In addition, we conduct measurements Algorithm 3: Greedy RLS algorithm prop osed b y us Input : X ∈ R n × m , y ∈ R m , k , λ Output : S , w 1 a ← λ − 1 y ; 2 d ← λ − 1 1 ; 3 C ← λ − 1 X T ; 4 S ← ∅ ; 5 while |S | < k do 6 e ← ∞ ; 7 b ← 0; 8 foreac h i ∈ { 1 , . . . , n } \ S do 9 v ← ( X i ) T ; 10 u ← C : ,i (1 + v C : ,i ) − 1 ; 11 ˜ a ← a − u ( v T a ); 12 e i ← 0; 13 foreac h j ∈ { 1 , . . . , m } do 14 ˜ d j ← d j − u j C j,i ; 15 p ← y j − ( ˜ d j ) − 1 ˜ a j ; 16 e i ← e i + l ( y j , p ); 17 end 18 if e i < e then 19 e ← e i ; 20 b ← i ; 21 end 22 end 23 v ← ( X b ) T ; 24 u ← C : ,b (1 + v T C : ,b ) − 1 ; 25 a ← a − u ( v T a ); 26 foreac h j ∈ { 1 , . . . , m } do 27 d j ← d j − u j C j,b ; 28 end 29 C ← C − u ( v T C ); 30 S ← S ∪ { b } ; 31 end 32 w ← X S a ; 8 of the degree to which the LOO criterion o verﬁts in Section 4.3. 4.1 Computational eﬃciency First, we presen t exp erimen tal results about the scal- abilit y of our metho d. W e use randomly generated data from tw o normal distributions with 1000 fea- tures of which 50 are selected. The n umber of ex- amples is v aried. In the ﬁrst exp eriment we v ary the n umber of training examples b et w een 500 and 5000, and provide a comparison to the low-rank up dated LS-SVM metho d [19]. In the second exp eriment the training set size is v aried b etw een 1000 and 50000. W e do not consider the baseline metho d in the second exp erimen t, as it does not scale up to the considered training set sizes. The runtime experiments were run on a mo dern desktop computer with 2.4 GHz In tel Core 2 Duo E6600 pro cessor, 8 GB of main memory , and 64-bit Ubuntu Linux 9.10 op erating system. W e note that the running times of these t wo meth- o ds are not aﬀected by the choice of the regularization parameter, or the distribution of the features or the class lab els. This is in contrast to iterative optimiza- tion techniques commonly used to train for example supp ort v ector mac hines [26]. Thus w e can dra w gen- eral conclusions about the scalability of the metho ds from exp eriments with a ﬁxed v alue for the regular- ization parameter, and synthetic data. In Fig. 1 are the run time comparisons with lin- ear scaling on y-axis, and in Fig. 2 with logarith- mic scaling. The results are consisten t with the al- gorithmic complexity analysis of the metho ds. The metho d of [19] sho ws quadratic scaling with resp ect to the num b er of training examples, while the pro- p osed metho d scales linearly . In Fig. 3 are the run- ning times for the prop osed metho d for up to 50000 training examples, in which case selecting 50 features out of 1000 to ok a bit less than tw elve minutes. 4.2 Qualit y of selected features In this section, we explore the quality of the feature selection pro cess. W e recall that our greedy RLS al- gorithm leads to equiv alen t results as the algorithms 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 training set size 0 5000 10000 15000 20000 25000 CPU seconds greedy RLS low-rank updated LS-SVM Figure 1: Running times in CPU seconds for the pro- p osed greedy RLS metho d and the and the low-rank up dated LS-SVM of [19]. Linear scaling on y-axis. 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 training set size 1 0 1 1 0 2 1 0 3 1 0 4 1 0 5 CPU seconds greedy RLS low-rank updated LS-SVM Figure 2: Running times in CPU seconds for the pro- p osed greedy RLS metho d and the and the low-rank up dated LS-SVM of [19] Logarithmic scaling on y- axis. . 9 0 10000 20000 30000 40000 50000 training set size 0 100 200 300 400 500 600 700 CPU seconds greedy RLS Figure 3: Running times in CPU seconds for the pro- p osed greedy RLS method. prop osed by [17] and by [19], while b eing computa- tionally m uch more eﬃcien t. The aforementioned au- thors ha ve in their w ork sho wn that the approac h compares fav orably to other commonly used feature selection techniques. How ev er, due to computational constrain ts the earlier ev aluations hav e b een run on small data sets consisting only of tens or hundreds of examples. W e extend the ev aluation to large scale learning with the largest of the considered data sets consisting of more than hundred thousand training examples. W rapp er selection methods do not typ- ically scale to such data set sizes, esp ecially when using cross-v alidation as a selection criterion. Thus w e consider as a baseline an approach which chooses k features at random. This is a go od sanity-c hec k, since training RLS with this approac h requires only O (min( k 2 m, k m 2 )) time that is ev en less than the time required by greedy RLS. W e consider the binary classiﬁcation setting, the qualit y of a selected feature set is measured by the classiﬁcation accuracy of a model trained on these features, ev aluated on independent test data. The exp erimen ts are run on a num b er of publicly av ail- able b enchmark data sets, whose characteristics are summarized in T able 1. The mnist dataset is orig- inally a m ulti-class digit recognition task, here we T able 1: Data sets data set #instances #features adult 32561 123 australian 683 14 colon-cancer 62 2000 german.n umer 1000 24 icjnn1 141691 22 mnist5 70000 780 consider the binary classiﬁcation task of recognizing the digit 5. All the presented results are a verage accuracies o ver stratiﬁed tenfold-cross v alidation run on the full data sets. On each of the ten cross-v alidation rounds, b efore the feature selection exp erimen t is run we se- lect the v alue of the regularization parameter as fol- lo ws. W e train the learner on the training folds us- ing the full feature set, and perform a grid searc h to choose a suitable regularization parameter v alue based on leav e-one-out performance. Using the chosen regularization parameter v alue, w e b egin the incremental feature selection pro cess on the training folds. One at a time we select the fea- ture, whose addition to the model pro vides highest LOO accuracy estimate on the training folds. Eac h time a feature has b een selected, the generalization p erformance of the updated model is ev aluated on the test fold. The pro cess is carried on until all the features ha v e been selected. In the presen ted ﬁgures w e plot the num b er of selected features against the accuracy ac hieved on the test fold, av eraged ov er the ten cross-v alidation rounds. In Fig. 4 are the results for the adult, in Fig. 5 for the australian, in Fig. 6 for colon-cancer, in Fig. 7 for german.n umer, in Fig. 8 for ijcnn1, and in Figure 9 for the mnist5 data set. On all data sets, the prop osed approac h clearly outp erforms the random selection strategy , suggest- ing that the leav e-one-out criterion leads to selecting informativ e features. In most cases with only a small subset of features one can achiev e as go od p erfor- mance as with all the features. 10 20 40 60 80 100 120 selected features 0.5 0.6 0.7 0.8 0.9 1.0 test accuracy greedy RLS random selection majority voter Figure 4: Performance on the adult data set. 2 4 6 8 10 12 14 selected features 0.5 0.6 0.7 0.8 0.9 1.0 test accuracy greedy RLS random selection majority voter Figure 5: Performance on the australian data set. 500 1000 1500 2000 selected features 0.5 0.6 0.7 0.8 0.9 1.0 test accuracy greedy RLS random selection majority voter Figure 6: Performance on the colon-cancer data set. 5 10 15 20 selected features 0.5 0.6 0.7 0.8 0.9 1.0 test accuracy greedy RLS random selection majority voter Figure 7: Performance on the german.n umer data set. 11 5 10 15 20 selected features 0.80 0.85 0.90 0.95 1.00 test accuracy greedy RLS random selection majority voter Figure 8: Performance on the ijcnn1 data set. 100 200 300 400 500 600 700 selected features 0.90 0.92 0.94 0.96 0.98 1.00 test accuracy greedy RLS random selection majority voter Figure 9: Performance on the mnist5 data set. 20 40 60 80 100 120 selected features 0.5 0.6 0.7 0.8 0.9 1.0 accuracy Test LOO Figure 10: T est vs. LOO accuracy on the adult data set. 4.3 Ov erﬁtting in feature selection One concern with using the LOO estimate for select- ing features is that the estimate may ov erﬁt to the training data. If one considers large enough set of features, it is quite likely that some features will by random chance lead to impro ved LOO estimate, while not generalizing outside the training set. W e explore to what extent and in whic h settings this is a prob- lem b y comparing the LOO and test accuracies in the previously introduced exp erimen tal setting. Again, w e plot the num b er of selected features against the accuracy estimates. In most cases the LOO and test accuracies are close to iden tical (Figures 10, 11, 14, and 15). How ever, on the colon-cancer and german.n umer data sets (Fig- ures 12 and 13) we see evidence of ov erﬁtting, with LOO providing ov er-optimistic ev aluation of p erfor- mance. The eﬀect is esp ecially clear on the c olon- cancer data set, whic h has 2000 features, and only 62 examples. The results suggest that reliable feature se- lection can be problematic on small high-dimensional data sets, the type of which are often considered for example in bioinformatics. On larger data sets the LOO estimate is quite reliable. 12 2 4 6 8 10 12 14 selected features 0.5 0.6 0.7 0.8 0.9 1.0 accuracy Test LOO Figure 11: T est vs. LOO accuracy on the australian data set. 500 1000 1500 2000 selected features 0.5 0.6 0.7 0.8 0.9 1.0 accuracy Test LOO Figure 12: T est vs. LOO accuracy on the colon- cancer data set. 5 10 15 20 selected features 0.5 0.6 0.7 0.8 0.9 1.0 accuracy Test LOO Figure 13: T est vs. LOO accuracy on the ger- man.n umer data set. 5 10 15 20 selected features 0.80 0.85 0.90 0.95 1.00 accuracy Test LOO Figure 14: T est vs. LOO accuracy on the ijcnn1 data set. 13 100 200 300 400 500 600 700 selected features 0.90 0.92 0.94 0.96 0.98 1.00 accuracy Test LOO Figure 15: T est vs. LOO ac curacy on the mnist5 data set. 5 Discussion and F uture Direc- tions In this pap er, w e presen t greedy RLS, a linear time al- gorithm for feature selection for RLS. The algorithm uses LOO as a criterion for ev aluating the go odness of the feature subsets and greedy forw ard selection as a search strategy in the set of p ossible feature sets. The prop osed algorithm op ens sev eral directions for future researc h. In this section, we will brieﬂy dis- cuss some of the directions which we consider to b e the most promising ones. Greedy RLS can quite straightforw ardly be gener- alized to use diﬀeren t t ypes of cross-v alidation cri- teria, such as n -fold or rep eated n -fold. These are motiv ated b y their smaller v ariance compared to the lea ve-one-out and they hav e also b een shown to hav e b etter asymptotic conv ergence properties for feature subset selection for ordinary least-squares [15]. This generalization can b e ac hieved b y using the short-cut metho ds developed by us [27] and by [28]. In this paper, we fo cus on the greedy forw ard selection approac h but an analogous metho d can also b e constructed for backw ard elimination. How- ev er, backw ard elimination would b e computation- ally muc h more exp ensive than forw ard selection, b e- cause an RLS predictor has to b e trained ﬁrst with the whole feature set. In the feature selection litera- ture, approaches which com bine both forward and bac kward steps, suc h as the ﬂoating search meth- o ds (see e.g. [29, 30]), hav e also b een prop osed. Re- cen tly , [31] considered a mo diﬁcation of the forward selection for least-squares, which p erforms corrective steps instead of greedily adding a new feature in eac h iteration. The considered learning algorithm was or- dinary least-squares and the feature subset selection criterion was empirical risk. The mo diﬁed approach w as sho wn to ha ve appro ximately the same computa- tional complexit y (in num b er of iterations) but b etter p erformance than greedy forward selection or back- w ard elimination. A rigorous theoretical justiﬁcation w as also presented for the mo diﬁcation. While this pap er concentrates on the algorithmic implementa- tion of the feature subset selection for RLS, we aim to in vestigate the p erformance of this type of mo di- ﬁcations in our future w ork. The computational short-cuts presented in this pa- p er are p ossible due to the closed form solution the RLS learning algorithms has. Suc h short-cuts are also a v ailable for v ariations of RLS such as RankRLS, an RLS based algorithm for learning to rank prop osed b y us in [32, 33]. In our future work, we also plan to design and implemen t similar feature selection algo- rithms for RankRLS. Analogously to the feature selection metho ds, man y approaches has b een developed also for so- called reduced set selection used in context of kernel- based learning algorithms (see e.g [10, 34] and refer- ences therein). F or example, incremental approac hes for selecting this set for regularized least-squares has b een dev elop ed by [35]. Closely related to the re- duced set metho ds, we can also men tion the metho ds traditionally used for selecting cen ters for radial basis function net works. Namely , the algorithms based on the orthogonal least-squares method proposed b y [36] for which eﬃcient forward selection metho ds are also kno wn. In the future, we plan to in vestigate ho w w ell approac hes similar to our feature selection algorithm could p erform on the tasks of reduced set or center selection. 14 6 Conclusion W e prop ose greedy regularized least-squares, a no v el training algorithm for sparse linear predictors. The predictors learned by the algorithm are equiv alen t with those obtained by p erforming a greedy forw ard feature selection with leav e-one-out (LOO) criterion for regularized least-squares (RLS), also kno wn as the least-squares supp ort vector machine or ridge regres- sion. That is, the algorithm w orks like a wrapp er t yp e of feature selection metho d whic h starts from the empt y feature set, and on each iteration adds the feature whose addition pro vides the b est LOO p erformance. T raining a predictor with greedy RLS requires O ( k mn ) time, where k is the n umber of non- zero entries in the predictor, m is the num ber of training examples, and n is the original n umber of features in the data. This is in contrast to the com- putational complexit y O (min { k 3 m 2 n, k 2 m 3 n } ) of us- ing the standard wrapp er metho d with LOO selection criterion in case RLS is used as a blac k-b o x metho d, and the complexit y O ( km 2 n ) of the metho d proposed b y [19], which is the most eﬃcient of the previously prop osed sp eed-ups. Hereb y , greedy RLS is compu- tationally more eﬃcient than the previously kno wn feature selection metho ds for RLS. W e demonstrate experimentally the computational eﬃciency of greedy RLS compared to the b est previ- ously prop osed implementation. In addition, we ex- plore the quality of the feature selection pro cess and measure the degree to which the LOO criterion ov er- ﬁts with diﬀerent sizes of data sets. W e hav e made freely av ailable a softw are pack age called RLScore out of our previously prop osed RLS based mac hine learning algorithms 2 . An implemen- tation of the greedy RLS algorithm will also b e made a v ailable as a part of this soft ware. References [1] I. Guyon and A. Elisseeﬀ, “An introduction to v ariable and feature selection,” Journal of Ma- chine L e arning R ese ar ch , vol. 3, pp. 1157–1182, 2003. 2 Av ailable at http://www.tucs.ﬁ/RLScore [2] R. Kohavi and G. H. John, “W rappers for feature subset selection,” Artiﬁcial Intel ligenc e , v ol. 97, no. 1-2, pp. 273–324, 1997. [3] S. S. Keerthi and S. K. Shev ade, “A fast track- ing algorithm for generalized lars/lasso,” IEEE T r ansactions on Neur al Networks , vol. 18, no. 6, pp. 1826–1830, 2007. [4] G. H. John, R. Kohavi, and K. Pﬂeger, “Ir- relev ant features and the subset selection prob- lem,” in Pr o c e e dings of the Eleventh Interna- tional Confer enc e on Machine L e arning , W. W. Cohen and H. Hirsc h, Eds. San F ransisco, CA: Morgan Kaufmann Publishers, 1994, pp. 121– 129. [5] A. E. Ho erl and R. W. Kennard, “Ridge regres- sion: Biased estimation for nonorthogonal prob- lems,” T e chnometrics , vol. 12, pp. 55–67, 1970. [6] T. Poggio and F. Girosi, “Netw orks for approxi- mation and learning,” Pr o c e e dings of the IEEE , v ol. 78, no. 9, 1990. [7] C. Saunders, A. Gammerman, and V. V ovk, “Ridge regression learning algorithm in dual v ariables,” in Pr o c e e dings of the Fifte enth Inter- national Confer enc e on Machine L e arning . San F rancisco, CA, USA: Morgan Kaufmann Pub- lishers Inc., 1998, pp. 515–521. [8] J. A. K. Suykens and J. V andewalle, “Least squares supp ort vector machine classiﬁers,” Neur al Pr o c essing L etters , vol. 9, no. 3, pp. 293– 300, 1999. [9] J. Suykens, T. V an Gestel, J. De Braban ter, B. De Mo or, and J. V andew alle, L e ast Squar es Supp ort V e ctor Machines . W orld Scien tiﬁc Pub. Co., Singapore, 2002. [10] R. Rifkin, G. Y eo, and T. P oggio, “Regular- ized least-squares classiﬁcation,” in A dvanc es in L e arning The ory: Metho ds, Mo del and Applic a- tions , ser. NA TO Science Series I II: Computer and System Sciences, J. Suykens, G. Horv ath, S. Basu, C. Micchelli, and J. V andewalle, Eds. 15 Amsterdam: IOS Press, 2003, v ol. 190, c h. 7, pp. 131–154. [11] T. Poggio and S. Smale, “The mathematics of learning: Dealing with data,” Notic es of the A meric an Mathematic al So ciety (AMS) , vol. 50, no. 5, pp. 537–544, 2003. [12] V. V apnik, Estimation of Dep endenc es Base d on Empiric al Data [in Russian] . Mosco w, So- viet Union: Nauk a, 1979, (English translation: Springer, New Y ork, 1982). [13] G. W ah ba, Spline Mo dels for Observational Data . Philadelphia, USA: Series in Applied Mathematics, V ol. 59, SIAM, 1990. [14] P . Green and B. Silv erman, Nonp ar ametric R e gr ession and Gener alize d Line ar Mo dels, A R oughness Penalty Appr o ach . London, UK: Chapman and Hall, 1994. [15] J. Shao, “Linear mo del selec tion by cross- v alidation,” Journal of the Americ an Statistic al Asso ciation , v ol. 88, no. 422, pp. 486–494, 1993. [16] P . Zhang, “Model selection via m ultifold cross v alidation,” The Annals of Statistics , vol. 21, no. 1, pp. 299–313, 1993. [17] E. K. T ang, P . N. Suganthan, and X. Y ao, “Gene selection algorithms for microarray data based on least squares supp ort vector machine,” BMC Bioinformatics , v ol. 7, p. 95, 2006. [18] Z. Ying and K. C. Keong, “F ast leav e-one- out ev aluation for dynamic gene selection,” Soft Computing , v ol. 10, no. 4, pp. 346–350, 2006. [19] F. Ojeda, J. A. Suyk ens, and B. D. Mo or, “Lo w rank up dated LS-SVM classiﬁers for fast v ari- able selection,” Neur al Networks , vol. 21, no. 2- 3, pp. 437 – 449, 2008, adv ances in Neural Net- w orks Researc h: IJCNN ’07. [20] S. R. Searle, Matrix Algebr a Useful for Statistics . New Y ork, NY: John Wiley and Sons, Inc., 1982. [21] K.-R. M ¨ uller, S. Mik a, G. R¨ atsc h, K. Tsuda, and B. Sc h¨ olkopf, “An introduction to k ernel- based learning algorithms,” IEEE T r ansactions on Neur al Networks , vol. 12, pp. 181–201, 2001. [22] B. Sch¨ olk opf and A. J. Smola, L e arning with ker- nels . MIT Press, Cambridge, MA, 2002. [23] J. Sha w e-T aylor and N. Cristianini, Kernel Metho ds for Pattern Analysis . Cam bridge: Cam bridge Univ ersity Press, 2004. [24] R. Rifkin and R. Lipp ert, “Notes on regular- ized least squares,” Massach usetts Institute of T echnology , T ech. Rep. MIT-CSAIL-TR-2007- 025, 2007. [25] S. Russell and P . Norvig, Artiﬁcial Intel ligenc e: A Mo dern Appr o ach . Prentice Hall, 1995. [26] L. Bottou and C.-J. Lin, “Supp ort vector ma- c hine solvers,” in L ar ge-Sc ale Kernel Machines , ser. Neural Information Pro cessing, L. Bottou, O. Chap elle, D. DeCoste, and J. W eston, Eds. Cam bridge, MA, USA: MIT Press, 2007, pp. 1– 28. [27] T. P ahikk ala, J. Boberg, and T. Salak oski, “F ast n-fold cross-v alidation for regularized least- squares,” in Pr o c e e dings of the Ninth Sc andina- vian Confer enc e on Artiﬁcial Intel ligenc e (SCAI 2006) , T. Honkela, T. Raiko, J. Kortela, and H. V alp ola, Eds. Esp oo, Finland: Otamedia, 2006, pp. 83–90. [28] S. An, W. Liu, and S. V enk atesh, “F ast cross- v alidation algorithms for least squares supp ort v ector machine and kernel ridge regression,” Pattern R e c o gnition , vol. 40, no. 8, pp. 2154– 2162, 2007. [29] P . Pudil, J. Nov ovi ˇ co v´ a, and J. Kittler, “Float- ing searc h metho ds in feature selection,” Pat- tern R e c o gn. L ett. , v ol. 15, no. 11, pp. 1119–1125, 1994. [30] J. Li, M. T. Manry , P . L. Narasimha, and C. Y u, “F eature selection using a piecewise linear net- w ork,” IEEE T r ansactions on Neur al Networks , v ol. 17, no. 5, pp. 1101–1115, 2006. 16 [31] T. Zhang, “Adaptive forw ard-backw ard greedy algorithm for sparse learning with linear mo d- els,” in A dvanc es in Neur al Information Pr o- c essing Systems 21 , D. Koller, D. Sc huurmans, Y. Bengio, and L. Bottou, Eds., 2009, pp. 1921– 1928. [32] T. P ahikk ala, E. Tsivtsiv adze, A. Airola, J. Bob erg, and T. Salak oski, “Learning to rank with pairwise regularized least-squares,” in SI- GIR 2007 Workshop on L e arning to R ank for Information R etrieval , T. Joac hims, H. Li, T.- Y. Liu, and C. Zhai, Eds., 2007, pp. 27–33. [33] T. P ahikk ala, E. Tsivtsiv adze, A. Airola, J. Bob erg, and J. J¨ arvinen, “An eﬃcient al- gorithm for learning to rank from preference graphs,” Machine L e arning , vol. 75, no. 1, pp. 129–165, 2009. [34] A. J. Smola and B. Sch¨ olkopf, “Sparse greedy matrix approximation for machine learning,” in Pr o c e e dings of the Sevente enth International Confer enc e on Machine L e arning (ICML 2000) , P . Langley , Ed. San F rancisco, CA: Morgan Kaufmann Publishers Inc., 2000, pp. 911–918. [35] L. Jiao, L. Bo, and L. W ang, “F ast sparse ap- pro ximation for least squares supp ort v ector ma- c hine,” IEEE T r ansactions on Neur al Networks , v ol. 18, no. 3, pp. 685–697, 2007. [36] S. Chen, C. F. N. Cow an, and P . M. Grant, “Orthogonal least squares learning algorithm for radial basis function netw orks,” IEEE T r ansac- tions on Neur al Networks , v ol. 2, no. 2, 1991. 17

Linear Time Feature Selection for Regularized Least-Squares

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment