EigenNet: A Bayesian hybrid of generative and conditional models for sparse learning

It is a challenging task to select correlated variables in a high dimensional space. To address this challenge, the elastic net has been developed and successfully applied to many applications. Despite its great success, the elastic net does not expl…

Authors: Yuan Qi, Feng Yan

EigenNet: A Bayesian hybrid of generative and conditional models for   sparse learning
EigenNet: A Ba y esian h ybrid of generativ e and conditional mo dels for sparse learning Y uan Qi Departmen ts of CS and Statistics Purdue Univ ersity F eng Y an Departmen t of CS Purdue Univ ersity Octob er 28, 2018 Abstract It is a challenging task to select correlated v ariables in a high dimen- sional space. T o address this challenge, the elastic net has been developed and successfully applied to many applications. Despite its great success, the elastic net does not explicitly use correlation information em b edded in data to select correlated v ariables. T o o vercome this limitation, w e presen t a no vel Bay esian h ybrid model, the EigenNet, that uses the eigenstruc- tures of data to guide v ariable selection. Specifically , it in tegrates a sparse conditional classification mo del with a generativ e model capturing v ari- able correlations in a principled Bay esian framework. W e reparameterize the hybrid mo del in the eigenspace to av oid ov erfiting and to increase the computational efficiency of its MCMC sampler. F urthermore, we provide an alternativ e view to the EigenNet from a regularization persp ective: the EigenNet has an adaptive eigenspace-based comp osite regularizer, which naturally generalizes the l 1 / 2 regularizer used by the elastic net. Exp er- imen ts on synthetic and real data show that the EigenNet significan tly outp erforms the lasso, the elastic net, and the Bay esian lasso in terms of prediction accuracy , especially when the n umber of training samples is smaller than the num ber of v ariables. 1 In tro duction In this pap er we consider the problem of selecting correlated v ariables in a high dimensional space. Among man y v ariable selection metho ds, the lasso and the elastic net are tw o p opular c hoices [Tibshirani, 1994, Zou & Hastie, 2005]. The lasso uses a l 1 regularizer on mo del parameters. This regularizer shrinks the parameters tow ards zero, removing irrev erent v ariables and yielding a sparse mo del [Tibshirani, 1994]. Ho wev er, the l 1 p enalt y may lead to o ver- sparisification: giv en many correlated v ariables, the lasso often only select a few of them. This not only degenerates its prediction accuracy but also affects the in terpretability of the estimated model. F or example, based on high-throughput 1 biological data suc h as gene expression and RNA-seq data, it is highly desirable to select m ultiple correlated genes sp ecific to a phenotype since it may rev eal underlying biological pathw a ys. Due to its ov er-sparsification, lasso may not b e suitable for this task. T o address this issue, the elastic net has been developed to encourage a grouping effect, where strongly correlated v ariables tend to be in or out of the mo del together [Zou & Hastie, 2005]. Ho wev er, the grouping effect is just the result of its comp osite l 1 and l 2 regularizer; the elastic net do es not explicitly incorp orate correlation information among v ariables in its mo del. In this pap er, we propose a new sparse Bay esian hybrid model, called the EigenNet . Unlike the previous sparse mo dels, it uses the eigen information from the data cov ariance matrix to guide the selection of correlated v ariables. Sp ecif- ically , it integrates a sparse c onditional classification mo del with a gener ative mo del capturing v ariable correlation in a principle Ba y esian framew ork [Lasserre et al., 2006]. The hybrid mo del enables identification of groups of correlated v ari- ables guided by the eigenstructures. Also, it passes the information from the conditional mo del to the generative mo del, selecting informativ e eigenv ectors for the classification task. Unlike frequentist approac hes, the Ba yesian h ybrid mo del can rev eal correlations betw een classifier weigh ts via their joint p osterior distribution. W e reparameterize the model in the eigenspace of the data. When the n um- b er of predictor v ariables (i.e., input features), ( p ), is bigger than the num b er of training samples ( n ), this reparameterization restricts the mo del in the data subspace, which not only reduces o verfitting, but also allows us to develop effi- cien t Marko v Chain Mon te Carlo sampler. F rom the regularization p erspective, the EigenNet naturally generalizes the elastic net b y using a composite regularizer adaptive to the data eigenstructures. It con tains a l 1 sparsit y regularizer and a directional regularizer that encourages selecting v ariables associated with eigenv ectors c hosen b y the mo del. When the v ariables are indep endent of eac h other, the eigenv ectors are parallel to the axes and this comp osite regularizer reduces to the l 1 / 2 regularizer used by the elastic net; when some of the input v ariables are strongly correlated, the regularizer will encourage the classifier aligned with eigen vectors selected b y the mo del. On one hand, our mo del is like the elastic net to retain ‘all the big fish’. On the other hand, our mo del is differen t from the elastic net by using the eigenstructure. Hence the name EigenNet. Exp erimen ts on synthetic and real data are presented in Section 7. They demonstrate that the EigenNet significantly outperforms the lasso, the elastic net, and the Bay esian lasso [Park et al., 2008, Hans, 2009] in terms of prediction accuracy , esp ecially when the num b er of training samples is smaller than the n umber of features. 2 2 Bac kground: lasso and elastic net W e denote n indep enden t and identically distributed samples as D = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } , where x i is a p dimensional input features (i.e., explanatory v ariables) and y i is a scalar label (i.e., resp onse). Also, w e denote [ x 1 , . . . , x n ] by X and ( y 1 , . . . , y n ) b y y . In this pap er, w e consider the binary classification problem ( y i ∈ {− 1 , 1 } ), but our analysis and the prop osed mo dels can b e extended to regression and other problems. F or classification, we use a logistic function as the data likelihoo d function: p ( y | X , w , b ) = Y i σ ( y i ( w T x i + b )) (1) where σ ( z ) = 1 1+exp( − z) , and w and b define the classifier. T o identify relev ant v ariables for high dimensional problems, the lasso [Tib- shirani, 1994] uses a l 1 p enalt y , effectively shrinking w and b tow ards zero and pruning irrelev ant v ariables. In a probabilistic framework this p enalt y corre- sp onds to a Laplace prior distribution: p ( w ) = Y j λ exp( − λ | w j | ) (2) where λ is a hyperparameter that controls the sparsity of the estimated mo del. The larger the hyperparameter λ , the sparser the mo del. As describ ed in Section 1, the lasso may o ver-penalize relev ant v ariables and h urt its predictive p erformance, esp ecially when there are strongly correlated v ariables. T o address this issue, the elastic net [Zou & Hastie, 2005] com bines l 1 and l 2 regularizers to a void the ov er-p enalization. The com bined regularizer corresp onds to the follo wing prior distribution: p ( w ) ∝ Y j exp( − λ 1 | w j | − λ 2 w 2 j ) (3) where λ 1 and λ 2 are hyperparameters. While it is well kno wn that the elastic net tends to select strongly correlated v ariables together, it do es not uses corre- lation information embedded in the data. The selection of correlated v ariables is merely the result of a less aggressiv e regularizer for sparisty . Besides the elastic net, there are man y v ariants (and extensions) to the lasso, suc h as the bridge [F rank & F riedman, 1993] and smoothly clipp ed absolute deviation [F an & Li, 2001]. These v arian ts mo dify the l 1 p enalt y to choose v ariables, but again do not explicitly use correlation information in data. 3 3 EigenNet: eigenstructure-guided v ariable se- lection In this section, we prop ose to use cov ariance structures in data to guide the sparse estimation of mo del parameters. First, let us consider the following toy examples. 3.1 T o y examples Figure 1(a) shows samples from tw o classe s. Clearly the v ariables x 1 and x 2 are not correlated. The lasso or the elastic net can successfully select the relev ant v ariable x 1 to classify the data. F or the samples in Figure 1(b), the v ariables x 1 and x 2 are strongly correlated. Despite the strong correlation, the lasso w ould select only x 1 and ignore x 2 . The elastic net may select both x 1 and x 2 if the regularization weigh t λ 1 is small and λ 2 is big, so that the elastic net b eha ves lik e l 2 regularized classifier. The elastic net, how ever, do es not explore the fact that x 1 and x 2 are correlated. 0 1 2 3 0.5 1 1.5 2 2.5 Lasso, EigenNet (a) Indep enden t v ariables 0 1 2 3 0.5 1 1.5 2 2.5 Lasso EigenNet (b) Correlated v ariables Figure 1: T o y examples. (a) When the v ariables x 1 and x 2 are indep enden t of each other, b oth the lasso and the EigenNet select only x 1 . (b) When the v ariables x 1 and x 2 are correlated, the lasso selects only one v ariable. By con- trast, guided by the ma jor eigen vector of the data, the EigenNet selects both v ariables. Since the eigenstructure of the data cov ariance matrix captures correlation information betw een v ariables, w e prop ose to not only regularize the classifier to b e sparse, but also encourage it to b e aligned with certain eigenv ector(s) that are helpful for the classification task. Since our new mo del uses the eigen information, we name it the EigenNet. F or the data in Figure 1(a), since the tw o eigenv ectors are parallel with the horizontal and vertical axes, the EigenNet essentially reduces to the elastic net and selects x 1 . F or the data in Figure 1(b), how ever, the eigen v ectors (in particular, the principle eigenv ector) will guide the EigenNet to select b oth x 1 and x 2 . 4 W e use a Ba yesian framework to materialize the abov e ideas in the EigenNet, as shown in the following section. 3.2 Ba y esian h ybrid of conditional and generative mo dels The EigenNet is a h ybrid of conditional and generativ e mo dels. The conditional comp onen t allows us to learn the classifier via ”discriminative” training; the generativ e comp onen t captures the correlations b et ween v ariables; and these tw o mo dels are glued together via a joint prior distribution, so that the correlation information is used to guide the estimation of the classifier and the classification task is used to choose or scale relev ant eigen vectors. Our approach is based on the general Ba yesian framework prop osed by Lasserre et al. [2006]), whic h allo ws one to combine conditional and generativ e mo dels in an elegant principled w ay . Sp ecifically , for the conditional mo del w e hav e the same likelihoo d as (1), p ( y | X , w , b ) = Q i σ ( y i ( w T x i + b )). T o sparsify the classifier, w e can use a Laplace prior on w , p ( w ) = Y j λ 1 exp {− λ 1 | w j |} . (4) T o encourage the classifier aligned with certain eigen vectors, w e use the follo wing generative model: p ( Vs | ˜ w ) ∝ exp( − λ 2 2 X η j || ˜ w − s i v i || 2 + ) (5) where || ˜ w − s i v i || 2 + ≡ − 1 2 λ 2 X j η j ( || ˜ w || 2 − 2 s j | ˜ w T v j | + s 2 j || v j || 2 = − 1 2 λ 2 X j η j ( || ˜ w || 2 − 2 s j | ˜ w T v j | + s 2 j , (6) s are nonnegative contin uous v ariables, v i and η i are the i -th eigen vector and eigen v alue of the data co v ariance matrix, resp ectiv ely . The reason w e use ab- solute v alues of ˜ w T v j in (6) is b ecause we only care ab out the alignment of ˜ w and v i , not the sign of their pro duct. Ov erall, the ab o ve mo del encourages the classifier to more aligned with the ma jor eigen vectors with bigger eigen v alues. But the v ariables s allow us to scale or select individual eigenv ectors to remov e irrelev ant ones. T o integrate the conditional and generative mo dels, we use a join t prior on w and ˜ w : p ( w , ˜ w ) ∝ exp( − λ 1 | w | 1 ) exp( − λ 3 2 || w − ˜ w || 2 ) . (7) 5 y i x i i = 1 , … , n v j s j j = 1 , … , m λ 1 λ 3 λ 2 w , w w ~ Figure 2: The graphical mo del of the EigenNet. i.e., we ha ve p ( w , ˜ w ) = λ 1 exp( − λ 1 | w | 1 ) N ( ˜ w | w , λ 3 − 1 ) . (8) Finally we can assign Gamma priors on all the hyperparameters, λ 1 , λ 2 , and λ 3 . The whole mo del is depicted in the graphical mo del in Figure 2. 3.3 Reparameterization and constrain t in Eigenspace In this section we reparameterize the mo del in the eigenspace: w = V α ˜ w = V β (9) where V ≡ [ v 1 , . . . , v m ] ( m = min { n, p } ), and α and β are the pro jections of w and ˜ w on the eigenv ectors, resp ectiv ely . The reparameterization restricts w in the vector space spanned by { v 1 , . . . , v m } , whic h is equiv alent to the data space C ( X ), spanned b y the data p oin ts { x 1 , . . . , x n } . When the n umber of features is bigger than the num b er of training p oin ts, i.e. , p > n , it effectiv ely reduces the n umber of free parameters in the model, helping a void ov erfitting. F urthermore, it provides significant computational adv antage when p >> n . Giv en p ( w , ˜ w ) and the relationship b et ween ( w , ˜ w ) and ( α , β ), we obtain p ( α , β ) (Please see App endix for the details): p ( α , β ) ∝ exp( − λ 1 | V α | 1 ) exp( − λ 3 2 || α − β || 2 ) (10) Based on the new reparameterization, the lik eliho o d for the conditional mo del b ecomes p ( y | X , α , b ) = Y i σ ( y i ( x T i V α + b )) . (11) 6 v (a) v (b) Figure 3: Adaptiv e regularization of the EigenNet. The ellipses are the con tours of a lik eliho od function. While the lasso dra ws the estimates to wards the l 1 ball, the EigenNet’s estimate is guided by an eigenv ector v . Similarly , the likelihoo d for the generative mo del b ecomes p ( V , s | β ) ∝ exp( − 1 2 λ 2 X j η j ( || V β || 2 − 2 s j | V β T v j | + || v j || 2 ) ∝ exp( − 1 2 λ 2 X j ( β 2 j − 2 η j s j | β j | + η j s 2 j )) (12) The second equation holds since V is an orthonormal matrix. Com bining (10), (11) and (12), we obtain a complete mo del. W e use Mark ov Chain Monte Carlo with a random walk prop osal to estimate the mo del param- eters s , w , and ˜ w . 4 Alternativ e view: comp osite regularization In this section, we pro vide an alternative view to the EigenNet by considering the limiting case of λ 3 → 0. F or such as case the prior p ( α , β ) b ecomes p ( α , β ) = p ( α ) δ ( α − β ) This forces α = β . F rom a regularization p ersp ectiv e, this prior is equiv alent to a comp osite regularizer: λ 1 | w | + λ 2 2 X η j || w − s j v j || 2 + (13) = λ 1 | w | + λ 2 2 X η j ( || w || 2 − 2 s j | w T v j | + s 2 j ) (14) Clearly , when s i = 0 for all i ’s, the abov e regularizer reduces to the l 1 / 2 regu- larizer used b y the elastic net 1 . When s i 6 = 0 then the regularizer is adaptive 1 A subtle difference is that w e also constrain w in the data space for our model. 7 based on the eigenv ector v i : First, if the elemen ts of v i all ha ve reasonably large v alues, then all the v ariables in w will very likely to be selected. This effect is visualized in Figure 3(b). Second, if this eigenv ector has only several large ele- men ts, the corresp onding v ariables in ˜ w and w are likely to b e selected join tly . Unlik e the l 1 / 2 regularizer that encourages the selection of groups of v ariables from all the v ariables, our regularizer directly targets at sp e cific groups of v ari- ables corresp onding to the sparse eigenv ector. Third, if all the v ariables are indep enden t of each other, then the eigenv ectors are parallel to the axes and eac h of them con tains only one nonzero elemen t. In this case | w T v j | reduces | w j | , a l 1 regularizer. Figure 3(a) visualizes the eigen regularizer when v ariables are indep enden t of each other. In summary , the EigenNet can b e viewed as an adaptiv e generalization of the elastic net by selecting groups of correlated v ariables based on eigenv ectors of the data cov ariance matrix. 5 Related w ork The EigenNet can b e viewed as an extension of the classical eigenface approaches [T urk & Pen tland, 1991, Sirovic h & Kirby, 1987]. The eigenface approach uses PCA co efficien ts of samples to train a classifier. Naturally the ma jor eigen- v ectors are often asso ciated with large PCA co efficients and the classifier is constrained in the data subspace when the num b er of features is smaller than the num b er of training samples. The EigenNet essentially extends the eigenface approac h by combining generative and conditional mo dels in a Bay esian frame- w ork and p erforms sparse learning in an adaptive eigenspace (since the model selects or scales relev ant eigen vectors based on s j ). There are Bay esian versions of the lasso and the elastic net. Ba yesian lasso [P ark et al., 2008] puts a hyper-prior on the regularization co efficient and use a Gibbs sampler to join tly sample both regression w eights and the regulariza- tion co efficien t. Using a similar treatment to Bay esian lasso, Ba yesian elastic net [Li & Lin, 2010] samples the tw o regularization co efficien ts simultaneously , p oten tially av oiding the “double shrink age” problem describ ed in the original elastic net pap er [Zou & Hastie, 2005]. As the EigenNet, these metho ds are grounded in a Bay esian framew ork, sharing the b enefits of obtaining p osterior distributions for handling estimation uncertaint y . How ever, Bay esian lasso and Ba yesian elastic net are presented to handle regression problems (though cer- tainly they can b e generalized for classification problems) and sample in the original parameter space, not using the eigen information embedded in data. The EigenNet, by con trast, works in the eigenspace and uses eigen information to guide classification. 8 6 Exp erimen tal results W e ev aluate the new sparse Ba yesian model, the EigenNet, on both synthetic and real data and compare it with three representativ e state-of-the-art v ariable selection metho ds, including the lasso, the elastic net, and the Ba yesian lasso mo dified for classification problems. F or the lasso and the elastic net we use the Glmnet softw are pack age that uses cyclical co ordinate descent in a path wise fashion 2 . The original Bay esian lasso w as dev elop ed for regression and uses Gibbs sampling. F or the classification tasks w e consider, w e change its Gaussian regression lik eliho o d to the logistic lik eliho od (1) while keeping its Laplace prior distributions. W e used Mark ov Chain Mon te Carlo, instead of Gibbs sampler, to estimate the classifier for the Ba yesian lasso. Bay esian approaches are capable of estimating all the hyperparameters from data. How ever, for easy and ob jective comparisons, we simply use cross-v alidation to tune the hyperparameters, λ i , for all methods. F or the Ba yesian lasso and the EigenNet, we draw the 300,000 MCMC samples and use the last 150,000 samples to estimate the posterior mean of the classifiers, whic h are used for predicting the labels of test samples. W e measure the prediction p erformance of all metho ds on test samples in terms of their a v erage test error rate (e.g., the 0.2 error rate indicates 20% errors) and rep ort the standard error of the error rates (except for the following visualization example). 6.1 Visualization of estimated classifiers First, w e test these metho ds on synthetic data that con tain correlated features. W e sample 40 dimensional data p oin ts, each of which contains t wo groups of correlated v ariables. The correlation coefficient b et ween v ariables in each group is 0.81 and there are 4 v ariables in eac h group. W e set the v alues of the classifier w eights in one group as 5 and in the other group as -5. W e also generate the bias term randomly from a standard Gaussian distribution. W e set the n umber of training p oin ts to 80. Figure 4 sho ws the estimated classifiers and the true classifier. It is not surprising that the elastic net identifies more features than the lasso. What is interesting is that EigenNet do es not suppress many the irrelev ant features to be exactly 0, but it clearly identifies al l the relev ant one, which dominate the irrelev ant ones. T o sa ve space, we did not sho w the estimated classifier by the Ba yesian lasso. Similar to the EigenNet, its classifier also contains many small, but nonzero weigh ts. On this dataset, the test error rates of the lasso, the elastic net, the Ba yesian lasso, and the EigenNet are 0.297, 0.245, 0.251, and 0.137. An adv antage of the Bay esian treatment for feature selection ov er frequen- tist approac hes is to p ossibly unco ver the correlations betw een the classifier w eights. These correlations can b e revealed b y the co v ariance matrices of the join t p osterior distribution ov er the classifier weigh ts. In Figure 5, we visual- ize the quan tized cov ariance matrices estimated b y the Ba yesian lasso and the 2 http://www-stat.stanford.edu/ tibs/glmnet-matlab/ 9 0 10 20 30 40 50 (a) Lasso 0 10 20 30 40 50 (b) Elastic net 0 10 20 30 40 50 (c) EigenNet 0 5 10 15 20 25 30 35 40 (d) T rue Figure 4: Visualization of the lasso, the elastic net, the EigenNet and the true classifier weigh ts. These classifiers are estimated on 80 training samples with 40 features. Among the 40 features, 8 of them (as well as the bias) are relev an t for the classification task. On this dataset the test error rates of the lasso, the elastic net, and the Bay esian lasso, the EigenNet are 0.297, 0.245, 0.251, and 0.137. EigenNet. As shown in 5(a) and 5(b), while the Bay esian lasso suggests some correlation structures among features, they are fairly noisy . By contrast, the EigenNet shows the tw o groups of correlated features muc h more clearly . 6.2 Classification of syn thetic data No w we systematically compare these metho ds on synthetic datasets con taining correlated features and datasets containing indep enden t features. F or this first case, we use a similar pro cedure as in the visualization example: w e sample 40 dimensional data p oin ts, each of which contains tw o groups of correlated v ariables. The correlation coefficient b et ween v ariables in each group is 0.81 and there are 4 v ariables in each group. How ever, unlike for the previous example where the classifier weigh ts are the same for the correlated v ariables, now we set the weigh ts within the same group to hav e the same sign, but with different random v alues. W e v ary the num b er of training p oin ts, ranging from 10 to 80, and test all these metho ds. F or the datasets with indep endent features, we follo w the same pro cedure except that the features are indep enden tly sampled. W e run the experiments 10 times. Figure 6 sho ws the error rates av eraged o ver 10 runs. W e do not plot the standard errors of the test error rates, since 10 0 10 20 30 40 0 10 20 30 40 (a) Bay esian lasso 0 10 20 30 40 0 10 20 30 40 (b) EigenNet Figure 5: Cov ariance matrices of the Ba yesian lasso and the EigenNet classi- fiers. The co v ariance matrices are estimated based on the MCMC samples for these tw o mo dels. W e use 80 training samples with 40 features p er sample. The co v ariance matrix of the EigenNet classifier correctly suggests the last few fea- tures are correlated. In particular, it clearly iden tifies a group of four correlated features. they hav e very small v alues: the biggest one is less than 0.0183 for the results on data with correlated features, and for the results on data with indep enden t features, the biggest one is less than 0.030. W e rep ort the numerical v alues of b oth the av eraged error rates and the standard errors in the supplemen tal materials. F or the datasets with independent features, the EigenNet outperforms the alternativ e metho ds when the num b er of training samples are smaller than 40, the num b er of features (i.e., p > n ). Since in this case the eigenstructures of the datasets are uninformativ e, w e exp ect the impro v ed prediction accuracy is the result of the subspace constraint used by the EigenNet. And once the num b er of training samples are not bigger than the data dimension, all these methods p erform quite similarly . F or the datasets with correlated features, the EigenNet significantly out- p erforms the alternative metho ds consistently , not only when the num b er of training samples are smaller than 40 ( p > n ) but also when it is not. W e b eliev e this is b ecause the EigenNet uses the v aluable eigen information rev ealing the feature correlations to train its classifiers. Note that although the res ult of the elastic net app ear to o verlaps with those of the lasso. Actually for the data with correlated features, the elastic net often slightly outp erforms the lasso (Please their numerical v alues in the supplemental materials). 6.3 Classification of real data Besides the syn thetic data, w e also test all these metho ds on UCI b enc hmark datasets, tw o high-dimensional gene expression datasets, leuk aemia and colon cancer, and a spambase dataset with relatively low er dimension but a lot more training samples. F or the leuk aemia dataset, the task is to distinguish acute m y eloid leuk aemia 11 0 20 40 60 80 0.1 0.2 0.3 0.4 0.5 0.6 # of training examples test error rate Lasso Elastic net Bayesian lasso EigenNet (a) Data with independent fe atur es 0 20 40 60 80 0.2 0.25 0.3 0.35 0.4 0.45 # of training examples test error rate Lasso Elastic net Bayesian lasso EigenNet (b) Data with corr elate d fe atures Figure 6: T est error rates on synthetic datasets with indep enden t features and with correlated features. Each training sample has 40 features, 8 of which are revelen t features. W e increase the n umber of training samples from 10 to 80 and use 2000 test samples eac h time. The results are av eraged o ver 10 runs. F or the data with indep enden t features, the EigenNet outperforms the alternative metho ds at b eginning when the num b er of training samples are few er than 40, the num b er of the features. With more training samples con taining independent features, all these metho ds p erform comparably . F or data with correlated features, the EigenNet outp erforms the alternativ e metho ds consisten tly . (AML) from acute lymphoblastic leuk aemia (ALL). The whole dataset has 47 and 25 samples of type ALL and AML respectively with 7129 features p er sample. The dataset w as randomly split 20 times in to 37 training and 35 test samples. F or the colon cancer dataset, the task is to discriminate tumor from normal tissues using microarra y data. The dataset has 22 normal and 40 cancer samples with 2000 features p er sample. W e randomly split the dataset into 31 training and 31 test samples 10 times. F or the spam base datast, the task is to detect spam emails, i.e., unsolicited commercial emails. W e use 57 features indicating whether a particular word or character w as frequen tly occurring in the emails. W e randomly split the dataset into 1533 training and 3066 test samples 10 times. Note that we do not use an y kernel here and the results on this dataset are meant to examine ho w the p erformance of these methods compares to eac h other when there are more samples than features. Using a nonlinear basis function, e.g., a radial basis function, is exp ected to b o ost the predictive p erformance of all these metho ds. Figure 7 summarizes the av erage test error rates and the standard errors of these metho ds on the three datasets. Again, the EigenNet significantly outp er- forms the alternative methods on three datasets. Note that for the leuk aemia and colon cancer datasets Ba yesian lasso do es not p erform muc h w orse than the other metho ds. The reason, we b eliev e, is that these tw o high dimensional datasets contain thousands of features and Bay esian lasso directly dra ws sam- 12 Lasso Elastic net Bayesian lasso EigenNet 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 test error rate (a) Sp amb ase Lasso Elastic net Bayesian lasso EigenNet 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 test error rate (b) Colon Lasso Elastic net Bayesian lasso EigenNet 0 0.05 0.1 0.15 0.2 0.25 0.3 test error rate (c) L eukemia Figure 7: T est error rates on spambase, leukemia and colon cancer datasets. The error bars represen t the standard errors of the error rates. The results on the spambase and colon cancer datasets are av eraged ov er 10 random partitions and the results on the leukemia dataset are av eraged ov er 20 partitions. ples in such high dimensional spaces, leading to v ery slow mixing rates. By con trast, the EigenNet draws samples efficiently in a muc h smaller eigenspace, not only leading to faster mixing rates but also greatly saving the computing cost for obtaining each sample. 7 Conclusions In this pap er, w e hav e presented a nov el sparse Ba yesian h ybrid mo del, the EigenNet. It integrates a sparse conditional classification mo del with a genera- tiv e mo del capturing the feature correlations. It also generalizes the elastic net b y explicitly exploring correlations b etw een features. Compared with several state-of-the art metho ds, the EigenNet achiev es significan tly impro ved predic- tion accuracy on several b enc hmark datasets. W e plan to extend our hybrid mo del by utilizing other probabilistic genera- tiv e mo dels, such as sparse principle comp onent analysis and related pro jection metho ds [Guan & Dy, Archam b eau & Bach, 2009] and indep endent comp onen t 13 analysis mo dels. Compared to the classical PCA mo dels, these mo dels could b e used to b etter guide the selection of interdependent sparse features. Ac kno wledgement Thanks to Jy otishk a Datta for his help on soft ware implemen tation and to T ommi Jaakkola for stimulating discussion. References Arc hambeau, C ´ edric and Bac h, F rancis. Sparse probabilistic pro jections. In A dvanc es in Neur al Information Pr o c essing Systems 21 . 2009. F an, Jianqing and Li, Runze. V ariable selection via nonconcav e p enalized like- liho od and its oracle properties. Journal of the Americ an Statistic al Asso ci- ation , 96(456):1348–1360, 2001. F rank, Ildiko E. and F riedman, Jerome H. A Statistical View of Some Chemo- metrics Regression T o ols. T e chnometrics , 35(2):109–135, 1993. Guan, Y ue and Dy , Jennifer. Sparse probabilistic principal comp onent analysis. JMLR W&CP: AIST A TS , 5. Hans, Chris. Bay esian lasso regression. Biometrika , 96(4):835–845, 2009. Lasserre, Julia A., Bishop, Christopher M., and Mink a, Thomas P . Principled h ybrids of generativ e and discriminativ e mo dels. In Pr o c. of IEEE Confer enc e on Computer Vision and Pattern R e c o gn ition , pp. 87–94, 2006. Li, Qing and Lin, Nan. The Bay esian Elastic Net. Bayesian Analysis , 5(1): 151–170, 2010. P ark, T rev or, Casella, and George. The Bay esian Lasso. Journal of the Americ an Statistic al Asso ciation , 103(482):681–686, 2008. P etersen, Kaare Brandt and Pedersen, Mic hael Syskind. The matrix co okbo ok, 2008. URL http://matrixcookbook.com . Siro vich, L. and Kirby , M. Low-dimensional pro cedure for the characterization of human faces. J. Opt. So c. A m. A , 4(3):519–524, 1987. Tibshirani, Rob ert. Regression shrink age and selection via the lasso. Journal of the R oyal Statistic al So ciety, Series B , 58:267–288, 1994. T urk, Matthew and P entland, Alex. Eigenfaces for recognition. J. Co gnitive Neur oscienc e , 3:71–86, 1991. Zou, Hui and Hastie, T revor. Regularization and v ariable selection via the Elastic Net. Journal of the R oyal Statistic al So ciety B , 67:301–320, 2005. 14 App endix Giv en the linear relationship betw een ( α , β ) and ( w , ˜ w ), the prior p ( w , ˜ w ) defined in (8) is equiv alent to p ( α , β ) defined in (10). First, when n ≥ p , we can easily obtain the p ( α , β ) from p ( w , ˜ w ). In this case, the num b er of eigenv ectors is p and the Jacobian matrix is the p × p full rank matrix V . F urthermore, the determinan t of V is 1 since V is an orthonormal matrix. Therefore, with [ w , ˜ w ] = V [ α , β ] we ha ve p ( α , β ) = p ( w , ˜ w ). When p > n , V p × n is a tall matrix and therefore we cannot compute its determinan t to transform the prior distribution p ( α , β ). Now p ( w , ˜ w ) is essen- tially a distribution on the data subspace embedded in the high dimensional space R p . T o obtain the equiv alence b et ween these tw o priors, we consider the follo wing theorem [Petersen & Pedersen, 2008]: Theorem 1 If A is “tal l”, i.e.,“under-determine d”, then p ( x ) = R p ( s ) δ ( x − As )d s = ( 1 √ | A T A | p ( A + x ) if x = AA + x 0 otherwise Using this theorem and the fact | VV + | = 1, we s ee that with the simple linear relationship b et ween the v ariables, p ( α , β ) = p ( w , ˜ w ). 15

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment