Bregman Distance to L1 Regularized Logistic Regression

Br egman Distance to L1 Regularized Logistic Regr ession Mithun Das Gupta Epson Research and Dev elopment, Inc. 2580 Orchard Parkway , Suite 225 San Jose, CA 95131. mdasgupta@erd.epson.com Thomas S. Huang Dept. Electrical and Computer Engg. Beckman Inst. of Advance Science and T ech. Univ ersity of Illinois, Urbana Champaign huang@ifp.uiuc.edu Abstract In this work we in v estigate the relationship be- tween Bregman distances and re gularized Lo- gistic Regression model. W e present a de- tailed study of Bre gman Distance minimization, a family of generalized entropy measures asso- ciated with con v ex functions. W e con vert the L1-regularized logistic regression into this more general framework and propose a primal-dual method based algorithm for learning the param- eters. W e pose L1-regularized logistic re gression into Bregman distance minimization and then apply non-linear constrained optimization tech- niques to estimate the parameters of the logistic model. 1 Introduction W e study the problem of regularized logistic regression as proposed by [5] and [12]. L 1 regularization has been stud- ied e xtensi v ely during recent years due to the sparsity of the classiﬁers obtained by such regularization [11]. The objec- tiv e function in the L 1 -regularized LRP (Eqn. 4) is con ve x, but not differentiable (speciﬁcally , when any of the weights is zero), so solving it is more of a computational challenge than solving the L 2 -re gularized LRP . Despite the additional computational challenge posed by L 1 -regularized logistic regression, compared to L 2 -regularized logistic regression, interest in its use has been growing. The main motiv ation is that L 1 -regularized LR typically yields a sparse vector λ , i.e., λ typically has relativ ely few nonzero coef ﬁcients. (In contrast, L 2 -regularized LR typically yields λ with all coefﬁcients nonzero.) When λ j = 0, the associated logistic model does not use the jth component of the feature v ector , so sparse λ corresponds to a logistic model that uses only a few of the features, i.e., components of the feature v ec- tor . Indeed, we can think of a sparse λ as a selection of the relev ant or important features (i.e., those associated with nonzero λ j ), as well as the choice of the intercept value and weights (for the selected features). A logistic model with sparse λ is, in a sense, simpler or more parsimonious than one with non-sparse λ . It is not surprising that L 1 - regularized LR can outperform L 2 -regularized LR, espe- cially when the number of observations is smaller than the number of features. Our work is based directly on the general setting of [12] in which one attempts to solve optimization problems based on general Bregman distances. They proposed the iterativ e scaling algorithm for minimizing such div ergences through the use of auxiliary functions. Our work builds on se v- eral previous works which ha ve compared di ver gence ap- proaches to logistic regression. W e closely follow the work by [5] who propose a new category of parallel and sequen- tial algorithms for boosting and logistic regression based on Bregman distance minimization. They are one of the ﬁrst to connect the ﬁelds of regression and generalized di- ver gences, but as such unconstrained logistic parameter is unreliable for large problems and hence we tak e up this study to tie constrained optimization to the existing work. Most of the work related to connecting the idea of Breg- man distance and logistic re gression minimize the uncon- strained auxiliary function at each step. In this work we pose the problem with box or L 1 constraints due to the fa- vorable properties of L 1 regularization for cases with large dimensions but relatively fewer number of training data points. 2 Logistic Regression Let S = h ( x 1 , y 1) , . . . , ( x m , y m ) i be a set of training ex- amples where each instance x i belongs to a domain or in- stance space χ , and each label y i ∈ {− 1 , +1 } . W e assume that we are given a set of real-valued functions on χ , denoted by h i where i = { 1 , 2 , . . . , n } . Follo w- ing con vention in the Maximum-Entropy literature, we call these functions features; in the boosting literature, these would be called weak or base hypotheses. Note that, in the terminology of the latter literature, these features cor- respond to the entire space of base hypotheses rather than merely the base hypotheses that were previously found by the weak learner . W e study the problem of approximating the y i s using a linear combination of features. That is, we are interested in the problem of ﬁnding a vector of parame- ters λ ∈ R n such that f λ ( x i ) = P n j =1 λ j h j ( x i ) is a good approximation of y i . For classiﬁcation problems, it is natural to try to match the sign of f λ ( x i ) to y i , that is, to attempt to minimize n X j =1 I [ y i f λ ( x i ) ≤ 0] (1) where I { c } = 1 whene ver { c } is tr ue . This form of loss is intractable for in its most general form and so some other non-negati ve loss function is minimized which closely re- sembles the abov e loss. In the logistic regression frame work we use the estimate P { y = +1 | x } = 1 1 + exp( − f λ ( x )) (2) and the log-loss for this model is deﬁned as ` ( x , y ) = m X j =1 ln(1 + exp ( − y i f λ ( x i ))) (3) This is the loss function for the unconstrained minimization problem. But as pointed out earlier regularized loss func- tions are effecti ve for most practical cases and hence we would try to pose the optimization problem with the re g- ularized loss function. The regularized loss function can now be written as ` ( x , y ) = m X j =1 ln(1 + exp ( − y i f λ ( x i ))) + R ( λ ) (4) where R ( λ ) is the regularization function and can hav e dif- ferent forms depending on the regularization method. For L 1 regularization the function R is deﬁned as α | λ | 1 . 3 Bregman Distance Let F : ∆ → R be a continuously differentiable and strictly conv ex function deﬁned on a closed con vex set ∆ ⊆ R r + . The Bregman distance associated with F is de- ﬁned for p , q ∈ ∆ to be B F ( p k q ) . = F ( p ) − F ( q ) − ∇ F ( q ) · ( p − q ) (5) For instance when F ( p ) = r X i =1 p i ln p i B F is the unnormalized relativ e entropy , deﬁned as D U D U ( p k q ) = r X i =1  p i ln  p i q i  + q i − p i  A graphical representation of Bregman distance as a mea- sure of con vexity is sho wn in Fig. 1. Figure 1: The Bre gman distance B f ( p k q ) is an indication of the increase in f ( p ) ov er f ( q ) above linear growth with slope f 0 ( q ) . The distances B F were introduced in by Bregman [4] along with an iterati ve algorithm for minimizing B F subject to linear constraints. Bregman distances hav e been used ear- lier by numerous authors to pose problems as generalized div ergences. [7] used such diver gences for generalized non- negati ve matrix approximations. [1] used them for clus- tering applications. Other di vergence minimization ap- proaches hav e been tried for data mining and information retriev al. The concept of posing numerous problems of density estimation as KL di vergence minimization problem has been long studied. It can be sho wn that KL diver gence is a specialized case of Bregman div ergence and hence the comprehensiv e success of such methods warrants a better in vestigation of Bregman di vergence itself. T o develop the rest of this work we need a few deﬁnitions. Let ∆ ⊂ R r and let F : ∆ → R be a real valued function. W e assume that ∆ is a closed con vex set, and that F is strictly con vex and C 1 on the interior of ∆ . Deﬁnition 1 F or v ∈ R r and q ∈ ∆ the Leg endr e T rans- form L F ( v , q ) is deﬁned as L F ( v , q ) = arg min p ∈ ∆ B F ( p k q ) + v · q Lemma 1 The mapping v , q 7→ L F ( v , q ) deﬁnes a smooth action of R r on ∆ by L F ( v , L F ( w , q )) = L F (( v + w ) , q ) . The optimization problem which we consider is the fol- lowing: let A be an n × r matrix of linear constraints on p ∈ ∆ . Let q 0 ∈ ∆ be a default distribution , chosen such that ∇ F ( q 0 ) = 0 . Finally , let ˜ p ∈ ∆ be giv en, which is considered the empirical distrib ution , since it typically arises from a set of training samples that determine the lin- ear constraints. W e now deﬁne P ( A, ˜ p ) and Q ( A, q 0 ) as P ( A, ˜ p ) = { p ∈ ∆ | A p = A ˜ p } Q ( A, q 0 ) = { q ∈ ∆ | q = L F (( λ T A ) , q 0 ) , λ ∈ R n } The following well-known theorem [12] establishes the du- ality between the two natural projections of B F ( p k q ) with respect to the families P ( A, ˜ p ) and Q ( A, q 0 ) Theorem 1 Suppose B F ( ˜ p k q ) < ∞ and let ¯ Q ( A, q 0 ) = cl( Q ( A, q 0 )) . Then ther e e xists a unique q ? ∈ ∆ such that 1. q ? ∈ P ( A, ˜ p ) ∩ Q ( A, q 0 ) 2. B F ( p k q ) = B F ( p k q ? ) + B F ( q ? k q ) for any p ∈ P ( A, ˜ p ) and q ∈ ¯ Q ( A, q 0 ) 3. q ? = arg min q ∈ ¯ Q B F ( ˜ p k q ) 4. q ? = arg min p ∈ ¯ P B F ( p k q 0 ) Mor eover , any of these four pr operties determines q ? uniquely . Note that since we hav e deﬁned ∇ F ( q 0 ) = 0 , arg min p ∈ ¯ P B F ( p k q 0 ) = arg min p ∈ ¯ P F ( p ) . Property 2 . is called the Pythagor ean pr operty since it resembles the Pythagorean theorem if we imagine that B F ( p k q ) is the square of Euclidean distance and ( p , q ? , q ) are the vertices of a right triangle. 4 Bregman Distance to Logistic Regr ession In this section we study the minimization problem as men- tioned in the previous section. By unconstrained we mean that the parameters λ ∈ R n are free. W e pose the logis- tic regression problem in the Bergman distance framew ork which was de veloped by Collins and Schapire [5]. The key idea is to write the function F ( p ) as F ( p ) = m X i =1 p i ln p i + (1 − p i ) ln(1 − p i ) (6) The resulting Bergman distance is D B ( p k q ) = m X i =1 p i ln p i q i + (1 − p i ) ln 1 − p i 1 − q i (7) For this choice of F the Legendre transform is found to be L F ( v , q ) i = q i e − v i 1 − q i + q i e − v i (8) Now we deﬁne the constraint matrix A as A j i = y i h j ( x i ) from which we get v i = ( λ T A ) i = P n j =1 λ j y i h j ( x i ) Now , if we put q 0 = (1 / 2) 1 into eqn. 8 we get the logistic probability eqn. 2. Also note that D B ( 0 k q ) = − m X i =1 ln(1 − q i ) (9) which giv es ` ( x , y ) = m X i =1 ln(1 + e ( − y i f λ ( x i ) ) (10) = D B ( 0 kL F ( λ T A, q 0 )) where f λ ( x i ) = P n j =1 λ j h j ( x i ) Finally , we can write the equiv alent optimization problem as min q ∈ ¯ Q D B ( 0 k q ) st A q = 0 (11) where as before ¯ Q = cl( Q ) , where Q =  q ∈ ∆ : q i = σ  P n j =1 λ j y i h j ( x i )  , λ ∈ R n  where σ ( x ) = (1 + e x ) − 1 is the Sigmoid function. For our choice of q 0 = (1 / 2) 1 we hav e L F ( v , q 0 ) i = σ ( v i ) as shown in Eqn. 8. Also, since each of the elements of q is Sigmoid function output, therefore, ∆ ∈ [0 , 1] m . The key points to note in this deri vation are a. ˜ p ≡ 0 b . λ ∈ R n The implication of the point (a.) abov e is that the con- straints are homogenous. This is a strong assumption on the constraints. It so turns out that we can relax this con- straint only when we put some additional constraints on the free parameter λ . This points to a regularized scheme, where the ﬁrst constraint is relaxed on the cost of putting some additional constraints on the second condition. W e redeﬁne the set Q as Q = { q : q i = σ ( n X j =1 λ j y i h j ( x i )) , λ ∈ R n , k λ k 1 ≤ c } W e consider supervised learning in settings where there are many input features, but where there is a small subset of the features that is suf ﬁcient to approximate the tar get con- cept well. In supervised learning settings with many input features, o ver -ﬁtting is usually a potential problem unless there is ample training data. For example, it is well known that for un-regularized discriminativ e models ﬁt via train- ing error minimization, sample complexity (i.e., the num- ber of training examples needed to learn “well”) gro ws lin- early with the VC dimension [14]. Further , the VC dimen- sion for most models gro ws about linearly in the number of parameters [13], which typically grows at least linearly in the number of input features. Thus, unless the training set size is large relative to the dimension of the input, some special mechanism, such as regularization, which encour- ages the ﬁtted parameters to be small is usually needed to prev ent over -ﬁtting. Once we ha ve deﬁned our optimization problem our aim is to ﬁnd a sequence of q k = L F ( λ T k A, q 0 ) which minimizes our cost function, all the while remaining feasible to the additional regularization constraint k λ k 1 ≤ c . 5 A uxiliary Function The idea of auxiliary functions was proposed by Della Pietra et al. [12]. The idea is analogous to EM algorithm and tries to bound the error for two iterations. Since we are dealing with distances which are deﬁned to be positi ve, so the quantity k d t +1 − d t k = − ( d t +1 − d t ) for strict de- scent, which can be minimized iterativ ely , till conv ergence is achiev ed. Deﬁnition 2 F or a linear constraint matrix A , if λ ∈ R n . A function A : R n × ∆ → R is an auxiliary function for L ( q ) = − B F ( ˜ p k q ) if 1. For all q ∈ ∆ and λ ∈ R n L ( L F ( λ T A, q )) ≥ L ( q ) + A ( λ , q ) 2. A ( λ , q ) is continuous in q ∈ ∆ and C 1 in λ ∈ R n with A (0 , q ) = 0 and d dt | t =0 A ( t λ , q ) = d dt | t =0 L ( L F ((( t λ ) T A ) , q )) 3. If λ = 0 is a minima of A ( λ , q ) , then q T A = p T 0 A . Theorem 2 Suppose q k is any sequence in ∆ with q 0 = q 0 and q k +1 = L F ( λ T A, q ) wher e λ ∈ R n satisﬁes A ( λ k , q k ) = sup λ A ( λ , q k ) Then L ( q k ) incr eases monotonically to max q ∈ ¯ Q L ( q ) and q k con verg es to the distribution q ? = arg max q ∈ ¯ Q L ( q ) . The proof of this theorem is elucidated in Della Pietra et al. [12]. W e will mention the three lemmas on which the proof is based. Once the lemmas ha ve been pro ved the proof for the theorem can be dra wn simply from them. The three lemmas are 1. If m ∈ ∆ is a cluster point of q ( k ) , then A ( λ , q ( k ) ) ≤ 0 for all λ ∈ R n . 2. If m ∈ ∆ is a cluster point of q ( k ) , then d dt | t =0 L ( L F ( t λ T A, q ( k ) )) = 0 for all λ ∈ R n . 3. Suppose { q ( k ) } is any sequence with only one cluster point q ? . Then q ( k ) con verges to q ? . 6 Constrained Bregman Distance Minimization Once we hav e shown the analogy between logistic regres- sion and Bregman distances, we can proceed to ﬁnd a suit- able auxiliary function for our problem. One key observa- tion is that we can write q k +1 as a simple function of q k as follows q k +1 = L F (( λ k + δ k ) T A, q 0 ) = L F ( δ T k A, L F ( λ k , q 0 )) = L F ( δ T k A, q k ) Let us denote v = δ T k A , hence we can write q k +1 = L F ( v , q k ) . Now , from Eqn. 9, we can write D B (0 k q k +1 ) − D B (0 k q k ) = m X i =1 ln(1 − q i + q i e − v i ) ≤ m X i =1 q i ( e − v i − 1) Substituting, ( δ T A ) i = v i , we deﬁne our auxiliary func- tion as A ( δ , q ) = m X i =0 q i ( e − ( δ T A ) i − 1) (12) It can be easily v eriﬁed that the above choice of auxil- iary function satisﬁes the conditions mentioned in Def 2. Now we need to ﬁnd a sequence of { δ k } → 0 for which A ( δ , q ) ≤ 0 and A ( δ , q ) → 0 monotonically . 7 Algorithm Assumptions: F : ∆ → R , such that { q ∈ ∆ : B F (0 k q ) ≤ c } where c < ∞ . Parameters: ∆ ∈ [0 , 1] m , F satisfying assumptions in part 1, and q 0 = (1 / 2) 1 . Input: Constraint matrix A ∈ [ − 1 , 1] n × m , where A j i = y i h j ( x i ) , and P n j =1 | A j i | ≤ 1 . Output: Denote L F ( λ T t A, q 0 ) as L λ t F . Generate a se- quence of λ 1 , λ 2 . . . such that lim t →∞ B F (0 kL λ t F ) → arg min λ ∈ R n B F (0 kL λ F ) subject to k λ k 1 ≤ u Let λ 1 = 0 For k = 1 , 2 , . . . q k = L λ k F δ k = arg min δ ∈ R n m X i =1 q k i ( e − ( λ T A ) i − 1) st : k λ k + δ k k 1 ≤ u Update λ k +1 = λ k + δ k End For 8 A Primal-Dual method f or L 1 regularized Logistic Regression The basic algorithm for the unconstrained case was pro- posed by [5], but their method ﬁnds a lower bound using the ﬁrst order characteristics of the unconstrained minimizer . In our case we want to ﬁnd the constrained minimizer of the auxiliary function. Since we need strict non-negati ve A ( δ , q ) ≤ 0 , so the new set of conditions are arg min δ ∈ R n m X i =1 q i ( e − ( δ T A ) i − 1) (13) st : k λ + δ k 1 ≤ u A ( δ , q ) ≤ 0 Analyzing the cost function more closely we ﬁnd that it can be written as e − ( δ T A ) i − 1 = e − P n j =1 ( δ j A j i ) i − 1 = e − P n j =1 ( δ j s j i | A j i | ) − 1 ≤ n X j =1 | A j i | ( e − ( δ j s j i ) − 1) where s j i = sig n ( A j i ) . Absorbing, this constraint into the cost function we get arg min δ ∈ R n m X i =1 q i n X j =1 | A j i | ( e − ( δ j s j i ) − 1) (14) st : k λ + δ k 1 ≤ u A ( δ , q ) ≤ 0 Now we deﬁne the tw o quantities W + j ( q ) = X sig n ( A j i )=+1 q i | A j i | W − j ( q ) = X sig n ( A j i )= − 1 q i | A j i | such that at iteration k we have W + j ( q t ) and W − j ( q t ) , then we can re-write the optimization problem as arg min δ ∈ R n n X j =1 W + j ( q t )( e − δ j − 1) + W − j ( q t )( e δ j − 1) st : k λ + δ k 1 ≤ u (15) A ( δ , q ) ≤ 0 Adopting from [6], we can no w introduce slack v ariables and write the penalty function as arg min δ , r , s , t , u ∈ R n n X j =1 G ( δ j ) + a e T ( s j + t j ) st : λ j + δ j + s j − t j = u j (16) G ( δ j ) + r j = 0 s j , t j , r j ≥ 0 where G ( δ j ) = W + j ( q t )( e − δ j − 1) + W − j ( q t )( e δ j − 1) and j = { 1 , . . . , n } . Finally , introducing the log barrier function and absorbing the two terms λ j and u j into one term c j = u j − λ j we get arg min δ , r , s , t , c ∈ R n n X j =1 G ( δ j ) + a e T ( s j + t j ) − µ φ ( s j , t j , r j ) st : δ j + s j − t j = c j (17) G ( δ j ) + r j = 0 where φ ( s j , t j , r j ) = log s j + log t j + log r j and µ is the barrier parameter . As proposed in [6], we decompose the problem into a master problem and a sequence of sub- problems. W e solve the following master problem for a se- quence of barrier parameters { µ k } such that lim k →∞ µ k = 0+ where the + sign denotes conv erging to 0 from the positi ve side min c N X j =1 F ? j ( µ, c ) The sequence of subproblems are e xactly same as Eqn. 17, except the fact that the value of c is held constant while solving the sub-problems. The j th sub-problem can now be written as arg min δ,r ,s,t ∈ R G ( δ ) + a ( s + t ) − µ φ ( s, t, r ) (18) st : δ + s − t = c G ( δ ) + r = 0 Proceeding as shown in Conv ex Optimization [3], Eqn. 11 . 53 , the modiﬁed KKT conditions can be expressed as r t ( x, λ, ν ) = 0 , (where the ( λ, ν ) are the multipliers, rede- ﬁned again for consistency of notation), where we deﬁne r t ( x, λ, ν ) =   ∇ f 0 ( x ) + J ( x ) T λ + A T ν ( λ ) f ( x ) − µ Ax − b   = 0 (19) where x =  δ, r, s, t  T f 0 ( x ) = G ( δ ) + a ( s + t ) − µ φ ( s, t, r ) f ( x ) = G ( δ ) + r J ( x ) =  ∆ G ( δ ) , 1 , 0 , 0  T A =  1 , 0 , 1 , − 1  T b = c The Newton step can be no w be formulated as   ∇ 2 f 0 ( x ) + λ ∇ 2 f ( x ) J ( x ) T A T λJ ( x ) f ( x ) 0 A 0 0   · (20)   ∇ x ∇ λ ∇ ν   = −   r dual r cent r pri   where   r dual r cent r pri   = r t ( x, λ, ν ) 9 Experiments and Results In this section we report results for the experiments con- ducted for the new model proposed in this paper . The sparsity introduced by the L 1 regularization is captured by conducting tests on randomly generated data. The loss- minimization curves remain similar to the unconstrained case since the unit slave problems mentioned in Eqn. 17 are con ve x. But the sparsity of feature vectors enables the dropping of redundant features and hence speeds up the it- erations. In our experiments, we generated random data and classi- ﬁed it using a very noisy hyperplane. W e in vestig ate only 2 -class classiﬁcation problems in this work. W e in vestigate medium to high dimensional problems where the dimen- sionality ranges from 20 − 500 . W e tested both the scenar- ios a) when the number of training points is of the order of the feature dimension and b) when the number of the train- ing data points is an more than an order from the feature di- mension. For every case the random data is ﬁrst classiﬁed based on a random hyperplane and then we add Gaussian noise to the data dimensions based on a coin ﬂip. The noise is assumed to be  ∼ N ( 0 , σ I ) , where σ < 1 . The k ey point of interest is the f act that since the procedure men- tioned in this work decouples the features, and hence the features are dropped from the optimization scheme when the change ∇ δ i drops below some threshold. One such comparativ e plots are sho wn in Fig. 2 (left). The sparsity of feature is shown in Fig. 2 (right). For comparing with other algorithms we run the logistic classiﬁer over public domain data namely the Wisconsin Diagnostic Breast Cancer (WDBC) data set and the Musk data base (Clean 1 and 2 ) [10]. The WDBC data has 569 in- stances with 30 real valued features. There are 357 benign (positiv e) instances and 212 malignant (negati ve) instances. The best reported result is 97 . 5% using decision trees con- structed by linear programming [9, 2]. Our method gen- erate 16 fakse negati ves and 23 false positi ves, totaling 39 errors with an accuracy of 93 . 15% . The training and testing errors are shown in Fig. 3 (left). The musk clean 1 data-set describes a set of 92 molecules of which 47 are judged by human experts to be musks and the remaining 45 molecules are judged to be non-musks. Similarly , the musk clean 2 data base describes a set of 102 molecules of which 39 are musks and the remaining 63 molecules are non-musks. The 166 features that describe these molecules depend upon the exact shape, or confor- mation, of the molecule. Multiple conﬁrmations for each instance were created, which after pruning amount to 476 conformations for clean 1 and 6598 for clean 2 data-set. The many-to-one relationship between feature vectors and molecules is called the ”multiple instance problem”. When learning a classiﬁer for this data, the classiﬁer should clas- sify a molecule as ”musk” if ANY of its conformations is Figure 2: Left: T est Error , regularized (blue) and unconstrained (red) for 500 D, Right: Dropped features as a percentage of the total features. classiﬁed as a musk. A molecule should be classiﬁed as ”non-musk” if NONE of its conformations is classiﬁed as a musk. W e report results for tests conducted on the two data-bases. The training and test plots for the clean 2 data are sho wn in Fig. 3 (right). W e compare our method L 1 Logistic Regres- sion based on Bregman Distances (L1LRB) against pub- lished results and our method outperforms most of them. The comparati ve results are shown in T able. 1 and T able. 2. Also note that the poor performance of C 4 . 5 algorithm has been attrib uted to the fact that it does not take the multi- instance nature of the problem into consideration for train- ing. W e did not take this consideration while training and still our method ranks as the top 2 for among all the re- ported results. The details for the other methods mentioned hav e been discussed in [8]. Algorithm TP FN FP TN % Acc L1LRB 45 2 2 43 95 . 6 Iter-discrim APR 42 5 2 43 92 . 4 GFS-Elim-kde APR 46 1 7 38 91 . 3 All-pos APR 36 11 7 38 80 . 4 Back-prop 45 2 21 24 75 . 0 C4.5(pruned) 45 2 24 21 68 . 5 T able 1: Comparativ e results for the Musk Clean 1 database. 10 Conclusion and extensions W e posed the problem of L 1 regularized logistic re gression as a constrained Bregman distance minimization problem and posed the optimization problem as a decoupled primal- dual problem in each of the dimensions of the parameter vector . The optimization technique mentioned in this work takes help from the strict feasibility properties of primal Algorithm TP FN FP TN % Acc Iter-discrim APR 30 9 2 61 89 . 2 L1LRB 30 9 6 57 85 . 29 GFS-Elim-kde APR 32 7 13 50 80 . 4 GFS-El-count APR 31 8 17 46 75 . 5 All-pos APR 34 5 23 40 72 . 6 Back-prop 16 23 10 53 67 . 7 GFS-All-Pos APR 37 2 32 31 66 . 7 Most Freq Class 0 39 0 63 61 . 8 C4.5(pruned) 32 7 35 28 58 . 8 T able 2: Comparativ e results for the Musk Clean 2 database. dual methods and hence guarantee the conv ergence of the algorithm. Comparative results on published data-sets have prov e the strength of the regularized method. References [1] A. Banerjee, S. Merugu, I. Dhillon, and J. Ghosh. Clus- tering with bre gman di vergences. In SIAM International Confer ence on Data Mining (SDM) , 2004. [2] K. P . Bennett. Decision tree construction via linear pro- gramming. In Proceedings of the 4 th Midwest Artiﬁcial Intelligence and Cognitive Science Society , pages 97–101, 1992. [3] S. Boyd and L. V andenberghe. Con vex Optimization . Cam- bridge Univ ersity Press, 2004. [4] L. M. Bregman. The relaxation method of ﬁnding the com- mon point of conv ex sets and its application to the solution of problems in con ve x programming. In Computational Mathematics and Mathematical Physics , volume 7, pages 200–217, U.S.S.R, 1967. [5] M. Collins, R. E. Schapire, and Y . Singer . Logistic regres- sion, adaboost and bre gman distances. Mach. Learn. , 48(1- 3):253–285, 2002. [6] A.-V . de Miguel. T wo Decomposition Algorithms for Non- con vex Optimization Problems with Global V ariables . PhD thesis, Stanford Univ ersity , April 2001. Figure 3: T rain Error (blue) and T est Error (red). Left: WDBC data, Right: Musk Clean 2 data. [7] I. Dhillon and S. Sra. Generalized nonnegati ve matrix approximations with bregman div ergences. In Y . W eiss, B. Sch ¨ olkopf, and J. Platt, editors, Advances in Neur al Information Processing Systems 18 , pages 283–290. MIT Press, Cambridge, MA, 2006. [8] T . G. Dietterich, R. H. Lathrop, and T . Lozano-Perez. Solv- ing the multiple instance problem with axis-parallel rectan- gles. Artiﬁcial Intelligence , 89(1-2):31–71, 1997. [9] O. L. Mangasarian, W . N. Street, and W . H. W olberg. Breast cancer diagnosis and prognosis via linear programming. In Operations Researc h , volume 43, pages 570–577, July- August 1995. [10] D. J. Ne wman, S. Hettich, C. L. Blake, and C. J. Merz. UCI repository of machine learning databases, 1998. [11] A. Ng. Feature selection, l 1 vs. l 2 regularization, and rota- tional inv ariance. In In Pr oceedings of the twenty-ﬁrst in- ternational confer ence on Machine learning (ICML) , pages 78–85, New Y ork, NY , USA, 2004. A CM Press. [12] S. D. Pietra, V . D. Pietra, and J. Lafferty . Inducing features of random ﬁelds. In IEEE T ransactions on P attern Analy- sis and Machine Intelligence , volume 19, pages 380–393, 1997. [13] V . N. V apnik. Estimation of Dependences Based on Empir- ical Data . Springer-V eriag, 1982. [14] V . N. V apnik and A. Y . Chervonenkis. Theory of P attern Recognition . Nauka, Moscow , 1974.

Bregman Distance to L1 Regularized Logistic Regression

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment