Metric Embedding for Nearest Neighbor Classification
The distance metric plays an important role in nearest neighbor (NN) classification. Usually the Euclidean distance metric is assumed or a Mahalanobis distance metric is optimized to improve the NN performance. In this paper, we study the problem of …
Authors: Bharath K. Sriperumbudur, Gert R. G. Lanckriet
Metric Em b edding f o r Nearest Neigh b or Clas sification Bharath K. Srip erum budur & Gert R. G. Lanck riet Departmen t of Electrical a nd Computer Engineering Univ ersit y of California, San Diego La Jolla, CA 9 2093. { bharathsv@u csd.edu, gert@ece.uc sd.edu } August 30, 2021 Abstract The distance metric pla y s an imp ortan t role in nearest neighbor (NN ) clas sification. Usu- ally the Euclidean distance metric is assumed or a Mahalanobis distance metric is optimized to impro ve the NN p erfo rmance. In this pap er, we s tu dy the problem of em b edding arbitrary metric spaces into a Euclidean sp ace with th e goal to improv e the accuracy of th e NN classi- fier. W e prop os e a solution by app ealing to the framew ork of regularization in a repro ducing kernel Hilb ert space and pro ve a representer-like theorem for NN classification. The embedding function is t h en determined by solving a semidefinite program which has an in teresting con- nection to the soft-margin linear bin ary supp ort vector mac hin e classifier. Although the main focus of this pap er is to present a general, theoretical framew ork for metric embed ding in a N N setting, we demonstrate th e p erformance of the prop osed metho d on some b enc hmark datasets and show that it p erfo rms b etter th an th e Mahalanobis metric learning algorithm in terms of lea ve-one-out and generalization errors. 1 In tro duction The neare st neighbor (NN) algorithm [ ? , ? ] is o ne o f the most p opular non-par ametric sup ervised classification metho ds. Because of the non-linearity of its decision bounda r y , the NN alg orithm generally pr o vides go o d cla ssification p erformance. The alg o rithm is straight for w ard to implement and is easily extendible to multi-class problems unlike other popula r classificatio n metho ds like suppo rt vector machines (SVM) [ ? ]. The k -NN rule classifies each unlab elled example by the ma jority lab el amo ng its k -near est neighbors in the training set. There fore, the perfo rmance of the rule depe nds on the distance metric used, which defines the nearest neig hbo rs. I n the absence o f prio r knowledge, the examples a r e assumed to lie in a Euclidea n metric space and the Euc lidea n distance is used to find the near est neig h b or. Ho wever, often ther e are distance meas ures that b etter reflect the under lying structur e of the data at hand, which, if used, would lea d to b etter NN cla ssification per formance. Let ( X , ρ ) repres en t a metric space X (or mo re gener ally , a semimetric s pa ce) with ρ : X × X → R + as its metric (semimetric) and x ∈ X . F or example, (a) when x is an image , X = R d and ρ is the t angent distanc e b et ween ima ges, (b) for x lying on a manifold, X = ma nifold in R d and ρ is the ge o desic distanc e , and (c) in a structur ed setting like a gra ph, X = { vertices } a nd ρ ( x, y ) is the shortest p ath distanc e from x to y , where x, y ∈ X . These settings ar e pr actical with (a) a nd (b) mor e pr o minen t in c o mputer v ision and (c) in bio-informa tics. How ever, in such scenar ios, the true underlying distance metr ic may not b e known or it might b e difficult to estimate it for its us e in NN classification. In such cas es, a s afor e mentioned, often the E uclidean dis tance metric is used instead. The goal of this pap er is to extend the NN rule to arbitra ry metric spaces , X , wher e in we prop ose to embed the g iv en tr aining data int o a space whose underlying metric is k no wn. Prior works [ ? , ? , ? , ? ] deal with X = R D and assume the Mahalano bis distance metr ic , i.e., ρ ( x , y ) = p ( x − y ) T A ( x − y ) (with A 0), which is then optimized with the goa l to improv e NN classification p erformance. These metho ds can b e interpreted a s finding a linear transformatio n L ∈ R d × D so that the tr a nsformed da ta lie in a Euclidea n metric s pace, i.e., 1 ρ ( x , y ) = p ( x − y ) T A ( x − y ) = || Lx − Ly || 2 with A = L T L . All these methods lear n the Ma- halanobis distance metric b y minimizing the distance b et ween training data of same class while separating the data from different classes with a lar ge marg in. Ins tea d of assuming the Mahalanobis distance metric which r estricts X to R D , we would like to find some ge neral trans formation (instead of linear) that embeds the tr aining data from an a rbitrary metric space, X in to a Euclidea n s pace while improving the NN class ification p erformance. In this pap er, we pro pose to minimize the pr oxy to av erag e leave-one-out er ror (LOO E) of a ε -neighbor ho o d NN class ifier b y embedding the da ta into a Euclidean metric spa ce. T o achiev e this, we study tw o differ en t appro ac hes that lea rn the em b edding function. The first approach deals within the framework o f regula rization in a repro ducing kernel Hilb ert space (RKHS) [ ? ], wherein f ∈ H k = { f | f : X → R d } is learned with k : X × X → R b eing the repro ducing kernel of H k . W e pr o ve a r epr esenter-like the or em for NN clas sification and show that f a dmits the form f = P n i =1 c i k ( ., x i ) with P n i =1 c i = 0 , wher e { c i } n i =1 ∈ R d and n is the num b e r o f training po in ts. Therefore, the problem of lear ning f reduces to le a rning { c i } n i =1 , resulting in a non-co n vex optimization problem which is then relaxed to yield a co n vex semidefinite prog ram (SDP) [ ? ]. W e provide an interesting interpretation o f this appro ac h by showing that the o btained SDP is in fact a soft-marg in linear binar y SVM classifier. In the sec ond approa c h, we lea rn a Mercer kernel map φ : X → ℓ N 2 that satisfies h φ ( x ) , φ ( y ) i = k ( x, y ) , ∀ x, y ∈ X , whe r e k is the Merc e r k er nel and N ∈ N o r N = ∞ dep ending on the num ber o f non-zero e ig en v alues of k . W e show that lea rning φ is equiv alent to lea rning the kernel k . How ever, the learned k is not interesting as it do es no t allow for an out-of-sa mple extension and so can b e used o nly in a trans ductiv e setting. Using the algorithm der iv ed fr o m the RKHS fr amew ork, some exp eriment s are carried out on four b enc hmark datasets, wherein w e s ho w that the prop osed metho d has b etter leave-one-out and gener alization error p e rformance compared to the Ma halanobis metric learning a lgorithm prop osed in [ ? ]. 2 Problem form ulation Let { x i , y i } n i =1 denote the training set of n lab elled ex amples with x i ∈ X and y i ∈ { 1 , 2 , . . . , l } , where l is the num b er of classe s . Unlike prio r works which lea rn a linear tr ansformation L : R D → R d (assuming X = R D ), leading to the distance metric, ρ L ( x i , x j ) = || L x i − L x j || 2 , our go al is to learn a transforma tion, g ∈ G = { g | g : X → Y } so that (a) the av erag e LOOE of the ε -neighborho o d NN cla s sifier is r educed a nd (b) Y is Euclidea n, i.e., ρ g ( x i , x j ) = || g ( x i ) − g ( x j ) || 2 . Let B g ( x, ε ) = { g ( y ) : ρ 2 g ( x, y ) ≤ ε } repr e sen t a Euclidean ba ll centered at g ( x ) with r adius √ ε . Let τ ij = 2 δ y i ,y j − 1 where δ re pr esen ts the Kroneck er de lta . 1 Let µ x ( A ) denote a Dirac measure 2 for any measura ble s et A . In the ε - ne ig h b orhoo d NN c lassification setting, the LOOE for a p oin t g ( x ) o ccurs when the num b er of tr aining p oin ts of o pposite class (to that of x ) tha t belo ng to B g ( x, ε ) is more tha n the num b er of po in ts of the same clas s as x that be long to B g ( x, ε ). So, the average LOOE for the ε -neig h b orhoo d NN c la ssifier can b e given as LOOE( g , ε ) = 1 2 + 1 2 n n X i =1 sgn X j : τ ij = − 1 µ g ( x j ) ( B g ( x i , ε )) − X j 6 = i : τ ij =1 µ g ( x j ) ( B g ( x i , ε )) , (1) where sgn is the sign function. Minimizing Eq. (1) ov er g ∈ G and ε > 0 is computationally har d bec ause of its dis c on tinuous and no n-differen tiable nature. Instead, ba s ed on the obser v a tion that LOOE for a p oin t g ( x i ) c a n be minimized by maximizing P j : τ ij =1 µ g ( x j ) ( B g ( x i , ε )) and minimizing P j : τ ij = − 1 µ g ( x j ) ( B g ( x i , ε )), we there fo re minimize the pr o xy to av era ge LOOE by solving min g ∈ G , ε> 0 1 n n X i =1 X j : τ ij =1 µ g ( x j ) R d \ B g ( x i , ε ) + X j : τ ij = − 1 µ g ( x j ) ( B g ( x i , ε )) . (2) In addition, to av oid ov er-fitting to the training set, the complex ity of G has to b e controlled, for which a p enalt y functional, Ω[ g ], where Ω : G → R is int r oduced in Eq. (2) resulting in the 1 The Kronec ker delta is defined as δ ij = 1, if i = j and δ ij = 0, if i 6 = j . 2 The Dirac m easure f or any measurable set A is defined b y µ x ( A ) = 1, if x ∈ A and µ x ( A ) = 0, if x / ∈ A . 2 minimization of the following r e gularized error functional, min g ∈ G , ε> 0 1 n n X i =1 X j : τ ij =1 µ g ( x j ) R d \ B g ( x i , ε ) + X j : τ ij = − 1 µ g ( x j ) ( B g ( x i , ε )) + λ Ω[ g ] , (3) with λ > 0 b eing the regula rization parameter . F or a given x i , the ab ov e functional minimizes (i) the nu mber o f po in ts with τ ij = 1 that do not belo ng to B g ( x i , ε ), (ii) the num b er of p oints with τ ij = − 1 that b elong to B g ( x i , ε ) and (iii) the p enalt y functional. With a few alg ebraic manipula tions, Eq. (3) can b e r educed to min g ∈ G , ε> 0 n X i,j =1 µ g ( x j ) { g ( x ) : τ ij ρ 2 g ( x i , x ) − τ ij ε ≥ 0 } + ˜ λ Ω[ g ] . (4) where ˜ λ = nλ . The first ter m in E q. (4) repr esen ts the 0 –1 loss function, which is hard to minimize bec ause of its discontin uity at 0. Usually , the 0 – 1 loss is replaced by other loss functions like hinge/squar e/logistic loss which are conv ex. Since hinge loss is the tightest conv ex upp er b ound to the 0 –1 loss, we use it as an approximation to the 0 –1 lo s s resulting in the following minimization problem, min g ∈ G , ε> 0 n X i,j =1 1 + τ ij || g ( x i ) − g ( x j ) || 2 2 − τ ij ε + + ˜ λ Ω[ g ] , (5) where [ a ] + = max(0 , a ). It is to b e noted that Eq . (5) is conv ex in ε whereas it is non-convex in g (even when g is linear). So , the a ppro ximation o f 0– 1 loss with the hinge loss in this case do es not yield a convex pro gram unlike in p opular machine learning a lgorithms like SVM. Eq. (5) has an int er esting g eometrical interpretation that the p oin ts of same clas s ar e closer to one ano ther than to any p oin t from other cla sses. This mea ns that the training po ints are cluster ed tog ether accor ding to their class lab els, which will definitely improve the accur acy of NN classifier . How ever, in comparis on to the metho d in [ ? ], such a b eha vior might b e computationally difficult to achieve. The idea in [ ? ] is to keep the ta rget ne ig h b ors 3 closer to one another and se pa rate them by a lar ge mar gin fro m the neighbors with no n-matc hing lab els. This metho d, therefore, do es not lo ok beyond targ e t neighbor s and optimizes the Mahala nobis distance metric lo cally leading to a globa l metric. But, the adv antage with our formulation is that no side informa tion (reg arding target ne ig h b ors) is nee de d unlike in [ ? ]. If the underlying metric in X is not known, tar get neig h b ors ca nnot b e computed and so we do aw ay with the tar g et neighbo r formulation a nd study the clustering formulation. Another reason the clustering formulation is int er esting is that it neatly yields a se tting to pr o ve a representer theorem [ ? ] for the ε -neighbor hoo d NN classificatio n when G = H k and Ω[ g ] = || g || 2 G , which is discussed in § 3. Solving Eq. (5) is no t eas y unless some assumptions a bout G a re made. In § 3, we assume G as a RKHS with the r e pr oducing kernel k and solve for a function that optimizes Eq . (5). In § 4, we restrict G to the set of Mercer kernel maps a nd show the equiv alence b etw een lear ning the Mercer kernel map and lea rning the Merc e r kernel. 3 Regularization in repro ducing k ernel Hilb ert space Many machine learning alg o rithms like SVMs, r egularization netw o r ks and lo gistic reg ression can be der iv ed within the framework of regula rization in RKHS by choosing the appro priate empirical risk functional with the p enalizer b eing the squar e d RKHS nor m [ ? ]. In Eq. (5), we have extended the regular ization framework to ε -neighbo rhoo d NN cla ssification, wherein g ∈ G and ε > 0 that minimize the surrog ate to average LO OE hav e to be computed. Instead of considering any arbitrar y G , we introduce a sp ecial structure to G b y assuming it to b e a RK HS, H k with the repro ducing kernel k . F rom now onw a rds, we change the notation from G to H k and search for f ∈ H k . F o r the time b eing, let us assume that H k = { f | f : X → R } . The p enalty functional in E q. (5) for H k is defined to b e the squa red RKHS norm, i.e ., Ω[ f ] = || f || 2 H k . E q. (5 ) can ther efore b e rewr itten as min f ∈ H k , ε> 0 n X i,j =1 1 + τ ij || f ( x i ) − f ( x j ) || 2 2 − τ ij ε + + ˜ λ || f || 2 H k . (6) 3 The target neigh b ors are kno wn a priori a nd are determined by assuming the Euclidean distance metric. 3 The following lemma provides a represe ntation for f that minimizes E q. (6). Using this result, Theorem 2 pr ovides a repres e n tation for f : X → R d that minimizes Eq. (6). Lemma 1. If f is an optimal solution to Eq. (6), then f c an b e expr esse d as f = P n i =1 c i k ( ., x i ) with { c i } n i =1 ∈ R and P n i =1 c i = 0 . Pr o of. Since f ∈ H k , f ( x ) = h f , k ( ., x ) i H k . Ther efore, E q. (6) c a n b e written as min f ∈ H k , ε> 0 n X i,j =1 1 + τ ij ( h f , k ( ., x i ) − k ( ., x j ) i H k ) 2 − τ ij ε + + ˜ λ h f , f i H k . (7) W e may decomp o se f ∈ H k int o a pa rt contained in the spa n of the kernel functions { k ( ., x i ) − k ( ., x j ) } n i,j =1 , and the o ne in the orthogo na l complement; f = f || + f ⊥ = P n i,j =1 α ij ( k ( ., x i ) − k ( ., x j )) + f ⊥ . Here α ij ∈ R and f ⊥ ∈ H k with h f ⊥ , k ( ., x i ) − k ( ., x j ) i H k = 0 for all i , j ∈ { 1 , 2 , . . . , n } . Ther efore, f ( x i ) − f ( x j ) = h f , k ( ., x i ) − k ( ., x j ) i H k = h f || , k ( ., x i ) − k ( ., x j ) i H k = P n p,m =1 α pm ( k ( x i , x p ) − k ( x j , x p ) − k ( x i , x m ) + k ( x j , x m )). Now, consider the p enalt y functional, h f , f i H k . F or a ll f ⊥ , h f , f i H k = || f || || 2 H k + || f ⊥ || 2 H k ≥ || P n i,j =1 α ij ( k ( ., x i ) − k ( ., x j )) || 2 H k . Thus for any fixed α ij ∈ R , Eq. (7) is minimized for f ⊥ = 0. Therefore, the minimizer of E q. (6) has the for m f = P n i,j =1 α ij ( k ( ., x i ) − k ( ., x j )), which is parameteriz ed b y n 2 parameters of { α ij } n i,j =1 . Howev er, f can b e repr e sen ted by n parameter s as f = P n i =1 c i k ( ., x i ) wher e R ∋ c i = P n j =1 ( α ij − α j i ) a nd P n i =1 c i = 0. Theorem 2 (Multi-o utput regular ization) . L et H k = { f | f : X → R d } . If f is an optimal solution to Eq. (6), then f = n X i =1 c i k ( ., x i ) (8) with c i ∈ R d , ∀ i ∈ { 1 , 2 , . . . , n } and P n i =1 c i = 0 . Pr o of. Let ˜ H ˜ k = { ˜ f | ˜ f : X → R } with ˜ k a s its repro ducing kernel. Co nstruct H k = ˜ H ˜ k × ˜ H ˜ k × d . . . × ˜ H ˜ k = { ( ˜ f 1 , ˜ f 2 , . . . , ˜ f d ) | ˜ f 1 ∈ ˜ H ˜ k , ˜ f 2 ∈ ˜ H ˜ k , . . . , ˜ f d ∈ ˜ H ˜ k } . No w, H k is a RKHS with the repro duc- ing kernel, k = ( ˜ k , ˜ k , d . . ., ˜ k ). Then, with || f ( x ) || 2 2 = P d m =1 || ˜ f m ( x ) || 2 2 = P d m =1 ( h ˜ f m , ˜ k ( ., x ) i ˜ H ˜ k ) 2 and h f , f i H k = P d m =1 h ˜ f m , ˜ f m i ˜ H ˜ k , E q. (6) reduces to min { ˜ f m } d m =1 ∈ ˜ H ˜ k , ε> 0 n X i,j =1 " 1 + τ ij d X m =1 ( h ˜ f m , ˜ k ( ., x i ) − ˜ k ( ., x j ) i ˜ H ˜ k ) 2 − τ ij ε # + + ˜ λ d X m =1 h ˜ f m , ˜ f m i ˜ H ˜ k . Applying Lemma 1 indep enden tly to each ˜ f m , m = 1 , 2 , . . . , d pr o ves the re s ult. 4 W e now study the ab o ve r esult for linear kernels. The following co r ollary shows tha t applying a linear kernel is equiv alent to assuming the underlying distance metric in X to b e the Mahalanobis distance. Corollary 3 (Linear kernel) . L et X = R D and x , y ∈ X . If k ( x , y ) = h x , y i = x T y , then R d ∋ f ( x ) = Lx and ρ f ( x , y ) = p ( x − y ) T M ( x − y ) with M = L T L . Pr o of. F rom Eq. (8), w e hav e f ( x ) = P n i =1 c i h x , x i i = Lx , where L = P n i =1 c i x T i . Ther e fo re, ρ f ( x , y ) = || Lx − Ly || 2 = p ( x − y ) T M ( x − y ) with M = L T L . This mea ns that most of the prior work has e xplored only the linea r kernel. How ever, it has to b e mentioned that [ ? ] derived a dual problem to Eq. (6) by assuming X = R D , f ( x ) = Lx and Ω[ f ] = || L T L || 2 F , which is then kernelized by using the kernel trick. [ ? ] studied the problem in the same framework a s [ ? ] barring the p enalt y functional but in an online mo de. T ho ugh o ur ob jective function in E q. (6) is similar to the o ne in [ ? , ? ], we solve it in a completely different setting by app ealing to reg ularization in RKHS and without making any assumptions a bout X or its underlying distance metr ic. Recently , a different metho d is prop osed by [ ? ], which kernelized 4 See [ ? , § 4.7] for more details. 4 the o ptimiza tion pro blem studied in [ ? ] by assuming a particular par ametric for m for L to inv oke the kernel trick. Our method do es not make any such ass umptions except to restrict f to H k . Substituting Eq. (8) in Eq . (6), we get min C , ε> 0 n X i,j =1 1 + τ ij tr( CA ij C T ) − τ ij ε + + ˜ λ tr( CK C T ) s.t. C1 = 0 , C ∈ R d × n (9) where C = [ c 1 , c 2 , . . . , c n ], K is the kernel matrix with K ij = k ( x i , x j ) and A ij = ( k i − k j )( k i − k j ) T with k i being the i th column of K . The ob jectiv e in Eq. (9) involv es terms that ar e quadra tic a nd piece-wise quadratic in C , while piece-wise linear in ε . The co nstrain ts ar e linea r in C and ε . So, a cursory lo ok might sugg est Eq. (9) to b e a conv ex progr am. But, it is actually non-convex bec ause of the presence of τ ij = − 1 for so me i a nd j . T o make the o b jectiv e convex, we linearize the qua dratic functions b y rewriting E q. (9) in terms of ¯ C = C T C . But this r esults in a har d non-conv ex constraint, r ank( ¯ C ) = d . W e relax the r ank co nstrain t to obtain the conv ex semidefinite progra m (SDP), min ¯ C , ε> 0 n X i,j =1 1 + τ ij tr( A ij ¯ C ) − τ ij ε + + ˜ λ tr( K ¯ C ) s.t. 1 T ¯ C1 = 0 , ¯ C 0 . (10) By intro ducing s lac k v ariables, this SDP can b e written as min ¯ C , { ξ ij } n i,j =1 , ε h ¯ C , K i F + η P n i,j =1 ξ ij s.t. τ ij h ¯ C , − A ij i F + ε ≥ 1 − ξ ij , ∀ i, j h ¯ C , 11 T i F = 0 , ¯ C 0 , ε > 0 ξ ij ≥ 0 , ∀ i, j. (11) where h A , B i F = tr( A T B ) a nd η = 1 / ˜ λ . An interesting observ ation is that solving Eq. (11) is equiv alent to c omputing the Maha lanobis metric , ¯ C in R n using the training set, { k i , y i } n i =1 . This is b ecause, ρ f ( x i , x j ) = || f ( x i ) − f ( x j ) || 2 = p ( k i − k j ) T ¯ C ( k i − k j ) = p tr( ¯ CA ij ). Now, to cla s sify a test p o in t, x t , ¯ C and ε obtained by so lv ing Eq. (11) ar e used to compute || f ( x t ) − f ( x i ) || 2 2 = tr( ¯ CA ti ) , ∀ i ∈ { 1 , 2 , . . . , n } where A ti = ( k t − k i )( k t − k i ) T with k t = [ k ( x t , x 1 ) , . . . , k ( x t , x n )] T and the clas s ification is done by either k -NN o r ε - neigh b orho od NN. A careful obser v a tion of Eq . (11) shows an interesting similar it y to the soft-marg in formulation o f linear binary SVM classifier wherein a hyperpla ne in R n × n (or R n 2 by vectorizing the matr ices) that separates the training da ta , { ( − A ij , τ ij ) } n i,j =1 has to be computed. The hyp erplane is defined by the normal, ¯ C ∈ S n + ∩ { B : h B , 11 T i = 0 } and its offset from the orig in, ε ∈ R ++ . The ob jective function is a trade- off betw een maximizing the kernel mis-alig nmen t (be tw een ¯ C and K ) a nd minimizing the training error . η = ∞ r e s ults in a hard- margin bina ry SVM c la ssifier. While Eq. (11) ca n b e solved by genera l purp ose solvers, they s cale p o o rly in the num be r of constraints, which in our case is O ( n 2 ). So, we implemented our own sp ecial-purp ose so lv er based on the one developed in [ ? ], which ex ploits the fact that most o f the sla c k v ariables, { ξ ij } n i,j =1 never attain p ositive v alues resulting in very few active c o nstrain ts. The s olv er follows a very s imple tw o- step up date rule: It firs t takes a step alo ng the g radien t to minimize the o b jectiv e a nd then pro jects ¯ C and ε onto the feasible set. 4 Mercer ke rnel map As afor emen tioned, our o b jective is to reduce the average LOOE of NN cla ssifier by embedding the data in to a Euclidean metric space. O ne obvious choice of such a mapping is the Mer cer kernel map, φ : X → ℓ N 2 , that satisfies h φ ( x ) , φ ( y ) i = k ( x, y ), where k is the Merc e r kernel. So, to solve Eq. (5), we restrict our selv es to P = { φ | φ : X → ℓ N 2 } . Replacing G by P and g by φ in Eq. (5), we have 5 || φ ( x i ) − φ ( x j ) || 2 2 = k ( x i , x i ) + k ( x j , x j ) − 2 k ( x i , x j ). Defining k ij = k ( x i , x j ), E q. (5) re duces to the kernel learning pro blem, min K 0 , ε> 0 n X i,j =1 [1 + τ ij ( k ii + k j j − 2 k ij ) − τ ij ε ] + + ˜ λ || K || 2 F , (12) with the pena lt y functional chosen to b e || K || 2 F . It is to b e noted that the embedding is uniquely determined by the kernel matrix K a nd no t by φ as ther e may exist φ 1 and φ 2 such that φ 1 6 = φ 2 everywhere but k ( x, y ) = h φ i ( x ) , φ i ( y ) i , for i = 1 , 2. Eq. (12) is a kernel learning problem which is conv ex in K and ε . I t dep ends only on the entries of the kernel matrix and { τ ij } n i,j =1 while no t utilizing the training data. F or sufficiently s mall ˜ λ , It can b e shown that ε = 1 and k ii + k j j − 2 k ij = 1 − τ ij . Therefor e, ther e exis ts φ that a c hieves zero LOO E. How ever, the obtained mapping or K is not interesting as it do e s not allow for a n out-of-sa mple extension and can be used only in a transductive setting. T o extend this metho d to an inductiv e setting, the kernel matrix K ca n b e approximated as a line a r combination of kernel matrices who se kernel functions are known [ ? ]. This reduces to minimizing Eq. (12) over the co efficien ts of kernel matr ic es rather than K . Let us assume that K = P q r =1 β r K r with { β r } q r =1 ∈ R . Then, E q. (1 2) r educes to min { β r } q r =1 ∈ R , ε > 0 n X i,j =1 " 1 + τ ij q X r =1 β r ( k r ii + k r j j − 2 k r ij ) − τ ij ε # + + ˜ λ q X r,s =1 β r β s tr( K r K s ) s.t. q X r =1 β r K r 0 , (13) which is a SDP with k r ij = [ K r ] ij . By constr aining { β r } q r =1 ∈ R + , E q. (1 3) reduces to a quadr atic progra m (QP) given by min { β r } q r =1 ∈ R + , ε> 0 n X i,j =1 " 1 + τ ij q X r =1 β r ( k r ii + k r j j − 2 k r ij ) − τ ij ε # + + ˜ λ q X r,s =1 β r β s tr( K r K s ) . (14) Though the QP fo r m ulation in E q. (14) is computionally cheap compar ed to the SDP formulation in E q. (13), it is not interesting for kernels of the for m h ( || x − y || 2 2 ) wher e x , y ∈ R D and h is completely mono tonic. 5 The rea son is that, to clas sify a test p oin t, x t , we compute || φ ( x t ) − φ ( x i ) || 2 2 = P q r =1 β r ( k r ( x t , x t ) + k r ( x i , x i ) − 2 k r ( x t , x i )) ∝ − 2 P q r =1 β r k r ( x t , x i ) , ∀ i . Therefore, minimizing || φ ( x t ) − φ ( x i ) || 2 2 is equiv alen t to maximizing P q r =1 β r k r ( x t , x i ) and so the Q P formulation yields the s a me r esult as that o btained when NN classification is p erformed in X . Therefor e, this r esult eliminates the p opular Gaussian kernel fr om consideratio n. Ho wev er , it is not clear how the QP formulation b ehav es for o ther kernels, e.g., p olynomial kernel of degree γ and deserves further study . W e derived a SDP for m ulation in § 3 that embeds the training data into a E uclidean spac e while minimizing the LOO E of the ε -neig h b orhoo d NN cla ssifier. Since the SDP formulation derived in this s e c tion is bas ed on the approximation of kernel matrix, w e prefer the fo r m ulation in § 3 to the one derived here for exp eriments in § 5. The purp ose o f this sectio n is to show that the metric learning for NN cla ssification ca n be po sed a s a kernel learning pro blem, whic h ha s not b een explo r ed b efore. Presently , we feel that this fra mework provides only a limited scop e to explor e b ecause of the issues with out-of-s a mple ex tension. How ever, the der iv ed SDP and QP formulations mer it further study as they can b e used for heterogeno us data in tegr ation in NN setting similar to the one in SVM setting [ ? ]. 5 Exp erimen ts & Results In this s e ction, we illus trate the effectiveness o f the pr oposed metho d, w hich we refer to as ME NN (metric embedding for near est neighbor) in terms of le a ve-one-out erro r and genera lization erro r o n four b enc hmark datasets from the UCI machine learning rep ository . 6 Since L MNN 7 (large margin 5 See [ ? , § 2.4] for details on conditionally p ositiv e definite ke r nels and completely monotonic functions. 6 ftp://ftp.ics.uci.edu/pub/mac hine-learning-databases 7 LMNN softw are is a v ailable at http://www.s eas.upenn.edu/~kilianw/Downloads/LMNN.html. 6 T able 1: k -NN classifica tion a ccuracy on UCI datasets : Balance, Ionosphere, Iris and Wine. The algorithms compa r ed are standard k -NN with Euclidean distance ( Eucl -NN), LMNN [ ? ], Kernel - NN (see the text) and MENN (prop osed method). Mean ( µ ) and standar d deviation ( σ ) o f the leav e-one- o ut (LOO) and test (gene r alization) e rrors a re rep orted. Dataset Algorithm/ Eucl - NN LMNN Kernel -NN MENN ( n, D , l ) Erro r µ ± σ µ ± σ µ ± σ µ ± σ Balance LOO 17 . 81 ± 1 . 8 6 11 . 40 ± 2 . 8 9 10 . 73 ± 1 . 3 2 6 . 87 ± 1 . 69 (625 , 4 , 3) T est 18 . 18 ± 1 . 8 8 11 . 49 ± 2 . 5 7 17 . 46 ± 2 . 1 3 7 . 12 ± 1 . 93 Ionosphere LOO 15 . 8 9 ± 1 . 43 3 . 50 ± 1 . 18 2 . 84 ± 0 . 80 2 . 27 ± 0 . 76 (351 , 3 4 , 2) T est 15 . 95 ± 3 . 03 12 . 14 ± 2 . 9 2 5 . 81 ± 2 . 2 5 4 . 2 1 ± 1 . 9 6 Iris LOO 4 . 30 ± 1 . 55 3 . 25 ± 1 . 15 3 . 60 ± 1 . 33 3 . 33 ± 1 . 0 2 (150 , 4 , 3) T est 4 . 02 ± 2 . 2 2 4 . 11 ± 2 . 26 4 . 83 ± 2 . 47 3 . 06 ± 1 . 65 Wine LOO 5 . 89 ± 1 . 35 0 . 90 ± 2 . 80 4 . 9 5 ± 1 . 35 3 . 06 ± 1 . 5 5 (178 , 1 3 , 3) T est 6 . 22 ± 2 . 70 3 . 41 ± 2 . 10 7 . 37 ± 2 . 82 5 . 18 ± 2 . 42 nearest neighbor) algor ithm, base d o n optimizing the Mahala nobis distance metric and prop osed by [ ? ], ha s demonstrated improv ed p erformance ov er the standard NN with Euclidean dista nc e metric, w e include it in our p erformance compar ison. W e also compare our metho d to standard kernelized NN classification, i.e., by embedding the data using one of the standa rd kernel functions, which we r efer to as Kernel -NN. 8 This metho d is also included in the co mparison a s o ur metho d ca n be seen a s learning a Mahala nobis metr ic in the empirical kernel map spa ce. As afore men tioned, [ ? ] prop osed a kernelized v er sion of LMNN, refer r ed to a s KLMCA (kernel lar ge margin comp onen t analysis), which s eems to p erform b etter than L MNN. How ever, b ecause of the non-av ailabilit y of KLMCA code, we are not a ble to compare o ur results with it. How ever, comparing the r esults rep orted in their pap er with the ones in T able 1, we no tice that MENN offers similar improvemen ts (ov er LMNN) than KLMCA do es. The wine, iris, ionosphere and balance data sets from UCI mac hine learning repo sitory w er e considered for exp erimen tation. Since we are interested in testing the SDP for m ulation (which scales with the num b er of data p oin ts), we fo cus o ur exp erimen tation on pr oblems of not to o lar ge size. 9 F or large datasets, lo cal optimization o f Eq. (9) is mor e computionally attractive than the conv ex o ptimization of Eq. (11). How ever, addressing optimization asp ects in more deta il is omitted here b ecause of spac e co nstrain ts. The res ults shown in T able 1 were av era ged over 100 runs (10 runs on Iris and Wine, 5 r uns o n Ba lance and Io no sphere for MENN b e cause of the complexity of SDP) with different 7 0 / 30 splits of the data for training and testing. 15% of the training data was use d for v alidation (r equired in LMNN, Kernel -NN a nd MENN). The Gaussian kernel, exp( − ρ || x − y || 2 ), was used for Kernel -NN and ME NN . The para meters ρ and λ were set by cros s-v alidation by s e a rc hing ov er ρ ∈ { 2 i } 4 − 4 and λ ∈ { 1 0 i } 3 − 3 . In all these metho ds, k -NN classifier with k = 3 w as used for classificatio n. On all datasets except on wine, for which the mapping to the high dimensio nal space seems to hurt p erformance (noted s imilarly by [ ? ]), ME NN gives b etter classificatio n accuracy than LMNN and the other t wo metho ds. The role of empirical kernel maps is no t clear as ther e is no consistent b e ha vior b et ween the p erformance a ccuracy achiev ed with s ta ndard NN ( Eucl -NN in T able 1) and Kernel -NN. 6 Related w ork W e briefly r eview some relev an t w o rk a nd point o ut s imilarities and differ e nces with our work. The central idea in all the following reviewed works related to the distanc e metric le a rning for NN clas sification is that similarly lab elled examples should cluster together a nd b e far awa y fro m differently lab elled examples. Three ma jo r differences b et ween o ur work and these works is tha t (i) no a ssumptions are made a bout the underlying distance metric; the metho d ca n b e e xtended to arbitrar y metric spaces, (ii) a suitable proxy to LO OE is chosen a s the ob jective to b e minimized, 8 Kernel -NN is computed as f ol lo ws. F or each training p oin t, x j , the empir ical map w.r .t. { x i } n i =1 defined as x j 7→ k ( ., x j ) | { x i } n i =1 = ( k ( x 1 , x i ) , . . . , k ( x n , x i )) T , k i is computed. { k i } n i =1 is considered to b e the training set for the NN classification of empirical maps of the test data. 9 LMNN scales with D for whic h prepro ce ssi ng is done by principal component analysis to reduce the dimensionality . 7 which neatly yields a setting to prove the r epresen ter - lik e theore m fo r NN cla ssification and (iii) int er pretation of metric lear ning as a kernel learning problem a nd a s a soft-marg in linear binary SVM cla ssification pr oblem. [ ? ] used SDP to lea rn a Mahalanobis distance metric for clustering by minimizing the sum of squared distances betw een similarly lab elled examples while low er b ounding the s um of distances betw een differ en tly lab elled examples. [ ? ] prop osed neighborho o d comp onen t analysis (NCA) which minimizes the pro babilit y of err or under sto chastic neighbor hoo d as signmen ts using gra die nt descent to learn a Mahalano bis distanc e metric. Inspired by [ ? ], [ ? ] prop osed a SDP alg orithm to learn a Mahalanobis dista nce metric by minimizing the distance b et ween pr e defined target neighbors and separating them by a large margin from the examples with no n- matc hing lab els. Co mpa red to these metho ds, our metho d results in a non-convex progr am which is then r elaxed to yield a conv ex SDP . Unlik e [ ? ], o ur metho d do es not r equire prior infor mation ab out targ et neighbor s. [ ? ] pr oposed an online algo rithm for lear ning a Mahalano bis distance metric with the cost function very s imilar to Eq. (5 ) a nd g being linea r. [ ? ] studies exactly in the sa me setting as [ ? ] but in a batch mode. The kernelized version in b oth these metho ds inv olves c o n vex o ptimization ov er n 2 parameters with a semidefinite co nstrain t on n × n matr ix, which is similar to o ur metho d. T o ease the co mputational burden, [ ? ] neglects the s emidefinite constr ain t and solves a SVM-t yp e Q P . [ ? ] prop osed a kernel version of the algor ithm pr oposed in [ ? ] and assumes a particular parametric form for the linear mapping to inv oke the kernel trick. Instead of solving a SDP , they use gradient desce n t to s olv e the non-conv ex pro gram. Similar techniques can b e used for our method, esp ecially for large datasets. But, we pr efer the SDP formulation as w e do not know how to choose d , wherea s the eig en decomp osition of ¯ C obtained fro m SDP would give an idea ab out d . 7 Conclusion & Discussion W e have prop osed tw o different metho ds to embed arbitrar y metr ic spaces into a Euclidean spa ce with the go al to improve the accura cy o f a NN class ifier. The first metho d dea lt within the frame- work of regula rization in RKHS where in a repr e s en ter-like theor em was derived for NN classifiers and parallels were drawn b et ween the ε -neighbor hoo d NN classifier and the so ft -ma rgin linear binary SVM cla ssifier. Although the pr imary fo cus of this work is to intro duce a g e neral theoretical frame- work to metric learning for NN class ific a tion, we hav e illus tr ated our findings with some b enc hmark exp erimen ts demonstrating that the SDP algo rithm derived from the pr o posed framework p erforms better than a pre v iously prop osed Ma halanobis distance metric lear ning algo rithm. In the second metho d, by choosing the embedding function to b e a Mer cer kernel map, we hav e shown the equiv- alence betw een Mercer kernel map learning and kernel matrix learning. Though this metho d is theoretically interesting, cur ren tly it is not useful for inductive learning as it do es not a llo w for an out-of-sample ex tension. In the future, we would like to a pply our RKHS ba s ed alg orithm to data from structured s pa ces, esp ecially fo cussing o n applications in bio- info r matics. On the Mercer kernel map front, we would like to study it in more depth a nd derive an embedding function that s upports an o ut-of-sample extension. W e would also lik e to apply this framework for heterogeno us data integration in NN setting. 8
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment