A Unified Semi-Supervised Dimensionality Reduction Framework for Manifold Learning

A Uniﬁed Semi-Sup ervised Dimensionalit y Reduction F ramew ork for Manifold Learning Ratthachat Chatpatanasir i and B o onserm Kijsirik ul Department of Co mputer E ngineering, Chulalongkorn Univ ersity , Bangkok 1033 0, THAILAND. ratthachat.c@gmail.co m and b o onserm.k@chula.a c.th May 10, 20 09 Abstract W e presen t a gener al framework of semi-s uper vised dimensio nality reduction for manifold lea rning which naturally generalizes exis ting sup er v ised a nd unsupe r - vised learning fra meworks whic h a pply the s pec tral decomp osition. Algorithms derived under our framework ar e able to employ b oth lab eled and unlab eled examples and are a ble to ha ndle complex pr oblems wher e data form separ ate clusters of manifolds. Our framework o ﬀer s simple views, expla ins relationships among existing frameworks and provides further extensions which can improve existing algorithms. F urthermor e, a new s e mi-sup ervised kernelization frame- work called “ K PCA tr ick” is pr op osed to ha ndle non-line a r problems. Keyw ords: Semi-super v ised Learning , T r a nsductive Learning, Sp ectral Meth- o ds, Dimensio na lity Reduction, Manifold Learning , KPCA T rick. 1 In tro duction In many real- world applications, high-dimensional data indeed lie o n (or near) a low-dimensional subspace. The goal of dimensio nality re duct ion is to reduce complexity of input data while s ome desired intrinsic informa tion of the data is preserved. The desir e d information can be discr iminative [1, 2 , 3, 4, 5, 6], geometrical [7, 8, 9, 10] or b oth [11]. Fisher discriminant analysis (FD A) [12] is the mo st popular method among all sup erv ised dimensio nality reduction algo- rithms. Denote c as the num ber of cla sses in a given tra ining set. P rovided that training exa mples o f each class lie in a linear subspace and do not fo r m several separate clusters, i.e. do not form multi-modality , FD A is able to discover a low-dimensional linear subspa ce (with at mo st c − 1 dimens io nality) which is eﬃcient for c lassiﬁcation. Recent ly , man y works hav e impr ov ed the FD A a lgo- rithm in s everal a sp ects [11, 1, 2, 3, 4, 5 , 6]. These extende d FDA algorithms are able to discover a nice low-dimensional subspace even w he n training exa m- ples o f each class lie in s eparate clusters of co mplicated no n-linear manifolds. Moreov er, a s ubspace discov ered by these algor ithms has no limitation of c − 1 dimensionality . Although the extended FD A algorithms w ork reasonably well, a consider able nu mber of lab eled exa mples is required to achiev e satisﬁable p erfor mance. In many real-world applications suc h as image cla ssiﬁcation, web page cla ssiﬁcation and protein function prediction, a lab eling pr o cess is costly and time consuming; in con trast, unlab eled examples can be easily obtained. Therefore, in suc h situations, it can b e bene ﬁc ia l to incor p o rate the infor ma tion which is contained Semi-Super vised Dimensionality Reduction for Manif ol d Learning - 1 in unlabeled exa mples in to a lea rning problem, i.e., semi-supe r vised le arning (SSL) s ho uld b e applied instead of sup ervised learning [13]. In this pap er , we pr esent a gener al semi-sup erv ised dimensiona lity r eduction framework whic h is able to employ information fro m bo th lab eled a nd unlab eled examples. Contributions of the pap er can be summarized as follows. • As the extended FD A algo rithms, algor ithms develop ed in o ur framework are able to discover a nice low-dimensional subspace even w he n training exa m- ples of each clas s form separate clusters o f co mplicated non- linear manifolds. In fact, those pre vious sup ervis ed algor ithms can be casted a s instances in our framework. Mor eov er, our fra mework explains previously unclear relationships among existing algor ithms in a simple viewp oint. • W e present a novel technique ca lled the Hadamar d p ower op er ator which improv es the use o f unlabeled ex amples in previo us algor ithms. Exper imen ts show that the Hadamard p ower opera tor improv es the classiﬁcation p erformance of a s emi-sup ervised le a rner derived from our fra mework. • W e show that recent existing semi-sup ervis ed frameworks a pplying s p e c - tral dec o mp o sitions known to us [14, 15] can be view ed as sp ecial cases of our framework. Mo reov er, empirical evide nc e shows that semi-sup ervis e d lea r ners derived from o ur framework are sup er ior to ex isting lea rners in ma ny s tandard problems. • A new non-lineariza tio n fra mework, namely , a KPCA trick framework [16] is extended into a se mi-sup ervised learning setting. In contrast to the sta ndard kernel trick, the KPCA trick do es not require user s to der ive new ma thematical formulas a nd to r e-implement the kernel version o f the original learner. 2 The F ramew ork Let { x i , y i } ℓ i =1 denote a training set o f ℓ lab eled exa mples, with inputs x i ∈ R d 0 generated from a ﬁxed but unknown probability distribution P x , and corr e- sp onding class lab els y i ∈ { 1 , ..., c } g enerated from P y | x . In addition to the lab eled exa mples, let { x i } ℓ + u i = ℓ +1 denote a set of u unlab eled examples also gen- erated fro m P x . Denote X ∈ R d 0 × ( ℓ + u ) as a matrix o f the input ex amples ( x 1 , ..., x ℓ + u ). W e deﬁne n = ℓ + u . The goal of semi- s upe r vised learning (SSL) dimensionality reduction is Goal. Using the informa tion of b oth lab eled a nd unlab eled ex amples, we want to map ( x ∈ R d 0 ) 7→ ( z ∈ R d ) wher e d < d 0 , suc h that in the embedded space P y | z can b e a c curately estimated ( i.e., unknown lab els a re easy to predict) by a simple classiﬁer . Here, following the previous works in the sup er vised setting [11, 1, 2], the ne ar- est neighb or a lgorithm is used for repr esenting a simple classiﬁer mentioned in the go al. Note that imp ortant sp ecial cases of SSL pro blems are tra nsductive problems where we only wan t to predict the la b els { y i } ℓ + u i = ℓ +1 of the given un- lab eled examples. In order to make use of unlab eled examples in the learning pro cess, we make the following s o -called manifold assumption [13]: Semi-Sup ervised Manifo ld Assumption . The s uppo rt of P x is on a low- dimensional manifold. F ur thermore, P y | x is smo oth, as a function of x , with Semi-Super vised Dimensionality Reduction for Manif ol d Learning - 2 resp ect to the underlying structur e of the manifold. A t ﬁrst, to fulﬁll our goa l, w e linear ly parameter ize z i = A x i where A ∈ R d × d 0 . Thus, AX = ( A x 1 , ..., A x n ) ∈ R d × n is a matrix o f embedded po ints. An eﬃcient non-linea r extension is pres ented in Section 2 .2. In our framework, we prop ose to cast the problem as a constrained optimization problem: A ∗ = arg min A ∈A f ℓ ( AX ) + γ f u ( AX ) , (1) where f ℓ ( · ) a nd f u ( · ) are ob jective functions based on lab eled and unlab eled examples, resp ectively , γ is a parameter co ntrolling the weigh ts betw een the tw o ob jective functions and A is a constraint set in R d × d 0 . The t wo ob jective func- tions determine “how go o d the embedded p oints ar e”; thus, their arguments are AX , a matrix o f embedded p oints. Up to ortho gonal and tr anslational t ra nsfor- mations , we ca n iden tify embedded points via their pairwise distances instead of their individual lo cations. Therefo re, we can base the ob jective functions on pa irwise dis ta nces of embedded examples. Her e , we deﬁne the ob jective functions to b e line ar with r esp ect to the pairwis e distances: f ℓ ( AX ) = n X i,j =1 c ℓ ij dist( A x i , A x j ) a nd f u ( AX ) = n X i,j =1 c u ij dist( A x i , A x j ) , where dist( · , · ) is an arbitrar y distance function b etw e en tw o embedded p oints, c ℓ ij and c u ij are costs which p enalize an embedded distance betw een tw o p oints i and j . A sp eciﬁcation of c ℓ ij and c u ij are based on the lab el info rmation and unlab el information, resp ectively , as describ ed in Section 2 .1 . If we restrict our selves to consider only the cases that (I) dist( · , · ) is a square d Euclidean distance function, i.e. dist( A x i , A x j ) = k A x i − A x j k 2 , (I I) c ℓ ij and c u ij are symmetric, and (II I) A ∈ A is in the form of AB A T = I wher e B is a p os itive semideﬁnite (PSD) matrix, Eq.(1) will result in a gener al framework which indeed generalizes previous frameworks a s sho wn in Section 3. Deﬁne c ij = c ℓ ij + γ c u ij . W e ca n rewrite the weighted combination of the ob jective fun tions in Eq.(1) as follows: f ℓ ( AX ) + γ f u ( AX ) = n X i,j =1 c ℓ ij dist( A x i , A x j ) + γ n X i,j =1 c u ij dist( A x i , A x j ) = n X i,j =1 ( c ℓ ij + γ c u ij )dist( A x i , A x j ) = n X i,j =1 c ij dist( A x i , A x j ) = n X i,j =1 c ij k A x i − A x j k 2 = 2 n X i,j =1 c ij ( x T i A T A x i − x T i A T A x j ) = 2trace   A  n X i,j =1 ( x i c ij x T i ) − n X i,j =1 ( x i c ij x T j )  A T   = 2trace( AX ( D − C ) X T A T ) , Semi-Super vised Dimensionality Reduction for Manif ol d Learning - 3 where C is a symmetric cost matrix with elements c ij and D is a diagona l matr ix with D ii = P j c ij 1 . Th us, the optimizatio n pro blem (1) ca n b e resta ted as A ∗ = arg min AB A T = I trace( AX ( D − C ) X T A T ) , (2) Note that the cons tr aint AB A T = I pre vents trivial solutions suc h as e very A x i is a zero vector. If B is a p o sitive deﬁnite (PD) matrix, a solution o f the ab ov e pr oblem is given b y the b ottom d eigenvectors of the following generalized eigenv a lue pro blem [12, 1 7] X ( D − C ) X T a ( j ) = λ j B a ( j ) , j = 1 , ..., d. (3) Then the optimal linea r map is A ∗ = ( a (1) , ..., a ( d ) ) T . (4) Note that, in terms of solutions of Eq.(3), it is more convenien t to repres e n t A ∗ by its r ows a ( i ) than its co lumns a i . Moreov er, note that k z − z ′ k = k A ∗ x − A ∗ x ′ k . (5) Therefore, kNN in the embedded space can be p erformed. Consequen tly , an algorithm implemented under o ur framework consists of three steps a s shown in Figure 1. Input: 1. training exa mples: { ( x 1 , y 1 ) , ..., ( x ℓ , y ℓ ) , x ℓ +1 , ..., x ℓ + u } 2. a new example: x ′ 3. a p o s itive-v a lue parameter : γ Algorithm: (1) Construct cost matrices, C ℓ , C u and C = C ℓ + γ C u , and a constr aint matrix B (see Sectio n 2.1). (2) Obtain an o ptimal matrix A ∗ by s o lving Eq.(3). (3) Perform kNN classiﬁca tion in the obta ined subspace by using E q.(5). Figure 1: Our semi-sup ervise d learning framework. 2.1 Sp eciﬁcation of t he Cost and Constrain t Matrices In this section, we pr esent v arious rea sonable approa ches of sp ecifying the tw o cost matrices, C ℓ and C u , and the constra int matrix, B , b y using the lab el and unla be l information. W e use the tw o words “unlab el information” and “neighborho od infor mation” interchangeably in this pap er. 2.1.1 The Cost Matrix C ℓ and the Constraint Matrix B Normally , based on the la b el informa tion, classical superv ised algorithms usu- ally r equire an embedded space to have the fo llowing tw o desir a ble co nditions: 1 T o si mplify our notations, in this pap er wheneve r we deﬁne a cost matrix C ′ ha ving elemen ts c ′ ij , we alwa ys deﬁne its asso ciated diagonal matrix D ′ with elements D ′ ii = P j c ′ ij . Semi-Super vised Dimensionality Reduction for Manif ol d Learning - 4 Figure 2: An example whe n data form a multi-modal structure. An algor ithm, e.g. FD A, which impo s es the co ndition (1) will try to discov er a new subspace (a dashed line) whic h merg es t wo clusters A and B altogether. An obtained space is undesirable as data of the tw o cla sses are mixed together . In co ntrast, an algo rithm whic h impo ses the co ndition (1*) (instead o f (1)) will disc ov er a subspace (a thic k line) whic h do es not merg e the t w o clusters A and B as there are no nearb y ex a mples (indicated b y a link b etw een a pair of examples) betw een the tw o clusters. (1) tw o exa mples of the same class stay close to one another, and (2) tw o exa mples of diﬀerent class es stay far apar t. The t wo conditions are imp osed in classical w orks such as FD A. Ho wev e r , the ﬁrst condition is to o restrictive to ca pture ma nifold a nd multi-modal structures of data which naturally arise in some a pplica tions. Thu s, the ﬁrst condition should b e r elaxed as follows: (1*) tw o n e arby examples of the same class stay clos e to one another where “ nearby ex a mples”, deﬁned by using the neig hborho o d information, ar e examples which sho uld stay close to each o ther in both or iginal and embedded spaces. The sp eciﬁcatio n of “ nearby examples” has b een proven to b e succ e s sful in discov ering manifold and mu lti-mo dal structur e [11, 1, 2, 3, 4, 5, 6, 18, 19, 20, 21, 22]. See Fig ure 2 for e x planations. In some ca s es, it is a lso appropria te to r e la x the s econd co ndition to (2*) tw o n e arby examples of diﬀerent clas ses stay far apar t. In this section, we give three examples of cost matrices which satisfy the conditions (1* ) a nd (2) (or (2*)). These three examples are rece ntly intro duced in pr evious works, namely , Discrimina nt Neighborho o d E mbedding (DNE) [2], Marginal Fisher Analy sis (MF A) [1] and L o cal Fisher Discriminant Analysis (LFD A) [11], with diﬀere nt presentations and motiv ations but they can b e uni- ﬁed under o ur g eneral fra mework. Firstly , to sp ecify nearby examples, we constr uct tw o matrices C I and C E based on any sensible distance (E uc lidea n distance is the simplest choice). F or each x i , let N e ig I ( i ) b e the set of k nea rest neig hbors having the same lab el y i , and let N eig E ( i ) be the set of k nearest neighbors having diﬀer ent la b e ls from y i . Deﬁne C I and C E as follows: let c I ij = c E ij = 0 if points x i and/or x j are unlab eled, and c I ij = ( 1 , if j ∈ N e ig I ( i ) ∨ i ∈ N eig I ( j ) , 0 , otherwise, and c E ij = ( 1 , if j ∈ N e ig E ( i ) ∨ i ∈ N eig E ( j ) , 0 , otherwise . Semi-Super vised Dimensionality Reduction for Manif ol d Learning - 5 The spec iﬁc a tion c I ij = 1 and c E ij = 1 repres ent nearby exa mples in the condi- tions (1* ) and (2 *). Then, C ℓ and B of existing algor ithms (see Eq. (2)) are: Discriminant Neigh b orho o d Embeddi ng (DNE) C ℓ = C I − C E B = I (an identit y matr ix) Marginal Fisher Analysi s (MF A) C ℓ = − C E B = X ( D I − C I ) X T Lo cal Fisher Discrim inan t Analysis (LFD A) Let n 1 , ..., n c be the num b ers of examples o f clas s es 1 , ..., c , r esp ectively . Deﬁne matrices C bet and C wit as: c bet ij = ( c I ij ( 1 n k − 1 n ) , if y i = y j = k , − 1 n , otherwise , and c wit ij = ( 1 n k c I ij , if y i = y j = k , 0 , otherwise , C ℓ = C bet B = X ( D wit − C wit ) X T Within our fra mework, rela tionships a mo ng the thr e e pr evious works can b e explained. The thr e e metho ds exploit diﬀerent ideas in sp ecifying matrices C ℓ and B to sa tisfy t w o desirable conditions in an em bedded s pace. In DNE, C ℓ is designed to p enalize an embedded s pace which do es not satisfy the condition (1*) and (2* ). In MF A, the co nstraint matrix B is designed to satisfy the condition (1*) a nd C ℓ is des ig ned to penalize an embedded space which do e s not satisfy the condition (2*). Things are not quite obvious in the case of LFDA. In LFD A, the constra int matrix B is designed to s atisfy the co ndition (1*) since elemen ts C wit are pro- po rtional to C I ; nevertheless, since w eights are inversely prop or tional to n k , elements in a small cla ss hav e lar ger weigh ts than elements in a big ger class , i.e. a pa ir in a sma ll class is mor e likely to s a tisfy the condition (1* ) than a pair in a big ger cla s s. T o understand C ℓ , we r ecall that trace( AX ( D ℓ − C ℓ ) X T A T ) = X i,j c ℓ ij k A x i − A x j k = X y i = y j c I ij ( 1 n k − 1 n ) k A x i − A x j k − X y i 6 = y j 1 n k A x i − A x j k = d − 1 n   X y i = y j c I ij k A x i − A x j k + X y i 6 = y j k A x i − A x j k   , where at the third eq ua lity we use the co nstraint AX B X T A T = I and hence trace( AX ( D wit − C wit ) X T A T ) = X y i = y j c I ij n k k A x i − A x j k = trace( I ) = d. Hence, w e observe that every pair of lab eled exa mples coming from diﬀer ent classes has a cor resp onding cost of − 1 n . Ther efore, C ℓ is designed to p enalize an embedded s pace which do es not satisfy the condition (2). Surprisingly , in Semi-Super vised Dimensionality Reduction for Manif ol d Learning - 6 LFD A, near by exa mples of the same cla ss (having c I ij = 1) also has a cos t of − 1 n . As a cos t prop or tional to − 1 n is mean t to pres erve a pairwise distance betw een each pair of examples (see Section 3.1), L FDA tries to preserve a lo ca l geometrical structure b etw een eac h pair of near by ex a mples o f the same clas s, in contrast to DNE and MF A which try to squeeze near by examples o f the same class into a sing le p oint W e note that other recent sup ervised metho ds for manifold learning ca n also be presented and in terpreted in our framework with diﬀeren t sp eciﬁca tio ns of C ℓ , for ex amples, L o c al Discriminant Emb e dding of Chen et al. [5] a nd Sup ervise d Nonline ar L o c al Emb e ddi ng of Cheng et al [6]. 2.1.2 The Cost Matrix C u and the Hadamard Po wer Op erator One imp o rtant implication of the ma nifold assumption is that “ne a rby examples are likely to b elong to a sa me class”. Hence, by the a ssumption, it makes sens e to design C u such that it pr events any p airs of ne arby examples to stay far ap art in an emb e dde d sp ac e . Among metho ds of e xtracting the neighbo r ho o d information to deﬁne C u , metho ds based on t he he at kernel (or t he gaussian function ) are most p opular. Beside using the heat k ernel, other methods of deﬁning C u are inv en ted, see [13, Chap. 15] and [17] for more details. The simplest sp eciﬁcations of near by examples based on the hea t kernel a r e: c u ij = exp( k x i − x j k 2 σ 2 ) . (6) Each pair o f near by examples w ill b e p enalized with diﬀerent cos ts dep ended on their simila r ity , and a similarity between tw o p oints is based on the Euclidea n distance b etw een them in the input space. Incidentally , with this sp eciﬁcation of C u , the term f u ( AX ) in E q. (1) can be interpreted as an a pproximation o f the L aplac e-Beltr ami op er ator on a da ta manifold. A learner whic h employs C = C u ( C ℓ = 0) is named L o c al ly Pr eserving Pr oje ction (LPP) [9]. The par ameter σ is crucia l as it controls the scale o f a cost c u ij . Hence, the choice of σ m ust b e sensible. Moreover, an appropr iate choice of σ may v ar y across the support of P x . Hence, the lo c al sc ale σ i for each p oint x i should b e used. Let x ′ i be the k th nearest neighbor of x i . A lo ca l sc a le is deﬁned as σ i = k x ′ i − x i k , and a w eigh t of each edge is then deﬁned as c u ij = exp( k x i − x j k 2 σ i σ j ) . (7) Using this lo cal scaling metho d is prov en to be eﬃcient in previo us exp eriments [23] on clustering. A sp eciﬁcation of k to deﬁne the lo cal scale of each p oint is usually more conv enient than a speciﬁcatio n of σ since a space o f p ossible choices of k is considera bly smaller than that of σ . Instead of propo s ing y et a no ther method to sp ecify a cost matrix, here we present a no vel metho d which can be use d to mo dify a ny existing co st matrix . Let Q and R b e tw o matrices of eq ual size and have q ij and r ij as their elements. Recall that the H adamar d pr o duct P [24] b etw ee n Q and R , P = Q ⊙ R , ha s Semi-Super vised Dimensionality Reduction for Manif ol d Learning - 7 elements p ij = q ij r ij . In w ords, the Hadamard pr o duct is a p oint wise pr o duct betw een tw o matrices. Here, we deﬁne the Hadamar d α th power op erato r as α K Q = α times z }| { Q ⊙ Q ⊙ ... ⊙ Q . (8) Given a cost matrix C u and a p ositive in teger α , w e deﬁne a new cost matrix C u α as C u α = α K C u k C u k F k J α C u k F , (9) where k·k F denotes the F ro be nius norm of a matrix . T he m ultiplication of k C u k F k J α C u k F make k C u α k F = k C u k F . Note that if C u is symmetric and non- negative, C u α still has these prop erties. The in tuition of C u α will b e explained thro ugh experiments in Section 4 where we show that C u α can further improve the quality of C u so that the classiﬁcation pe r formance of a semi-sup erv is ed learner is increas ed. An y combinations of a lab el cost matr ix C ℓ of those in previous works such as DNE, MF A and LFDA with an unla be l cost matrix C u result in new SSL algorithms, and we will call the new algo rithms SS-DNE, SS-MF A a nd SS- LFD A. 2.2 Non-Linear Pa rameterization Using the KPC A T ric k By the linear par a meterization, howev er, we can only obtain a linear s ubspace deﬁned by A . Le arning a no n-linear subspace ca n b e acc omplished by the stan- dard kernel trick [2 5]. How ever, applying the k ernel tric k can be inconvenien t since new mathematical formulas ha v e to be derived and new implemen tation hav e to b e done separ ately from the linear implementations. Recently , Chat- patanasiri et al. [1 6] hav e prop osed an a lternative k ernelization framework called the KPCA trick , which does not r equire a user to derive a new mathematica l formula o r re-implement a kernelized algorithm. Moreover, the KPCA trick framework av oids troublesome problems such as s ingularity , etc. 2.2.1 The KPCA-T rick Algorithm In this section, the KPCA tric k framework is extended to co ver lea rners im- plement ed under o ur s e mi-sup ervised lea rning framework. Let k ( · , · ) b e a PSD kernel function asso ciated with a non-linear function φ ( · ) : R d 0 7→ H such that k ( x , x ′ ) = h φ ( x ) , φ ( x ′ ) i [26] where H is a Hilbert spac e. Denote φ i for φ ( x i ) for i = 1 , ..., ℓ + u a nd φ ′ for φ ( x ′ ). The cent ral idea of the K PCA trick is to represent each φ i and φ ′ in a new “ﬁnite”-dimensio nal spa c e, with dimensional- it y b ounded b y ℓ + u , without a ny loss of information. Within the fr amework, a new co o r dinate of ea ch example is computed “explicitly”, and eac h ex ample in the new co o rdinate is then us e d as the input o f any existing semi-s upe r vised learner w itho ut a ny re-implementations. T o s implify the discussio n, we ass ume that { φ i } is linear ly independent a nd has its cen ter a t the o rigin, i.e. P i φ i = 0. Since we hav e n = ℓ + u total examples, the spa n o f { φ i } has dimensionality n by o ur assumption. According to [16], each example φ i can be r e presented as ϕ i ∈ R n with resp e ct to a new Semi-Super vised Dimensionality Reduction for Manif ol d Learning - 8 orthonormal basis { ψ i } n i =1 such that span ( { ψ i } n i =1 ) is the sa me as span ( { φ i } n i =1 ) without loss of a ny infor mation. More pr ecisely , we deﬁne ϕ i =  h φ i , ψ 1 i , . . . , h φ i , ψ n i  = Ψ T φ i . (10) where Ψ = ( ψ 1 , ..., ψ n ). Note that although we ma y be unable to numerically represent each ψ i , a n inner-pr o duct of h φ i , ψ j i ca n b e conv enien tly computed by KPCA wher e eac h ψ i is a principal comp onent in the feature space . Likewise, a new test point φ ′ can be mapp ed to ϕ ′ = Ψ T φ ′ . Conseque ntly , the mapp ed data { ϕ i } a nd ϕ ′ are ﬁnite-dimensional and can b e explicitly computed. The KPCA-trick a lgorithm co nsisting of three simple steps is s hown in Fig - ure 3. All semi-super vised learner s c a n b e kernelized by this simple alg orithm. In the alg orithm, we denote a semi-sup ervis ed lear ner by ss l which outputs the bes t linear map A ∗ . Input: 1. training exa mples: { ( x 1 , y 1 ) , ..., ( x ℓ , y ℓ ) , x ℓ +1 , ..., x ℓ + u } 2. a new example: x ′ 3. a kernel function: k ( · , · ) 4. a linea r semi-sup ervis e d lea rning a lgorithm: ssl (see Figur e 1) Algorithm: (1) Apply kp ca ( k , { x i } ℓ + u i =1 , x ′ ) s uch that { x i } 7→ { ϕ i } a nd x ′ 7→ ϕ ′ . (2) Apply ssl with new inputs { ( ϕ 1 , y 1 ) , ..., ( ϕ ℓ , y ℓ ) , ϕ ℓ +1 , ..., ϕ ℓ + u } to achieve A ∗ . (3) Perform kNN based on the dista nce k A ∗ ϕ i − A ∗ ϕ ′ k . Figure 3: The KP CA-trick a lgorithm for semi- s upe r vised learning. 2.3 Remarks 1. The main optimization problem shown in Eq.(2) can b e res tated as follows: [12] arg min A ∈ R d × d 0 trace  ( AB A T ) − 1 AX ( D − C ) X T A T  . Within this form ulation, the corresp onding optimal solution is in v ariant under a non-singular linea r tra nsformation; i.e., if A ∗ is a n o ptimal solution, then T A ∗ is also an o ptimal solutio n fo r a ny non-singula r T ∈ R d × d [12, pp.447]. W e note that four choices of T which assign a weigh t to each new a xis ar e natural: (1 ) T = I , (2) T is a diago nal matrix with T ii = 1 k a ( i ) k , i.e. T nor- malizes each a x is to be equally imp ortant, (3) T is a dia gonal ma trix with T ii = √ λ i as √ λ i determines how well each axis a ( i ) ﬁts the ob jective function a ( i ) T X ( D − C ) X T a ( i ) , and (4) T is a dia gonal matrix with T ii = √ λ i k a ( i ) k , i.e. a combination of (2) a nd (3). 2. The matrices B deﬁned in Subs e c tion 2.1 of the t wo alg orithms, SS- MF A and SS-LFDA, are gua rantee to b e p ositive semideﬁnite (PSD) but may not b e po sitive deﬁnite (PD), i.e., B may not b e full-rank. In this case, B is singular and we canno t immedia tely apply E q.(3) to solve the optimization problems. Semi-Super vised Dimensionality Reduction for Manif ol d Learning - 9 One common wa y to s olve this diﬃculty is to use ( B + ǫI ), for s o me v a lue of ǫ , which is no w guaranteed to b e full-r ank instead of B in E q.(2). Since ǫ acts in a role of reg ularizer , it ma kes sense to set ǫ = γ , the regulariza tion parameter sp eciﬁed in Section 2.1. Similar settings of ǫ hav e also b een used by some e xisting a lgorithms, e .g . [27, 14]. Also, in a small sa mple size pro ble m wher e X ( D − C ) X T is not full-r ank, the obtained ma trix A ∗ (or some columns of A ∗ ) lie in the n ull space o f X ( D − C ) X T . Although this matrix do es optimize our optimization problem, it usually ov erﬁts the given data. O ne po s sible solution to this problem is to apply PCA to the given data in the ﬁrst place [28] so that the resulted data have dimensio nality less than or equal to the r ank o f X ( D − C ) X T . Note that in our KPCA trick framework this pr e - pro cess is automatically accomplished as KP CA has to be applied to a lea rner as shown in Figur e 3. 3 Related W ork: Connection and Impro ve men t As we a lready describ ed in Section 2.1, our fra mework generaliz es v a rious exist- ing sup ervised and uns up er vised manifold learner s [11, 1 , 2 , 3 , 4, 5, 6, 17, 9, 2 3]. The KPCA trick is new in the ﬁeld of semi-s upe r vised learning. There are some sup e rvised manifold learner s which ca nnot b e represented in our framework [18, 19, 2 0, 2 1, 22] b e cause their co st functions a re not linear with r esp ect to distances among exa mples. E xtension o f thes e alg orithms to handle semi-sup ervis ed learning proble ms is an int eresting future work. Y ang et al. [29] present a nother semi-super v ised learning framework which solves ent irely diﬀerent problems to pr oblems considered in this pa pe r. They prop ose to extend unsup ervised algo rithms such a s ISOMAP [7] a nd La pla cian Eigenmap [13, Cha pter 16] to ca ses to which information ab out exact lo cations of some po ints is a v ailable. T o the b es t of our knowledge, there are current ly t w o existing semi-sup er vised dimensionality reduction frameworks in litera tures which have similar g oal to ours; b oth are v ery recen tly pro p o sed. Here, we subseq ue ntly sho w that these frameworks c a n b e restated as sp e cial c ases of our framework. 3.1 Sugiy ama et al. [14] Sugiyama et al. [14] extends the L FD A a lg orithm to handle a semi-sup er vised learning pro blem by adding the PCA ob jective function f P C A ( A ) into the ob- jective function f ℓ ( A ) o f LFDA desc rib ed in Section 2 .1. T o describ e Sugiyama et al.’s a lgorithm, namely ‘SE LF’, without los s of genera lity , w e assume that training data ar e cen tered at the origin, i.e. P n i =1 x i = 0, and then w e can write f P C A ( A ) = − P n i =1 k A x i k 2 . Sugiyama et al. prop ose to solve the fo llow- ing pr oblem: A ∗ = arg min AB A T = I   ℓ X i,j =1 c ℓ ij k A x i − A x j k 2 − γ n X i =1 k A x i k 2   (11) Int erestingly , it can b e shown that this formulation can b e formulated in our framework with unlabel cost c u ij being ne gative , and hence our fra mework sub- Semi-Super vised Dimensionality Reduction f or Manifol d Learning - 10 sumes SELF. T o see this, let c u ij = − 1 / 2 n , for a ll i, j = 1 , ..., n . Then, the ob jective f u ( A ) is e q uiv alent to f P C A ( A ): f u ( A ) = n X i,j =1 − 1 2 n k A x i − A x j k 2 = − 1 2 n n X i,j =1 h A x i − A x j , A x i − A x j i = − 1 2 n   2 n X i,j =1 h A x i , A x i i − 2 n X i,j =1 h A x i , A x j i   = − 1 2 n   2 n n X i =1 k A x i k 2 − 2 h A n X i =1 x i , A n X j =1 x j i   = f P C A ( A ) , where we use the fact that P n i =1 x i = 0 . This pr ov es that SEL F is a special case of o ur fra mework. Note that the use of negative unlab el c o sts c u ij = − 1 / 2 n r esults in an al- gorithm which tries to preserve a glob al structure o f the input data and do es not conv ey the manifold assumption where only a lo c al structure s hould b e pre - served. Ther efore, when the input unlab eled data lie in a complicated manifold, it is not appropr iate to apply f u ( A ) = f P C A ( A ). 3.2 Song et al. [15] Song et a l. prop o s e to extend FD A and another algorithm na med maximum mar gin criterion (MMC) [30] to ha ndle a s e mi- sup ervised lea rning pr oblem. Their idea of semi-superv ised lea r ning ex tension is similar to ours as they add the term f u ( · ) into the ob jective of FDA and MMC (hence, we call them, SS-FD A and SS-MMC, res p e c tively). How ever, SS- FD A and SS-MMC canno t handle problems where data of ea ch cla ss form a manifold or s e veral clusters as shown in Figure 2 b eca use SS-FDA and SS-MMC satisfy the condition (1) but not (1*). In fact, SS-FDA and SS-MMC can b oth be restated as ins ta nces of our fra mework. T o see this, we no te that the optimization problem o f SS-MMC can b e s ta ted as A ∗ = ar g min AA T = I γ ′ trace( AS w A T ) − tr ace( AS b A T ) + γ f u ( A ) , (12) where S b and S w are standard b etw een-class and within-clas s sca tter ma trices, resp ectively [12]: S w = c X i =1 X j | y j = i ( x j − µ i )( x j − µ i ) T and S b = c X i =1 ( µ − µ i )( µ − µ i ) T , where µ = 1 n P n i =1 x i , µ i = 1 n i P n i i =1 x i and n i is the num ber of e xamples in the i th class. It ca n b e c hec ked that tr a ce( AS w A T ) = P ℓ i,j =1 c w ij k A x i − A x j k 2 and trace( AS b A T ) = P ℓ i,j =1 c b ij k A x i − A x j k 2 where c b ij = ( ( 1 n k − 1 n ) , if y i = y j = k , − 1 n , otherwise , and c w ij = ( 1 n k , if y i = y j = k , 0 , otherwise . Semi-Super vised Dimensionality Reduction f or Manifol d Learning - 11 −1.5 −1 −0.5 0 0.5 1 1.5 −1.5 −1 −0.5 0 0.5 1 1.5 Accuracy −− LFDA:0.935 −− FDA:0.820 −− LPP:0.620 LFDA FDA LPP Figure 4: The ﬁrst to y ex a mple. The pro jection axes of three algorithms, namely FD A, LFDA, and LPP , ar e presented. Big circles and big cro sses denote lab eled examples while small circles and small cros s es deno te unlabeled e xamples. Their per centage accuracy ov er the unlab eled examples are s hown on the top. Hence, by setting c ℓ ij = γ ′ c w ij − c b ij we ﬁnis h our pro of that SS-MMC is a sp ecia l case of our framew ork. The pro of that SS-FD A is in our framework is similar to that o f SS-MMC. 3.3 Impro v ement o ver P revious F ramew orks In this section, we explain why SELF and SS-FDA pr op osed by Sug iyama et al. [14] and Song et al. [15] describ ed ab ove are not enoug h to solve some semi- sup e rvised lea rning problems, even simple ones shown in Figure 4 a nd Figure 5. In Figure 4, three dimensionality reduction algorithms, FDA , LFDA and LPP ar e p erformed on this dataset. B ecause of m ulti-modality , FD A canno t ﬁnd an a ppropriate pro jection. Since the tw o clusters do not contain data of the same class, LPP which tries to pre serve the structure o f the tw o clusters also fa ils. In this ca se, only LFDA can ﬁnd a pro p er pro jection since it ca n c o p e with multi-moda lity a nd can take into acco unt the lab eled examples. Note that since SS- FD A is a line a rly co m bined a lg orithm of FDA and LPP , it can o nly ﬁnd a pro jection lying in b etw een the pro jections discov ered by FDA and LP P , and in this ca se SS-FDA cannot ﬁnd an eﬃcient pr o jection, unlike LFDA and, of course, SS-LFD A derived from o ur fr amework. A similar ar gument can b e g iven to warn an unca reful use of SEL F in so me situations. In Figure 5, four dimensionality reduction algorithms, FDA , PCA, LFD A and LP P are p e rformed on this dataset. Because of multi-modality , FDA and P CA ca nno t ﬁnd an appropria te pro jection. Also, s ince ther e ar e only a few lab eled examples, LFD A fails to ﬁnd a go o d pro jectio n a s well. In this ca se, only LPP c an ﬁnd a pr op er pro jection since it can cop e with m ulti-modality and can take the unlab eled examples in to account . No te that since SELF is a linearly combined algorithm of LFDA and PCA, it can only ﬁnd a pro jectio n lying in b etw een the pro jections discovered b y LFDA and PCA, and in this c a se SELF canno t ﬁnd a corre ct pr o jection, unlike a semi-sup ervised lear ner like SS- LFD A der ived fro m our framework which, as explained in Se c tion 2.1, employs Semi-Super vised Dimensionality Reduction f or Manifol d Learning - 12 −1 −0.5 0 0.5 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1NN Accuracy −− LFDA:0.644 −− FDA:0.385 −− LPP:0.963 −−PCA:0.667 LFDA FDA LPP PCA Figure 5: The sec o nd toy exa mple consis ting o f three cluster s o f tw o class e s. the LPP cost function as its C u . Since a s emi-sup ervised manifold learner derived from our framework can b e int uitively thought of as a combination o f a sup er v ised learner and an unsup er- vised learner. O ne may misunderstand that a semi-sup ervised learner cannot discov er a go o d subspace if neither is a s uper vised learner nor an unsup er v ised learner able to dis c ov er a go o d subspace. The above t wo toy examples may also mislea d the readers in tha t way . In fact, that in tuition is incorrect. Here, we giv e another toy example shown in Figure 6 where only a semi-sup ervised learner is a ble to discov er a go o d subspace but neither is its sup ervised and un- sup e rvised co unt erparts. Intuitiv ely , a semi-sup er v ised lear ner is able to exploit useful information from bo th lab eled and unlab eled examples. 4 Exp erimen ts In this section, classiﬁcation per formances of algorithms deriv ed from o ur frame- work are demonstrated. W e try to use a similar exp erimental setting as those in previous w orks [14, 13, Chapter 21] so that o ur results can b e compare d to them. 4.1 Exp erimen tal Setting In all exper iments, t wo semi-sup erv ised learner s, SS-LFDA and SS-DNE, de- rived from our framew ork a re c o mpared to relev ant e x isting algorithms, PCA, LPP*, LFDA , DNE and SE LF [14]. In con trast to the s ta ndard LPP whic h do es no t a pply the Hadamard p ow er o per ator ex plained in Section 2.1, we de- note LPP* as a v ariant of LP P applying the Hadamar d p ow er op er a tor. Non-linear semi-sup ervise d manifold learning is also exp erimented by apply- ing the K PCA trick algorithm illustr ated in Fig ure 3. Since it is not our in ten tion to apply the “b est” kernel but to compare eﬃciency betw een a “ semi-sup ervise d” kernel lear ner and its base “super vised” (and “unsuper vised”) k ernel learners, we simply apply the 2 nd -degree p olyno mial kernel k ( x , x ′ ) = h x , x ′ i 2 to the kernel algor ithms in all e xp eriments. Semi-Super vised Dimensionality Reduction f or Manifol d Learning - 13 −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 1NN Accuracy −− LFDA:0.688 −− LPP:0.250 −− SS−LFDA:1.000 LFDA LPP SS−LFDA −5 −4 −3 −2 −1 0 1 2 3 4 −4 −3 −2 −1 0 1 2 3 4 5 Undirected Graph of the LPP Cost Matrix Figure 6 : (Left) The third toy example where only a semi- s upe r vised lea rner is able to ﬁnd a go o d pro jection. (Right) An undire cted graph c o rresp onding to the v alues of C u used by LP P and SS-LFDA. In this ﬁgure, a pair o f examples i and j has a link if and only if c u ij > 0 . 1. This graph e x plains why LP P pro jects the data in the axis shown in the left ﬁgur e ; LP P , which do es not a pply the lab el information, tries to choose a pro jection axis which squeeze s the t w o cluster s as m uch a s p ossible. Note that we apply a lo ca l-scaling metho d, Eq.(7), to sp ecify C u . By using the nearest neighbo r algorithm o n their discovered subs paces, classiﬁcation p erfor mances of the exp erimented learners are measur ed on ﬁve standard da tasets shown on T able 1, the ﬁrst tw o datasets are obtained from the UCI rep o sitory [31], the next tw o da tasets mainly desig ned for testing a semi-sup ervised learner are obtained fro m http:/ /www.kyb.tueb ingen.mpg.d e/ssl- b o ok/b enc hmarks.html [13, Chapter 21]. The ﬁnal dataset, extende d Y ale B [3 2], is a standa r d data set of a face recognition task. The cla ssiﬁcation p er formance of ea ch algor ithm is measure d by the av erage test accura cy ov er 25 realiza tions of randomly s plitting each dataset int o tr aining and testing subsets. Three parameters are needed to b e tuned in or der to apply a semi-super vised learner de r ived fro m our framework (see Section 2 .1): γ , the reg ularizer, α , the degree of the Hadamard p ower op erator and k , the k th -nearest neighbor param- eter needed to construc t the cost matrices . T o mak e our le a rners satisfy the condition (1* ) descr ibe d in Section 2.1, it is clear that k should b e small com- pared to n c , the num ber of tr aining ex amples o f c lass c . F rom our ex pe r ience, we found that semi-s upe r vised learners are quite insensitive to v arious small v alues of k . Therefore, in all our exper iment s, we simply set k = min(3 , n c ) so tha t only two pa r ameters, γ and α , are needed to be tuned. W e tune these t wo par ameters via cro ss v alidation. Note that only α is needed to b e tuned for LPP* a nd only γ is needed to b e tuned for SELF. The ‘ Good Neighbors ’ score shown in T a ble 1 is due to Sugiy ama et al. [14]. The scor e is simply deﬁned a s a training a ccuracy o f the nearest neig hbor algorithm when al l available data ar e lab ele d and ar e given to the algorithm . Note that this s core is not used by a dimensionalit y reductio n algo rithm. It just cla riﬁes a use fulness of unlabeled examples of ea ch dataset to the rea de r s. Int uitively , if a da taset gets a high score, unlabeled examples should b e useful since it indicates that each pair of ex a mples having a high p enalty cos t c u ij should Semi-Super vised Dimensionality Reduction f or Manifol d Learning - 14 T able 1: Details of each da taset: d 0 , c, ℓ, u and t denote the num bers o f input features, c lasses, la b eled examples, unlab eled examples and testing examples, resp ectively . ‘*’ denotes the transductive setting used in s mall datasets, where all examples which ar e not lab eled ar e given a s unlab eled ex amples and used as testing examples as w ell. d , determined by using prior knowledge, denotes the target dimensionality for each dataset. “ Goo d Neighbors ” denotes a q uantit y which measures a g o o dness of unla be le d data fo r ea ch da taset. Name d 0 c ℓ + u + t ℓ u d Good Neighbors linear kernel Ionosphere 34 2 351 10/100 * 2 0.866 0.843 Balance 4 3 625 10/100 300 1 0.780 0.760 BCI 117 2 400 10/100 * 2 0.575 0.593 Usps 241 2 1500 10/100 300 10 0.969 0.971 M-Ey ale 504 5 320 20/100 * 10 0.878 0.850 belo ng to the same cla ss. Note that on T able 1 ther e a re tw o scores for ea ch dataset: linear is a s core on a giv en input space while kernel measur es a score o n a fea ture s pa ce co rresp onding to the 2 nd -degree p o lynomial kernel. 4.2 Numerical Results Numerical res ults are s hown in T able 2 for the cas e of ℓ = 1 0 (except M- Ey ale where ℓ = 20) a nd T able 3 for the case of ℓ = 100 . In exp eriments, SS-DNE and SS-LFD A ar e compared their classiﬁcation p e rformances to their unsupe r vised and sup ervised counterparts: LP P* a nd DNE for SS-DNE, and LPP* and LFDA fo r SS-LFD A. SELF is als o compar ed to SS- L FDA as they are r elated semi-sup ervis ed lea rners or iginated fr o m LFDA. Our tw o algo rithms will b e highligh ted if they are sup erio r to their counterpart opp onents. F rom the results, o ur tw o alg orithms, SS-LFD A and SS-DNE, o utpe r form all their opp onents in 32 out of 40 comparisons: in the ﬁrst setting o f small ℓ (T able 2), our algo r ithms outp e rform the opp one nts in 18 out o f 2 0 comparis o ns while in the s e c ond setting of large ℓ (T a ble 3), our algor ithms outp erfor m the opp onents in 14 out of 20 co mparisons . Consequently , our framework oﬀers a semi-sup er vised learner which consistent ly improv es its base sup ervis ed and unsupe r vised learners . Note that as the n um ber of lab eled examples incre a ses, usefulness of unla- bele d exa mples decr eases. W e will subsequently dis cuss and analyze the results of each data set in details in the next subsections. 4.2.1 Ionosphere Ionosphere is a r eal-world datas et of radar pulses passing through the iono - sphere w hich w ere collected b y a system in Go ose B ay , La brador . The targets were free electrons in the ionosphere. “Go o d” radar r eturns are tho s e showing evidence of so me type of structure in the ionosphere. “ Bad” r e turns are those that do not. Since we do not know the true decision bo undary of Ionosphere , we simply set the target dimensio nality d = c = 2 . It can b e observed that non-lineariza tion do es improv e the classiﬁcatio n p erformance of all algo rithms. Semi-Super vised Dimensionality Reduction f or Manifol d Learning - 15 T able 2: Percentage accur a cies of SS- DNE a nd SS-LFDA derived from our framework co mpa red to existing algorithms ( ℓ = 10, except M-Ey ale where ℓ = 20 ). SS-LFDA and SS-DNE a re highlighted when they outpe r form their opp onents (LPP* and DNE for SS-DNE, and LPP*, LFD A and SELF for SS- LFD A). Super scripts indica te %-co nﬁdence levels of the o ne-tailed pair ed t-test for diﬀerences in accuracies betw een our a lgorithms and their best opp onents. No sup er s cripts denote conﬁdence levels which below 80%. Linear PCA LPP* DNE LFD A SELF SS-DNE SS -LFD A Ionosphere 71 ± 1.2 82 ± 1.3 70 ± 1.2 7 1 ± 1.1 70 ± 1.5 75 ± 1.0 78 .1 ± .9 Balance 49 ± 1.9 61 ± 1.9 63 ± 2.2 70 ± 2.2 69 ± 2.3 71 ± 1.8 99 73 ± 2.3 80 BCI 49.8 ± .6 53.4 ± .3 51.3 ± .6 52.6 ± .5 52.1 ± .5 57.1 ± .6 99 55.2 ± .3 99 Usps 79 ± 1.2 74 ± 1.0 79.6 ± .6 80.6 ± .9 81.7 ± .8 81. 8 ± .5 99 83.0 ± .5 90 M-Ey ale 44.6 ± .7 67 ± 1.1 66 ± 1.2 71.6 ± 1.0 67.2 ± .8 76.9 ± . 8 99 75.7 ± .9 99 Kernel PCA LPP* DNE LFD A SELF SS-DN E S S-LFDA Ionosphere 70 ± 1.8 83.2 ± .9 70 ± 1.6 71 ± 1.3 74 ± 1.5 87.2 ± .9 99 88 ± 1.0 99 Balance 41.7 ± .8 47.9 ± .9 62 ± 2.5 66 ± 2.0 60 ± 2.8 66 ± 1. 8 80 69 ± 1.9 80 BCI 49.7 ± .3 53.7 ± .3 50.1 ± .4 50.3 ± .6 50.5 ± .4 53.8 ± .3 54.1 ± .3 80 Usps 77 ± 1.1 76 ± 1.1 79.9 ± .5 80.3 ± .8 80.9 ± .8 82. 0 ± .4 99 83.7 ± .6 99 M-Ey ale 42.1 ± .9 63.2 ± .7 58.0 ± .9 60.3 ± .8 58.8 ± .7 69.9 ± .7 99 73.2 ± .8 99 T able 3: Percentage a ccuracies of SS-DNE a nd SS-LFDA compar ed to existing algorithms ( ℓ = 100). Linear PCA LPP* DNE LFDA SELF SS-DNE SS-LFDA Ionosphere 72.8 ± .6 83.7 ± .6 77.9 ± .7 74 ± 1.0 77.8 ± .5 84.5 ± .6 80 84.9 ± .4 95 Balance 57 ± 2.2 80 ± 1.3 86.4 ± .5 87.9 ± .3 87.2 ± .4 88.2 ± . 5 99 86.3 ± .6 BCI 49.5 ± .5 54.9 ± .5 53.1 ± .7 67.9 ± .5 67.6 ± .6 63.1 ± . 5 99 67.5 ± .6 Usps 91.4 ± .3 75.7 ± .3 91.1 ± .3 89.3 ± .4 92.2 ± .3 92.2 ± . 4 95 91.6 ± .3 M-Ey ale 69.4 ± .4 84.1 ± .4 92.3 ± .4 95.4 ± .3 94.3 ± .2 93.5 ± . 4 95 95.7 ± .2 Kernel PCA LPP* DNE LFDA SELF SS-DNE SS-LFDA Ionosphere 79.8 ± .4 89.7 ± .5 78.7 ± .9 81.3 ± .7 81.1 ± .5 93.6 ± .2 99 93.7 ± .3 99 Balance 42.5 ± .3 46.9 ± .5 84.0 ± .7 87.8 ± .7 79 ± 1.6 86.5 ± .7 99 87.7 ± .9 BCI 49.7 ± .5 54.5 ± .4 51.6 ± .6 51.0 ± .8 52.4 ± .6 57.6 ± . 2 99 57.0 ± .4 99 Usps 91.1 ± .3 81.5 ± .6 91.4 ± .4 91.2 ± .4 92.7 ± .3 92.3 ± . 3 95 91.9 ± .3 M-Ey ale 66.3 ± .3 81.9 ± .5 91.2 ± .3 89.1 ± .5 85.8 ± .6 91.2 ± .3 94.3 ± .3 99 It ca n be observed that LPP* is muc h b etter than PCA on this data s et, and therefore, unlike SELF, SS-LFDA muc h improves LFDA. In fac t, the main reason that SS-LFDA, SS-DNE and LPP * ha v e goo d classiﬁca tion perfor mances are b ecaus e of the Hada mard p ow er op era tor. T his is explained in Fig ures 7, 8 and 9. F rom Figures 7 and 8, deﬁning “nea rby examples” b e a pair of examples with a link (ha ving c u ij ≥ 0 . 36), we see that almost every link c onne cts ne arby examples of t he same class (i.e. connects go o d nearby ex a mples). This indicates that o ur unlab el cos t matrix C u is quite accura te as bad nea rby examples ra rely hav e links. In fact, the r atio of go o d n e arby ex amples p er total ne arby ex amples (shortly , the go o d- nearby-examples ratio ) is 394 /408 ≈ 0.96 6. Nevertheless, if we re-deﬁne “ nearby ex amples” b e a pa irs of exa mples having, e.g ., c u ij ≥ 0 . 0 1 , the same ratio then re duces to 0 . 75 a s shown in Fig ure 9 (Left). This indica tes that many pairs of examples having small v alues of c u ij are of diﬀerent classes (i.e. bad nearby ex amples). Semi-Super vised Dimensionality Reduction f or Manifol d Learning - 16 −5 0 5 10 15 x 10 −16 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 Correct links are 394 from 408 (according to the threshold 0.36) Figure 7: The undir ected g raph co rresp onding to C u constructed on Iono- sphere . Each link corres po nds to a pair of nea r by exa mples having c u ij ≥ 0 . 36. The num b er ‘0.36 ’ is just chosen for visualiza bility . Since an algor ithm derived fro m our fra mework minimizes the c ost-weighte d aver age distanc e s of every pa ir of exa mples (see Eq. (2) a nd its deriv ation), it is beneﬁcia l to further increa ses the cost o f a pair having large c u ij (since it usually corresp onds to a pair of the s a me class) a nd decreases the cos t of of a pair having small c u ij . F ro m Eq. (9), it can b e easily s een that the eﬀect o f the Hadamard power op erato r is exactly wha t we need. The go od- nearby-examples ratios after a pplying the Hadamard p ow er opera tor with α = 8 are illustrated in Figure 9 (Right). Notice that, a fter applying the op erato r, ev en pairs with small v alues o f c u ij are usually of the sa me cla ss. 4.2.2 Balance Balance is an ar tiﬁcial dataset whic h was ge ne r ated to mo del psychological exp erimental results. Each exa mple is class iﬁe d as having the balance sc a le tip to the right, tip to the left, or be balanced. The 4 attributes co nt aining in- teger v a lues fro m 1 to 5 are left weight , left dist ance , right weight , and right dist ance . The c o rrect wa y to ﬁnd the class is the greater o f ( left d ist an ce × left weight ) and ( right dist ance × right weight ). If they are equal, it is balanced. Therefore, there are 5 4 = 62 5 total exam- ples a nd 3 classes in this dataset. Mor eov er, the c o rrect decision sur face is 1-dimensional manifold ly ing in the feature space corr esp onding to the h· , ·i 2 kernel so that we set the targ et dimensio nality d = 1. This dataset illustrates ano ther ﬂaw of using P CA in a cla ssiﬁcation task. After centering, the cov ariance matrix o f the 6 25 exa mples is just a m ultiple of I , the identit y matrix. Therefore, a ny direction is a principal comp onent with lar gest v ar iance, and PCA is just r eturn a rando m direction! Hence, we cannot expec t muc h a b out the classiﬁcation p er formance of P CA in this data set. Thu s, P CA cannot help SELF improv es muc h the perfor mance on LFD A, and sometimes SELF degr ades the p e rformance of LFDA due to ov erﬁtting. In Semi-Super vised Dimensionality Reduction f or Manifol d Learning - 17 1 2 3 4 5 6 7 x 10 −16 −0.45 −0.4 −0.35 −0.3 −0.25 −0.2 Correct links are 394 from 408 (according to the threshold 0.36) Figure 8: Zo om-in o n the squar e area of Figure 8. contrast, SS-LFD A often improves the pe rformance of LFDA. Also, SS-DNE is able to improve the classiﬁca tion p erfor mance of DNE a nd LPP* in a ll settings. 4.2.3 BCI This dataset originates from the development of a Brain-Computer In terface where a sing le p ers on p er formed 4 00 tria ls in ea ch o f which he ima g ined move- men ts with either the left hand (the 1 st class) or the right hand (the 2 nd class). In each trial, electro e nc e phalogra phy (EE G) was recorded from 39 electro des. An autoregress ive mo del of order 3 was ﬁtted to each of the resulting 39 time series. The trial was represented b y the total of 1 17 = 3 9*3 ﬁtted parameters. The target dimensionality is set to the n um be r of classes, d = c = 2. Similar to the previo us datasets, SS-LFDA a nd SS-DNE are us ually able to outpe r form their opp onents. Again, PCA is no t a ppr opriate for this rea l-world dataset, and hence SELF is inferio r to SS-LFD A. 4.2.4 USPS This b enchmark is derived from the famous USPS dataset o f handwritten dig it recognition. F or ea ch dig it, 15 0 imag e s are rando mly dr awn. The digits ‘2 ’ and ‘5’ are a ssigned to the ﬁrst class, and all others form the second class. T o preven t a user to employ a do ma in kno wledge o f the data, each example is rescaled, noise added, dimension masked and pixel sh uﬄed [13, Chapter 21]. Although there are only 2 class es in this datas et, the original data presumably for m 10 clusters , one for ea ch dig it. Therefore, the target dimensio n d is se t to 10 . Often, SS-LFDA and SS-DNE outperfor m their opp onents. Nevertheless, note that SS-LFDA and SS-DNE do no t improv e mu ch on LFDA and DNE when ℓ = 1 0 0 be cause 100 la b e le d exa mples ar e quite e no ugh to discr iminating the data a nd therefor e unlab eled exa mples oﬀer rela tively small informatio n to semi-sup ervised lear ne r s. Semi-Super vised Dimensionality Reduction f or Manifol d Learning - 18 0 0.2 0.4 0.6 0.8 1 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Threshold of C u Ratio of good neighbors per all neighbors satisfying the threshold Ionosphere: Alpha = 1 0.36 0.966 0.01 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.955 0.96 0.965 0.97 0.975 0.98 0.985 0.99 0.995 1 Ionosphere: Alpha = 8 Ratio of good neighbors per all neighbors satisfying the threshold Threshold of C u Figure 9: F o r eac h n um ber x in the the x-ax is , its co rresp onding v alue on the y-axis is the r atio b etwe en the numb er of go o d ne arby examples (having c u ij > x and b elo nging to the same class) and t he numb er of ne arby examples (having c u ij > x ). The ratios with res pec t to C u α are demonstr ated where (Left) α = 1 (the s tandard LPP), and wher e (Right) α = 8 (LP P*). 4.2.5 M -Ey ale This fa ce r ecognition dataset is derived from ex tende d Y ale B [32]. There ar e 28 hu man sub jects under 9 p os es and 64 illumination co nditions. In our M-Ey ale (Mo diﬁed E xtended Y ale B), w e rando mly chose ten sub jects, 32 images per each sub ject, from the or iginal data set and down-sampling each example to be of size 21 × 24 pixels . M-Ey ale consists of 5 clas ses whe r e ea ch cla ss co nsists o f ima ges o f t wo randomly-chosen sub jects. Hence, there should be t w o separa ted clusters for each class, a nd we should b e able to see the adv a nt age of algo rithms employing the conditions (1*) and (2*) ex pla ined in Section 2.1. In this data s et, the nu mber of la be le d examples of ea ch clas s is ﬁxed to ℓ c so that ex a mples of all classes ar e o bserved. Since this datas e t sho uld consist of ten clusters, the targ et dimensionality is s et to d = 10 . It is clear that LPP* perfor ms m uc h b etter than PCA in this dataset. Re- call that PCA ca ptures maximum-v ariance direc tio ns; nevertheless, in this fac e recognition task, max imu m-v ar iance dir ections are not discriminant dir ections but directions of lighting and po s ing [28]. Therefor e, P CA captures totally wrong directions, and hence P CA degra des the p erformance o f SELF from LFD A. In co ntrast, LPP * m uch b etter captures lo cal str uc tur es in the dataset and discover muc h b etter subspac es. Thus, by co op erating LPP* with LFDA and DNE, SS-LFDA and SS- DNE are able to obtain very go o d p erformance s . 5 Conclusion W e hav e pres ented a uniﬁed semi-s up er vised learning framework for linear and non-linear dimensionality reduction algor ithms. Adv antages of our framework are that it generalizes existing v a rious s up er vised, unsupervised and semi-supe r vised learning frameworks employing sp ectra l metho ds. Empirical evidences showing satisﬁable p er formance of algo r ithms derived fro m o ur framework hav e b een re- po rted o n standard da tasets. Semi-Super vised Dimensionality Reduction f or Manifol d Learning - 19 Ac kno wledgemen ts. This work is supp or ted by Thailand Res e arch F und. References [1] Shuic heng Y an, Dong Xu, Be nyu Zha ng , Hong-Jiang Zhang , Qiang Y a ng, and S. Lin. Gr aph embedding and extensions: A g eneral fra mework for dimensionality reduction. IEEE T r ansactions on Pattern Anal ysis and Ma- chine Intel ligenc e , 29(1):40–51 , 200 7. [2] W ei Zhang, Xiangyang Xue, Zichen Sun, Y ue- F ei Guo, a nd Ho ng Lu. Op- timal dimensionality of metric spa ce fo r classiﬁca tion. In International Confer enc e on Mach ine L e arning , 2007. [3] Deng Cai, Xiaofei He, and Jiawei Han. Sp ectra l reg ression for eﬃcient reg- ularized subspace learning . In Computer Vision and Pattern Re c o gnition , 2007. [4] Steven C. H. Hoi, W ei Liu, Michael R. L y u, a nd W ei-Ying Ma. Lear n- ing distance metrics with co nt extual cons traints fo r image retriev al. In Computer V ision and Pattern Re c o gnition , 200 6. [5] Hwann-Tzong Chen, Huang-W ei Chang , and Tyng-Luh Liu. Lo ca l dis- criminant em be dding and its v aria nt s. In Computer Vision and Pattern R e c o gnition , volume 2, 2005. [6] J . Cheng, Q. Liu, H. Lu, and Y.W. Chen. A super vised nonlinear local embedding for face reco gnition. Intern ational Confer enc e on Image Pr o- c essing , 1, 200 4. [7] J . B. T enenbaum, V. de Silv a, and J. C. Langfo rd. A global geometric framework for nonlinear dimensionality reduction. Scienc e , 290:2 319– 2323, Decem b e r 2 000. [8] S. T. Roweis and L. K. Saul. Nonlinear Dimensionalit y Reduction b y Lo- cally Linear Embedding , 2000. [9] X. He and P . Niyogi. Lo cality Preserv ing Pro jections. In A dvanc es in Neur al Information Pr o c essing Systems 16 , 2004 . [10] L. K. Saul, K . Q. W einberger, J. H. Ham, F. Sha, and D. D. Lee. Sp ec- tral metho ds for dimensio nality reduction. Semisup ervise d L e arning. MIT Pr ess: Cambridge, MA , 200 6. [11] M. Sugiy ama. Dimensionality Reduction of Multimo dal Lab eled Data by Lo cal Fisher Discriminant Analysis. The Journal of Machi ne L e arning R ese ar ch , 8:10 27–1 0 61, 200 7. [12] K. F ukunag a. Intr o duction to Statistic al Pattern R e c o gnition . Academic Press, 1990. Semi-Super vised Dimensionality Reduction f or Manifol d Learning - 20 [13] O. Chap elle, B. Sch¨ olkopf, a nd A. Z ien, edito r s. S emi-Sup ervise d L e arning . The MIT P r ess, 20 06. [14] M. Sug iyama, T. Ide, S. Nak a jima, and J. Sese. Semi-Supervis ed Local Fisher Discriminant Analysis for Dimensionality Reduction. P AKDD , pages 333–3 44, 2 008. [15] Y angqiu Song, F eiping Nie, Changshui Zha ng, and Shiming Xiang. A Uniﬁed framework for semi-sup ervis ed dimensionality reductio n. Pattern R e c o gnition , 41:27 89–2 799, 20 08. [16] Ratthachat Chatpatana siri, T e e sid Ko rsrilabutr , Pasakorn T a ng chanac ha- ianan, a nd Bo onser m Kijsirikul. On Kerneliz ing Ma halanobis Distance Learning Algo rithms. Arxiv pr eprint. http:// arxiv.or g/abs/080 4.1441 , 2008. [17] Ulrike von Luxbur g. A tutoria l on sp ectral clustering. Statistics and Com- puting , 17 (4):395– 416, 2007. [18] Jacob Goldberg er, Sam Ro w eis, Geoﬀrey Hin ton, and Ruslan Salakhut di- nov. Neighbourho od c o mp o nents a nalysis. NIPS , pa ges 513– 5 20, 2 005. [19] A. Glob erso n and S. Ro w eis. Metric lea rning by co lla psing classes. NIPS , 18:451 –458 , 2 006. [20] K. W ein b e rger, J. Blitzer , and L. Saul. Distance metric learning for lar ge margin nea r est neig hbor clas siﬁcation. NIPS , 18 :1473 –1480 , 2006. [21] Liu Y ang, Rong Jin, Rahul Sukthank ar, and Yi Liu. An eﬃcient alg orithm for lo cal distance metric lea rning. In AAAI , 200 6. [22] Lorenzo T orre sani a nd Kuang C. Lee. La r ge margin component analysis. In B. Sch¨ olk opf, J. Platt, and T. Ho ﬀman, editors , A dvanc es in Neur al Infor- mation Pr o c essing Systems 19 , pa ges 1385 –1392 . MIT Press , Cambridge, MA, 2 007. [23] L. Zelnik-Ma nor and P . Perona. Self-tuning s p e c tral clus tering. A dvanc es in N eur al Information Pr o c essing Systems , 17 :1601 –1608 , 2 004. [24] James R. Schott. Matrix Analysis for Statistics . Wiley , 2nd edition, 2005 . [25] John Shaw e-T aylor and Nello Cristianini. Kernel Metho ds for Pattern Anal- ysis . Cam bridge Universit y P ress, June 2004. [26] Bernhard Sch¨ olk opf and Alexander J . Smola. L e arning with K ern els . The MIT Pr ess, December 2 001. [27] Jerome H. F r ie dman. Regularize d Discriminant Analysis. Journ al of the Americ an Statistic al Asso cia tion , 84:16 5 –175 , 198 9. [28] P .N. B e lh umeur, J.P . Hespanha, and D.J. Kr ie g man. E igenfaces vs. Fisher- faces: r ecognition using class sp eciﬁc linear pr o jection. IEEE T r ansactions on Pattern Analy sis and Machine Intel ligenc e , 19(7):711– 720, 19 9 7. [29] X. Y a ng , H. F u, H. Zha, and J. Barlow. Semi-sup ervised no nlinea r dimen- sionality r eduction. In ICML , pa ges 1065– 1072 , 2 006. Semi-Super vised Dimensionality Reduction f or Manifol d Learning - 21 [30] Haifeng Li, T ao Jiang, and Keshu Zhang. Eﬃcient a nd Ro bust F ea ture Extraction by Maximum Marg in Criterion. IEEE T r ansactions on Neur al Networks , 17(1):15 7 –165 , 2 006. [31] A. Asuncio n and D. J. Newman. UCI machine learning rep o s itory , 2007 . [32] A.S. Georghia des, P .N. Belhumeur, and D.J. Kr iegman. F ro m few to many: Illumination cone mode ls for face reco gnition under v aria ble light ing and po se. Pattern Analysis and Machi ne Intel ligenc e , 23(6):6 43–6 6 0, 20 01. Semi-Super vised Dimensionality Reduction f or Manifol d Learning - 22

A Unified Semi-Supervised Dimensionality Reduction Framework for Manifold Learning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment