Learning Transformations for Clustering and Classification

Learning Transforma tions for Clustering and Classifica tion Learning T ransformations for Clustering and Classiﬁcation Qiang Qiu qiang.qiu@duke.edu Dep artment of Ele ctric al and Computer Engine ering Duke University Durham, NC 27708, USA Guillermo Sapiro guillermo.sapiro@duke.edu Dep artment of Ele ctric al and Computer Engine ering, Dep artment of Computer Scienc e, Dep artment of Biome dic al Engine ering Duke University Durham, NC 27708, USA Editor: * Abstract A lo w-rank transformation learning framework for subspace clustering and classiﬁcation is here prop osed. Man y high-dimensional data, such as face images and motion sequences, appro ximately lie in a union of low-dimensional subspaces. The corresponding subspace clustering problem has been extensively studied in the literature to partition suc h high- dimensional data in to clusters corresponding to their underlying low-dimensional subspaces. Ho wev er, low-dimensional intrinsic structures are often violated for real-w orld observ ations, as they can b e corrupted by errors or deviate from ideal models. W e prop ose to address this by learning a linear transformation on subspaces using nuclear norm as the mo deling and optimization criteria. The learned linear transformation restores a low-rank structure for data from the same subspace, and, at the same time, forces a maximally separated structure for data from diﬀeren t subspaces. In this wa y , we reduce v ariations within the subspaces, and increase separation b et ween the subspaces for a more robust subspace clus- tering. This prop osed learned robust subspace clustering framework signiﬁcantly enhances the p erformance of existing subspace clustering metho ds. Basic theoretical results here presen ted help to further support the underlying framew ork. T o exploit the low-rank structures of the transformed s ubspaces, w e further introduce a fast subspace clustering tec hnique, whic h eﬃcien tly com bines robust PCA with sparse modeling. When class la- b els are present at the training stage, we sho w this lo w-rank transformation framew ork also signiﬁcan tly enhances classiﬁcation p erformance. Extensive exp erimen ts using pub- lic datasets are presen ted, showing that the prop osed approach signiﬁcan tly outperforms state-of-the-art methods for subspace clustering and classiﬁcation. The learned low cost transform is also applicable to other classiﬁcation frameworks. Keyw ords: Subspace clustering, classiﬁcation, lo w-rank transformation, nuclear norm, feature learning. 1. In tro duction High-dimensional data often hav e a small intrinsic dimension. F or example, in the area of computer vision, face images of a sub ject (Basri and Jacobs (F ebruary 2003), W right 1 Qiu and Sapiro et al. (2009)), handwritten images of a digit (Hastie and Simard (1998)), and tra jectories of a moving ob ject (T omasi and Kanade (1992)) can all be w ell-approximated b y a low- dimensional subspace of the high-dimensional ambien t space. Th us, m ultiple class data often lie in a union of lo w-dimensional subspaces. The ubiquitous subspace clustering problem is to partition high-dimensional data in to clusters corresponding to their underlying subspaces. Standard clustering methods suc h as k-means in general are not applicable to subspace clustering. V arious metho ds hav e b een recently suggested for subspace clustering, such as Sparse Subspace Clustering (SSC) (Elhamifar and Vidal (2013)) (see also its extensions and analysis in Liu et al. (2010); Soltanolk otabi and Candes (2012); Soltanolkotabi et al. (2013); W ang and Xu (2013)), Local Subspace Aﬃnity (LSA) (Y an and P ollefeys (2006)), Lo cal Best-ﬁt Flats (LBF) (Zhang et al. (2012)), Generalized Principal Component Analysis (Vidal et al. (2003)), Agglomerative Lossy Compression (Ma et al. (2007)), Locally Linear Manifold Clustering (Goh and Vidal (2007)), and Sp ectral Curv ature Clustering (Chen and Lerman (2009)). A recen t surv ey on subspace clustering can b e found in Vidal (2011). Lo w-dimensional in trinsic structures, whic h enable subspace clustering, are often vio- lated for real-w orld data. F or example, under the assumption of Lambertian reﬂectance, Basri and Jacobs (F ebruary 2003) show that face images of a sub ject obtained under a wide v ariety of lighting conditions can b e accurately approximated with a 9-dimensional linear subspace. Ho wev er, real-w orld face images are often captured under p ose v ariations; in addition, faces are not p erfectly Lambertian, and exhibit cast shadows and sp ecularities (Cand ` es et al. (2011)). Therefore, it is critical for subspace clustering to handle corrupted underlying structures of realistic data, and as suc h, deviations from ideal subspaces. When data from the same low-dimensional subspace are arranged as columns of a single matrix, the matrix should be appro ximately lo w-rank. Th us, a promising wa y to handle corrupted data for subspace clustering is to restore suc h low-rank structure. Recen t eﬀorts ha ve been inv ested in seeking transformations such that the transformed data can b e de- comp osed as the sum of a lo w-rank matrix component and a sparse error one (P eng et al. (2010); Shen and W u (2012); Zhang et al. (2011)). P eng et al. (2010) and Zhang et al. (2011) are prop osed for image alignmen t (see Kuyb eda et al. (2013) for the extension to m ultiple-classes with applications in cry o-tomograhy), and Shen and W u (2012) is discussed in the context of salient ob ject detection. All these metho ds build on recent theoretical and computational adv ances in rank minimization. In this pap er, w e prop ose to impro v e subspace clustering and classiﬁcation by learning a linear transformation on subspaces using matrix rank, via its n uclear norm con vex surrogate, as the optimization criteria. The learned linear transformation reco v ers a lo w-rank structure for data from the same subspace, and, at the same time, forces a maximally separated structure for data from diﬀerent subspaces (actually high n uclear norm, which as discussed later, impro ves the separation betw een the subspaces). In this w ay , we reduce v ariations within the subspaces, and increase separations b et ween the subspaces for more accurate subspace clustering and classiﬁcation. F or example, as sho wn in Fig. 1, after faces are detected and aligned, e.g., using Zh u and Ramanan (June 2012), our approach learns linear transformations for face images to restore for the same sub ject a low-dimensional structure. By comparing the last row to the ﬁrst row in Fig. 1, we can easily notice that faces from the same sub ject across diﬀeren t 2 Learning Transforma tions for Clustering and Classifica tion p oses are more visually similar in the new transformed space, enabling b etter face clustering and classiﬁcation across p ose. This paper mak es the following main con tributions: • Subspace low-rank transformation (LR T) is in tro duced and analyzed in the context of subspace clustering and classiﬁcation; • A Learned Robust Subspace Clustering framework (LRSC) is prop osed to enhance existing subspace clustering metho ds; • A discriminative low-rank (n uclear norm) transformation approac h is prop osed to reduce the v ariation within the classes and increase separations betw een the classes for impro ved classiﬁcation; • W e prop ose a sp eciﬁc fast subspace clustering technique, called Robust Sparse Sub- space Clustering (R-SSC), by exploiting lo w-rank structures of the learned trans- formed subspaces; • W e discuss online learning of subspace low-rank transformation for big data; • W e demonstrate through extensive exp erimen ts that the prop osed approac h signiﬁ- can tly outp erforms state-of-the-art methods for subspace clustering and classiﬁcation. The prop osed approac h can b e considered as a wa y of learning data features, with such features learned in order to reduce within-class rank (nuclear norm), increase b et ween class separation, and encourage robust subspace clustering. As suc h, the framework and criteria here introduced can b e incorp orated in to other data classiﬁcation and clustering problems. In Section 2, w e form ulate and analyze the low-rank transformation learning problem. In sections 3 and 4, w e discuss the lo w-rank transformation for subspace clustering and clas- siﬁcation respectively . Exp erimen tal ev aluations are given in Section 5 on public datasets commonly used for subspace clustering ev aluation. Finally , Section 6 concludes the pap er. 2. Learning Lo w-rank T ransformations (LR T) Let {S c } C c =1 b e C n -dimensional subspaces of R d (not all subspaces are necessarily of the same dimension, this is only here assumed to simplify notation). Given a data set Y = { y i } N i =1 ⊆ R d , with eac h data p oin t y i in one of the C subspaces, and in general the data arranged as columns of Y . Y c denotes the set of points in the c -th subspace S c , p oin ts arranged as columns of the matrix Y c . As data p oin ts in Y c lie in a lo w-dimensional subspace, the matrix Y c is exp ected to b e low-r ank , and suc h low-rank structure is critical for accurate subspace clustering. Ho wev er, as discussed ab o ve, this lo w-rank structure is often violated for real data. Our prop osed approach is to learn a global linear transformation on subspaces. Such linear transformation restores a lo w-rank structure for data from the same subspace, and, at the same time, encourages a maximally separated structure for data from diﬀerent sub- spaces. In this wa y , w e reduce the v ariation within the subspaces and introduce separations b et w een the subspaces for more robust subspace clustering or classiﬁcation. 2.1 Preliminary P edagogical F orm ulation using Rank W e ﬁrst assume the data cluster labels are known b eforehand for training purp oses, assump- tion to b e remov ed when discussing the full clustering approac h in Section 3. W e adopt 3 Qiu and Sapiro Or ig i n a l f a c e i m a g e s De t e c t e d an d alig n e d f ac e s Cr o p p e d an d flip p e d f a c e s Lo w - r a n k t r a n s f o r m e d f ace s Figure 1: Learned lo w-rank transformation on faces across p ose. In the second row, the input faces are ﬁrst detected and aligned, e.g., using the metho d in Zh u and Ramanan (June 2012). Pose mo dels deﬁned in Zh u and Ramanan (June 2012) enable an optional crop-and- ﬂip step to retain the more informativ e side of a face in the third ro w. Our prop osed approac h learns linear transformations for face images to restore for the same sub ject a lo w-dimensional structure as shown in the last row. By comparing the last row to the ﬁrst ro w, w e can easily notice that faces from the same sub ject across diﬀeren t poses are more visually similar in the new transformed space, enabling b etter face clustering or recognition across pose (note that the goal is clustering/recognition and not reconstruction). matrix rank as the key learning criterion (presen ted here ﬁrst for pedagogical reasons, to b e later replaced by the n uclear norm), and compute one global linear transformation on all subspaces as arg T min C X c =1 r ank ( TY c ) − r ank ( TY ) , s.t. || T || 2 = γ , (1) where T ∈ R d × d is one global linear transformation on all data p oin ts (w e will later discuss then T ’s dimension is less than d ), ||·|| 2 denotes the matrix induced 2-norm, and γ is a p ositiv e constan t. In tuitiv ely , minimizing the ﬁrst r epr esentation term P C c =1 r ank ( TY c ) encourages a consisten t representation for the transformed data from the same subspace; and minimizing the second discrimination term − r ank ( TY ) encourages a div erse repre- sen tation for transformed data from diﬀerent subspaces (we will later formally discuss that the con vex surrogate n uclear norm actually has this desired eﬀect). The normalization condition || T || 2 = γ preven ts the trivial solution T = 0. W e no w explain that the p edagogical formulation in (1) using rank is ho wev er not opti- mal to simultaneously reduce the v ariation within the same class subspaces and in tro duce separations b et w een the diﬀerent class subspaces, motiv ating the use of the nuclear norm not only for optimization reasons but for mo deling ones as well. Let A and B b e matri- ces of the same dimensions (standing for t wo classes Y 1 and Y 2 resp ectiv ely), and [ A , B ] 4 Learning Transforma tions for Clustering and Classifica tion (standing for Y ) be the concatenation of A and B , w e ha ve (Marsaglia and St yan (1972)) r ank ([ A , B ]) ≤ r ank ( A ) + r ank ( B ) , (2) with equalit y if and only if A and B are disjoin t, i.e., they intersect only at the origin (often the analysis of subspace clustering algorithms considers disjoint spaces, e.g., Elhamifar and Vidal (2013)). It is easy to sho w that (2) can b e extended for the concatenation of m ultiple matrices, r ank ([ Y 1 , Y 2 , Y 3 , · · · , Y C ]) ≤ r ank ( Y 1 ) + r ank ([ Y 2 , Y 3 , · · · , Y C ]) (3) ≤ r ank ( Y 1 ) + r ank ( Y 2 ) + r ank ([ Y 3 , · · · , Y C ]) . . . ≤ C X c =1 r ank ( Y c ) , with equalit y if matrices are indep enden t. Thus, for (1), w e ha ve C X c =1 r ank ( TY c ) − r ank ( TY ) ≥ 0 , (4) and the ob jective function (1) reaches the minim um 0 if matrices are independent after applying the learned transformation T . Ho wev er, indep endence do es not infer maximal separation, an important goal for robust clustering and classiﬁcation. F or example, tw o lines in tersecting only at the origin are indep enden t regardless of the angle in b et w een, and they are maximally separated only when the angle b ecomes π 2 . With this in tuition in mind, w e now pro ceed to describ e our prop osed formulation based on the nuclear norm. 2.2 Problem F ormulation using Nuclear Norm Let || A || ∗ denote the nuclear norm of the matrix A , i.e., the sum of the singular v alues of A . The n uclear norm || A || ∗ is the conv ex en velop of r ank ( A ) ov er the unit ball of matrices F azel (2002). As the nuclear norm can b e optimized eﬃciently , it is often adopted as the b est conv ex appro ximation of the rank function in the literature on rank optimization (see, e.g., Cand ` es et al. (2011) and Rec ht et al. (2010)). One factor that fundamen tally aﬀects the performance of subspace clustering and clas- siﬁcation algorithms is the distance b et ween subspaces. An imp ortan t notion to quantify the distance (separation) b et ween t wo subspaces S i and S j is the smallest principal angle θ ij (Miao and Ben-Israel (1992), Elhamifar and Vidal (2013)), which is deﬁned as θ ij = min u ∈S i , v ∈S j arccos u 0 v || u || 2 || v || 2 , (5) Note that θ ij ∈ [0 , π 2 ] . W e replace the rank function in (1) with the n uclear norm, arg T min C X c =1 || TY c || ∗ − || TY || ∗ , s.t. || T || 2 = γ . (6) 5 Qiu and Sapiro 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Original subspaces (a) θ AB = π 2 = 1 . 57. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Transformed subspaces (b) T =  1 . 00 0 0 1 . 00  ; θ AB = 1 . 57. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Original subspaces (c) θ AB = π 4 = 0 . 79, | A | ∗ = 1 , | B | ∗ = 1, | [ A , B ] | ∗ = 1 . 41 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5 Transformed subspaces (d) T =  0 . 50 − 0 . 21 − 0 . 21 0 . 91  ; θ AB = 1 . 57, | A | ∗ = 1 , | B | ∗ = 1, | [ A , B ] | ∗ = 1 . 95 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 −0.2 0 0.2 0.4 0.6 0.8 Original subspaces (e)  θ AB = 0 . 79 , θ AC = 0 . 79 , θ B C = 1 . 05  A = 0 . 0141 ,  B = 0 . 0131 ,  C = 0 . 0148  , | A | ∗ = 4 . 06 , | B | ∗ = 4 . 08 , | C | ∗ = 4 . 16 . −0.2 0 0.2 0.4 0.6 −0.2 0 0.2 0.4 0.6 −0.25 −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2 0.25 Transformed subspaces (f ) T =   0 . 39 − 0 . 16 − 0 . 16 − 0 . 13 0 . 90 0 . 11 − 0 . 23 0 . 11 0 . 57   ;  θ AB = 1 . 51 , θ AC = 1 . 49 , θ B C = 1 . 57  A = 0 . 0091 ,  B = 0 . 0085 ,  C = 0 . 0114  , | A | ∗ = 1 . 93 , | B | ∗ = 2 . 37 , | C | ∗ = 1 . 20 . Figure 2: The learned transformation T using (6) with the n uclear norm as the key criterion. Three subspaces in R 3 are denoted as A (red), B (blue), C (green). W e denote the angle b e- t ween subspaces A and B as θ AB (and analogous for the other pairs of subspaces). Using (6), w e transform A , B , C in (a),(c),(e) to (b),(d),(f ) resp ectiv ely (in the ﬁrst ro w the subspace C is empty , b eing this basically a t wo dimensional example). Data p oin ts in (e) are associ- ated with random noises ∼ N (0 , 0 . 01). W e denote the ro ot mean square deviation of p oin ts in A from the true subspace as  A (and analogous for the other subspaces). W e observ e that the learned transformation T maximizes the distance b et ween ev ery pair of subspaces to wards π 2 , and reduces the deviation of points from the true subspace when noise is present, note how the individual subspaces n uclear norm is signiﬁcantly reduced. Note that, in (c) and (d), we hav e the same rank v alues r ank ( A ) = 1 , r ank ( B ) = 1 , rank ([ A , B ]) = 2, but diﬀeren t nuclear norm v alues, manifesting the improv ed b et w een-subspaces separation. The normalization condition || T || 2 = γ prev ents the trivial solution T = 0. Without loss of generality , we set γ = 1 unless otherwise sp eciﬁed. How ev er, understanding the eﬀects of 6 Learning Transforma tions for Clustering and Classifica tion adopting a diﬀeren t normalization here is interesting and is the sub ject of future researc h. Throughout this pap er we k eep this particular form of the normalization which w as already pro ven to lead to excellent results. It is imp ortan t to note that (6) is not simply a relaxation of (1). Not only the replacement of the rank by the nuclear norm is critical for optimization considerations in reducing the v ariation within same class subspaces, but as w e sho w next, the learned transformation T using the ob jectiv e function (6) also maximizes the separation betw een diﬀerent class subspaces (a missing prop erty in (1)), leading to improv ed clustering and classiﬁcation p erformance. W e start b y presen ting some basic norm relationships for matrices and their corresp ond- ing concatenations. Theorem 1 L et A and B b e matric es of the same r ow dimensions, and [ A , B ] b e the c onc atenation of A and B , we have || [ A , B ] || ∗ ≤ || A || ∗ + || B || ∗ . Pr o of: See Appendix A.  Theorem 2 L et A and B b e matric es of the same r ow dimensions, and [ A , B ] b e the c onc atenation of A and B , we have || [ A , B ] || ∗ = || A || ∗ + || B || ∗ . when the c olumn sp ac es of A and B ar e ortho gonal. Pr o of: See Appendix B.  It is easy to see that theorems 1 and 2 can be extended for the concatenation of multiple matrices. Thus, for (6), w e ha ve, C X c =1 || TY c || ∗ − || TY || ∗ ≥ 0 . (7) Based on (7) and Theorem 2, the prop osed ob jectiv e function (6) reac hes the minimum 0 if the column spaces of ev ery pair of matrices are orthogonal after applying the learned transformation T ; or equiv alen tly , (6) reac hes the minim um 0 when the separation b et ween ev ery pair of subspaces is maximized after transformation, i.e., the smallest principal angle b et w een subspaces equals π 2 . Note that suc h impro ved separation is not obtained if the r ank is used in the second term in (6), thereb y further justifying the use of the n uclear norm instead. W e ha ve then, b oth in tuitively and theoretically , justiﬁed the selection of the crite- ria (6) for learning the transform T . W e now illustrate the prop erties of the learned transformation T using syn thetic examples in Fig. 2 (real examples are presented in Sec- tion 5). Here we adopt a pro jected subgradien t metho d describ ed in App endix C (though other mo dern n uclear norm optimization tec hniques could be considered, including recent real-time form ulations Sprec hmann et al. (2012)) to search for the transformation matrix 7 Qiu and Sapiro T that minimizes (6). As sho wn in Fig. 2, the learned transformation T via (6) maxi- mizes the separation b et ween ev ery pair of subspaces to wards π 2 , and reduces the deviation of the data p oin ts to the true subspace when noise is presen t. Note that, comparing Fig. 2c to Fig.2d, the learned transformation using (6) maximizes the angle betw een sub- spaces, and the n uclear norm changes from | [ A , B ] | ∗ = 1 . 41 to | [ A , B ] | ∗ = 1 . 95 to make | A | ∗ + | B | ∗ − | [ A , B ] | ∗ ≈ 0; How ever, in both cases, where subspaces are indep enden t, r ank ([ A , B ]) = 2, and r ank ( A ) + r ank ( B ) − r ank ([ A , B ]) = 0. 2.3 Comparisons with other T ransformations F or indep enden t subspaces, a transformation that renders them pairwise orthogonal can b e obtained in a closed-form as follo ws: we tak e a basis U c for the column space of Y c for eac h subspace, form a matrix U = [ U 1 , ..., U C ], and then obtain the orthogonalizing transformation as T = ( U 0 U ) − 1 U 0 . T o further elab orate the prop erties of our learned transformation, using synthetic examples, w e compare with the closed-form orthogonalizing transformation in Fig. 3 and with linear discriminan t analysis (LD A) in Fig. 4. Tw o in tersecting planes are sho wn in Fig. 3a. Though subspaces here are neither in- dep enden t nor disjoin t, the closed-form orthogonalizing transformation still signiﬁcantly increases the angle b etw een the tw o planes tow ards π 2 in Fig. 3b (note that the angle for the common line here is alwa ys 0). Note also that the closed-form orthogonalizing transforma- tion is of size r × d , where r is the sum of the dimension of eac h subspace, and w e plot just the ﬁrst 3 dimensions for visualization. Comparing to the orthogonalizing transformation, our leaned transformation in Fig. 3c in tro duces similar subspace separation, but enables signiﬁcan tly reduced within subspace v ariations, indicated by the decreased n uclear norm v alues (close to 1). The same set of experiments with diﬀerent samples p er subspace are sho wn in the second row of Fig. 3. Our formulation in (6) not only maximizes the separa- tions b et w een the diﬀeren t classes subspaces, but also simultaneously reduces the v ariations within the same class subspaces. Our learned transformation shares a similar methodology with LD A, i.e., minimizing in tra-class v ariation and maximizing inter-class separation. Two classes Y + and Y − are sho wn in Fig. 4a, each class consisting of t wo lines. Our learned transformation in Fig. 4c sho ws smaller intra-class v ariation than LDA in Fig. 4b by merging tw o lines in each class, and sim ultaneously maximizes the angle b et ween t wo classes to wards π 2 (suc h t wo-class clus- tering and classiﬁcation is critical for example for trees-based techniques Qiu and Sapiro (April, 2014)). Note that we usually use LD A to reduce the data dimension to the num b er of classes minus 1; ho wev er, to b etter emphasize the distinction, we learn a ( d − 1) × d sized transformation matrix using b oth metho ds. The closed-form orthogonalizing trans- formation discussed ab o ve also gives higher in tra-class v ariations as | Y + | ∗ = 1 . 45 and | Y + | ∗ = 1 . 68. Fig. 4d sho ws an example of tw o non-linearly separable classes, i.e., tw o in tersecting planes, whic h cannot b e improv ed b y LD A, as sho wn in Fig. 4e. How ev er, our learned transformation in Fig. 4f prepares the data to b e separable using subspace cluster- ing. As sho wn in Qiu and Sapiro (April, 2014), the prop ert y demonstrated ab o ve mak es our learned transformation a better learner than LDA in a binary classiﬁcation tree. Lastly , w e generated an in teresting disjoint case: we consider three lines A , B and C on the same plane that in tersect at the origin; the angles b etw een them are θ AB = 0 . 08, 8 Learning Transforma tions for Clustering and Classifica tion −0.5 0 0.5 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 Original subspaces (a) 200 samples p er plane, θ AB = 0 . 31, | A | ∗ = 1 . 91 , | B | ∗ = 1 . 88. −1.5 −1 −0.5 0 0.5 1 1.5 −1 −0.5 0 0.5 1 −2 −1.5 −1 −0.5 0 0.5 1 1.5 Orthogonalized subspaces (b) T =     − 0 . 36 0 0 . 10 − 1 . 4 0 . 09 − 2 . 98 − 0 . 92 0 1 . 54 1 . 06 − 1 . 12 2 . 63     ; θ AB = 1 . 41, | A | ∗ = 1 . 61 , | B | ∗ = 1 . 62 . −0.1 −0.05 0 0.05 0.1 −0.1 −0.05 0 0.05 0.1 −0.08 −0.06 −0.04 −0.02 0 0.02 0.04 0.06 0.08 Transformed subspaces (c) T =   0 . 14 0 . 03 0 . 33 − 0 . 01 0 . 14 − 0 . 16 0 . 29 − 0 . 16 0 . 86   ; θ AB = 1 . 55, | A | ∗ = 1 . 06 , | B | ∗ = 1 . 06 . −1 −0.5 0 0.5 1 −0.5 0 0.5 1 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 Original subspaces (d) 75 samples p er plane, θ AB = 0 . 31, | A | ∗ = 1 . 92 , | B | ∗ = 1 . 81. −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 −0.5 0 0.5 −0.5 0 0.5 Orthogonalized subspaces (e) T =     − 1 . 62 0 − 1 . 94 − 0 . 50 0 − 2 . 38 0 . 25 − 0 . 50 1 . 81 1 . 12 − 1 . 50 2 . 37     ; θ AB = 1 . 04, | A | ∗ = 1 . 75 , | B | ∗ = 1 . 71 . −0.1 −0.05 0 0.05 0.1 −0.1 −0.05 0 0.05 0.1 −0.1 −0.05 0 0.05 0.1 0.15 Transformed subspaces (f ) T =   0 . 06 − 0 . 12 0 . 30 − 0 . 01 0 . 13 − 0 . 15 0 . 33 − 0 . 12 0 . 86   ; θ AB = 1 . 55, | A | ∗ = 1 . 08 , | B | ∗ = 1 . 07 . Figure 3: Comparisons with the closed-form orthogonalizing transformation. Tw o in- tersecting planes are sho wn in (a), and eac h plane con tains 200 p oin ts. The closed-form orthogonalizing transformation signiﬁcan tly increase the angle b et ween the tw o planes to- w ards π 2 in (b). Our leaned transformation in (c) introduces similar subspace separation, but sim ultaneously enables signiﬁcantly reduced within subspace v ariation, indicated b y the smaller nuclear norm v alues (close to 1). The same set of experiments with 75 p oin ts p er subspace are shown in the second row. θ B C = 0 . 08, and θ AC = 0 . 17. As the closed-form orthogonalizing approac h is v alid for indep enden t subspaces, it fails by pro ducing θ AB = 0 . 005, θ B C = 0 . 005, θ B C = 0 . 01. Our framew ork is not limited to that, even if additional theoretical foundations are yet to come. After our learned transformation, w e hav e θ AB = 1 . 20, θ B C = 1 . 20, and θ AC = 0 . 75. W e can make tw o immediate observ ations: First, all angles are signiﬁcantly increased within the 9 Qiu and Sapiro 0 0.5 1 −0.5 0 0.5 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Original subspaces (a) Tw o classes { Y + , Y − } , Y + = { A ( blue ) , B ( cy an ) } , Y − = { C ( y ellow ) , D ( red ) } ,  θ AB = 1 . 1 , θ AC = 1 . 1 , θ AD = 1 . 1 , θ B C = 1 . 3 , θ B D = 1 . 4 , θ C D = 0 . 5  , | Y + | ∗ = 1 . 58 , | Y − | ∗ = 1 . 27 . −4 −3 −2 −1 0 1 2 3 4 −1 −0.5 0 0.5 1 1.5 2 2.5 3 LDA subspaces (b) T =  − 3 . 64 − 1 . 95 5 . 98 0 . 19 3 . 87 3 . 35  ;  θ AB = 0 . 7 , θ AC = 0 . 78 , θ AD = 0 . 2 , θ B C = 1 . 5 , θ B D = 0 . 9 , θ C D = 0 . 57  , | Y + | ∗ = 1 . 35 , | Y − | ∗ = 1 . 27 . 0 0.5 1 1.5 −1.4 −1.2 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 Transformed subspaces (c) T =  1 . 47 0 . 26 − 0 . 73 0 . 07 0 . 06 − 1 . 62  ;  θ AB = 0 . 04 , θ AC = 1 . 54 , θ AD = 1 . 54 , θ B C = 1 . 55 , θ B D = 1 . 56 , θ C D = 0 . 01  , | Y + | ∗ = 1 . 02 , | Y − | ∗ = 1 . 00 . −0.5 0 0.5 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 Original subspaces (d) Two classes { A (blue), B (red) } , θ AB = 0 . 31, | A | ∗ = 1 . 91 , | B | ∗ = 1 . 88. −3 −2 −1 0 1 2 3 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 LDA subspaces (e) T =  − 0 . 54 2 . 60 − 9 . 51 0 . 56 − 3 . 21 − 1 . 02  ; θ AB = 0, | A | ∗ = 1 . 52 , | B | ∗ = 1 . 69 . −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2 0.25 −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2 0.25 Transformed subspaces (f ) T =  0 . 49 − 0 . 11 1 . 27 − 0 . 09 0 . 29 − 0 . 59  ; θ AB = 1 . 57, | A | ∗ = 1 . 08 , | B | ∗ = 1 . 03 . Figure 4: Comparisons with the linear discriminan t analysis (LD A). Two classes Y + and Y − are shown in (a), each class consisting of t wo lines. W e notice that our learned trans- formation (c) sho ws smaller in tra-class v ariation than LD A in (b) by merging tw o lines in eac h class, and sim ultaneously maximizes the angle betw een tw o classes to wards π 2 (suc h t wo-class clustering and classiﬁcation is critical for example for trees-based tec hniques Qiu and Sapiro (April, 2014)). (d) sho ws an example of t wo non-linearly separable classes, i.e., t wo in tersecting planes, which cannot b e impro ved b y LD A in (e). Ho wev er, our learned transformation in (f ) prepares data to b e separable using subspace clustering. v alid range of [0 , π 2 ]. Second, θ AB + θ B C + θ AC = π (w e made the same t w o observ ations while rep eating the experiments with diﬀeren t subspace angles). Though at this p oint we ha ve no clean interpretation about ho w those angles are balanced when pair-wise orthogonality is not p ossible, we strongly believe that some theories are b ehind the ab o ve p ersisten t observ ations and w e are currently exploring this. 10 Learning Transforma tions for Clustering and Classifica tion 2.4 Discussions ab out Other Matrix No rms W e no w discuss the adv antages of replacing the rank function in (1) with the nuclear norm o ver other (p opular) matrix norms, e.g., the induced 2-norm and the F robenius norm. Prop osition 3 L et A and B b e matric es of the same r ow dimensions, and [ A , B ] b e the c onc atenation of A and B , we have || [ A , B ] || 2 ≤ || A || 2 + || B || 2 , with e quality if at le ast one of the two matric es is zer o. Prop osition 4 L et A and B b e matric es of the same r ow dimensions, and [ A , B ] b e the c onc atenation of A and B , we have || [ A , B ] || F ≤ || A || F + || B || F , with e quality if and only if at le ast one of the two matric es is zer o. W e choose the n uclear norm in (6) for tw o ma jor adv an tages that are not so fa v orable in other (p opular) matrix norms: • The nuclear norm is the b est con vex appro ximation of the rank function F azel (2002), whic h helps to reduce the v ariation within the subspaces (ﬁrst term in (6)); • The ob jectiv e function (6) is optimized when the distance b et w een ev ery pair of sub- spaces is maximized after transformation, which helps to in tro duce separations b e- t ween the subspaces. Note that (1), whic h is based on the rank, reaches the minim um when subspaces are indep enden t but not necessarily maximally distan t. Propositions 3 and 4 sho w that the prop ert y of the nuclear norm in Theorem 1 holds for the induced 2-norm and the F rob enius norm. Ho wev er, if we replace the rank function in (1) with the induced 2-norm norm or the F rob enius norm, the ob jective function is minimized at the trivial solution T = 0, whic h is prev ented by the normalization condition || T || 2 = γ ( γ > 0). 2.5 Online Learning Lo w-rank T ransformations When data Y is big, we use an online algorithm to learn the lo w-rank transformation T : • W e ﬁrst randomly partition the data set Y into B mini-batches; • Using mini-batc h subgradient descen t, a v arian t of sto c hastic subgradien t descent, the subgradien t in (16) in Appendix C is appro ximated b y a sum of subgradien ts obtained from eac h mini-batc h of samples, T ( t +1) = T ( t ) − ν B X b =1 ∆ T b , (8) where ∆ T b is obtained from (17) App endix C using only data p oin ts in the b -th mini-batc h; • Starting with the ﬁrst mini-batc h, we learn the subspace transformation T b using data only in the b -th mini-batc h, with T b − 1 as w arm restart. 11 Qiu and Sapiro 2.6 Subspace T ransformation with Compression Giv en data Y ⊆ R d , so far, we considered a square linear transformation T of size d × d . If w e devise a “fat” linear transformation T of size r × d , where ( r < d ), w e enable dimension reduction along with transformation. This connects the proposed framework with the literature on compressed sensing, though the goal here is to learn a “sensing” matrix T for subspace classiﬁcation and not for reconstruction Carson et al. (2012). The n uclear-norm minimization pro vides a new metric for suc h compressed sensing design (or compressed feature learning) paradigm. Results with this reduced dimensionality will be presen ted in Section 5. 3. Subspace Clustering using Lo w-rank T ransformations W e now mo ve from classiﬁcation, where w e learned the transform from training lab eled data, to clustering, where no training data is a v ailable. In particular, we address the sub- sp ac e clustering problem, meaning to partition the data set Y into C clusters corresponding to their underlying subspaces. W e ﬁrst presen t a general pro cedure to enhance the p erfor- mance of existing subspace clustering metho ds in the literature. Then we further prop ose a sp eciﬁc fast subspace clustering tec hnique to fully exploit the low-rank structure of (learned) transformed subspaces. 3.1 A Learned Robust Subspace Clustering (LRSC) F ramew ork In clustering tasks, the data labeling is of course not known beforehand in practice. The prop osed algorithm, Algorithm 1, iterates betw een t wo stages: In the ﬁrst assignmen t stage, w e obtain clusters using an y subspace clustering methods, e.g., SSC (Elhamifar and Vidal (2013)), LSA (Y an and Pollefeys (2006)), LBF (Zhang et al. (2012)). In particular, in this pap er w e often use the new impro ved tec hnique introduced in Section 3.2. In the second up date stage, based on the curren t clustering result, w e compute the optimal subspace trans- formation that minimizes (6). The algorithm is repeated until the clustering assignments stop c hanging. The LRSC algorithm is a general procedure to enhance the p erformance of an y subspace clustering metho ds (part of the b eauty of the proposed model is that it can b e applied to an y suc h algorithm, and even beyond Qiu and Sapiro (April, 2014)). W e don’t enforce an o verall ob jectiv e function at the presen t form for suc h v ersatility purp ose. T o study conv ergence, one wa y is to adopt the subspace clustering metho d for the LRSC assignmen t step b y optimizing the same LRSC update criterion (6): giv en the cluster assignmen t and the transformation T at the curren t LRSC iteration, we tak e a p oin t y i out of its current cluster (k eep the rest assignmen ts no change) and place it into a cluster Y c that minimize P C c =1 || TY c || ∗ . W e iterativ ely p erform this for all p oin ts, and then up date T using curren t T as warm restart. In this w ay , w e decrease (or keep) the o verall ob jective function (6) after eac h LRSC iteration. Ho wev er, the ab o v e approach is computational expensive and only allo w one sp eciﬁc subspace clustering metho d. Th us, in the present implemen tation, an ov erall ob jectiv e 12 Learning Transforma tions for Clustering and Classifica tion function of the t yp e that the LRSC algorithm optimizes can tak e a form such as, arg T , {S c } C c =1 min C X c =1 X y i ∈S c || T y i − P TY c T y i || 2 2 + λ [ C X c =1 || TY c || ∗ − || TY || ∗ ] , s.t. || T || 2 = γ , (9) where Y c denotes the set of p oin ts y i in the c-th subspace S c , and P TY c denotes the pro jec- tion on to TY c . The LRSC iterative algorithm optimize (9) through alternativ e minimiza- tion (with a similar form as the popular k-means, but with a diﬀeren t data mo del and with the learned transform). While formally studying its conv ergence is the sub ject of future re- searc h, the exp erimen tal v alidation presented already demonstrates excellen t p erformance, with LRSC just one of the p ossible applications of the prop osed learned transform. In all our exp erimen ts, w e observe signiﬁcant clustering error reduction in the ﬁrst few LRSC iterations, and the prop osed LRSC iterations enable signiﬁcantly cleaner subspaces for all subspace clustering b enc hmark data in the literature. The in tuition b ehinds the observ ed empirical con vergence is that the up date step in eac h LRSC iteration decreases the second term in (9) to a small v alue close to 0 as discussed in Section 2; at the same time, the up dated transformation tends to reduce the intra-subspace v ariation, whic h further reduces the ﬁrst cluster deviation term in (9) even with assignmen ts deriv ed from v arious subspace clustering metho ds. Input : A set of data p oin ts Y = { y i } N i =1 ⊆ R d in a union of C subspaces. Output : A partition of Y in to C disjoint clusters { Y c } C c =1 based on underlying subspaces. b egin 1. Initial a transformation matrix T as the identit y matrix ; rep eat Assignmen t stage: 2. Assign points in TY to clusters with any subspace clustering methods, e.g., the prop osed R-SSC; Up date stage: 3. Obtain transformation T by minimizing (6) based on the curren t clustering result ; un til assignment c onver gence ; 4. Return the curren t clustering result { Y c } C c =1 ; end Algorithm 1: Learning a robust subspace clustering (LRSC) framework. 3.2 Robust Sparse Subspace Clustering (R-SSC) Though Algorithm 1 can adopt any subspace clustering metho ds, to fully exploit the low- rank structure of the learned transformed subspaces, w e further propose the following sp e- ciﬁc tec hnique for the clustering step in the LRSC framew ork, called Robust Sparse Sub- space Clustering (R-SSC): 13 Qiu and Sapiro 1. F or the transformed subspaces, w e ﬁrst recov er their lo w-rank representation L b y p erforming a lo w-rank decomp osition (10), e.g., using RPCA (Cand` es et al. (2011)), 1 arg L , S min || L || ∗ + β || S || 1 s.t. TY = L + S . (10) 2. Each transformed point Ty i is then sparsely decomp osed ov er L , arg x i min k Ty i − Lx i k 2 2 s.t. k x i k 0 ≤ K, (11) where K is a predeﬁned sparsity v alue ( K > d ). As explained in Elhamifar and Vidal (2013), a data p oin t in a linear or aﬃne subspace of dimension d can b e written as a linear or aﬃne combination of d or d + 1 p oin ts in the same subspace. Thus, if w e represen t a p oint as a linear or aﬃne combination of all other points, a sparse linear or aﬃne combination can b e obtained b y ch o osing d or d + 1 nonzero coeﬃcients. 3. As the optimization pro cess for (11) is computationally demanding, w e further simplify (11) using Local Linear Embedding (Row eis and Saul (2000), W ang et al. (2010)). Eac h transformed p oint Ty i is represen ted using its K Nearest Neighbors (NN) in L , whic h are denoted as L i , arg x i min k Ty i − L i x i k 2 2 s.t. k x i k 1 = 1 . (12) Let ¯ L i = L i − 1Ty T i . x i can then b e eﬃcien tly obtained in closed form, x i = ¯ L i ¯ L T i \ 1 , where x = A \ B solv es the system of linear equations Ax = B . As suggested in Ro weis and Saul (2000), if the correlation matrix ¯ L i ¯ L T i is nearly singular, it can b e conditioned by adding a small m ultiple of the identit y matrix. F rom exp erimen ts, w e observ e this simpliﬁcation step dramatically reduces the running time, without sacriﬁcing the accuracy . 4. Given the sparse representation x i of each transformed data p oin t Ty i , we denote the sparse representation matrix as X = [ x 1 . . . x N ]. It is noted that x i is written as an N -sized vector with no more than K << N non-zero v alues ( N being the total n um b er of data points). The pairwise aﬃnity matrix is no w deﬁned as W = | X | + | X T | , and the subspace clustering is obtained using sp ectral clustering (Luxburg (2007)). Based on exp erimen tal results presen ted in Section 5, the prop osed R-SSC outp erforms state-of-the-art subspace clustering tec hniques, in b oth accuracy and running time, e.g., ab out 500 times faster than the original SSC using the implemen tation pro vided in Elhamifar and Vidal (2013). Performance is further enhanced when R-SCC is used as an internal step of LRSC in Algorithm 1. 1. Note that while the learned transform T encourages low-rank in eac h sub-space, outliers migh t still exists. Moreo ver, during the iterations in Algorithm 1, the intermediate learned T is not yet the desired one. This justiﬁes the incorporation of this further low-rank decomposition. 14 Learning Transforma tions for Clustering and Classifica tion 4. Classiﬁcation using Single or Multiple Low-rank T ransformations In Section 2, learning one global transformation ov er all classes has been discussed, and then incorp orated in to a clustering framew ork in Section 3. The av ailability of data lab els for training enables us to consider instead learning individual class-based linear transformation. The problem of class-based linear transformation learning can b e form ulated as (13). arg { T c } C c =1 min C X c =1 [ || T c Y c || ∗ − λ || T c Y ¬ c || ∗ ] , (13) where T c ∈ R d × d denotes the transformation for the c-th class, Y ¬ c = Y \ Y c denotes all data except the c-th class, and λ is a p ositiv e balance parameter. When a global transformation matrix T is learned, w e can p erform classiﬁcation in the transformed space b y simply considering the transformed data TY as the new features. F or example, when a Nearest Neighbor (NN) classiﬁer is used, a testing sample y uses Ty as the feature and searc hes for nearest neighbors among TY . T o fully exploit the lo w-rank structure of the transformed data, we prop ose to p erform classiﬁcation through the follo wing pro cedure: • F or the c-th class, we ﬁrst reco ver its lo w-rank representation L c b y p erforming low- rank decomposition (14), e.g., using RPCA (Cand ` es et al. (2011)): 2 arg L c , S c min || L c || ∗ + β || S c || 1 s.t. TY c = L c + S c . (14) • Each testing image y will then b e assigned to the lo w-rank subspace L c that giv es the minimal reconstruction error through sparse decomp osition (15), e.g., using OMP (P ati et al. (Nov. 1993)), arg x min k Ty − L i x k 2 2 s.t. k x k 0 ≤ T , (15) where T is a predeﬁned sparsit y v alue. When class-based transformations { T c } C c =1 are learned, we p erform recognition in a similar w ay . How ever, no w we apply all the learned transforms T c to eac h testing data p oin t and then pic k the b est one using the same criterion of minimal reconstruction error through sparse decomposition (15). 5. Exp erimen tal Ev aluation This section ﬁrst presents exp erimen tal ev aluations on subspace clustering using three pub- lic datasets (standard b enc hmarks): the MNIST handwritten digit dataset, the Extended Y aleB face dataset (Georghiades et al. (2001)) and the Hopkins 155 database of motion segmen tation. The MNIST dataset consists of 8-bit gra yscale handwritten digit images of 2. Note that this is done only once and can be considered part of the training stage. As b efore, this further lo w-rank decomposition helps to handle outliers not addressed by the learned transform. 15 Qiu and Sapiro “0” through “9” and 7000 examples for each class. The Extended Y aleB face dataset con- tains 38 sub jects with near fron tal p ose under 64 ligh ting conditions. All the images are resized to 16 × 16. The classical Hopkins 155 database of motion segmen tation, whic h is a v ailable at http://www.vision.jhu.edu/data/hopkins155 , con tains 155 video sequences along with extracted feature tra jectories, where 120 of the videos hav e tw o motions and 35 of the videos ha ve three motions. Subspace clustering metho ds compared are SSC (Elhamifar and Vidal (2013)), LSA (Y an and Pollefeys (2006)), and LBF (Zhang et al. (2012)). Based on the studies in Elham- ifar and Vidal (2013), Vidal (2011) and Zhang et al. (2012), these three methods exhibit state-of-the-art subspace clustering p erformance. W e adopt the LSA and SSC implementa- tions provided in Elhamifar and Vidal (2013) from http://www.vision.jhu.edu/code/ , and the LBF implemen tation provided in Zhang et al. (2012) from http://www.ima.umn. edu/ ~ zhang620/lbf/ . W e adopt similar setups as describ ed in Zhang et al. (2012) for exp erimen ts on subspace clustering. This section then presents exp erimen tal ev aluations on classiﬁcation using t wo public face datasets: the CMU PIE dataset (Sim et al. (2003)) and the Extended Y aleB dataset. The PIE dataset consists of 68 sub jects imaged simultaneously under 13 diﬀerent poses and 21 lighting conditions. All the face images are resized to 20 × 20. W e adopt a NN classiﬁer unless otherwise sp eciﬁed. 5.1 Subspace Clustering with Illustrativ e Examples F or illustration purp oses, we conduct the ﬁrst set of exp eriments on a subset of the MNIST dataset. W e adopt a similar setup as describ ed in Zhang et al. (2012), using the same sets of 2 or 3 digits, and randomly choose 200 images for each digit. W e set the sparsity v alue K = 6 for R-SSC, and p erform 100 iterations for the subgradient up dates while learning the transformation on subspaces. The subgradien t update step was ν = 0 . 02 (see App endix C for details on the pro jected subgradien t optimization algorithm). Unless otherwise stated, w e do not p erform dimension reduction, suc h as PCA or ran- dom pro jections, to prepro cess the data, thereby further saving computations (please note that the learned transform can itself reduce dimensions if so desired, see Section 5.7). In the literature, e.g., Elhamifar and Vidal (2013), Vidal (2011) and Zhang et al. (2012), pro- jection to a very lo w dimension is usually p erformed to enhance the clustering p erformance. Ho wev er, it is often not ob vious how to determine the correct pro jection dimension for real data, and man y subspace clustering metho ds show sensitiv e to the c hoice of the pro jection dimension. This dimension reduction step is not needed in the framework here prop osed. Fig. 5 shows the misclassiﬁcation rate ( e ) and running time ( t ) on clustering subspaces of t wo digits. The misclassiﬁcation rate is the ratio of misclassiﬁed p oin ts to the total n umber of p oin ts 3 . F or visualization purp oses, the data are plotted with the dimension reduced to 2 using Laplacian Eigenmaps Belkin and Niy ogi (2003). Diﬀeren t clusters are represen ted b y diﬀeren t colors and the ground truth is plotted using the true cluster lab els. The prop osed R- SSC outp erforms state-of-the-art methods, both in terms of clustering accuracy and running time. The clustering error of R-SSC is further reduced using the prop osed LRSC framew ork in Algorithm 1 through the learned low-rank subspace transformation. The clustering 3. Meaning the ratio of p oin ts that were assigned to the wrong cluster. 16 Learning Transforma tions for Clustering and Classifica tion LBF e=9.2417 % t=7 8.00 s ec G round truth LSA e=9 .0047 % t=1 1.37 s ec SSC e=4.0284 % t=4 47.66 s ec R - SSC ( i ter =0 ) e=3 .3175% t=0 .93 s ec Unsu perv ised clust ering dig its [1 2 ] in MNIS T ( e : mi s c l as s i fi c ati on rate t : running ti me) R - SS C : R ob ust Spars e Subspac e cl ustering ( ou r appro ac h) (a) Original subspaces for digits { 1, 2 } . Clust ering dig its [1 2 ] usin g low - rank sub space transf ormat ion ( i ter : EM i teration s ) i ter =1 e=2 .8436 % i ter =2 e=1 .8957 % G r ou n d truth R - SSC i ter = 3 e= 1.4 21 8% i ter = 4 e= 1.8 95 7 % i ter =5 e=1 .8957 % (b) T ransformed subspaces for digits { 1, 2 } . LBF e= 16.4352% t= 74.94 s ec G ro und tru th LSA e= 2.3148% t= 1 1.31 s ec SSC e= 2.0833% t= 458.15 s ec R - SSC ( i ter =0 ) e= 1.1574% t= 0.95 s ec Unsu perv ised clust ering dig its [1 7] in MNIS T ( e : mi s c l as s i fi c ati on rate t : running ti me) R - S S C: Rob ust S par s e S ubspa c e c lus teri ng (o ur a pproac h) (c) Original subspaces for digits { 1, 7 } . Clust ering dig its [1 7] usin g low - rank sub space transf ormat ion ( i ter : EM i teration s ) i ter =1 e= 0.69444% i ter =2 e= 0.46296% G r ou n d truth R - SSC i ter = 3 e= 0.4 62 96 % i ter = 4 e= 0.4 62 96 % i ter =5 e= 0.46296% (d) T ransformed subspaces for digits { 1, 7 } . Figure 5: Misclassiﬁcation rate ( e ) and running time ( t ) on clustering 2 digits. Metho ds compared are SSC Elhamifar and Vidal (2013), LSA Y an and P ollefeys (2006), and LBF Zhang et al. (2012). F or visualization, the data are plotted with the dimension reduced to 2 using Laplacian Eigenmaps Belkin and Niy ogi (2003). Diﬀeren t clusters are represen ted b y diﬀeren t colors and the gr ound truth is plotted with the true cluster lab els. iter indicates the num b er of LRSC iterations in Algorithm 1. The prop osed R-SSC outp erforms state- of-the-art metho ds in terms of b oth clustering accuracy and running time, e.g., ab out 500 times faster than SSC. The clustering p erformance of R-SSC is further improv ed using the prop osed LRSC framework. Note how the data is clearly clustered in clean subspaces in the transformed domain (b est viewed zo oming on screen). con verges after ab out 3 LRSC iterations. The learned transformation not only recov ers 17 Qiu and Sapiro LBF e= 30.9904% G round truth LSA e= 30.1917% Unsu perv ised clust ering dig its [1 2 3] in MNIS T ( e : mi s c l as s i fi c ati on rate) R - LBF e= 9.5 84 7% G round truth R - LBF : a dopt LBF a s th e s ubspa c e c lus teri ng me th od fo r trans fo rme d s ubspa c e s T ransfo rmed s ubs pac es O rigi nal s ubspac es (a) Digits { 1, 2, 3 } . LBF e= 35.0937% G round truth LSA e= 21.2947% Unsu perv ised clust ering dig its [2 4 8] in MNIS T ( e : mi s c l as s i fi c ati on rate) R - LBF e= 6.9 84 7 % G ro und tru th R - LBF : a dopt LBF a s th e s ubspa c e c lus teri ng me th od fo r trans fo rme d s ubspa c e s T rans form ed s ub s pa c es O rigi nal s ubs pac es (b) Digits { 2, 4, 8 } . Figure 6: Misclassiﬁcation rate ( e ) on clustering 3 digits. Metho ds compared are LSA Y an and P ollefeys (2006) and LBF Zhang et al. (2012). LBF is adopted in the prop osed LRSC framew ork and denoted as R-LBF. After con vergence, R-LBF signiﬁcantly outp erforms state-of-the-art methods. a low-rank structure for data from the same subspace, but also increases the separations b et w een the subspaces for more accurate clustering. Fig. 6 sho ws misclassiﬁcation rate ( e ) on clustering subspaces of three digits. Here we adopt LBF in our LRSC framew ork, denoted as Robust LBF (R-LBF), to illustrate that the p erformance of existing subspace clustering metho ds can b e enhanced using the prop osed LRSC algorithm. After conv ergence, R-LBF, which uses the prop osed learned subspace transformation, signiﬁcan tly outperforms state-of-the-art metho ds. T able 1 sho ws the misclassiﬁcation rate on clustering diﬀeren t n umber of digits, [0 : c ] denotes the subset of c + 1 digits from digit 0 to c . W e randomly pick 100 samples p er digit to compare the p erformance when a fewer num b er of data p oin ts p er class are present. F or all cases, the prop osed LRSC metho d signiﬁcantly outp erforms state-of-the-art metho ds. 5.1.1 Online vs. Ba tch Learning In this set of experiments, w e use digits { 1, 2 } from the MNIST dataset. W e select 1000 images for each digit, and randomly partition them into 5 mini-batc hes. W e ﬁrst p erform one iteration of LRSC in Algorithm 1 o v er all selected data with v arious γ v alues. As sho wn in Fig. 7a, we alw ays observe empirical conv ergence for subspace transformation learning via (6). The pro jected subgradien t method presen ted in App endix C conv erges to 18 Learning Transforma tions for Clustering and Classifica tion T able 1: Misclassiﬁcation rate ( e %) on clustering diﬀeren t n umbers of digits in the MNIST dataset, [0 : c ] denotes the subset of c + 1 digits from digit 0 to c . W e randomly pic k 100 samples p er digit. F or all cases, the proposed LRSC metho d signiﬁcan tly outp erforms state-of-the-art methods. Subsets [0:1] [0:2] [0:3] [0:4] [0:5] [0:6] [0:7] [0:8] C 2 3 4 5 6 7 8 9 LSA 0.47 47.57 36.73 30.90 40.46 48.13 39.87 44.03 LBF 0.47 23.62 29.19 51.37 48.99 53.01 39.87 38.79 LRSC 0 3.88 3.89 5.31 14.04 13.79 14.50 16.05 0 100 200 300 400 500 0 5 10 15 20 25 Number of Iterations Objective Function No no rma li z at ion || T || 2 = 0 . 2 || T || 2 = 0 . 4 || T || 2 = 0 . 6 || T || 2 = 0 . 8 || T || 2 = 1 . 0 (a) Batch learning with v arious γ v alues. 0 100 200 300 400 500 0 5 10 15 20 25 N u mb e r o f I t e r a t i o n s O b j e c t i v e F u n c t i o n Batch Online (b) Online vs. batc h learning ( γ = 1). Figure 7: Con vergence of the ob jectiv e function (6) using online and batch learning for subspace transformation. W e alw ays observe empirical con vergence for both online and batc h learning. In (a), w e v ary the v alue of γ in the norm constrain t (“No normalization” denotes removing the norm constraint). More discussions on con vergence can b e found in App endix C. In (b), to conv erge to the same ob jective function v alue, it takes 131 . 76 sec. for online learning and 700 . 27 sec. for batch learning. a lo cal minim um (or a stationary p oin t). More discussions on conv ergence can b e found in App endix C. Starting with the ﬁrst mini-batch, w e then p erform one iteration of LRSC o ver one mini- batc h a time, with the subspace transformation learned from the previous mini-batch as w arm restart. W e adopt here 100 iterations for the subgradien t descen t up dates. As sho wn in Fig. 7b, we observ e similar empirical conv ergence for online transformation learning. T o con verge to the same ob jectiv e function v alue, it tak es 131 . 76 sec. for online learning and 700 . 27 sec. for batc h learning. 5.2 Application to F ace Clustering In the Extended Y aleB dataset, each of the 38 sub jects is imaged under 64 lighting con- ditions, shown in Fig. 8a. Under the assumption of Lam b ertian reﬂectance, face images of each sub ject under diﬀeren t lighting conditions can b e accurately appro ximated with a 9-dimensional linear subspace (Basri and Jacobs (F ebruary 2003)). W e conduct the face 19 Qiu and Sapiro (a) Example illumination conditions. 1 2 3 4 5 6 7 8 9 (b) Example sub jects. Figure 8: The extended Y aleB face dataset. −0.02 −0.015 −0.01 −0.005 0 0.005 0.01 0.015 0.02 0.025 0.03 −0.03 −0.02 −0.01 0 0.01 0.02 0.03 (a) Ground truth. −0.03 −0.025 −0.02 −0.015 −0.01 −0.005 0 0.005 0.01 0.015 0.02 −0.03 −0.02 −0.01 0 0.01 0.02 0.03 (b) SSC, e = 71 . 25%, t = 714 . 99 sec . −0.02 −0.015 −0.01 −0.005 0 0.005 0.01 0.015 0.02 0.025 0.03 −0.03 −0.02 −0.01 0 0.01 0.02 0.03 (c) LBF, e = 76 . 37%, t = 460 . 76 sec . −0.02 −0.015 −0.01 −0.005 0 0.005 0.01 0.015 0.02 0.025 0.03 −0.03 −0.02 −0.01 0 0.01 0.02 0.03 (d) LSA, e = 71 . 96%, t = 22 . 57 sec . −0.02 −0.015 −0.01 −0.005 0 0.005 0.01 0.015 0.02 0.025 0.03 −0.03 −0.02 −0.01 0 0.01 0.02 0.03 (e) R-SSC, e = 67 . 37 %, t = 1 . 83 sec . Figure 9: Misclassiﬁcation rate ( e ) and running time ( t ) on clustering 9 sub jects using diﬀeren t subspace clustering metho ds. The proposed R-SSC outp erforms state-of-the-art metho ds b oth in accuracy and running time. This is further impro v ed using the learned transform, LRSC reduces the error to 4.94%, see Fig. 10. clustering exp eriments on the ﬁrst 9 sub jects sho wn in Fig. 8b. W e set the sparsity v alue K = 10 for R-SSC, and perform 100 iterations for the subgradien t descen t up dates while learning the transformation. Fig. 9 shows error rate ( e ) and running time ( t ) on clustering subspaces of 9 sub jects using diﬀeren t subspace clustering metho ds. The prop osed R-SSC techniques outperforms state-of-the-art metho ds both in accuracy and running time. As sho wn in Fig. 10, using the prop osed LRSC algorithm (that is, learning the transform), the misclassiﬁcation errors of R- SSC are further reduced signiﬁcantly , for example, from 67 . 37% to 4 . 94% for the 9 sub jects. Fig. 10n shows the con vergence of the T updating step in the ﬁrst few LRSC iterations. The dramatic p erformance improv ement can b e explained in Fig. 11. W e observ e, as exp ected from the theory presen ted b efore, that the learned subspace transformation increases the distance (the smallest principal angle) betw een subspaces and, at the same time, reduces the n uclear norms of subspaces. More results on clustering subspaces of 2 and 3 sub jects are sho wn in Fig. 12. T able 2 shows misclassiﬁcation rate ( e ) on clustering subspaces of diﬀerent num b er of sub jects, [1 : c ] denotes the ﬁrst c sub jects in the extended Y aleB dataset. F or all cases, the prop osed LRSC metho d signiﬁcantly outperforms state -of-the-art methods. Note that without the lo w-rank decomp osition step in (10), we obtain a misclassiﬁcation rate 18.38% 20 Learning Transforma tions for Clustering and Classifica tion −0.05 −0.04 −0.03 −0.02 −0.01 0 0.01 0.02 −0.05 −0.04 −0.03 −0.02 −0.01 0 0.01 0.02 0.03 (a) Ground truth (iter=1). −0.05 −0.04 −0.03 −0.02 −0.01 0 0.01 0.02 −0.05 −0.04 −0.03 −0.02 −0.01 0 0.01 0.02 0.03 (b) e = 40 . 39% (iter=1). −0.04 −0.03 −0.02 −0.01 0 0.01 0.02 0.03 0.04 −0.04 −0.03 −0.02 −0.01 0 0.01 0.02 0.03 (c) Ground truth (iter=2). −0.04 −0.03 −0.02 −0.01 0 0.01 0.02 0.03 0.04 −0.04 −0.03 −0.02 −0.01 0 0.01 0.02 0.03 (d) e = 33 . 51% (iter=2). −0.05 −0.04 −0.03 −0.02 −0.01 0 0.01 0.02 0.03 −0.03 −0.02 −0.01 0 0.01 0.02 0.03 0.04 (e) Ground truth (iter=3). −0.05 −0.04 −0.03 −0.02 −0.01 0 0.01 0.02 0.03 −0.03 −0.02 −0.01 0 0.01 0.02 0.03 0.04 (f ) e = 29 . 98% (iter=3). −0.03 −0.02 −0.01 0 0.01 0.02 0.03 0.04 0.05 −0.04 −0.03 −0.02 −0.01 0 0.01 0.02 0.03 (g) Ground truth (iter=6). −0.03 −0.02 −0.01 0 0.01 0.02 0.03 0.04 0.05 −0.04 −0.03 −0.02 −0.01 0 0.01 0.02 0.03 (h) e = 13 . 40% (iter=6). −0.02 −0.01 0 0.01 0.02 0.03 0.04 0.05 −0.02 −0.01 0 0.01 0.02 0.03 0.04 (i) Ground truth (iter=8). −0.02 −0.01 0 0.01 0.02 0.03 0.04 0.05 −0.02 −0.01 0 0.01 0.02 0.03 0.04 (j) e = 6 . 17% (iter=8). −0.01 0 0.01 0.02 0.03 0.04 0.05 −0.04 −0.03 −0.02 −0.01 0 0.01 0.02 (k) Ground truth (iter=12). −0.01 0 0.01 0.02 0.03 0.04 0.05 −0.04 −0.03 −0.02 −0.01 0 0.01 0.02 (l) e = 4 . 94 % (iter=12). 0 2 4 6 8 10 12 0 10 20 30 40 50 60 70 Number of LRSC Iterations Misclassification Rate (m) Misclassiﬁcation rate. 0 20 40 60 80 100 −0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 Number of Iterations Objective Function ite r = 1 ite r = 2 ite r = 3 ite r = 4 ite r = 5 ite r = 6 (n) Conv ergence of T up dating Figure 10: Misclassiﬁcation rate ( e ) on clustering 9 sub jects using the proposed LRSC framew ork. W e adopt the prop osed R-SSC tec hnique for the clustering step. With the prop osed LRSC framework, the clustering error of R-SSC is further reduced signiﬁcantly , e.g., from 67 . 37% to 4 . 94% for the 9-sub ject case. Note ho w the classes are clustered in clean subspaces in the transformed domain. for clustering all 38 sub jects in the Extended Y aleB dataset, whic h is slightly lo wer than the 11.02% reported in T able 2. Thus, pushing the subspaces apart through our learned 21 Qiu and Sapiro .00 .10 .05 .07 .11 .07 .07 .10 .09 .10 .00 .08 .10 .11 .07 .09 .09 .11 .05 .08 .00 .07 .13 .09 .08 .09 .09 .07 .10 .07 .00 .13 .09 .04 .10 .09 .11 .11 .13 .13 .00 .11 .11 .13 .13 .07 .07 .09 .09 .11 .00 .08 .08 .08 .07 .09 .08 .04 .11 .08 .00 .09 .10 .10 .09 .09 .10 .13 .08 .09 .00 .07 .09 .11 .09 .09 .13 .08 .10 .07 .00 s1 s2 s3 s4 s5 s6 s7 s8 s9 s1 s2 s3 s4 s5 s6 s7 s8 s9 (a) Original smallest angles. .00 .26 .20 .20 .29 .20 .24 .22 .21 .26 .00 .26 .26 .34 .26 .30 .27 .24 .20 .26 .00 .24 .34 .22 .25 .19 .24 .20 .26 .24 .00 .35 .18 .23 .25 .25 .29 .34 .34 .35 .00 .33 .32 .29 .31 .20 .26 .22 .18 .33 .00 .21 .21 .23 .24 .30 .25 .23 .32 .21 .00 .26 .23 .22 .27 .19 .25 .29 .21 .26 .00 .24 .21 .24 .24 .25 .31 .23 .23 .24 .00 s1 s2 s3 s4 s5 s6 s7 s8 s9 s1 s2 s3 s4 s5 s6 s7 s8 s9 (b) T ransformed smallest angles. 1.0 .64 .64 .69 .67 .63 .58 .72 .67 .64 1.0 .63 .62 .64 .65 .61 .62 .63 .64 .63 1.0 .66 .65 .64 .61 .67 .65 .69 .62 .66 1.0 .64 .64 .61 .68 .67 .67 .64 .65 .64 1.0 .64 .57 .70 .67 .63 .65 .64 .64 .64 1.0 .59 .67 .70 .58 .61 .61 .61 .57 .59 1.0 .61 .61 .72 .62 .67 .68 .70 .67 .61 1.0 .66 .67 .63 .65 .67 .67 .70 .61 .66 1.0 s1 s2 s3 s4 s5 s6 s7 s8 s9 s1 s2 s3 s4 s5 s6 s7 s8 s9 (c) Original mean cosine angles. 1.0 .47 .47 .52 .49 .44 .42 .51 .47 .47 1.0 .46 .45 .47 .48 .45 .44 .44 .47 .46 1.0 .48 .47 .45 .43 .47 .45 .52 .45 .48 1.0 .48 .47 .46 .49 .47 .49 .47 .47 .48 1.0 .47 .42 .49 .46 .44 .48 .45 .47 .47 1.0 .42 .47 .50 .42 .45 .43 .46 .42 .42 1.0 .45 .43 .51 .44 .47 .49 .49 .47 .45 1.0 .46 .47 .44 .45 .47 .46 .50 .43 .46 1.0 s1 s2 s3 s4 s5 s6 s7 s8 s9 s1 s2 s3 s4 s5 s6 s7 s8 s9 (d) T ransformed mean cosine angles. 1 2 3 4 5 6 7 8 9 10 15 20 Subjects Nuclear Norm Original Transformed (e) Subspace nuclear norm. Figure 11: The smallest and mean principal angles betw een pairs of 9 sub ject subspaces and the n uclear norms of 9 s ub ject subspaces b efore and after transformation. Note that eac h en try in (a) and (b) denotes the smallest principal angle, and eac h en try in (c) and (d) denotes the a verage cosine ov er all principal angles. W e observe that the learned subspace transformation increases the angles b et ween subspaces and also reduces the nuclear norms of subspaces. Ov erall, the a verage smallest principal angles b et ween subspaces increased from 0.09 to 0.26, and the a verage subspace nuclear norm decreased from 21.43 to 8.53. transformation plays a ma jor role here; and the robustness in the lo w-rank decomposition enhances the p erformance ev en further. In Fig. 3 and Fig. 4, using syn thetic examples, w e previously compared our learned transformation with the closed-form orthogonalizing transformation and LDA. In T able 3, w e further compare three transformations using real data. W e p erform sup ervised trans- formation learning on all 38 sub jects in the Extended Y aleB dataset using three diﬀeren t transformation learning algorithms, and then p erform subspace clustering on the trans- formed data. The prop osed transformation learning signiﬁcantly outperforms the other t wo metho ds. 22 Learning Transforma tions for Clustering and Classifica tion L B F e =4 9 .2 0 6 3 % t= 7 6 .1 9 s e c Gro u n d tr u th L S A e =5 0 % t=1 .2 2 s e c SSC e =4 7 .6 1 9 % t= 1 0 1 .4 4 s e c R - SSC ( it e r= 0 ) e =4 2 .0 6 3 5 % t= 0 .3 2 s e c [1 2 ] in Y a le B Orig in a l s u b s p a c e s (a) Sub jects { 1, 2 } . [1 2 ] in Y a le B R - SSC e =0 % Gro u n d tr u th T ra n s fo rme d s u b s p a c e s (b) Sub jects { 1, 2 } . L B F e =4 9 .2 0 6 3 % t=7 0 .7 9 s e c Gro u n d tr u th L S A e =4 6 .8 2 5 4 % t=1 .2 4 s e c SSC e =4 5 .2 3 8 1 % t= 1 1 3 .7 5 s e c R - SSC ( it e r= 0 ) e =3 8 .8 8 8 9 % t= 0 .2 1 s e c [2 3 ] in Y a le B O ri g in a l s u b s p a c e s (c) Sub jects { 2, 3 } . [2 3 ] in Y a le B R - SSC e =2. 3 8 1 % Gro u n d tr u th T ra n s fo rme d s u b s p a c e s (d) Sub jects { 2, 3 } . L B F e = 6 4 .0 2 1 2 % t=1 2 1 .6 3 s e c Gro u n d tr u th L S A e =4 7 .6 1 9 % t=2 .8 4 s e c SSC e =5 3 .9 6 8 3 % t=1 8 0 .8 8 s e c R - SSC ( it e r= 0 ) e =4 1 .2 6 9 8 % t= 0 .2 9 s e c [4 5 6 ] in Y a le B Orig in a l s u b s p a c e s (e) Sub jects { 4, 5, 6 } . [4 5 6 ] in Y a le B R - SSC e =4 .2 3 2 8 % Gro u n d tr u th T ra n s fo rme d s u b s p a c e s (f ) Sub jects { 4, 5, 6 } . L B F e =6 4 .0 2 1 2 % t=1 1 9 .2 8 s e c Gro u n d tr u th L S A e =6 5 .0 7 9 4 % t=2 .7 4 s e c SSC e =6 5 .0 7 9 4 % t=1 8 6 .3 6 s e c R - SSC ( it e r= 0 ) e = 6 5 .6 0 8 5 % t= 0 .2 5 s e c [7 8 9 ] in Y a le B Orig in a l s u b s p a c e s (g) Sub jects { 7, 8, 9 } . [7 8 9 ] in Y a le B R - SSC e = 1 .5 8 7 3 % Gro u n d tr u th T ra n s fo rme d s u b s p a c e s (h) Sub jects { 7, 8, 9 } . Figure 12: Misclassiﬁcation rate ( e ) and running time ( t ) on clustering 2 and 3 sub jects. The prop osed R-SSC outp erforms state-of-the-art methods b oth in accuracy and running time. With the prop osed LRSC framew ork, the clustering error of R-SSC is further reduced signiﬁcan tly . Note ho w the classes are clustered in clean subspaces in the transformed domain (best view ed zooming on screen). 5.3 Application to Motion Segmen tation The Hopkins 155 dataset consists of three types of videos: c heck er, traﬃc and articulated, and 120 of the videos hav e t wo motions and 35 of the videos hav e three motions. The main task is to segmen t a video sequence of m ultiple rigidly mo ving ob jects into m ultiple spa- tiotemp oral regions that corresp ond to diﬀerent motions in the scene. This motion dataset con tains m uc h cleaner subspace data than the digits and faces data ev aluated abov e. T o enable a fair comparison, we pro ject the data into a low er dimensional subspace using PCA as explained in Vidal (2011); Zhang et al. (2012). Results on other comparing metho ds are tak en from Vidal (2011). As shown in Vidal (2011); Zhang et al. (2012), the SSC metho d signiﬁcan tly outperforms all previous state-of-the-art metho ds on this dataset. F rom T a- ble 4, we can see that our metho d sho ws comparable results to SSC for t wo motions and 23 Qiu and Sapiro T able 2: Misclassiﬁcation rate ( e %) on clustering diﬀerent num b er of sub jects in the Extended Y aleB face dataset, [1 : c ] denotes the ﬁrst c sub jects in the dataset. F or all cases, the proposed LRSC method signiﬁcantly outp erforms state-of-the-art metho ds. Subsets [1:10] [1:15] [1:20] [1:25] [1:30] [1:38] C 10 15 20 25 30 38 LSA 78.25 82.11 84.92 82.98 82.32 84.79 LBF 78.88 74.92 77.14 78.09 78.73 79.53 LRSC 5.39 4.76 9.36 8.44 8.14 11.02 T able 3: Misclassiﬁcation rate ( e %) on clustering 38 sub jects in the Extended Y aleB dataset using supervised transformation learning. The prop osed transformation learning outp erforms b oth the closed-form orthogonalizing transformation and LD A on clustering the transformed data. Metho ds Misclassiﬁcation (%) orthogonalizing 61.36 LD A 9.77 Prop osed 5.47 outp erforms SSC for three motions. Note that our metho d is orders of magnitude faster than SSC as discussed earlier. 5.4 Application to F ace Recognition across Illumination F or the Extended Y aleB dataset, we adopt a similar setup as describ ed in Jiang et al. (June 2011); Zhang and Li (June 2010). W e split the dataset into tw o halv es by randomly selecting 32 ligh ting conditions for training, and the other half for testing. W e learn a global lo w-rank transformation matrix from the training data. W e rep ort recognition accuracies in T able 5. W e mak e the following observ ations. First, the recognition accuracy is increased from 91 . 77% to 99 . 10% by simply applying the learned transformation matrix to the original face images. Second, the b est accuracy is obtained b y ﬁrst reco v ering the lo w-rank subspace for eac h sub ject, e.g., the third row in Fig. 13a. Then, eac h transformed testing face, e.g., the second ro w in Fig. 13b, is s parsely decomp osed ov er the lo w-rank subspace of each sub ject through OMP , and classiﬁed to the sub ject with the minimal reconstruction error. A sparsit y v alue 10 is used here for OMP . As shown in Fig. 13c, the low-rank represen tation for each sub ject sho ws reduced v ariations caused b y illumination. Third, the global transformation p erforms better here than class-based transformations, which can b e due to the fact that illumination in this dataset v aries in a globally co ordinated w a y across sub jects. Last but not least, our method outp erforms state-of-the-art sparse representation based face recognition metho ds. 5.5 Application to F ace Recognition across Pose W e adopt the similar setup as describ ed in Castillo and Jacobs (2009) to enable the com- parison. In this exp erimen t, we classify 68 sub jects in three p oses, frontal (c27), side (c05), 24 Learning Transforma tions for Clustering and Classifica tion T able 4: Misclassiﬁcation rate ( e %) on tw o motions and three motions segmentation in the Hopkins 155 dataset. As shown in Vidal (2011); Zhang et al. (2012), the SSC metho d sig- niﬁcan tly outp erforms all previous state-of-the-art metho ds on this dataset. The prop osed LRSC sho ws comparable results to SSC for tw o motions and outp erforms SSC for three motions. Note that our metho d is orders of magnitude faster than SSC. Chec k T raﬃc Articulated All Mean Median Mean Median Mean Median Mean Median 2-motion LSA 2.57 0.27 5.43 1.48 4.10 1.22 3.45 0.59 LBF 1.59 0 0.20 0 0.80 0 1.16 0 SSC 1.12 0 0.02 0 0.62 0 0.82 0 LRSC 1.19 0 0.23 0 0.88 0 0.92 0 3-motion LSA 5.80 1.77 25.07 23.79 7.25 7.25 9.73 2.33 LBF 4.57 0.94 0.38 0 2.66 2.66 3.63 0.64 SSC 2.97 0.27 0.58 0 1.42 0 2.45 0.2 LRSC 1.59 0 0.32 0 1.60 1.60 1.34 0 T able 5: Recognition accuracies (%) under illumination v ariations for the Extended Y aleB dataset. The recognition accuracy is increased from 91 . 77% to 99 . 10% b y simply applying the learned low-rank transformation (LR T) matrix to the original face images. Metho d Accuracy (%) D-KSVD Zhang and Li (June 2010) 94.10 LC-KSVD Jiang et al. (June 2011) 96.70 SR C W righ t et al. (2009) 97.20 Original+NN 91.77 Class LR T+NN 97.86 Class LR T+OMP 92.43 Global LR T+NN 99.10 Global LR T+OMP 99.51 and proﬁle (c22), under ligh ting condition 12. W e use the remaining p oses as the training data. F or this example, w e learn a class-based low-rank transformation matrix p er sub ject from the training data. It is noted that the goal is to learn a transformation matrix to help in the classiﬁcation, which may not necessarily corresp ond to the real geometric trans- form. T able 6 shows the face recognition accuracies under pose v ariations for the CMU PIE dataset (we applied the crop-and-ﬂip step discussed in Fig. 1.). W e make the following observ ations. First, the recognition accuracy is dramatically increased after applying the learned transformations. Second, the b est accuracy is obtained b y recov ering the lo w-rank subspace for eac h sub ject, e.g., the third ro w in Fig. 14a and Fig. 14b. Then, each trans- formed testing face, e.g., Fig. 14c and Fig. 14d, is sparsely decomp osed ov er the low-rank 25 Qiu and Sapiro Sh are d T Ya le B train Or ig i n a l f a c e i m a g e s Lo w - r a n k t r a n s f o r m e d f ace s Lo w - r a n k c o m p o n e n t s Sp a r s e err o r s (a) Low-rank decomposition of globally transformed training samples Sh are d T Ya le B T est Lo w - r a n k t r a n s f o r m a t i o n (b) Globally transformed testing samples Sh are d T Ya le B T est (c) Mean low-rank components for sub jects in the training data Figure 13: F ace recognition across illumination using global low-rank transformation. T able 6: Recognition accuracies (%) under pose v ariations for the CMU PIE dataset. Metho d F ron tal Side Proﬁle (c27) (c05) (c22) SMD Castillo and Jacobs (2009) 83 82 57 Original+NN 39.85 37.65 17.06 Original(crop+ﬂip)+NN 44.12 45.88 22.94 Class LR T+NN 98.97 96.91 67.65 Class LR T+OMP 100 100 67.65 Global LR T+NN 97.06 95.58 50 Global LR T+OMP 100 98.53 57.35 subspace of each sub ject through OMP , and classiﬁed to the sub ject with the minimal re- construction error, Section 4. Third, the class-based transformation p erforms b etter than the global transformation in this case. The choice betw een these t wo settings is data dep en- den t. Last but not least, our metho d outp erforms SMD, which the b est of our kno wledge, rep orted the b est recognition p erformance in such exp erimen tal setup. Ho w ever, SMD is an unsupervised metho d, and the proposed metho d requires training, still illustrating how a simple learned transform (note that applying it to the data at testing time if virtually free of cost), can signiﬁcan tly impro ve p erformance. 26 Learning Transforma tions for Clustering and Classifica tion Su bj ect 1 Cl ass T (da dl - 10 ) 20 x 20 Or ig i n a l f a c e i m a g e s Lo w - r a n k t r a n s f o r m e d f ac e s Lo w - r a n k c o m p o n e n t s Sp a r s e err o r s (a) Low-rank decomp osition of class-based trans- formed training samples for subje ct3 Su bj ect 2 Cl ass T (da dl - 10 ) 20 x 20 Or ig i n a l f a c e i m a g e s Lo w - r a n k t r a n s f o r m e d f ac e s Lo w - r a n k c o m p o n e n t s Sp a r s e err o r s (b) Low-rank decomp osition of class-based trans- formed training samples for subje ct1 Su bj ect 1 Cl ass T (da dl - 10 ) 20 x 20 Lo w - r a n k t r a n s f o r m a t i o n P r of i l e Sid e Fr o n t a l (c) class-based transformed testing samples for subje ct3 Su bj ect 2 Cl ass T (da dl - 10 ) 20 x 20 Lo w - r a n k t r a n s f o r m a t i o n P r of i l e Sid e Fr o n t a l (d) class-based transformed testing sam- ples for subje ct1 Figure 14: F ace recognition across p ose using class-based lo w-rank transformation. Note, for example in (c) and (d), ho w the learned transform reduces the pose-v ariability . 5 10 15 20 0 0.25 0.5 0.75 1 Illumination Recognition Rate G−LRT C−LRT DADL Eigenface SRC (a) Pose c02 5 10 15 20 0 0.25 0.5 0.75 1 Illumination Recognition Rate G−LRT C−LRT DADL Eigenface SRC (b) Pose c05 5 10 15 20 0 0.25 0.5 0.75 1 Illumination Recognition Rate G−LRT C−LRT DADL Eigenface SRC (c) Pose c29 5 10 15 20 0 0.25 0.5 0.75 1 Illumination Recognition Rate G−LRT C−LRT DADL Eigenface SRC (d) Pose c14 Figure 15: F ace recognition accuracy under com bined pose and illumination v ariations on the CMU PIE dataset. The prop osed methods are denoted as G-LR T in color red and C- LR T in color blue. The prop osed metho ds signiﬁcantly outp erform the comparing methods, esp ecially for extreme poses c02 and c14. 5.6 Application to F ace Recognition across Illumination and Pose T o enable the comparison with Qiu et al. (Oct. 2012), w e adopt their setup for face recog- nition under combined p ose and illumination v ariations for the CMU PIE dataset. W e use 68 sub jects in 5 p oses, c22, c37, c27, c11 and c34, under 21 illumination conditions for training; and classify 68 sub jects in 4 p oses, c02, c05, c29 and c14, under 21 illumination conditions. Three face recognition methods are adopted for comparisons: Eigenfaces T urk and Pen t- land (June 1991), SRC W right et al. (2009), and D ADL Qiu et al. (Oct. 2012). SRC and D ADL are b oth state-of-the-art sparse represen tation metho ds for face recognition, and D ADL adapts sparse dictionaries to the actual visual domains. As shown in Fig. 15, the prop osed metho ds, b oth the global LR T (G-LR T) and class-based LR T (C-LR T), signiﬁ- 27 Qiu and Sapiro Su bj ect 2 Sh are T ( eccv ) 20 x 20 Lo w - r a n k t r a n s f o r m a t i o n c02 c05 c29 c14 (a) Globally transformed testing samples for subje ct1 Su bj ect 2 Sh are T ( eccv ) 20 x 20 Lo w - r a n k t r a n s f o r m a t i o n c02 c05 c29 c14 (b) Globally transformed testing samples for subje ct2 Figure 16: F ace recognition under com bined p ose and illumination v ariations using global lo w-rank transformation. can tly outp erform the comparing metho ds, esp ecially for extreme p oses c02 and c14. Some testing examples using a global transformation are shown in Fig. 16. W e notice that the transformed faces for eac h sub ject exhibit reduced v ariations caused by p ose and illumina- tion. 5.7 Discussion on the Size of the T ransformation Matrix T In the exp erimen ts presented ab o ve, we learned a square linear transformation. F or example, if images are resized to 16 × 16, the learned subspace transformation T is of size 256 × 256. If w e learn a transformation of size r × 256 with r < 256, we enable dimension reduction while p erforming subspace transformation (feature learning). Through exp erimen ts, we notice that the p eak clustering accuracy is usually obtained when r is smaller than the dimension of the am bien t space. F or example, in Fig. 12, through exhaustive searc h for the optimal r , we observe the misclassiﬁcation rate reduced from 2 . 38% to 0% for sub jects { 2, 3 } at r = 96, and from 4 . 23% to 0% for sub jects { 4, 5, 6 } at r = 40. As discussed b efore, this provides a framework to sense for clustering and classiﬁcation, connecting the w ork here presen ted with the extensive literature on compressed sensing, and in particular for sensing design, e.g., Carson et al. (2012). W e plan to study in detail the optimal size of the learned transformation matrix for subspace clustering and classiﬁcation, including its p oten tial connection with the n umber of subspaces in the data, and further inv estigate such connections with compressive sensing. 6. Conclusion W e introduced a subspace low-rank transformation approac h for subspace clustering and classiﬁcation. Using matrix rank as the optimization criteria, via its n uclear norm con vex surrogate, we learn a subspace transformation that reduces v ariations within the subspaces, and increases separations b et ween the subspaces. W e demonstrated that the prop osed approac h signiﬁcantly outp erforms state-of-the-art metho ds for subspace clustering and classiﬁcation, and provided some theoretical support to these exp erimen tal results. Numerous ven ues of researc h are op ened by the framework here in tro duced. At the theo- retical lev el, extending the analysis to the noisy case is needed. F urthermore, understanding the virtues of the global vs the class-dependent transform is both imp ortan t and in teresting, as it is the study of the framework in its compressed dimensionalit y form. Beyond this, 28 Learning Transforma tions for Clustering and Classifica tion considering the proposed approach as a feature extraction technique, its combination with other successful clustering and classiﬁcation techniques is the sub ject of current research. App endix A. Pro of of Theorem 1 Pr o of: W e kno w that (Srebro et al. (2005)) || A || ∗ = min U , V A = UV 0 1 2 ( || U || 2 F + || V || 2 F ) . W e denote U A and V A the matrices that achiev e the minim um; same for B , U B and V B ; and same for the concatenation [ A , B ], U [ A , B ] and V [ A , B ] . W e then ha ve || A || ∗ = 1 2 ( || U A || 2 F + || V A || 2 F ) , || B || ∗ = 1 2 ( || U B || 2 F + || V B || 2 F ) . The matrices [ U A , U B ] and [ V A , V B ] obtained by concatenating the matrices that achiev e the minimum for A and B when computing their n uclear norm, are not necessarily the ones that achiev e the corresp onding minimum in the n uclear norm computation of the concatenation matrix [ A , B ]. Thus, together with || [ A , B ] || 2 F = || A || 2 F + || B || 2 F , w e ha ve || [ A , B ] || ∗ = 1 2 ( || U [ A , B ] || 2 F + || V [ A , B ] || 2 F ) ≤ 1 2 ( || [ U A , U B ] || 2 F + || [ V A , V B ] || 2 F ) = 1 2 ( || U A || 2 F + || U B || 2 F + || V A || 2 F + || V B || 2 F ) = 1 2 ( || U A || 2 F + || V A || 2 F ) + 1 2 ( || U B || 2 F + || V B || 2 F ) = || A || ∗ + || B || ∗ .  App endix B. Pro of of Theorem 2 Pr o of: W e perform the singular v alue decomp osition of A and B as A = [ U A1 U A2 ]  Σ A 0 0 0  [ V A1 V A2 ] 0 , B = [ U B1 U B2 ]  Σ B 0 0 0  [ V B1 V B2 ] 0 , 29 Qiu and Sapiro where the diagonal en tries of Σ A and Σ B con tain non-zero singular v alues. W e hav e AA 0 = [ U A1 U A2 ]  Σ A 2 0 0 0  [ U A1 U A2 ] 0 , BB 0 = [ U B1 U B2 ]  Σ B 2 0 0 0  [ U B1 U B2 ] 0 . The column spaces of A and B are considered to b e orthogonal, i.e., U A1 0 U B1 = 0. The ab o v e can b e written as AA 0 = [ U A1 U B1 ]  Σ A 2 0 0 0  [ U A1 U B1 ] 0 , BB 0 = [ U A1 U B1 ]  0 0 0 Σ B 2  [ U A1 U B1 ] 0 . Then, w e ha ve [ A , B ][ A , B ] 0 = AA 0 + BB 0 = [ U A1 U B1 ]  Σ A 2 0 0 Σ B 2  [ U A1 U B1 ] 0 . The nuclear norm || A || ∗ is the sum of the square root of the singular v alues of AA 0 . Thus, || [ A , B ] || ∗ = || A || ∗ + || B || ∗ .  App endix C. Pro jected Subgradien t Learning Algorithm W e use a simple pro jected subgradient metho d to search for the transformation matrix T that minimizes (6). Before describing it, w e should note that the problem is non- diﬀeren tiable and non-con vex, and it deserves a proper study for eﬃcient optimization, k eeping in mind that the developmen t of more adv anced optimization tec hniques will just further improv e the p erformance of the prop osed framework. W e selected a simple subgra- dien t approac h since the goal of this paper is to presen t the framew ork, and already this simple optimization leads to very fast conv ergence and excellent p erformance as detailed in Section 5, signiﬁcant improv emen ts in p erformance when compared to state-of-the-art. T o minimize (6), the proposed pro jected subgradien t method uses the iteration T ( t +1) = T ( t ) − ν ∆ T , (16) where T ( t ) is the t -th iterate, and ν > 0 deﬁnes the step size. The subgradien t step ∆ T is ev aluated as ∆ T = C X c =1 ∂ || TY c || Y T c − ∂ || TY || Y T , (17) 30 Learning Transforma tions for Clustering and Classifica tion where ∂ ||·|| is the sub diﬀeren tial of the norm ||·|| (given a matrix A , the sub diﬀeren tial ∂ || A || can b e ev aluated using the simple approac h sho wn in Algorithm 2, W atson (1992)). After eac h iteration, we pro ject T via γ T || T || . The ob jective function (6) is a D.C. (diﬀerence of con vex functions) program (Dinh and An (1997), Y uille and Rangara jan (2003), Srip erum budur and Lanc kriet (2012)). W e pro vide here a simple conv ergence analysis to the pro jected subgradient approach prop osed ab o v e. W e ﬁrst provide an analysis to the minimization of (6) without the norm constraint || T || 2 = γ , using the following iterative D.C. pro cedure (Y uille and Rangara jan (2003)): • Initialize T (0) with the identit y matrix; • At the t -th iteration, we update T ( t +1) b y solving a conv ex minimization sub-problem (18), T ( t +1) = arg T min C X c =1 || TY c || ∗ − ∂ || T ( t ) Y || Y T T T . (18) where the ﬁrst term in (18) is the conv ex term in (6), and the added second term is a linear term on T using a subgradient of the concav e term in (6) ev aluated at the current iteration. W e solv e this sub-ob jective function (18) using the subgradient metho d, i.e., using a constan t step size ν , w e iteratively take a step in the negative direction of subgradien t, and the subgradien t is ev aluated as C X c =1 ∂ || TY c || Y T c − ∂ || T ( t ) Y || Y T . (19) Though eac h subgradient step do es not guaran tee a decrease of the cost function (Bo yd et al. (2003); Rec ht et al. (2010)), using a constant step size, the subgradien t metho d is guaran teed to conv erge to within some range of the optimal v alue for a con vex problem (Bo yd et al. (2003)) (it is easy to notice that (16) is a simpliﬁed version of this D.C. pro cedure b y performing only one iteration of the subgradien t method in solving the sub- ob jective function (18), and we will ha ve more discussion on this simpliﬁcation later). Giv en T ( t +1) as the minimizer found for the conv ex problem (18) using the subgradien t metho d, we hav e for (18), C X c =1 || T ( t +1) Y c || ∗ − ∂ || T ( t ) Y || Y T T ( t +1) T (20) ≤ C X c =1 || T ( t ) Y c || ∗ − ∂ || T ( t ) Y || Y T T ( t ) T , and from the conca vity of the second term in (6), we hav e −|| T ( t +1) Y || ∗ ≤ −|| T ( t ) Y || ∗ − ∂ || T ( t ) Y || Y T ( T ( t +1) − T ( t ) ) T . (21) 31 Qiu and Sapiro By summing 20 and 21, we obtain C X c =1 || T ( t +1) Y c || ∗ − || T ( t +1) Y || ∗ ≤ C X c =1 || T ( t ) Y c || ∗ − || T ( t ) Y || ∗ . (22) The ob jective (6) is b ounded from b elo w by 0 (shown in Section 2), and decreases after eac h iteration of the abov e D.C. pro cedure (shown in (22)). Th us, the con v ergence to a lo cal minim um (or a stationary point) is guaranteed. F or eﬃciency considerations, while solving the con v ex sub-ob jective function (18), w e perform only one iteration of the subgradient metho d to obtain a simpliﬁed metho d (16), and still observe empirical con vergence in all exp erimen ts, see Fig.7 and Fig.10n. The norm constrain t || T || 2 = γ is adopted in our formulation to prev ent the trivial solution T = 0. As shown abov e, the minimization of (6) without the norm constrain t alw ays con v erges to a local minim um (or a stationary p oin t), thus the initialization b ecomes critical while dropping the norm constraint. By initializing T (0) with the iden tity matrix, w e observ e no trivial solution con vergence in all exp erimen ts, such as the no normalization case in Fig.7. As sho wn in Douglas et al. (2000), the norm constrain t || T || 2 = γ can b e incorp orated to a gradient-based algorithm using v arious alternatives, e.g., Lagrange multipliers, co eﬃcient normalization, and gradien ts in the tangent space. W e implement the co eﬃcien t normal- ization method, i.e., after obtaining T ( t +1) from (18), we normalize T ( t +1) via γ T ( t +1) || T ( t +1) || . In other words, w e normalize the length of T ( t +1) without c hanging its direction. As dis- cussed in Douglas et al. (2000), the problem of minimizing a cost function sub ject to a norm constraint forms the basis for many imp ortan t tasks, and gradient-based algorithms are often used along with the norm constrain t. Though it is expected that a norm constrain t do es not change the con vergence behavior of a gradien t algorithm (Douglas et al. (2000); F uhrmann and Liu (1984)), Fig.7, to the b est of our kno wledge, it is an op en problem to analyze ho w a norm constraint and the c hoice of γ aﬀect the con vergence behavior of a gradien t/subgradient metho d. Input : An m × n matrix A , a small threshold v alue δ Output : The subdiﬀerential of the matrix norm ∂ || A || . b egin 1. P erform singular v alue decomp osition: A = UΣV ; 2. s ← the num b er of singular v alues smaller than δ , 3. P artition U and V as U = [ U (1) , U (2) ], V = [ V (1) , V (2) ] ; where U (1) and V (1) ha ve ( n − s ) columns. 4. Generate a random matrix B of the size ( m − n + s ) × s , B ← B || B || ; 5. ∂ || A || ← U (1) V (1) T + U (2) BV (2) T ; 6. Return ∂ || A || ; end Algorithm 2: An approac h to ev aluate the sub diﬀeren tial of a matrix norm. 32 Learning Transforma tions for Clustering and Classifica tion Ac knowledgmen ts This w ork w as partially supp orted by ONR, NGA, NSF, ARO, DHS, and AF OSR. W e thank Dr. Pablo Sprec hmann, Dr. Ehsan Elhamifar, Ching-Hui Chen, and Dr. Mariano T epp er for imp ortan t feedbac k on this work. References R. Basri and D. W. Jacobs. Lam b ertian reﬂectance and linear subspaces. IEEE T r ans. on Patt. Anal. and Mach. Intel l. , 25(2):218–233, F ebruary 2003. M. Belkin and P . Niy ogi. Laplacian eigenmaps for dimensionalit y reduction and data rep- resen tation. Neur al Computation , 15:1373–1396, 2003. S. Bo yd, L. Xiao, and A. Mutapcic. Subgradient method. notes for EE392o, Standfor d University , 2003. E. J. Cand` es, X. Li, Y. Ma, and J. W righ t. Robust principal comp onen t analysis? J. ACM , 58(3):11:1–11:37, June 2011. W. R. Carson, M. Chen, M. R. D. Rodrigues, R. Calderbank, and L. Carin. Comm unications-inspired pro jection design with application to compressive sensing. SIAM J. Imaging Sci. , 5(4):1185–1212, 2012. C. Castillo and D. Jacobs. Using stereo matc hing for 2-D face recognition across p ose. IEEE T r ans. on Patt. A nal. and Mach. Intel l. , 31:2298–2304, 2009. G. Chen and G. Lerman. Sp ectral curv ature clustering (SCC). International Journal of Computer Vision , 81(3):317–330, 2009. T. P . Dinh and L. T. H. An. Con v ex analysis approach to d.c. programming: Theory , algorithms and applications. A cta Mathematic a Vietnamic a , 22(1):289355, 1997. S. C. Douglas, S. Amari, and S. Y. Kung. On gradient adaptation with unit-norm con- strain ts. IEEE T r ans. on Signal Pr o c essing , 48(6):1843–1847, 2000. E. Elhamifar and R. Vidal. Sparse subspace clustering: Algorithm, theory , and applications. IEEE T r ans. on Patt. Anal. and Mach. Intel l. , 2013. T o appear. M. F azel. Matrix Rank Minimization with Applications. PhD thesis, Stanfor d University , 2002. D. R. F uhrmann and B. Liu. An iterative algorithm for lo cating the minimal eigen vector of a symmetric matrix. In Pr o c. IEEE Int. Conf. A c oust., Sp e e ch, Signal Pr o c essing , Dallas, TX, 1984. A. S. Georghiades, P . N. Belhumeur, and D. J. Kriegman. F rom few to many: Illumination cone mo dels for face recognition under v ariable lighting and pose. IEEE T r ans. on Patt. A nal. and Mach. Intel l. , 23(6):643–660, June 2001. 33 Qiu and Sapiro A. Goh and R. Vidal. Segmen ting motions of diﬀeren t t yp es by unsup ervised manifold clustering. In Pr o c. IEEE Computer So ciety Conf. on Computer Vision and Patt. R e cn. , Minneap olis, Minnesota, 2007. T. Hastie and P . Y. Simard. Metrics and models for handwritten c haracter recognition. Statistic al Scienc e , 13(1):54–65, 1998. Z. Jiang, Z. Lin, and L. S. Davis. Learning a discriminativ e dictionary for sparse co ding via lab el consistent K-SVD. In Pr o c. IEEE Computer So ciety Conf. on Computer Vision and Patt. R e cn. , Colorado springs, CO, June 2011. O. Kuybeda, G. A. F rank, A. Bartesaghi, M. Borgnia, S. Subramaniam, and G. Sapiro. A collab orativ e framework for 3D alignmen t and classiﬁcation of heterogeneous sub volumes in cry o-electron tomograph y . Journal of Structur al Biolo gy , 181:116–127, 2013. G. Liu, Z. Lin, and Y. Y u. Robust subspace segmentation by lo w-rank represen tation. In International Confer enc e on Machine L e arning , Haifa, Israel, 2010. U. Luxburg. A tutorial on sp ectral clustering. Statistics and Computing , 17(4):395–416, Decem b er 2007. Y. Ma, H. Derksen, W. Hong, and J. W right. Segmentation of multiv ariate mixed data via lossy data coding and compression. IEEE T r ans. on Patt. A nal. and Mach. Intel l. , 29 (9):1546–1562, 2007. G. Marsaglia and G. P . H. Sty an. When do es rank (a + b) = rank(a)+rank(b)? Canad. Math. Bul l. , 15(3), 1972. J. Miao and A. Ben-Israel. On principal angles b et w een subspaces in rn. Line ar A lgebr a and its Applic ations , 171(0):81 – 98, 1992. Y. C. P ati, R. Rezaiifar, and P . S. Krishnaprasad. Orthogonal matching pursuit: recursive function approximation with applications to w av elet decomp osition. Pr o c. 27th Asilomar Confer enc e on Signals, Systems and Computers , pages 40–44, Nov. 1993. Y. P eng, A. Ganesh, J. W righ t, W. Xu, and Y. Ma. RASL: Robust alignmen t b y sparse and lo w-rank decomp osition for linearly correlated images. In Pr o c. IEEE Computer So ciety Conf. on Computer Vision and Patt. R e cn. , San F rancisco, USA, 2010. Q. Qiu and G. Sapiro. Learning transformations for classiﬁcation forests. In International Confer enc e on L e arning R epr esentations , Banﬀ, Canada, April, 2014. Q. Qiu, V. Patel, P . T uraga, and R. Chellappa. Domain adaptiv e dictionary learning. In Pr o c. Eur op e an Confer enc e on Computer Vision , Florence, Italy , Oct. 2012. B. Rec ht, M. F azel, and P . A. P arrilo. Guaran teed minimum rank solutions to linear matrix equations via nuclear norm minimization. SIAM R eview , 52(3):471–501, 2010. S. T. Ro weis and L. K. Saul. Nonlinear dimensionalit y reduction b y lo cally linear em b edding. Scienc e , 290:2323–2326, 2000. 34 Learning Transforma tions for Clustering and Classifica tion X. Shen and Y. W u. A uniﬁed approach to salient ob ject detection via low rank matrix reco very . In Pr o c. IEEE Computer So ciety Conf. on Computer Vision and Patt. R e cn. , Rho de Island, USA, 2012. T. Sim, S. Baker, and M. Bsat. The CMU p ose, illumination, and expression (PIE) database. IEEE T r ans. on Patt. Anal. and Mach. Intel l. , 25(12):1615 –1618, Dec. 2003. M. Soltanolkotabi and E. J. Candes. A geometric analysis of subspace clustering with outliers. The A nnals of Statistics , 40(4):2195–2238, 2012. M. Soltanolk otabi, E. Elhamifar, and E. J. Cand ` es. Robust subspace clustering. CoRR , abs/1301.2603, 2013. URL . P . Sprec hmann, A. M. Bronstein, and G. Sapiro. Learning eﬃcien t sparse and low rank mo dels. CoRR , abs/1212.3631, 2012. URL . N. Srebro, J. Rennie, and T. Jaakk ola. Maxim um margin matrix factorization. In A dvanc es in Neur al Information Pr o c essing Systems , V ancouver, Canada, 2005. B. K. Sriperumbudur and G. R. G. Lanckriet. A proof of conv ergence of the conca ve-con v ex pro cedure using zangwill’s theory . Neur al Computation , 24(6):1391–1407, 2012. C. T omasi and T. Kanade. Shap e and motion from image streams under orthography: a factorization method. International Journal of Computer Vision , 9:137–154, 1992. M.A. T urk and A.P . Pen tland. F ace recognition using eigenfaces. In Pr o c. IEEE Computer So ciety Conf. on Computer Vision and Patt. R e cn. , Maui, Haw aii, June 1991. R. Vidal. Subspace clustering. Signal Pr o c essing Magazine, IEEE , 28(2):52–68, 2011. R. Vidal, Yi Ma, and S. Sastry . Generalized principal comp onen t analysis (GPCA). In Pr o c. IEEE Computer So ciety Conf. on Computer Vision and Patt. R e cn. , Madison, Wisconsin, 2003. J. W ang, J. Y ang, K. Y u, F. Lv, T. Huang, and Y. Gong. Lo calit y-constrained linear co ding for image classiﬁcation. In Pr o c. IEEE Computer So ciety Conf. on Computer Vision and Patt. R e cn. , San F rancisco, USA, 2010. Y. W ang and H. Xu. Noisy sparse subspace clustering. In International Confer enc e on Machine L e arning , A tlanta, USA, 2013. G. A. W atson. Characterization of the sub diﬀeren tial of some matrix norms. Line ar Algebr a and Applic ations , 170:1039–1053, 1992. J. W right, A. Y ang, A. Ganesh, S. Sastry , and Y. Ma. Robust face recognition via sparse represen tation. IEEE T r ans. on Patt. Anal. and Mach. Intel l. , 31(2):210–227, 2009. J. Y an and M. Pollefeys. A general framew ork for motion segmentation: indep endent, ar- ticulated, rigid, non-rigid, degenerate and non-degenerate. In Pr o c. Eur op e an Confer enc e on Computer Vision , Graz, Austria, 2006. 35 Qiu and Sapiro A. L. Y uille and A. Rangara jan. The conca ve-con vex pro cedure. Neur al Computation , 4: 915–936, 2003. Q. Zhang and B. Li. Discriminativ e k-SVD for dictionary learning in face recognition. In Pr o c. IEEE Computer So ciety Conf. on Computer Vision and Patt. R e cn. , San F rancisco, CA, June 2010. T. Zhang, A. Szlam, Y. W ang, and G. Lerman. Hybrid linear mo deling via lo cal best-ﬁt ﬂats. International Journal of Computer Vision , 100(3):217–240, 2012. Z. Zhang, X. Liang, A. Ganesh, and Y. Ma. TIL T: transform in v arian t low-rank textures. In Pr o c. Asian c onfer enc e on Computer vision , Queenstown, New Zealand, 2011. X. Zh u and D. Ramanan. F ace detection, p ose estimation and landmark localization in the wild. In Pr o c. IEEE Computer So ciety Conf. on Computer Vision and Patt. R e cn. , Pro vidence, Rho de Island, June 2012. 36

Learning Transformations for Clustering and Classification

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment