Sparse Compositional Metric Learning
We propose a new approach for metric learning by framing it as learning a sparse combination of locally discriminative metrics that are inexpensive to generate from the training data. This flexible framework allows us to naturally derive formulations…
Authors: Yuan Shi, Aurelien Bellet, Fei Sha
Sparse Compositional Metric Learning ∗ Y uan Shi † ‡ , Aur ´ elien Bellet † ‡ , Fei Sha ‡ Abstract W e propose a new approach for metric learning by framing it as learning a sparse combination of locally discriminati ve metrics that are ine xpensi ve to generate from the training data. This flexible framew ork allo ws us to naturally derive formulations for global, multi-task and local metric learning. The resulting algorithms hav e sev eral adv antages o ver e xisting methods in the literature: a much smaller number of parameters to be estimated and a principled way to generalize learned metrics to ne w testing data points. T o analyze the approach theoretically , we deri ve a generalization bound that justifies the sparse combination. Empirically , we ev aluate our algorithms on se veral datasets against state-of-the-art metric learning methods. The results are consistent with our theoretical findings and demonstrate the superiority of our approach in terms of classification performance and scalability . 1 Intr oduction The need for measuring distance or similarity between data instances is ubiquitous in machine learning and many application domains. Ho we ver , each problem has its own underlying semantic space for defining distances that standard metrics (e.g., the Euclidean distance) often fail to capture. This has led to a growing interest in metric learning for the past few years, as summarized in two recent surveys (Bellet et al., 2013; Kulis, 2012). Among these methods, learning a globally linear Mahalanobis distance is by far the most studied setting. Representati ve methods include (Xing et al., 2002; Goldber ger et al., 2004; Davis et al., 2007; Jain et al., 2008; W einberger and Saul, 2009; Shen et al., 2012; Y ing and Li, 2012). This is equi v alent to learning a linear projection of the data to a feature space where constraints on the training set (such as “ x i should be closer to x j than to x k ”) are better satisfied. Although the performance of these learned metrics is typically superior to that of standard metrics in practice, a single linear metric is often unable to accurately capture the complexity of the task, for instance when the data are multimodal or the decision boundary is complex. T o overcome this limitation, recent work has focused on learning multiple locally linear metrics at se veral locations of the feature space (Frome et al., 2007; W einber ger and Saul, 2009; Zhan et al., 2009; Hong et al., 2011; W ang et al., 2012), to the extreme of learning one metric per training instance (Noh et al., 2010). This line of research is moti vated by the fact that locally , simple linear metrics perform well (Ramanan and Baker, 2011; Hauber g et al., 2012). The main challenge is to integrate these metrics into a meaningful global one while keeping the number of learning parameters to a reasonable lev el in order to av oid heavy computational burden and sev ere ov erfitting. So far , e xisting methods are not able to compute v alid (smooth) global metrics from the local metrics the y learn and do not provide a principled way of generalizing to new regions of the space at test time. Furthermore, ∗ This document is an extended v ersion of a conference paper (Shi et al., 2014) that provides additional details and results. † Equal contribution. ‡ Department of Computer Science, Univ ersity of Southern California, { yuanshi,bellet,feisha } @usc.edu . 1 Extracted basis set ... T raining data Multi-task metric learning . . . Global metric learning dimensional sparse vector Local metric lear ning Figure 1: Illustration of the general framew ork and its applications. W e extract locally discriminati ve basis elements from the training data and cast metric learning as learning sparse combinations of these elements. W e formulate global metric learning as learning a single sparse weight v ector w . For multi-task metric learning, we learn a vector w t for each task where all tasks share the same basis subset. F or local metric learning we learn a function T ( x ) that maps any instance x to its associated sparse weight vector w x . Shades of grey encode weight magnitudes. they scale poorly with the dimensionality D of the data: typically , learning a Mahalanobis distance requires O ( D 2 ) parameters and the optimization in volves projections onto the positi ve semidefinite cone that scale in O ( D 3 ) . This is expensi ve ev en for a single metric when D is moderately large. In this paper , we study metric learning from a new perspectiv e to ef ficiently address these key chal- lenges. W e propose to learn metrics as sparse compositions of locally discriminative metrics . These “basis metrics” are low-rank and extracted efficiently from the training data at dif ferent local regions, for instance using Fisher discriminant analysis. Learning higher-rank linear metrics is then formulated as learning the combining weights, using sparsity-inducing regularizers to select only the most useful basis elements. This provides a unified framework for metric learning, as illustrated in Figure 1, that we call SCML (for Sparse Compositional Metric Learning). In SCML, the number of parameters to learn is much smaller than e xisting approaches and projections onto the positive semidefinite cone are not needed. This gi ves an ef ficient and flexible w ay to learn a single global metric when D is large. The proposed framew ork also applies to multi-task metric learning, where one wants to learn a global metric for se veral related tasks while exploiting commonalities between them (Caruana, 1997; Parameswaran and W einber ger, 2010). This is done in a natural way by means of a group sparsity regularizer that mak es the task-specific metrics share the same basis subset. Our last and ar guably most interesting contribution is a ne w formulation for local metric learning, where we learn a transformation T ( x ) that takes as input any instance x and outputs a sparse weight vector defining its metric. This can be seen as learning a smoothly v arying metric tensor over the feature space (Ramanan and Baker, 2011; Hauberg et al., 2012). T o the best of our knowledge, it is the first discriminativ e metric learning approach capable of computing, in a principled way , an instance-specific metric for any point in the feature space. All formulations can be solved using scalable optimization procedures based on stochastic subgradient descent with proximal operators (Duchi and Singer, 2009; Xiao, 2010). W e present both theoretical and experimental evidence supporting the proposed approach. W e deri ve a generalization bound which pro vides a theoretical justification to seeking sparse combinations and suggests that the basis set B can be large without incurring ov erfitting. Empirically , we ev aluate our algorithms against state-of-the-art global, local and multi-task metric learning methods on sev eral datasets. The results strongly support the proposed frame work. The rest of this paper is or ganized as follows. Section 2 describes our general frame work and illustrates ho w it can be used to deri ve efficient formulations for global, local and multi-task metric learning. Section 3 2 provides a theoretical analysis supporting our approach. Section 4 revie ws related work. Section 5 presents an experimental e valuation of the proposed methods. W e conclude in Section 6. 2 Pr oposed A pproach In this section, we present the main idea of sparse compositional metric learning (SCML) and show how it can be used to unify se veral e xisting metric learning paradigms and lead to efficient ne w formulations. 2.1 Main Idea W e assume the data lie in R D and focus on learning (squared) Mahalanobis distances d M ( x , x 0 ) = ( x − x 0 ) T M ( x − x 0 ) parameterized by a positi ve semidefinite (PSD) D × D matrix M . Note that M can be represented as a nonnegati ve weighted sum of K rank-1 PSD matrices: 1 M = K X i =1 w i b i b T i , with w ≥ 0 , (1) where the b i ’ s are D -dimensional column vectors. In this paper , we use the form (1) to cast metric learning as learning a sparse combination of basis elements taken from a basis set B = { b i } K i =1 . The ke y to our frame work is the fact that such a B is made readily a v ailable to the algorithm and consists of rank-one metrics that are locally discriminative . Such basis elements can be easily generated from the training data at several local re gions — in the experiments, we simply use Fisher discriminant analysis (see the corresponding section for details). They can then be combined to form a single global metric, multiple global metrics (in the multi-task setting) or a metric tensor (implicitly defining an infinite number of local metrics) that v aries smoothly across the feature space, as we will sho w in later sections. W e use the notation d w ( x , x 0 ) to highlight our parameterization of the Mahalanobis distance by w . Learning M in this form makes it PSD by design (as a nonne gati ve sum of PSD matrices) and in volv es K parameters (instead of D 2 in most metric learning methods), enabling it to more easily deal with high- dimensional problems. W e also want the combination to be sparse , i.e., some w i ’ s are zero and thus M only depends on a small subset of B . This provides some form of re gularization (as shown later in Theo- rem 1) as well as a w ay to tie metrics together when learning multiple metrics. In the rest of this section, we apply the proposed framew ork to several metric learning paradigms (see Figure 1). W e start with the simple case of global metric learning (Section 2.1.1) before considering more challenging settings: multi-task (Sec- tion 2.1.2) and local metric learning (Section 2.1.3). Finally , Section 2.2 discusses how these formulations can be solved in a scalable w ay using stochastic subgradient descent with proximal operators. 2.1.1 Global Metric Learning In global metric learning, one seeks to learn a single metric d w ( x , x 0 ) from a set of distance constraints on the training data. Here, we use a set of triplet constraints C where each ( x i , x j , x k ) ∈ C indicates that the distance between x i and x j should be smaller than the distance between x i and x k . C may be constructed from label information, as in LMNN (W einberger and Saul, 2009), or in an unsupervised manner based for instance on implicit users’ feedback (such as clicks on search engine results). Our formulation for global 1 Such an expression e xists for any PSD matrix M since the eigen value decomposition of M is of the form (1). 3 metric learning, SCML-Global, is simply to combine the local basis elements into a higher-rank global metric that satisfies well the constraints in C : min w 1 | C | X ( x i , x j , x k ) ∈ C L w ( x i , x j , x k ) + β k w k 1 , (2) where L w ( x i , x j , x k ) = [1 + d w ( x i , x j ) − d w ( x i , x k )] + with [ · ] + = max(0 , · ) , and β ≥ 0 is a regular- ization parameter . The first term in (2) is the classic margin-based hinge loss function. The second term k w k 1 = P K i =1 w i is the ` 1 norm regularization which encourages sparse solutions, allo wing the selection of rele v ant basis elements. SCML-Global is con ve x by the linearity of both terms and is bounded below , thus it has a global minimum. 2.1.2 Multi-T ask Metric Lear ning Multi-task learning (Caruana, 1997) is a paradigm for learning se veral tasks simultaneously , exploiting their commonalities. When tasks are related, this can perform better than separately learning each task. Recently , multi-task learning methods hav e successfully built on the assumption that the tasks should share a common lo w-dimensional representation (Argyriou et al., 2008; Y ang et al., 2009; Gong et al., 2012). In general, it is unclear ho w to achie ve this in metric learni ng. In contrast, learning metrics as sparse combinations allows a direct translation of this idea to multi-task metric learning. Formally , we are gi ven T dif ferent but somehow related tasks with associated constraint sets C 1 , . . . , C T and we aim at learning a metric d w t ( x , x 0 ) for each task t while sharing information across tasks. In the following, the basis set B is the union of the basis sets B 1 , . . . , B T extracted from each task t . Our formulation for multi-task metric learning, mt-SCML, is as follo ws: min W T X t =1 1 | C t | X ( x i , x j , x k ) ∈ C t L w t ( x i , x j , x k ) + β k W k 2 , 1 , where W is a T × K nonnegati ve matrix whose t -th row is the weight vector w t defining the metric for task t , L w t ( x i , x j , x k ) = [1 + d w t ( x i , x j ) − d w t ( x i , x k )] + and k W k 2 , 1 is the ` 2 /` 1 mixed norm used in the group lasso problem (Y uan and Lin, 2006). It corresponds to the ` 1 norm applied to the ` 2 norm of the columns of W and is known to induce group sparsity at the column lev el. In other words, this regularization makes most basis elements either ha ve zero weight or nonzero weight for all tasks . Overall, while each metric remains task-specific ( d w t is only required to satisfy well the constraints in C t ), it is composed of shar ed featur es (i.e., it potentially benefits from basis elements generated from other tasks) that are regularized to be rele vant acr oss tasks (as fa vored by the group sparsity). As a result, all learned metrics can be expressed as combinations of the same basis subset of B , though with different weights for each task. Since the ` 2 /` 1 norm is con vex, mt-SCML is ag ain conv ex. 2.1.3 Local Metric Learning Local metric learning addresses the limitations of global methods in capturing complex data patterns (Frome et al., 2007; W einberger and Saul, 2009; Zhan et al., 2009; Noh et al., 2010; Hong et al., 2011; W ang et al., 2012). For heterogeneous data, allowing the metric to vary across the feature space can capture the semantic distance much better . On the other hand, local metric learning is costly and often suffers from sev ere ov erfitting since the number of parameters to learn can be very large. In the following, we show ho w our frame work can be used to deri ve an ef ficient local metric learning method. 4 W e aim at learning a metric tensor T ( x ) , which is a smooth function that (informally) maps any instance x to its metric matrix (Ramanan and Baker, 2011; Hauber g et al., 2012). The distance between two points should then be defined as the geodesic distance on a Riemannian manifold. Howe ver , this requires solving an intractable problem, so we use the widely-adopted simplification that distances from point x are computed based on its o wn metric alone (Zhan et al., 2009; Noh et al., 2010; W ang et al., 2012): d T ( x , x 0 ) = ( x − x 0 ) T T ( x )( x − x 0 ) = ( x − x 0 ) T K X i =1 w x ,i b i b T i ( x − x 0 ) , where w x is the weight vector for instance x . W e could learn a weight vector for each training point. This would result in a formulation similar to mt-SCML, where each training instance is considered as a task. Howe ver , in the context of local metric learning, this is not an appealing solution. Indeed, for a training sample of size S we would need to learn S K parameters, which is computationally difficult and leads to heavy o verfitting for large-scale problems. Furthermore, this gi ves no principled w ay of computing the weight vector of a test instance. W e instead propose a more ef fectiv e solution by constraining the weight vector for an instance x to parametrically depend on some embedding of x : T A , c ( x ) = K X i =1 ( a T i z x + c i ) 2 b i b T i , (3) where z x ∈ D 0 is an embedding of x , 2 A = [ a 1 . . . a K ] T is a D 0 × K real-v alued matrix and c ∈ R K . The square makes the weights nonnegati ve ∀ x ∈ R D , ensuring that the y define a valid (pseudo) metric. Intuiti vely , (3) combines the locally discriminative metrics with weights that depend on the position of the instance in the feature space. There are sev eral adv antages to this formulation. First, by learning A and c we implicitly learn a dif ferent metric not only for the training data but for an y point in the feature space. Second, if the embedding is smooth, T A , c ( x ) is a smooth function of x , t herefore similar instances are assigned similar weights. This can be seen as some kind of manifold regularization. Third, the number of parameters to learn is now K ( D 0 + 1) , thus independent of both the size of the training sample and the dimensionality of x . Our formulation for local metric learning, SCML-Local, is as follo ws: min ˜ A 1 | C | X ( x i , x j , x k ) ∈ C L T A , c ( x i , x j , x k ) + β k ˜ A k 2 , 1 , where ˜ A is a ( D 0 + 1) × K matrix denoting the concatenation of A and c , and L T A , c ( x i , x j , x k ) = 1 + d T A , c ( x i , x j ) − d T A , c ( x i , x k ) + . The ` 2 /` 1 norm on ˜ A introduces sparsity at the column le vel, reg- ularizing the local metrics to use the same basis subset. Interestingly , if A is the zero matrix, we recover SCML-Global. SCML-Local is noncon vex and is thus subject to local minima. 2.2 Optimization Our formulations use (nonsmooth) sparsity-inducing regularizers and typically in volv e a lar ge number of triplet constraints. W e can solve them efficiently using stochastic composite optimization (Duchi and Singer, 2 In our e xperiments, we use kernel PCA (Sch ¨ olkopf et al., 1998) as it provides a simple w ay to limit the dimension and thus the number of parameters to learn. W e use RBF kernel with bandwidth set to the median Euclidean distance in the data. 5 2009; Xiao, 2010), which alternates between a stochastic subgradient step on the hinge loss term and a prox- imal operator (for ` 1 or ` 2 , 1 norm) that explicitly induces sparsity . W e solve SCML-Global and mt-SCML using Regularized Dual A veraging (Xiao, 2010), which offers fast con ver gence and lev els of sparsity in the solution comparable to batch algorithms. For SCML-Local, due to local minima, we ensure improvement ov er the optimal solution w ∗ of SCML-Global by using a forward-backward algorithm (Duchi and Singer, 2009) which is initialized with A = 0 and c i = p w ∗ i . Recall that unlike most existing metric learning algorithms, we do not need to perform projections onto the PSD cone, which scale in O ( D 3 ) for a D × D matrix. Our algorithms thereby have a significant computational adv antage for high-dimensional problems. 3 Theor etical Analysis In this section, we provide a theoretical analysis of our approach in the form of a generalization bound based on algorithmic robustness analysis (Xu and Mannor, 2012) and its adaptation to metric learning (Bellet and Habrard, 2012). For simplicity , we focus on SCML-Global, our global metric learning formulation in (2). Consider the supervised learning setting, where we are giv en a labeled training sample S = { z i = ( x i , y i ) } n i =1 drawn i.i.d. from some unknown distribution P over Z = X × Y . W e call a triplet ( z , z 0 , z 00 ) admissible if y = y 0 6 = y 00 . Let C be the set of admissible triplets b uilt from S and L ( w , z , z 0 , z 00 ) = [1 + d w ( x , x 0 ) − d w ( x , x 00 )] + denote the loss function used in (2), with the con vention that L returns 0 for non-admissible triplets. Let us define the empirical loss of w on S as R S emp ( w ) = 1 | C | X ( z , z 0 , z 00 ) ∈ C L ( w , z , z 0 , z 00 ) , and its expected loss o ver distribution P as R ( w ) = E z , z 0 , z 00 ∼ P L ( w , z , z 0 , z 00 ) . The following theorem bounds the de viation between the empirical loss of the learned metric and its expected loss. Theorem 1. Let w ∗ be the optimal solution to SCML-Global with K basis elements, β > 0 and C con- structed fr om S = { ( x i , y i ) } n i =1 as above. Let K ∗ ≤ K be the number of nonzer o entries in w ∗ . Let us assume the norm of any instance bounded by some constant R and L uniformly upper-bounded by some constant U . Then for any δ > 0 , with pr obability at least 1 − δ we have: R ( w ∗ ) − R S emp ( w ∗ ) ≤ 16 γ R K ∗ β + 3 U s N ln 2 + ln 1 δ 0 . 5 n , wher e N is the size of an γ -cover of Z . This bound has a standard O (1 / √ n ) asymptotic con vergence rate. 3 Its main originality is that it provides a theoretical justification to enforcing sparsity in our formulation. Indeed, notice that K ∗ (and not K ) 3 In rob ustness bounds, the co ver radius γ can be made arbitrarily close to zero at the expense of increasing N . Since N appears in the second term, the right hand side of the bound indeed goes to zero when n → ∞ . This is in accordance with other similar learning bounds, for example, the original rob ustness-based bounds in (Xu and Mannor, 2012). 6 appears in the bound as a penalization term, which suggests that one may use a large basis set K without ov erfitting as long as K ∗ remains small. This will be confirmed by our experiments (Section 5.3). A similar bound can be deriv ed for mt-SCML, but not for SCML-Local because of its noncon ve xity . The details and proofs can be found in Appendix A. 4 Related W ork In this section, we revie w rele vant work in global, multi-task and local metric learning. The interested reader should refer to the recent surve ys of Kulis (2012) and Bellet et al. (2013) for more details. Global methods Most global metric learning methods learn the matrix M directly: see (Xing et al., 2002; Goldberger et al., 2004; Davis et al., 2007; Jain et al., 2008; W einberger and Saul, 2009) for representati ve papers. This is computationally expensi ve and subject to ov erfitting for moderate to high-dimensional prob- lems. An exception is BoostML (Shen et al., 2012) which uses rank-one matrices as weak learners to learn a global Mahalanobis distance via a boosting procedure. Ho wever , it is not clear ho w BoostML can be generalized to multi-task or local metric learning. Multi-task methods Multi-task metric learning was proposed in (Parameswaran and W einberger, 2010) as an extension to the popular LMNN (W einberger and Saul, 2009). The authors define the metric for task t as d t ( x , x 0 ) = ( x − x 0 ) T ( M 0 + M t )( x − x 0 ) , where M t is task-specific and M 0 is shared by all tasks. Note that it is straightforward to incorporate their approach in our framework by defining a shared weight vector w 0 and task-specific weights w t . Howe ver , this assumption of a metric that is common to all tasks can be too restricti ve in cases where task relatedness is comple x, as illustrated by our experiments. Local methods MM-LMNN (W einberger and Saul, 2009) is an e xtension of LMNN which learns only a small number of metrics (typically one per class) in an ef fort to alleviate overfitting. Ho wever , no additional regularization is used and a full-rank metric is learned for each class, which becomes intractable when the number of classes is large. msNCA (Hong et al., 2011) learns a function that splits the space into a small number of regions and then learns a metric per region using NCA (Goldberger et al., 2004). Again, the metrics are full-rank so msNCA does not scale well with the number of metrics. Like SCML-Local, PLML (W ang et al., 2012) is based on a combination of metrics b ut there are major differences with our work: (i) weights only depend on a manifold assumption: the y are not sparse and use no discriminati ve information, (ii) the basis metrics are full-rank, thus expensi ve to learn, and (iii) a weight vector is learned explicitly for each training instance, which can result in a large number of parameters and pre vents generalization to ne w instances (in practice, for a test point, they use the weight vector of its nearest neighbor in the training set). As observed by Ramanan and Bak er (2011), the abo ve methods make the implicit assumption that the metric tensor is locally constant (at the class, re gion or neighborhood le vel), while SCML-Local learns a smooth function that maps any instance to its specific metric. ISD (Zhan et al., 2009) is an attempt to learn the metrics for unlabeled points by propagation, but is limited to the transducti ve setting. Unlike the abov e discriminati ve approaches, GLML (Noh et al., 2010) learns a metric for each point independently in a generati ve w ay by minimizing the 1-NN e xpected error under some assumption for the class distributions. 7 V ehicle V owel Segment Letters USPS BBC # samples 846 990 2,310 20,000 9,298 2,225 # classes 4 11 7 26 10 5 # features 18 10 19 16 256 9,636 T able 1: Datasets for global and local metric learning. Dataset Euc Global-Frob SCML-Global V ehicle 29.7 ± 0.6 21.5 ± 0.8 21.3 ± 0.6 V owel 11.1 ± 0.4 10.3 ± 0.4 10.9 ± 0.5 Segment 5.2 ± 0.2 4.1 ± 0.2 4.1 ± 0.2 Letters 14.0 ± 0.2 9.0 ± 0.2 9.0 ± 0.2 USPS 10.3 ± 0.2 5.1 ± 0.2 4.1 ± 0.1 BBC 8.8 ± 0.3 5.5 ± 0.3 3.9 ± 0.2 T able 2: Global metric learning results (best in bold). 5 Experiments In this section, we compare our methods to state-of-the-art algorithms on global, multi-task and local metric learning. 4 W e use a 3-nearest neighbor classifier in all experiments. T o generate a set of locally discrimi- nati ve rank-one metrics, we first divide data into regions via clustering. F or each region center , we select J nearest neighbors from each class (for J = { 10 , 20 , 50 } to account for different scales), and apply Fisher dis- criminant analysis follo wed by eigen value decomposition to obtain the basis elements. 5 Section 5.1 presents results for global metric learning, Section 5.2 for multi-task and Section 5.3 for local metric learning. 5.1 Global Metric Learning W e use 6 datasets from UCI 6 and BBC 7 (see T able 1). The dimensionality of USPS and BBC is reduced to 100 and 200 using PCA to speed up computation. W e normalize the data as in (W ang et al., 2012) and split into train/validation/test (60%/20%/20%), except for Letters and USPS where we use 3,000/1,000/1,000. Results are av eraged over 20 random splits. 5.1.1 Proof of Concept Setup Global metric learning is a con venient setting to study the ef fect of combining basis elements. T o this end, we consider a formulation with the same loss function as SCML-Global but that directly learns the metric matrix, using Frobenius norm re gularization to reduce overfitting. W e refer to it as Global-Frob. Both algorithms use the same training triplets, generated by identifying 3 target neighbors (nearest neighbors with same label) and 10 imposters (nearest neighbors with different label) for each instance. W e tune the regularization parameter on the v alidation data. For SCML-Global, we use a basis set of 400 elements for V ehicle, V owel, Se gment and BBC, and 1,000 elements for Letters and USPS. 4 For all compared methods we use M AT L A B code from the authors’ website. The M AT L A B code for our methods is available at http://www- bcf.usc.edu/ ˜ bellet/ . 5 W e also experimented with a basis set based on local GLML metrics. Preliminary results were comparable to those obtained with the procedure abov e. 6 http://archive.ics.uci.edu/ml/ 7 http://mlg.ucd.ie/datasets/bbc.html 8 Dataset Euc LMNN BoostML SCML-Global V ehicle 29.7 ± 0.6 23.5 ± 0.7 19.9 ± 0.6 21.3 ± 0.6 V owel 11.1 ± 0.4 10.8 ± 0.4 11.4 ± 0.4 10.9 ± 0.5 Segment 5.2 ± 0.2 4.6 ± 0.2 3.8 ± 0.2 4.1 ± 0.2 Letters 14.0 ± 0.2 11.6 ± 0.3 10.8 ± 0.2 9.0 ± 0.2 USPS 10.3 ± 0.2 4.1 ± 0.1 7.1 ± 0.2 4.1 ± 0.1 BBC 8.8 ± 0.3 4.0 ± 0.2 9.3 ± 0.3 3.9 ± 0.2 A vg. rank 3.3 2.0 2.3 1.2 T able 3: Comparison of SCML-Global against LMNN and BoostML (best in bold). Dataset BoostML SCML-Global V ehicle 334 164 V owel 19 47 Segment 442 49 Letters 20 133 USPS 2,375 300 BBC 3,000 59 T able 4: A verage number of basis elements in the solution. Results T able 2 sho ws misclassification rates with standard errors, where Euc is the Euclidean distance. The results show that SCML-Global performs similarly as Global-Frob on lo w-dimensional datasets but has a clear adv antage when dimensionality is high (USPS and BBC). This demonstrates that learning a sparse combination of basis elements is an ef fectiv e way to reduce o verfitting and improv e generalization. SCML- Global is also faster to train than Global-Frob on these datasets (about 2x faster on USPS and 3x on BBC) because it does not require any PSD projection. 5.1.2 Comparison to Other Global Algorithms Setup W e no w compare SCML-Global to two state-of-the-art global metric learning algorithms: Large Margin Nearest Neighbor (LMNN, W einberger and Saul, 2009) and BoostML (Shen et al., 2012). The datasets, preprocessing and setting for SCML-Global are the same as in Section 5.1.1. LMNN uses 3 target neighbors and all imposters, while these are set to 3 and 10 respectively for BoostML (as in SCML-Global). Results T able 3 sho ws the a verage misclassification rates, along with standard error and the av erage rank of each method across all datasets. SCML-Global clearly outperforms LMNN and BoostML, ranking first on 5 out of 6 datasets and achieving the o verall highest rank. Furthermore, its training time is smaller than competing methods, especially for high-dimensional data. For instance, on the BBC dataset, SCML-Global trained in about 90 seconds, which is about 20x faster than LMNN and 35x faster than BoostML. Note also that SCML-Global is consistently more accurate than linear SVM, as sho wn in Appendix B. Number of selected basis elements Like SCML-Global, recall that BoostML is based on combining rank-one elements (see Section 4). The main difference with SCML-Global is that our method is giv en a set of locally discriminati ve metrics and picks the rele vant ones by learning sparse weights, while BoostML generates a ne w basis element at each iteration and adds it to the current combination. T able 4 reports the number of basis elements used in SCML-Global and BoostML solutions. Overall, SCML-Global uses fewer 9 T ask st-Euc st-LMNN st-SCML u-Euc u-LMNN u-SCML mt-LMNN mt-SCML Books 33.5 ± 0.5 29.7 ± 0.4 27.0 ± 0.5 33.7 ± 0.5 29.6 ± 0.4 28.0 ± 0.4 29.1 ± 0.4 25.8 ± 0.4 D VD 33.9 ± 0.5 29.4 ± 0.5 26.8 ± 0.4 33.9 ± 0.5 29.4 ± 0.5 27.9 ± 0.5 29.5 ± 0.5 26.5 ± 0.5 Electronics 26.2 ± 0.4 23.3 ± 0.4 21.1 ± 0.5 29.1 ± 0.5 25.1 ± 0.4 22.9 ± 0.4 22.5 ± 0.4 20.2 ± 0.5 Kitchen 26.2 ± 0.6 21.2 ± 0.5 19.0 ± 0.4 27.7 ± 0.5 23.5 ± 0.3 21.9 ± 0.5 22.1 ± 0.5 19.0 ± 0.4 A vg. accuracy 30.0 ± 0.2 25.9 ± 0.2 23.5 ± 0.2 31.1 ± 0.3 26.9 ± 0.2 25.2 ± 0.2 25.8 ± 0.2 22.9 ± 0.2 A vg. runtime N/A 57 min 3 min N/A 44 min 2 min 41 min 5 min T able 5: Multi-task metric learning results. elements than BoostML (on two datasets, it uses more but this yields to significantly better performance). The results on USPS and BBC also suggest that the number of basis elements selected by SCML-Global seems to scale well with dimensionality . These nice properties come from its kno wledge of the entire basis set and the sparsity-inducing regularizer . On the contrary , the number of elements (and therefore iterations) needed by BoostML to con ver ge seems to scale poorly with dimensionality . 5.2 Multi-task Metric Learning Dataset Sentiment Analysis (Blitzer et al., 2007) is a popular dataset for multi-task learning that consists of Amazon revie ws on four product types: kitchen appliances, DVDs, books and electronics. Each product type is treated as a task and has 1,000 positiv e and 1,000 negati ve re views. T o reduce computational cost, we represent each revie w by a 200-dimensional feature vector by selecting top 200 words of the largest mutual information with the labels. W e randomly split the dataset into training (800 samples), validation (400 samples) and testing (400 samples) sets. Setup W e compare the following metrics: st-Euc (Euclidean distance), st-LMNN and st-SCML (single- task LMNN and single-task SCML-Global, trained independently on each task), u-Euc (Euclidean trained on the union of the training data from all tasks), u-LMNN (LMNN on union), u-SCML (SCML-Global on union), multi-task LMNN (Parameswaran and W einberger, 2010) and finally our own multi-task method mt-SCML. W e tune the regularization parameters in mt-LMNN, st-SCML, u-SCML and mt-SCML on val- idation sets. As in the pre vious e xperiment, the number of target neighbors and imposters for our methods are set to 3 and 10 respectively . W e use a basis set of 400 elements for each task for st-SCML, the union of these (1,600) for mt-SCML, and 400 for u-SCML. Results T able 5 shows the results averaged o ver 20 random splits. First, notice that u-LMNN and u- SCML obtain significantly higher error rates than st-LMNN and st-SCML respectiv ely , which suggests that the dataset may violate mt-LMNN’ s assumption that all tasks share a similar metric. Indeed, mt-LMNN does not outperform st-LMNN significantly . On the other hand, mt-SCML performs better than its single- task counterpart and than all other compared methods by a significant margin, demonstrating its ability to leverage some commonalities between tasks that mt-LMNN is unable to capture. It is worth noting that the solution found by mt-SCML is based on only 273 basis elements on average (out of a total of 1,600), while st-SCML makes use of significantly more elements (347 elements per task on av erage). Basis elements selected by mt-SCML are ev enly distributed across all tasks, which indicates that it is able to exploit meaningful information across tasks to get both more accurate and more compact metrics. Finally , note that our algorithms are about an order of magnitude faster . 10 Dataset MM-LMNN GLML PLML SCML-Local V ehicle 23.1 ± 0.6 23.4 ± 0.6 22.8 ± 0.7 18.0 ± 0.6 V owel 6.8 ± 0.3 4.1 ± 0.4 8.3 ± 0.4 6.1 ± 0.4 Segment 3.6 ± 0.2 3.9 ± 0.2 3.9 ± 0.2 3.6 ± 0.2 Letters 9.4 ± 0.3 10.3 ± 0.3 8.3 ± 0.2 8.3 ± 0.2 USPS 4.2 ± 0.7 7.8 ± 0.2 4.1 ± 0.1 3.6 ± 0.1 BBC 4.9 ± 0.4 5.7 ± 0.3 4.3 ± 0.2 4.1 ± 0.2 A vg. rank 2.0 2.7 2.0 1.2 T able 6: Local metric learning results (best in bold). 1 2 3 (a) Class membership (b) T rained metrics (c) T est metrics Figure 2: Illustrativ e experiment on digits 1, 2 and 3 of USPS in 2D. Refer to the main text for details. 5.3 Local Metric Learning Setup W e use the same datasets and preprocessing as for global metric learning. W e compare SCML- Local to MM-LMNN (W einberger and Saul, 2009), GLML (Noh et al., 2010) and PLML (W ang et al., 2012). The parameters of all methods are tuned on v alidation sets or set by authors’ recommendation. MM- LMNN use 3 tar get neighbors and all imposters, while these are set to 3 and 10 in PLML and SCML-Local. The number of anchor points in PLML is set to 20 as done by the authors. For SCML-Local, we use the same basis set as SCML-Global, and embedding dimension D 0 is set to 40 for V ehicle, V o wel, Se gment and BBC, and 100 for Letters and USPS. Results T able 6 giv es the error rates along with the average rank of each method across all datasets. Note that SCML-Local significantly improves upon SCML-Global on all but one dataset and achieves the best av erage rank. PLML does not perform well on small datasets (V ehicle and V o wel), presumably because there are not enough points to get a good estimation of the data manifold. GLML is fast but has rather poor performance on most datasets because its Gaussian assumption is restrictive and it learns the local metrics independently . Among discriminati ve methods, SCML-Local offers the best training time, especially for high-dimensional data (e.g. on BBC, it trained in about 8 minutes, which is about 5x faster than MM- LMNN and 15x faster than PLML). Note that on this dataset, both MM-LMNN and PLML perform worse than SCML-Global due to severe ov erfitting, while SCML-Local a voids it by learning significantly fewer parameters. Finally , SCML-Local achieves accuracy results that are v ery competitiv e with those of a kernel SVM, as sho wn in Appendix B. V isualization of the learned metrics T o provide a better understanding of why SCML-Local works well, we apply it to digits 1, 2, and 3 of USPS projected in 2D using t-SNE (van der Maaten and Hinton, 2008), 11 5 0 4 0 0 7 0 0 1 0 0 0 2 0 0 0 0 1 0 0 2 0 0 3 0 0 4 0 0 S i z e o f b a s i s s e t N u m b e r o f s e l e c t e d b a s i s S C M L − G l o b a l S C M L − L o c a l 5 0 4 0 0 7 0 0 1 0 0 0 2 0 0 0 3 3 . 5 4 4 . 5 5 5 . 5 S i z e o f b a s i s s e t Te s t i n g e r r o r r a t e S C M L − G l o b a l S C M L − L o c a l E u c Figure 3: Effect of the number of bases on Se gment dataset. sho wn in Figure 2(a). W e use 10 basis elements and D 0 = 5 . Figure 2(b) shows the training points colored by their learned metric (based on the projection of the weight vectors in 1D using PCA). W e see that the local metrics vary smoothly and are thereby robust to outliers. Unlike MM-LMNN, points within a class are allo wed to have dif ferent metrics: in particular , this is useful for points that are near the decision boundary . While smooth, the variation in the weights is thus dri ven by discriminative information, unlike PLML where they are only based on the smoothness assumption. Finally , Figure 2(c) sho ws that the metrics consistently generalize to test data. Effect of the basis set size Figure 3 shows the number of selected basis elements and test error rate for SCML-Global and SCML-Local as a function of the size of basis set on Segment (results were consistent on other datasets). The left pane shows that the number of selected elements increases sublinearly and e ventually con verges, while the right pane shows that test error may be further reduced by using a larger basis set without significant ov erfitting, as suggested by our generalization bound (Theorem 1). Figure 3 also shows that SCML-Local generally selects more basis elements than SCML-Global, but notice that it can outperform SCML-Global e ven when the basis set is v ery small. 6 Conclusion W e proposed to learn metrics as sparse combinations of rank-one basis elements. This framework unifies se veral paradigms in metric learning, including global, local and multi-task learning. Of particular interest is our local metric learning algorithm which can compute instance-specific metrics for both training and test points in a principled way . The soundness of our approach is supported theoretically by a generalization bound, and we sho wed in experimental studies that the proposed methods improve upon state-of-the-art algorithms in terms of accuracy and scalability . Acknowledgements This research is partially supported by the IARP A via DoD/ARL contract # W911NF- 12-C-0012 and DARP A via contract # D11AP00278. The U.S. Government is authorized to reproduce and distribute reprints for Go vernmental purposes notwithstanding any copyright annotation thereon. The vie ws and conclusions contained herein are those of the authors and should not be interpreted as necessarily rep- resenting the of ficial policies or endorsements, either expressed or implied, of IARP A, DoD/ARL, D ARP A, or the U.S. Gov ernment. 12 Refer ences A. Argyriou, T . Evgeniou, and M. Pontil. Con vex multi-task feature learning. Mach. Learn. , 73(3):243–272, 2008. A. Bellet and A. Habrard. Rob ustness and Generalization for Metric Learning. T echnical report, arXi v:1209.1086, 2012. Aur ´ elien Bellet, Amaury Habrard, and Marc Sebban. A Surv ey on Metric Learning for Feature Vectors and Structured Data. T echnical report, arXi v:1306.6709, June 2013. J. Blitzer, M. Dredze, and F . Pereira. Biographies, Bollywood, Boom-boxes and Blenders: Domain Adap- tation for Sentiment Classification. In A CL , 2007. R. Caruana. Multitask Learning. Mac h. Learn. , 28(1):41–75, 1997. C.-C. Chang and C.-J. Lin. LIBSVM : a library for support vector machines. A CM T ransactions on Intelli- gent Systems and T echnology , 2(3):27–27, 2011. J. Davis, B. K ulis, P . Jain, S. Sra, and I. Dhillon. Information-theoretic metric learning. In ICML , 2007. J. Duchi and Y . Singer . Efficient Online and Batch Learning Using Forw ard Backward Splitting. JMLR , 10: 2899–2934, 2009. A. Frome, Y . Singer , F . Sha, and J. Malik. Learning Globally-Consistent Local Distance Functions for Shape-Based Image Retrie v al and Classification. In ICCV , 2007. J. Goldberger , S. Roweis, G. Hinton, and R. Salakhutdinov . Neighbourhood Components Analysis. In NIPS , 2004. P . Gong, J. Y e, and C. Zhang. Robust multi-task feature learning. In KDD , 2012. S. Hauberg, O. Freifeld, and M. Black. A Geometric take on Metric Learning. In NIPS , 2012. Y . Hong, Q. Li, J. Jiang, and Z. T u. Learning a mixture of sparse distance metrics for classification and dimensionality reduction. In CVPR , 2011. P . Jain, B. Kulis, I. Dhillon, and K. Grauman. Online Metric Learning and Fast Similarity Search. In NIPS , 2008. A. K olmogorov and V . T ikhomirov . -entropy and -capacity of sets in functional spaces. American Mathe- matical Society T ranslations , 2(17):277–364, 1961. Brian Kulis. Metric Learning: A Survey . Foundations and Tr ends in Machine Learning , 5(4):287–364, 2012. Y .-K. Noh, B.-T . Zhang, and D. Lee. Generative Local Metric Learning for Nearest Neighbor Classification. In NIPS , 2010. S. Paramesw aran and K. W einberger . Large Mar gin Multi-Task Metric Learning. In NIPS , 2010. D. Ramanan and S. Baker . Local Distance Functions: A Taxonomy , New Algorithms, and an Ev aluation. TP AMI , 33(4):794–806, 2011. 13 B. Sch ¨ olkopf, A. Smola, and K.-R. M ¨ uller . Nonlinear component analysis as a k ernel eigenv alue problem. Neural Comput. , 10(1):1299–1319, 1998. C. Shen, J. Kim, L. W ang, and A. van den Hengel. Positi ve Semidefinite Metric Learning Using Boosting- like Algorithms. JMLR , 13:1007–1036, 2012. Y . Shi, A. Bellet, and F . Sha. Sparse Compositional Metric Learning. In AAAI , 2014. L. v an der Maaten and G. Hinton. Visualizing Data using t-SNE. JMLR , 9:2579–2605, 2008. J. W ang, A. W oznica, and A. Kalousis. Parametric Local Metric Learning for Nearest Neighbor Classifica- tion. In NIPS , 2012. K. W einberger and L. Saul. Distance Metric Learning for Large Margin Nearest Neighbor Classification. JMLR , 10:207–244, 2009. L. Xiao. Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization. JMLR , 11:2543–2596, 2010. E. Xing, A. Ng, M. Jordan, and S. Russell. Distance Metric Learning with Application to Clustering with Side-Information. In NIPS , 2002. H. Xu and S. Mannor . Robustness and Generalization. Mac h. Learn. , 86(3):391–423, 2012. X. Y ang, S. Kim, and E. Xing. Heterogeneous multitask learning with joint sparsity constraints. In NIPS , 2009. Y . Y ing and P . Li. Distance Metric Learning with Eigen v alue Optimization. JMLR , 13:1–26, 2012. M. Y uan and Y . Lin. Model selection and estimation in regression with grouped variables. J. Roy . Statist. Soc. Ser . B , 68:49–67, 2006. D.-C. Zhan, M. Li, Y .-F . Li, and Z.-H. Zhou. Learning instance specific distances using metric propagation. In ICML , 2009. A ppendix A Detailed Analysis In this section, we give the details of the deriv ation of the generalization bounds for the global and multi-task learning formulations gi ven in Section 3. A.1 Preliminaries W e start by introducing some notation. W e are given a training sample S = { z i = ( x i , y i ) } n i =1 drawn i.i.d. from a distribution P over the labeled space Z = X × Y . W e assume that k x k ≤ R (for some con venient norm), ∀ x ∈ X . W e call a triplet ( z , z 0 , z 00 ) admissible if y = y 0 6 = y 00 . Let C S be the set of all admissible triplets built from instances in S . 8 8 When the training triplets consist of only a subset of all admissible triplets (which is often the case in practice), a relaxed version of the robustness property can be used to deriv e similar results (Bellet and Habrard, 2012). F or simplicity , we focus here on the case when all admissible triplets are used. 14 Let L ( h, z , z 0 , z 00 ) be the loss suf fered by some hypothesis h on triplet ( z , z 0 , z 00 ) , with the con vention that L returns 0 for non-admissible triplets. W e assume L to be uniformly upper-bounded by a constant U . The empirical loss R C S emp ( h ) of h on C S is defined as R C S emp ( h ) = 1 | C S | X ( z , z 0 , z 00 ) ∈ C S L ( h, z , z 0 , z 00 ) , and its expected loss R ( h ) ov er distribution P as R ( h ) = E z , z 0 , z 00 ∼ P L ( h, z , z 0 , z 00 ) . Our goal is to bound the deviation between R ( A C S ) and R C S emp ( A C S ) , where A C S is the hypothesis learned by algorithm A on C S . A.2 Algorithmic Robustness T o deri ve our generalization bounds, we use the recent frame work of algorithmic rob ustness (Xu and Man- nor, 2012), in particular its adaptation to pairwise and tripletwise loss functions used in metric learning (Bellet and Habrard, 2012). For the reader’ s con venience, we briefly re view these main results belo w . Algorithmic rob ustness is the ability of an algorithm to perform “similarly” on a training example and on a test e xample that are “close”. The proximity of points is based on a partitioning of the space Z : two examples are close to each other if they lie in the same region. The partition is based on the notion of cov ering number (Kolmogoro v and Tikhomiro v, 1961). Definition 1 (Co vering number) . F or a metric space ( S , ρ ) and V ⊂ S , we say that ˆ V ⊂ V is a γ -cover of V if ∀ t ∈ V , ∃ ˆ t ∈ ˆ V such that ρ ( t , ˆ t ) ≤ γ . The γ -covering number of V is N ( γ , V , ρ ) = min n | ˆ V | : ˆ V is a γ -cover of V o . In particular , when X is compact, N ( γ , X , ρ ) is finite, leading to a finite cover . Then, Z can be parti- tioned into |Y |N ( γ , X , ρ ) subsets such that if two examples z = ( x , y ) and z 0 = ( x 0 , y 0 ) belong to the same subset, then y = y 0 and ρ ( x , x 0 ) ≤ γ . The definition of robustness for tripletwise loss functions (adapted from Xu and Mannor, 2012) is as follo ws. Definition 2 (Robustness for metric learning (Bellet and Habrard, 2012)) . An algorithm A is ( N , ( · )) r obust for N ∈ N and ( · ) : ( Z × Z ) n → R if Z can be partitioned into N disjoints sets, denoted by { Q i } N i =1 , such that the following holds for all S ∈ Z n : ∀ ( z 1 , z 2 , z 3 ) ∈ C S , ∀ z , z 0 , z 00 ∈ Z , ∀ i, j ∈ [ N ] : if z 1 , z ∈ Q i , z 2 , z 0 ∈ Q j , z 3 , z 00 ∈ Q k then | L ( A C S , z 1 , z 2 , z 3 ) − L ( A C S , z , z 0 , z 00 ) | ≤ ( C S ) , wher e A C S is the hypothesis learned by A on C S . N and ( · ) quantify the robustness of the algorithm and depend on the training sample. Again adapting the result from (Xu and Mannor, 2012), (Bellet and Habrard, 2012) sho wed that a metric learning algorithm that satisfies Definition 2 has the follo wing generalization guarantees. 15 Theorem 2. If a learning algorithm A is ( N , ( · )) -r obust and the training sample consists of the triplets C S obtained fr om a sample S generated by n i.i.d. dr aws fr om P , then for any δ > 0 , with pr obability at least 1 − δ we have: |R ( A C S ) − R C S emp ( A C S ) | ≤ ( C S ) + 3 U s N ln 2 + ln 1 δ 0 . 5 n . As shown in (Bellet and Habrard, 2012), establishing the robustness of an algorithm is easier using the follo wing theorem, which basically says that if a metric learning algorithm has approximately the same loss for triplets that are close to each other , then it is robust. Theorem 3. F ix γ > 0 and a metric ρ of Z . Suppose that ∀ z 1 , z 2 , z 3 , z , z 0 , z 00 : ( z 1 , z 2 , z 3 ) ∈ C S , ρ ( z 1 , z ) ≤ γ , ρ ( z 2 , z 0 ) ≤ γ , ρ ( z 3 , z 00 ) ≤ γ , A satisfies | L ( A C S , z 1 , z 2 , z 3 ) − L ( A C S , z , z 0 , z 00 ) | ≤ ( C S ) , and N ( γ / 2 , Z , ρ ) < ∞ . Then the algorithm A is ( N ( γ / 2 , Z , ρ ) , ( C S )) -r obust. W e now ha ve all the tools we need to prove the results of interest. A.3 Generalization Bounds f or SCML A.3.1 Bound f or SCML-Global W e first focus on SCML-Global where the loss function is defined as follows: L ( w , z , z 0 , z 00 ) = 1 + d w ( x , x 0 ) − d w ( x , x 00 ) + . W e obtain a generalization bound by showing that SCML-Global satisfies Definition 2 using Theorem 3. T o establish the result, we need a bound on the ` 2 norm of the basis elements. Since they are obtained by eigen value decomposition, their norm is equal to (and thus bounded by) 1. Let w ∗ be the optimal solution to SCML-Global. By optimality of w ∗ we hav e: L ( w ∗ , z , z 0 , z 00 ) + β k w ∗ k 1 ≤ L ( 0 , z , z 0 , z 00 ) + β k 0 k 1 = 1 , thus we get k w ∗ k 1 ≤ 1 /β . Let M ∗ = P K i =1 w ∗ i b i b T i be the learned metric. Then using Holder’ s inequality and the bound on w ∗ and the b ’ s: k M ∗ k 1 = K X i =1 w ∗ i b i b T i 1 = X i : w i 6 =0 w ∗ i b i b T i 1 ≤ k w ∗ k 1 X i : w i 6 =0 k b i k ∞ k b i k ∞ ≤ K ∗ /β , where K ∗ ≤ K is the number of nonzero entries in w ∗ . Using Definition 1, we can partition Z into |Y |N ( γ , X , ρ ) subsets such that if two e xamples z = ( x , y ) and z 0 = ( x 0 , y 0 ) belong to the same subset, then y = y 0 and ρ ( x , x 0 ) ≤ γ . No w , for z 1 , z 2 , z 3 , z 0 1 , z 0 2 , z 0 3 ∈ Z , if y 1 = y 0 1 , k x 1 − x 0 1 k 1 ≤ γ , y 2 = y 0 2 , k x 2 − x 0 2 k 1 ≤ γ , y 3 = y 0 3 , k x 3 − x 0 3 k 1 ≤ γ , then ( z 1 , z 2 , z 3 ) and ( z 0 1 , z 0 2 , z 0 3 ) are either both admissible or both non-admissible triplets. 16 In the non-admissible case, by definition their respectiv e loss is 0 and so is the deviation between the losses. In the admissible case we have: [1 + d w ∗ ( x 1 , x 2 ) − d w ∗ ( x 1 , x 3 )] + − 1 + d w ∗ ( x 0 1 , x 0 2 ) − d w ∗ ( x 0 1 , x 0 3 ) + ≤ | ( x 1 − x 2 ) T M ∗ ( x 1 − x 2 ) − ( x 1 − x 3 ) T M ∗ ( x 1 − x 3 ) + ( x 0 1 − x 0 3 ) T M ∗ ( x 0 1 − x 0 3 ) − ( x 0 1 − x 0 2 ) T M ∗ ( x 0 1 − x 0 2 ) | = | ( x 1 − x 2 ) T M ∗ ( x 1 − x 2 ) − ( x 1 − x 2 ) T M ∗ ( x 0 1 − x 0 2 ) + ( x 1 − x 2 ) T M ∗ ( x 0 1 − x 0 2 ) − ( x 0 1 − x 0 2 ) T M ∗ ( x 0 1 − x 0 2 ) + ( x 0 1 − x 0 3 ) T M ∗ ( x 0 1 − x 0 3 ) − ( x 0 1 − x 0 3 ) T M ∗ ( x 1 − x 3 ) + ( x 0 1 − x 0 3 ) T M ∗ ( x 1 − x 3 ) − ( x 1 − x 3 ) T M ∗ ( x 1 − x 3 ) | = | ( x 1 − x 2 ) T M ∗ ( x 1 − x 2 − ( x 0 1 − x 0 2 )) + ( x 1 − x 2 − ( x 0 1 − x 0 2 )) T M ∗ ( x 0 1 − x 0 2 ) + ( x 0 1 − x 0 3 ) T M ∗ ( x 0 1 − x 0 3 − ( x 1 − x 3 )) + ( x 0 1 − x 0 3 − ( x 1 − x 3 )) T M ∗ ( x 1 − x 3 ) | ≤ | ( x 1 − x 2 ) T M ∗ ( x 1 − x 0 1 ) | + | ( x 1 − x 2 ) T M ∗ ( x 0 2 − x 2 ) | + | ( x 1 − x 0 1 ) T M ∗ ( x 0 1 − x 0 2 ) | + | ( x 0 2 − x 2 ) T M ∗ ( x 0 1 − x 0 2 ) | + | ( x 0 1 − x 0 3 ) T M ∗ ( x 0 1 − x 1 ) | + | ( x 0 1 − x 0 3 ) T M ∗ ( x 3 − x 0 3 ) | + | ( x 0 1 − x 1 ) T M ∗ ( x 1 − x 3 ) | + | ( x 3 − x 0 3 ) T M ∗ ( x 1 − x 3 ) | ≤ k x 1 − x 2 k ∞ k M ∗ k 1 k x 1 − x 0 1 k 1 + k x 1 − x 2 k ∞ k M ∗ k 1 k x 0 2 − x 2 k 1 + k x 1 − x 0 1 k 1 k M ∗ k 1 k x 0 1 − x 0 2 k ∞ + k x 0 2 − x 2 k 1 k M ∗ k 1 k x 0 1 − x 0 2 k ∞ + k x 0 1 − x 0 3 k ∞ k M ∗ k 1 k x 0 1 − x 1 k 1 + k x 0 1 − x 0 3 k ∞ k M ∗ k 1 k x 3 − x 0 3 k 1 + k x 0 1 − x 1 k 1 k M ∗ k 1 k x 1 − x 3 k ∞ + k x 3 − x 0 3 k 1 k M ∗ k 1 k x 1 − x 3 k ∞ ≤ 16 γ RK ∗ β , by using the property that the hinge loss is 1-Lipschitz, Holder’ s inequality and bounds on the in volv ed quantities. Thus SCML-Global is ( | Y |N ( γ , X , k · k 1 ) , 16 γ RK ∗ β ) -robust and the generalization bound follows. A.3.2 Bound f or mt-SCML In the multi-task setting, we are gi ven a training sample S t = { z i = ( x t i , y t i ) } n t i =1 . Let C S t be the set of all admissible triplets built from instances in S t . Let W ∗ be the optimal solution to mt-SCML. Using the same arguments as for SCML-Global, by optimality of W ∗ we hav e k W ∗ k 2 , 1 ≤ 1 /β . Let M ∗ t = P K i =1 W ∗ ti b i b T i be the learned metric for task t and W ∗ t be the weight vector for task t , corresponding to the t -th ro w of W ∗ . Then using the fact that k W ∗ t k 2 , 1 ≤ k W ∗ k 2 , 1 and k b k 2 , 1 = k b k 2 , we hav e: k M ∗ t k 2 , 1 = K X i =1 W ∗ ti b i b T i 2 , 1 = X i : W ∗ ti 6 =0 W ∗ ti b i b T i 2 , 1 ≤ k W ∗ t k 2 , 1 X i : W ∗ ti 6 =0 k b i k 2 , 1 k b i k 2 , 1 ≤ K ∗ t /β , where K ∗ t ≤ K is the number of nonzero entries in W ∗ t . From this we can derive a generalization bound for each task using arguments similar to the global case, using a partition specific to each task defined with respect to k · k 2 . W ithout loss of generality , we focus on task t and only explicitly write the last deri v ations as the beginning is the same as abo ve: [1 + d w ∗ ( x 1 , x 2 ) − d w ∗ ( x 1 , x 3 )] + − [1 + d w ∗ ( x 0 1 , x 0 2 ) − d w ∗ ( x 0 1 , x 0 3 )] + ≤ | ( x 1 − x 2 ) T M ∗ ( x 1 − x 0 1 ) | + | ( x 1 − x 2 ) T M ∗ ( x 0 2 − x 2 ) | + | ( x 1 − x 0 1 ) T M ∗ ( x 0 1 − x 0 2 ) | + | ( x 0 2 − x 2 ) T M ∗ ( x 0 1 − x 0 2 ) | + | ( x 0 1 − x 0 3 ) T M ∗ ( x 0 1 − x 1 ) | + | ( x 0 1 − x 0 3 ) T M ∗ ( x 3 − x 0 3 ) | + | ( x 0 1 − x 1 ) T M ∗ ( x 1 − x 3 ) | + | ( x 3 − x 0 3 ) T M ∗ ( x 1 − x 3 ) | ≤ k x 1 − x 2 k 2 k M ∗ k F k x 1 − x 0 1 k 2 + k x 1 − x 2 k 2 k M ∗ k F k x 0 2 − x 2 k 2 + k x 1 − x 0 1 k 2 k M ∗ k F k x 0 1 − x 0 2 k 2 + k x 0 2 − x 2 k 2 k M ∗ k F k x 0 1 − x 0 2 k 2 + k x 0 1 − x 0 3 k 2 k M ∗ k F k x 0 1 − x 1 k 2 + k x 0 1 − x 0 3 k 2 k M ∗ k F k x 3 − x 0 3 k 2 + k x 0 1 − x 2 k 2 k M ∗ k F k x 1 − x 3 k 2 + k x 3 − x 0 3 k 2 k M ∗ k F k x 1 − x 3 k 2 ≤ 16 γ RK ∗ t β , 17 Dataset Linear SVM K ernel SVM SCML-Global SCML-Local V ehicle 21.4 ± 0.6 16.6 ± 0.8 21.3 ± 0.6 18.0 ± 0.6 V owel 24.3 ± 0.7 4.4 ± 0.4 10.9 ± 0.5 6.1 ± 0.4 Segment 5.1 ± 0.2 3.6 ± 0.2 4.1 ± 0.2 3.6 ± 0.2 Letters 19.5 ± 0.4 8.8 ± 0.2 9.0 ± 0.2 8.3 ± 0.2 USPS 6.5 ± 0.2 4.2 ± 0.1 4.1 ± 0.1 3.6 ± 0.1 BBC 4.3 ± 0.2 3.8 ± 0.2 3.9 ± 0.2 4.1 ± 0.2 A vg. rank 3.8 1.3 2.3 1.2 T able 7: Comparison of SCML against linear and kernel SVM (best in bold). where we used the same arguments as above and the inequality k M ∗ k F ≤ k M ∗ k 2 , 1 . Thus mt-SCML is ( | Y |N ( γ , X , k · k 2 ) , 16 γ RK ∗ t β ) -robust and the bound for task t follo ws. Note that the number of training examples in the bound is only that from task t , i.e., n = n t . A.3.3 Comments on SCML-Local It would be interesting to be able to deriv e a similar bound for SCML-Local. Unfortunately , as it is noncon- ve x, we cannot assume optimality of the solution. If a similar formulation can be made con vex, the same proof technique should apply: even though each instance has its o wn metric, it essentially depends on the instance itself (whose norm is bounded) and on the learned parameters shared across metrics (which could be bounded using optimality of the solution). Deriving such a con ve x formulation and the corresponding generalization bound is left as future work. A ppendix B Experimental Comparison with Support V ector Machines In this section, we compare SCML-Global and SCML-Local to Support V ector Machines (SVM) using a linear and a RBF kernel. W e used the software LIBSVM (Chang and Lin, 2011) and tuned the parameter C as well as the bandwidth for the RBF kernel on the validation set. T able 7 shows misclassification rates av eraged over 20 random splits, along with standard error and the av erage rank of each method across all datasets. First, we can see that SCML-Global consistently performs better than linear SVM. Second, SCML- Local is competiti ve with kernel SVM. These results show that a simple k -nearest neighbor strategy with a good metric can be competiti ve (and e ven outperform) SVM classifiers. 18
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment