Ground Metric Learning
Transportation distances have been used for more than a decade now in machine learning to compare histograms of features. They have one parameter: the ground metric, which can be any metric between the features themselves. As is the case for all para…
Authors: Marco Cuturi, David Avis
Ground Metric Learning Marco Cuturi, D a vid Avis Ky oto Univ ersit y { mcuturi,dav is } @i.kyoto-u.a c.jp Septem ber 8, 202 1 Abstract T ransp orta tion distances hav e b een used for mo re than a decade now in machine lear ning to co mpare histogra ms o f features. They have one parameter: the gr ound metric , whic h can b e any met ric b et ween the features themselves. As is the case for all pa rameterized distances, transp ortatio n distances ca n only prov e useful in practice when this parameter is carefully c hosen. T o date, the only option av aila ble to practitioners to set the gro und metric par ameter was to rely on a pri- ori knowledge of the features, which limited co ns iderably the sco pe of application of transp or tation distances. W e prop ose to lift this limita- tion and co nsider instead a lgorithms that ca n le arn the g round metric using only a training set of lab eled histograms. W e call this appro a ch ground metric learning. W e fo rmulate the problem o f lea rning the ground metric a s the minimization of the difference o f tw o po lyhedral conv ex functions ov er a c onv ex set of distance matrices. W e follow the presentation of our a lgorithms with pro mising exp e rimental results on binary class ification ta sks using GIST descriptors o f images taken in the Caltech-256 set. 1 In tro duction W e consider in this p ap er the problem of sup ervised m etric learning on normalized histograms. Normalized histograms arise f requent ly in natu- ral language pro cessing, compu ter vision, bioinformatics and m ore gener- ally areas in v olving complex datat yp es. Ob jects of in terest in suc h areas are usually simplified and eac h repr esen ted as a bag of smaller features. The o ccurr en ce frequencies of eac h of these f eatures in the considered ob- ject can then b e represen ted as a histogram. F or ins tance, the repr esen ta- tion of images as histograms of pixel co lors, SIFT or GIST f eatures (Lo w e 1 1999, Oliv a and T orralba 2001, Douze et al. 2009); texts as bags-of-w ords or topic allo cations (Joac hims 2002, Blei et al. 2003 , Blei and Laffert y 2009); sequences as n -grams coun ts (Leslie et al. 2002) and graphs as histograms of sub graphs (Kash ima et al. 2003 ) all follo w this prin ciple. V arious distances ha ve b een pr op osed in th e statistics and mac hin e learn- ing literatures to compare tw o h istograms (Deza and Deza 2009, § 14), (Rac hev 1991). Our fo cus is in this pap er is on the family of transp ortation d istances, whic h is b oth well motiv ated theoretical ly (Villani 2003, § 7), (Rac hev 1991, § 5) and works well empirically (Rubn er et al. 1997; 2000, Pele and W erman 2009). T ransp ortatio n distances are particularly p opu lar in computer vision, where, after the in fluentia l w ork of Rub n er et al. (1997), they w ere called Earth Mover’s Distanc es (EMD) . T ransp ortation distances in mac hine learnin g can b e thought of as meta- distances that b uild up on a metric on the fe atur es to form a distanc e on his- to gr ams of fe atur es . Suc h a metric, whic h is k n o wn in the computer vision literature as the gr ound metric 1 , is the unique parameter of transp ortation distances. In th eir seminal pap er, Rub ner et al. (2000) argue that, “in gen- eral, the ground distance can b e an y distance and w ill b e c hosen according to the pr oblem at han d ”. T o our kn o w ledge, the ground metric has alw a ys b een considered a priori in all applications of EMD in machine learning. T o b e more p recise, EMD has only b een applied to d atasets where suc h a metric w as a v ailable and m otiv ated by pr ior knowledge . W e argue that th is is p roblematic in t wo senses: first, this restriction limits the ap p lication of transp ortation distances to problems where su c h a kno w ledge exists. Second, ev en w hen such an a priori kno wledge is av ailable, w e argue that th ere cannot b e a “unive rsal” ground metric that w ill b e suitable for all learning p roblems in v olving histograms on suc h features. As w ith all parameters in mac hine learning algo rithms, the ground metric should b e selected adaptiv ely . Our goal in this pap er is to p rop ose g r ound metric le arning algorithms to do so. This p ap er is organized as f ollo ws: After pr o vidin g some bac kground on transp ortation distances in Section 2 , we prop ose in Section 3 a criterion – a difference of con vex function – to select a ground metric giv en a training set of histograms and a sim ilarity measure b etw een these histograms. W e then sho w ho w to obtain lo cal minima for that criterion using a su b gradi- en t descen t algorithm in S ection 4. W e prop ose differen t starting p oin ts to initialize this descent in Section 5. W e provide a r eview of other relev an t 1 Since the terms metric and distanc e are interc hangeable math ematically sp eaking, w e will alwa ys use the term metric for a metric b etw een features and the term distanc e f or the resulting transportation distance b etw een histogra ms, or more generally an y other distance on histograms. 2 distances and m etric lea rning tec hniqu es in S ection 6, in particular Maha- lanobis metric learning tec hniqu es (Xing et al. 2003, W ein b erger et al. 2006, W ein b erger and Saul 2009, Da vis et al. 2007) whic h ha ve insp ired m uch of this w ork. W e provi de empirical eviden ce in Secti on 7 that the distances prop osed in th is pap er compare fa v orably to comp eting tec h niques. W e conclude this p ap er in S ection 8 by p ro vid ing a few research a ven ues. Notations W e use upp er case letters A, B , . . . for d × d matrices. Bold up p er case letters A , B , . . . are us ed for larger matrices; lo wer ca se lett ers r , c, . . . are used for scala r num b ers or vect ors of R d . An upp er case letter M and its b old lo w er case m stand for th e same matrix written in d × d matrix form or d 2 v ector form b y stac king successiv ely all its column v ectors from the left-most on the top to the right-most at the b ottom. The notat ions m and m stand resp ectiv ely for the strict upp er and low er triangular parts of M expressed as vec tors of size d 2 . The ord er in which these elemen ts are en umerated must b e coheren t in the sense that th e upp er triangular part of M T expressed as a vect or m ust b e equal to m . Finally we use the F r ob enius dot-pro d uct for b oth matrix and v ector r epresen tations, written as h A, B i def = tr( A T B ) = a T b . 2 Optimal T ransp ortation Bet w een Histograms W e recall in this section a few basic facts ab out mass trans p ortation for tw o histograms. A more general and tec hnical introd uction is pr o vided by Villani (2003, I ntro duction & § 7); p ractical in sigh ts and motiv ation for its app li- cation in mac hine learning can b e foun d in Rubner et al. (2000); a recen t review of differen t extensions and particular cases of EMD can b e foun d in (P ele and W erm an 2009, § 2). 2.1 T ransp ortation Polytop es F or tw o s calar histograms r and c of s um 1 and dimension d , repr esen ted in the follo wing as column v ectors r = ( r 1 , . . . , r d ) T and c = ( c 1 , . . . , c d ) T of the canonical simplex Σ d − 1 = { u ∈ R d + | k u k 1 = 1 } of dimension d − 1, the p olytop e U ( r , c ) of trans p ortation plans that map r to c is the set of d × d matrices with co efficients in R + suc h that their r o w and columns marginals are equal to r and c resp ectiv ely , that is, wr iting 1 d ∈ R d for the column 3 v ector of ones, U ( r , c ) = { F ∈ R d × d + | F 1 d = r , F ⊤ 1 d = c } . U ( r , c ) is a p olytop e of dimension d 2 − 2 d + 1 in th e general case wh er e r and c h a ve p ositiv e coord in ates. U ( r , c ) is also kno w n in th e op erations researc h and statistical literatures as the set of transp ortation plans (Rac hev and R ¨ usc hendorf 1998) and conti ngency tables or t w o-w ay tables with fixed margins (Diaconis and Ef ron 1985). Given t w o h istograms r and c , w e defin e the follo w ing fun ction of a d × d real matrix A : G r c ( A ) def = min X ∈ U ( r,c ) h A, X i . (1) Equation (1) describ es a linear program whose feasible set is defin ed b y r and c and whose cost is parameterized by A . G r c is a p ositiv e homogeneous function, that is G r c ( tA ) = tG r c ( A ) for t ≥ 0. When M is a matrix tak en in the p oin ted, conv ex and p olyhedral cone of metric matrices, M def = n M ∈ R d × d : ∀ 1 ≤ i, j, k ≤ d, M ii = 0 , M ij = M j i , M ij ≤ M ik + M k j o ⊂ R d × d + , the quantit y G r c ( M ) is known as the Kant oro vich-Rubinstein distance (Villani 2003, § 7) b et ween r and c . T o highligh t the fact that G r c ( M ) can also b e seen as a the ev aluation of a function of r and c p arameterized b y M , w e will use the notation d M ( r , c ) def = G r c ( M ). 2.2 T ransp ortation Dist ances The function d M : Σ d − 1 × Σ d − 1 → R paramet erized by M has the fol- lo wing p rop erties: since M has a n ull diagonal d M ( r , r ) is alw ays zero; b y nonnegativit y of M , d M ( r , c ) ≥ 0; by sym metry of M , d M is itself a sym metric f u nction in its t wo argu m en ts. More generally , d M is a d is- tance b et w een histograms w henev er M is itself a m etric, n amely whenever M ∈ M (Villani 2003, Theo. 7.3). Th e distance d M b ears many n ames and has many v ariations: 1-W asserstein, Monge-Kant oro vich, Mallo w’s (Mallo w s 1972, Levina and Bic k el 2001), E arth Mo ve r’s (Rub ner et al. 2000) in vi- sion applications. Rub ner et al. (2000) and more recentl y Pele and W erman (2009) h a ve also prop osed to extend the transp ortatio n distance to compare un-normalized histograms. S imply put, these extensions compute a distance b et w een t w o unn ormalized h istograms u and v by com binin g any d ifference in the total mass of u and v with the optimal transp ortatio n plan that can carry the whole mass of u on to v if k u k 1 ≤ k v k 1 or v on to u if k v k 1 ≤ k u k 1 . W e will n ot consider s uc h extensions in this work; w e b elie v e how ev er that 4 the approac h es prop osed later in this pap er can b e extended to handle EMD distances for u n normalized histograms. d M can b e computed as the solution of the follo win g Linear Program (LP), d M ( r , c ) = minimize m T x sub ject to Ax = r c ∗ x ≥ 0 (P0) where A is the (2 d − 1) × d 2 matrix th at encod es th e r o w-su m and column- sum constraints for X to b e in U ( r, c ) as A = 1 1 × d ⊗ I d I d ⊗ 1 1 × d ∗ , ⊗ is Kronec ker’s pro duct and the lo w er subscript · ∗ in a matrix (resp. a v ector) means that its last line (resp . elemen t) has b een remov ed. This mo dification is carried out to mak e sure th at all constraints describ ed by A are indep enden t, or equiv alentl y that A T is not rank d efi cien t. This LP can b e solv ed using the net w ork simplex (F ord and F ulke rson 1962) or through more sp ecial ized netw ork flo w algorithms (Ahuja et al. 1993, § 9). 2.3 Prop erties of G r c Because its feasible set U ( r, c ) is a b oun d ed p olytope and its ob jectiv e is lin - ear, Pr ob lem (P0) has an optimal solution in the finite s et E x ( r, c ) of extreme p oints of U ( r, c ). G r c is th us the minimum of a finite set of linear f unctions and is b y extension piecewise linear and concav e (Bo yd and V anden b erghe 2004, § 3.2.3). Its gradien t is equal to ∇ G r c = x ⋆ whenev er an optimal solu- tion x ⋆ to Pr oblem (P0) is uniqu e (Bertsimas and Tsitsiklis 1997, § 5.4 ). More generally and regardless of the u niqueness of x ⋆ , an y optimal so- lution x ⋆ of Problem (P0) is in the sub-different ial ∂ G r c ( M ) of G r c at M (Bertsimas and Tsitsiklis 1997, Lem.11.4). W e use this prop ert y later in Section 4 when we optimize the criteria considered in the section b elo w. 3 Criteria for Ground Metric Learning W e define in this sectio n a family of criteria to quantify the relev ance of a ground metric for a giv en task, u sing a training dataset of histograms with additional inf ormation. 5 3.1 T raining Set: Histograms and Side Information Supp ose no w that w e are giv en a family ( r 1 , · · · , r n ) of histograms in th e canonical simplex with a corresp onding similarit y matrix [ ω ij ] 1 ≤ i,j ≤ n ∈ R n × n whic h quan tifies ho w similar r i and r j are: ω ij is large and p ositiv e whenever r i and r j describ e similar ob jects and small and negativ e for d issimilar ob- jects. W e assume that this similarit y is symmetric, ω ij = ω j i . The similarit y of an ob ject with itself will not b e considered in the follo win g, so we simp ly set ω ii = 0 for 1 ≤ i ≤ n . Let us giv e more intuiti on on how these weigh ts ω ij ma y b e set in pr ac- tice. In the most simple case, th ese we igh ts may reflect a class taxonomy and b e set to ω ij > 0 whenev er r i and r j come from the same class and ω ij < 0 for tw o differen t classes. Th is is the s etting w e consider in our exp eriments later in th is pap er . S uc h w eight s ma y b e also inferred from a h ierarchical taxonom y : eac h w eigh t ω ij corresp ondin g to tw o histograms could for instance reflect ho w close the resp ecti v e classes of these histograms lie in the tree of classes. Let us in tro d uce more notatio ns b efore mo ving on to the n ext sectio n. Since ω ij = ω j i and G r i r j = G r j r i w e r estrict the set of p airs of indices ( i, j ) to I def = { ( i, j ) | i, j ∈ { 1 , · · · , n } , i < j } . W e also introd uce t w o su b sets of I : E + def = { ( i, j ) ∈ I | ω ij > 0 } ; E − def = { ( i, j ) ∈ I | ω ij < 0 } , the subsets of similar and dissimilar histograms. Finally , w e write G ij for the G r i r j functions. 3.2 A Lo cal Crit erion to Select the Ground Metric W e prop ose to formulate th e ground metric learning p roblem as find in g a metric M ∈ M suc h that the transp ortation distance d M induced by this metric agree s with the w eights ω . More pr ecisely , this criterion will fa vor metrics for which, for a giv en p air of similar histograms r i and r j (namely ω ij > 0), the resulting d istance G ij ( M ) is small. Conv ersely , for a giv en p air of dissimilar histograms r i and r j (namely ω ij < 0), the resulting distance G ij ( M ) sh ould b e large. The criterion should balance th ese t w o requirement s, and in particular fa v or ground metrics for which these tw o ideas hold for pairs ( i, j ) su c h that | ω ij | is large. 6 F r om a form al p ersp ectiv e, an y criterion to select M sh ould consider th e family of n 2 pairs { ( ω 11 , G 11 ( M )) , · · · , ( ω n − 1 n , G n − 1 n ( M )) } . Since the ordering of the histograms s h ould not in fl uence the criterion, only symmetric fu nctions ( R × Σ d − 1 ) ( n 2 ) → R of the couples of v ariables ab o v e should b e considered. W e pr op ose in this pap er a family of simple criteria: the aver age v alue of ω ij d M ( r i , r j ), C ∞ ( M ) def = 2 X ( i,j ) ∈I ω ij G ij ( M ) , and a r estriction of such an a ve rage to neighborin g p oin ts, C k ( M ) def = n X i =1 S + ik ( M ) + S − ik ( M ) , where for eac h index i , the weigh ted su ms of distances of its similar and dissimilar neigh b ors are considered resp ectiv ely in S + ik ( M ) def = X j ∈ N + ik ω ij G ij ( M ) , and S − ik ( M ) def = X j ∈ N − ik ω ij G ij ( M ) . (2) These sums are computed using th e sets N + ik and N − ik , whic h stand for the indices of any k nearest neigh b ours of r i using distance d M , n ot nec- essarily unique, and whose in dices are tak en resp ectiv ely in the subsets E i + def = { j | ( i, j ) or ( j, i ) ∈ E + } and E i − def = { j | ( i, j ) or ( j, i ) ∈ E − } . W e adopt the con ve nt ion that N + ik = E i + whenev er k is larger than the cardinalit y of E i + , and follo w the same con ven tion for N − ik . This conv en tion mak es our notatio n C ∞ consisten t with the definition of C k since one can indeed c heck that C k = C ∞ for k large enough, and notably for k ≥ n . Since the tec hniqu es we prop ose b elow apply to b oth cases where k is finite or in fi- nite, we only consid er in the follo wing an extended in dex k ∈ [1 , ∞ ] and its corresp ondin g criterion C k . 3.3 Metrics of I n ter est Because C k is p ositiv ely homogeneous, the problem of minimizing C k o ver the p oin ted cone M is either unboun ded or trivially solv ed for M = 0. T o 7 get around this issue, we r estrict our s earch to the intersecti on of M and the un it sphere in R d × d of a suitable m atrix norm, that is M 1 = M ∩ B 1 , (3) where B 1 = { A ∈ R d × d | k A k · ≤ 1 } is d efined b y an arb itrary matrix norm k A k · suc h that the unit b all B 1 is con ve x. Using criteria C k , learning a ground metric f r om the family of p oin ts ( r 1 , · · · , r n ) and weig ht s ( ω ij ) ( i,j ) ∈I b oils down to findin g an optimal solution to problem min M ∈M 1 C k ( M ) . (P1) Problem (P1) has d 2 v ariables, one for eac h up p er-diagonal term in M . The feasible set M 1 is closed, conv ex, b ound ed and C k is piecewise linear. As a consequence, Pr oblem (P1) admits at least one optimal solution. 4 C k as a Difference of Con v ex F unctions Since eac h function G ij is conca v e, C k can b e cast as a Difference of Con v ex (DC) functions (Horst and Thoai 1999): C k ( M ) def = S − k ( M ) − - S + k ( M ) where b oth S − k ( M ) def = P n i =1 S − ik ( M ) and - S + k ( M ) def = P n i =1 - S + ik ( M ) are con- v ex, b y virtu e of the conv exit y of eac h of the terms S − ik and - S + ik defined in Equation (2). Th is follo ws from the conca vity of eac h fu nction G ij and the fact that suc h functions are weigh ted by negativ e f actors, ω ij for ( i, j ) ∈ E − and - ω ij for ( i, j ) ∈ E + . W e prop ose in this section an algorithm to obtain a lo cal minimizer of Problem (P1) that tak es adv an tage of this d ecomp osition. 4.1 Sub differen tiabilit y of C k The gradien t of C k computed at a given metric matrix M is ∇ C k ( M ) = ∇ S − k ( M ) − - ∇ S + k ( M ) , where ∇ S − k ( M ) = n X i =1 X j ∈ N + ik ω ij X ⋆ ij , - ∇ S + k ( M ) = n X i =1 X j ∈ N − ik - ω ij X ⋆ ij , 8 whenev er all solutions X ⋆ ij to the linear programs G ij considered in C k are unique and w henev er the set of k nearest neighbors of eac h histogram r i is unique. Mo re generally , by virtue of the prop ert y that an y optimal solution X ⋆ ij is in th e sub-d ifferen tial ∂ G ij ( M ) of G ij at M w e h a ve that n X i =1 X j ∈ N + ik ω ij X ⋆ ij ∈ ∂ S − k ( M ) , n X i =1 X j ∈ N + ik - ω ij X ⋆ ij ∈ ∂ - S + k ( M ) , regardless of the unicit y of the k nearest neigh b ors of eac h histogram r i . The details of th e compu tation of S − k ( M ) and on e of its s u bgradient s are giv en in Algorithm (1). The computations for S + k ( M ) follo w the same route; w e use the abbreviation S { + , −} k ( M ) to consider either of th ese t wo cases in our algorithm outline. Algorithm 1 Subgradient and Ob jectiv e Comp utation for S { + , −} k ( M ) Input : M ∈ M 1 . optional : w arm starts { ˜ X ij } . for ( i, j ) ∈ E { + , −} do Compute the optimum z ⋆ ij and an optimal solution X ⋆ ij for Prob- lem (P0), u sing the net wo rk simp lex for instance, with cost v ector m and constrain t vect or [ r i ; r j ] ∗ . optional : use w arm starts { ˜ X ij } . end for Set G = 0 , z = 0. for i ∈ { 1 , · · · , n } do Compute the n eigh b orho o d set N { + , −} ik b y ranking all z ⋆ ij in E i { + , −} . for j ∈ N { + , −} ik do G ← G + ω ij X ⋆ ij . z ← z + ω ij z ⋆ . end for end for Output z an d ∇ = g + g ; optional : return curren t solutions { X ⋆ ij } as w arm starts for the next iteration. 4.2 Lo calized Linearization of the Conca ve Pa rt of C k W e describ e in Algorithm (2) a simple app roac h to minimize C k lo cally b ased on a p ro jected subgradient descen t and a lo cal linearization of the conca v e part of C k . Algorithm (2) run s a subgradient descen t on C k using tw o nested lo ops. In the fi r st lo op parameterized with v ariable p , S + k (the conca v e part 9 of C k ) and a p oin t ∇ + in its su b differen tial are computed u sing the current metric M p . Using this v alue and the subgradient ∇ + , th e conca ve part S + k of C k can b e lo cally appro ximated by its first order T a ylor exp ansion, C k ( M ) ≈ S − k ( M ) + S + k ( M p ) + ∇ T + ( M − M p ) . This app ro ximation is con v ex, larger than C k and can b e minimized in an inner lo op usin g a pro jected gradien t descent. When this con v ex fu nction has b een m in imized up to a sufficient precision, w e obtain a p oint M p +1 ∈ argmin M ∈M 1 S − k ( M ) + S + k ( M p ) + ∇ T + ( M − M p ) . W e incremen t p r ep eat th e step d escrib ed ab o v e. The algorithm terminates when suffi cien t progress in the outer lo op has b een r ealized, at wh ic h p oin t either th e matrix computed in the last iteration, or that for whic h th e ob- jectiv e h as b een minimal so far, is return ed as the outpu t of the algorithm. Algorithm (2) fits the d escrip tion of simp lified DC algorithms (T ao and An 1997, § 4.2) to min im ize a difference of con v ex f unctions g − h in the case where either g or h is a con vex p olyhed r al function. In this p ap er b oth functions S − k ( M ) and - S + k ( M ) are con ve x p olyhedral; The ov erall qu alit y of this lo cal minima is d irectly lin ked to the quality of the initial p oin t M 0 . Cho osing a go o d M 0 is thus a cr u cial factor of our app roac h. W e provide a few options to define M 0 in Section 5. 5 Initial P oin ts Algorithm (2) conv erges to a lo cal minima of C k in M 1 . W e argue that this lo cal solution can only provide a go o d appr o ximation of the global min ima if the initial p oin t M 0 itself is already a goo d initial guess, not to o far fr om the global optim u m of C k . 5.1 The l 1 Distance as a T ranspo r tation Distance The l 1 distance b et wee n histograms can provide an edu cated guess to defin e an initial p oint M 0 to optimize C k . Indeed, th e l 1 distance can b e int erpreted as the Kantoro vic h-Ru binstein distance 2 seeded with the unif orm ground metric M 1 defined as M 1 ( i, j ) = 1 i = j . Because th e l 1 distance is itself a p opular distance to compare histograms, we consider M 1 in our exp erimen ts 2 Rigorously speaking, b oth distances are equal up to a factor 2, that is 1 2 k r − c k 1 = d M 1 ( r, c ) 10 Algorithm 2 Pro jected Sub gradien t Descent to min imize C k Input M 0 ∈ M 1 (see Section 5), gradien t step t 0 . t ← 1. p ← 0 , M out 0 ← M 0 . rep eat Compute ∇ + and z + of S + k using Algorithm (1) with M out p . q ← 0 , M in 0 ← M out p . rep eat Compute ∇ − and z − of S − k using Algorithm (1) with M in q and warm- starts ˜ X ij , ( i, j ) ∈ E − if defined; Set ˜ X ij ← X ⋆ ij for ( i, j ) ∈ E − . If q = 0, set z out p ← z in q . Set z in q ← z − + z + + ∇ T + ( m in q − m out p ) Set M in q +1 ← P M 1 m in q − t 0 √ t ( ∇ + + ∇ − ) . q ← q + 1. t ← t + 1. un t il q < q max or insufficient p rogress for z in q . M out p +1 ← M in q . p ← p + 1. un t il p < p max or insufficient p rogress for z out p . Output d p on M p . 11 to initialize Algorithm (2). This starting p oint d o es not, how ev er, exploit the information pro vided by the histograms { r i , 1 ≤ i ≤ n } an d weig h ts { ω ij , ( i, j ) ∈ I } . In order to d o s o, we approximat e C k b y a linear fun ction of M in Sectio n 5.2, and sho w that a minimizer of this appro x im ation can pro vide a b etter wa y of setting M 0 , as sho w n later in the exp eriment al section. 5.2 Linear A pproxima tions t o C k W e p rop ose to form an initial p oint M 0 b y replacing the optimization u n- derlying the compu tation of eac h distance G ij ( M ) by a dot pro d uct, G ij ( M ) = min X ∈ U ( r i ,r j ) h M , X i ≈ h M , Ξ ij i (4) where Ξ ij is a d × d matrix. W e discuss sev eral choice s to defin e matrices Ξ ij later in Section 5.3 . W e use these app ro xim ations to defin e th e criteria χ ∞ ( M ) = 2 X ( i,j ) ∈I ω ij h M , Ξ ij i = 2 h M , X ( i,j ) ∈I ω ij Ξ ij i (5) in the case w here k = ∞ , or χ k ( M ) = n X i =1 X j ∈ N − ik ( M 1 ) h M , Ξ ij i + X j ∈ N + ik ( M 1 ) h M , Ξ ij i = h M , n X i =1 X j ∈ N − ik ( M 1 ) ω ij Ξ ij + X j ∈ N + ik ( M 1 ) ω ij Ξ ij i (6) when k < ∞ . Note that when k is fin ite, and only in that case, the k nearest neigh b ors of eac h histogram r i need to b e selec ted first with a metric; w e use M 1 for this purp ose. Although this tr ic k may not b e satisfactory , w e observ e th at similar app r oac hes ha ve b een us ed b y W ein b erger and Saul (2009) to seed their algorithms with near neigh b ors in the initial phase of their optimization. Note th at suc h a tric k is n ot needed when k = ∞ whic h, in practice, seems to yield b etter resu lts as explained late r in the exp erimenta l section. In b oth cases wh ere k = ∞ and k < ∞ , χ k is a linear function of M whic h, f or our pu r p ose, needs to b e minimized o ver M 1 : min M ∈M 1 χ k ( M ) = min M ∈M 1 h M , Ξ k i (P2) 12 where Ξ k is a d × d matrix equal to the relev an t s u ms in the right hand sides of either Equatio n (5) or (6) dep ending on the v alue of k . This problem has a linear ob jectiv e and a conv ex feasible s et. If the norm defin in g the unit ball B 1 in Equation (3 ) is the l 1 norm, that is the sum of the absolute v alues of all co efficien ts in a matrix, then Problem (P2) is a linear program with O ( d 3 ) constrain ts. F or large d , th is form u lation might b e intrac table. W e pr op ose to consid er instead the l 2 norm unit ball for B 1 whic h yields an alternativ e form for Problem (P1), wher e the constraint M ∈ B 1 is replaced b y a regularization term min M ∈M λ h M , Ξ k i + k M k 2 2 = min M ∈M k M + λ 2 Ξ k k 2 2 , λ > 0 (P3) Bric kell et al. (2008, Algorithm 3.1) hav e prop osed r ecen tly a triangle fixing algorithm to s olv e problems of the form min M ∈M k M − H k 2 , (P4) where H is a pseudo-distance, that is a symmetric , zer o on the diagonal and nonne gative matrix. It is h o wev er straigh tforw ard to c hec k that eac h of these three conditions, although intuitiv e when considering the metric nearness problem as defin ed in (Bric kell et al. 2008, § 2), are not n ecessary for Algorithm (3.1) in (Bric kell et al. 2008, § 3) to con verge. This algorithm is not only v alid for non-symmetric matrices H as p oin ted out by the authors themselv es, but it is also applicable to matrices H with negativ e ent ries and non-zero diagonal entrie s. Problem (P3) can thus b e solv ed by replacing H b y − λ 2 Ξ k in Problem (P4) r egardless of the s ign of the en tr ies of Ξ. W e conclude this section by mentio ning that other app roac hes can b e considered to m inimize the d ot pro d uct h M , Ξ i usin g alternativ e norms and metho ds. F rangioni et al. (20 05) prop ose f or instance to handle linear pro- grams in the int ersection b et w een the cone of distances and th e set of p oly- hedral constraints { M ik + M k i + M ij ≤ 2 } w hic h defines what is kno wn as the m etric p olytop e. These appr oac hes are ho w ev er more inv olv ed compu- tationally and w e lea ve such extensions for futu re w ork. 5.3 Represen tative T ables The tec h n iques presented in Section 5.2 ab o ve build up on a linear appr o xi- mation of eac h function G ij ( M ) as h M , Ξ ij i by selecting a particular matrix Ξ ij suc h that G ij ( M ) ≈ h M , Ξ ij i . W e prop ose in th is section to obtain su ch an appro ximation by considering an arbitrary and repr esen tativ e transp orta- tion table in U ( r , c ). More precisely , we prop ose to use a simp le pro xy for 13 the optimal transp ortation distance: the dot-pro du ct of M with a matrix that lies at th e cen ter of U ( r, c ). 5.3.1 Indep endence T able Man y candidate tables can qualify as v alid cen ters of general p olytop es as discussed for instance in (Bo yd and V anden b erghe 2004, § 8.5). There is, ho wev er, a particular table in U ( r , c ) whic h is easy to compute and wh ic h has b een considered as a central p oin t of U ( r , c ) in p revious work: the indep endenc e table r c T (Goo d 1963). T he table r c T , w hic h is trivially in U ( r, c ) b ecause r c T 1 d = r and cr T 1 d = c , is also the maximal entrop y table in U ( r, c ), that is the table which maximizes h ( X ) def = − d X p,q =1 X pq log X pq . (7) Using th e indep endence table to app r o ximate G ij , th at is u s ing the app ro x- imation min F ∈ U ( r i ,r j ) h M , F i ≈ r T i M r j , yields the av er ages in dep end ence tables, Ξ k = ( P ij ω ij r i r T j , if k = ∞ . P n i =1 P j ∈ N − ik ( M 1 ) ω ij r i r T j + P j ∈ N + ik ( M 1 ) ω ij r i r T j , if k < ∞ . (8) Note how ev er that this app ro ximation tends to ov er estimate substan tially the distance b et wee n t w o similar histograms. Indeed, it is easy to c hec k that r T M r is p ositiv e when ever M is a definite distance and r has p ositiv e en tr op y . In the case where all co ordinates of r are equal to 1 /d , r T M r is simply k M k 1 /d 2 . 5.3.2 T ypical T able Barvinok (2010 ) argues more recent ly that most transp ortation tables are close to the so-called typic al table T of the trans p ortation p olytop e and not, as was hinted b y Goo d (1963), to the ind ep enden ce table. W e briefly explain the concen tration r esu lt obtained in (Barvinok 2010 , § 1.5). Barvinok prov es that, u n der the cond ition th at r and c d o not ha ve to o small co efficients, for an y table X sampled uniformly on U ( r , c ) the difference b et w een the sum of a su bset of co efficients of X and th e sum of the same co efficients in the t yp ical table T is small w ith high-probabilit y . W riting S ⊂ { 1 , · · · , n } 2 14 for a set of indices, and σ S ( X ) = P p,q ∈ S X pq for the corresp ondin g s u m of co efficien ts, we ha v e that for s ets S big enough, P { X ∈ U ( r , c ) , (1 − ε ) σ S ( T ) ≤ σ S ( X ) ≤ (1 + ε ) σ S ( T )) } ≥ 1 − 2 de − κd where κ and ε d ep ends on the smo othness of r and c , that is the magnitude of their smallest co efficients. The t yp ical table T ij of t wo h istograms r i and r j is defined (Ba rvinok 2010, § 1.2) as the table in the p olytop e U ( r i , r j ) whic h maximizes the conca ve fun ction g : R d × d ++ → R g ( X ) = d X p,q =1 ( X pq + 1) ln ( X pq + 1) − X pq ln( X pq ) . (9) Computing the t ypical table directly is not co mputationally tractable for large v alues of d . Barvinok (2009, p.350) pro vides h o wev er a different c har- acterizat ion of T ij as T ij = e − u p − v q 1 − e − u p − v q p,q ≤ d , where the vec tors u and v in R d are the unique m inimizers of th e conv ex program min u,v > 0 r T i u + r T j v − d X p,q =1 log 1 − e − u p − v q . (P5) Both gradient and Hessian of th e ob jectiv e of Problem (P5) ha v e a simple form. The He ssian can b e expressed in a blo c k form where the d iagonal blo c ks are themselv es d iagonal matrices and the off-diagonal blo c ks are of rank 1. u and v can th us b e easily computed using second-order methods . The resulting matrices Ξ k are th u s Ξ k = ( P ij ω ij T ij , if k = ∞ . P n i =1 P j ∈ N − ik ( M 1 ) ω ij T ij + P j ∈ N + ik ( M 1 ) ω ij T ij , if k < ∞ . (10) W e also note that, as for th e indep endence table, h M , T ij i will b e signifi- can tly larger th an 0 when r i and r j are similar. Let u s pro vide s ome intuitio n on the idea b ehind using the av erage t yp ical table in Problem (P3). The solution to Problem (P3) will b e a m etric M whic h will ha v e high co efficien ts M pq for an y pair of features 1 ≤ p, q ≤ d suc h that Ξ pq is negativ e, namely a pair ( p, q ) suc h that the v alue of X pq of a 15 transp ortation p lan X sampled un iformly in eac h U ( r i , r j ) is typic al ly h igh on a v erage for all mismatche d h istograms pairs. M pq will b e on th e con trary small for a p air of f eatures ( p, q ) such that the v alue of X pq is typic al ly h igh in transp ortatio ns plans b et w een similar histograms. T o recapitulate the results of this section, w e pr op ose to ap p ro ximate C k b y a linear function and compu te its minimum in the intersecti on of the unit ball and th e cone of matrices. This lin ear ob jectiv e can b e efficien tly minimized using a set of to ols p rop osed b y (Bric k ell et al. 2008) adapted to our pr oblem. In order to d o so, the u nit ball considered to d efi ne the feasible set M 1 m ust b e the un it ball of the F rob enius norm of matrices. In order to prop ose such an app ro ximation, we hav e used the indep endenc e and typic al tables as represent ativ e p oints of the p olytopes U ( r , c ). The successiv e steps of the computations that y ield an initial p oint M 0 are sp elled out in Algorithm (3). Algorithm 3 Initial Poin t M 0 to minimize C k Set Ξ = 0. for i ∈ { 1 , · · · , n } do if k = ∞ then set N { + , −} ik = E i { + , −} . else Compute the neigh b orho od sets N + ik and N − ik of histogram r i using an arbitrary distance, e.g. the l 1 distance. end if for j ∈ N + ik ∪ N − ik do Compute a cen ter Ξ ij of U ( r i , r j ), e.g. either the typical or indep en- dence table. Ξ ← Ξ + ω ij Ξ ij . end for end for Set M 0 ← min M ∈M k M + λ 2 Ξ k 2 using (Brick ell et al. 2008, Algorithm 3.1). Output M 0 . optional : regularize M 0 b y setting M 0 ← λM 0 + (1 − λ ) M 1 . 6 Related W ork 6.1 Metrics on t he Probability Simplex Deza and Deza (2009, § 14) provide an exhaus tiv e list of metrics for p roba- 16 bilit y measur es, most of w hic h app ly to probabilit y measures on R and R d . When narrow ed do wn to distances for pr obabilities on u nordered discrete sets – the domin ant case in machine learning app lications – Rub ner et al. (2000, § 2) p rop ose to split the most commonly used distances in to tw o fam- ilies: bin-to-bin distances and cr oss-bin distances. Bin-to-bin distances only compare the d couples of bin-counts ( r i , c i ) i =1 ..d indep end en tly to form a distance b et ween r and c : the Jensen-div ergence, χ 2 , Hellinger, total v ariation distances and more generally C sizar f -div ergences (Amari and Nagaok a 2001, § 3.2) all fall in th is category . Notic e that eac h of these distance is kno wn to w ork usually b ette r for histograms than a straigh tforward application of the Euclidean distance as illus trated for in - stance in (Chap elle et al. 1999, T able 4) or in our exp erimenta l section. This can b e explained in theory us ing geometric (Amari and Nagaok a 2001, § 3) or statistical argumen ts (Aitc h ison and Egozcue 2005). Bin-to-bin distances are easy to compu te and accurate enough to com- pare histograms wh en all d features are su fficien tly distinct. When, on the con trary , some of these features are known to b e similar, either b e- cause of sta tistical co-o ccurrence ( e.g. the words Nad al and Federer ) or through an y other form of pr ior knowledge ( e.g. color or amino-acid s im i- larit y) then a simple bin-to-bin comparison ma y not b e accurate enough as argued by (Rubner et al. 2000, § 2.2) . In particular, b in-to-bin distances are large b etw een histograms with distinct supp orts, regardless of the fact that these t w o supp orts ma y in fact describ e ve ry similar features. Cross-bin distances hand le this issue b y considering all d 2 p ossible pairs ( r i , c j ) of cr oss-b in coun ts to form a d istance. Th e most simple cross- co ordinate distance for general vect ors in R d is arguably the Mahalanobis family of distances, d Ω ( x, y ) = q ( x − y ) T Ω( x − y ) , where Ω is a p ositiv e semid efi nite d × d matrix. The Mahalanobis distance b et w een x and y can b e interpreted as the Euclidean d istance b et we en Lx and Ly wh ere L is a C holesky factor of Ω or any s quare ro ot of Ω. Learning suc h linear maps L or Ω dir ectly using labeled information has b een the sub ject of a sub stan tial amoun t of researc h in recen t y ears. W e briefly review this literature in the f ollo wing section. 6.2 Mahalan obis Metric Learning Xing et al. (2003), follo wed b y W ein b erger et al. (20 06) and Da vis et al. (2007) ha ve p rop osed d ifferen t algorithms to learn the p arameters of a Maha- 17 lanobis distance, that is either a p ositiv e semi-definite m atrix Ω or a lin ear map L . These tec hniqu es define first a criterion C and a feasible set of candidate m atrices to obtain, through optimizatio n alg orithms, a relev an t matrix Ω or L . T he criteria we p rop ose in Section 3 are mo deled along these ideas. W ein b erger et al. (2006) w ere the first to consider criteria that only use nearest n eighb ors , whic h inspired in this work the pr op osal of C k for finite v alues of k in S ection 3.2 as opp osed to consider in g the av erage o v er all p ossible distances as in (Xing et al. 2003) for instance. W e w ould lik e to in s ist at this p oin t in th e pap er that Mahalanobis metric learning and ground metric learning h av e v ery little in common con- ceptually: Mahala nobis metric learning algorithms pro du ce a d × d p ositiv e semidefinite matrix or a linear op erator L . Ground metric learning pro- duces instead a m etric matrix M . These sets of tec hn iques op erate on v ery differen t mathematical ob jects. It is also worth men tioning th at although Mahalanobis distances hav e b een designed for general vect ors in R d , and as a consequence can b e applied to h istograms, there is how ev er, to our kno wledge, n o statistical th eory wh ich motiv ates their u se on the p robabilit y simplex. 6.3 Metric Learning in Σ d − 1 Lebanon (2006) has prop osed to learn a b in-to-bin d istance in the probabilit y simplex us in g a parametric family of distances parameterized by a histogram λ ∈ Σ d − 1 defined as d λ ( r , c ) = arccos d X i =1 r r i λ i r T λ r c i λ i c T λ ! This formula can b e simplifi ed b y using the p erturbation op erator prop osed b y Aitc hison (1986, p.46): ∀ r , λ ∈ Σ d − 1 , r ⊙ λ def = 1 r T λ ( r 1 λ 1 , · · · , r d λ d ) T Aitc hison argues that the p erturbation op er ation can b e naturally inter- preted as an addition op eration in the simp lex. Using this notat ion, the distance d λ ( r , c ) b ecomes the simp le Fisher metric app lied to the p erturb ed histograms r ⊙ λ and c ⊙ λ , d λ ( r , c ) = arccos h √ r ⊙ λ, √ c ⊙ λ i . Using arguments related to the fact th at the distance s h ould v ary accordingly to the density of p oin ts describ ed in a d ataset, Lebanon (2006) prop oses to 18 learn this p ertu rbation λ in a semi-su p ervised con text, that is making only use of observ ed histograms but no other side-information. Because of this k ey d istinction we do n ot consider this approac h in the exp erimen tal section. 7 Exp erimen ts W e pro vide in this section a few d etails on the practical implementa tion of Algorithms (1) , (2) and (3). W e follo w by present ing empirical evidence that ground metric learnin g imp ro ves up on other state-of-the-art metric learning tec hniqu es wh en considered on norm alized histograms. 7.1 Implemen tation Notes Algorithms (1), (2) and (3) w ere implemented using sev eral optimization to olb o xes. Algorithm (1) requires the computation of sev eral transp orta- tion p roblems. W e use the CPLEX Matlab A PI implementat ion of netw ork flo w s with w arm starts to that effect. Th e compu tational gains w e obtain b y using the API, instead of u sing a f unction call to th e CP LE X matlab to olb ox or to the Mosek solv er are approximat ely 4 fold. Th ese b enefits come from the f act th at only the constraint vec tor in Pr oblem (P0) needs to b e up dated at eac h iteration of th e fi rst lo op of Algorithm (1). W e use the metricNe ar- ness to olb o x 3 to carry out b oth the pro jections of eac h inn er loop iteration of Algorithm (2), as w ell as the last minimization of Algorithm (3). W e com- pute T ypical tables using the fminunc Matla b function, with the gradien t and the Hessian of the ob jectiv e of Problem (P5) provided as au x iliary func- tions. Sin ce fminunc is by definition an unconstrained solv er, its solution ( u ⋆ , v ⋆ ) is k ept if b oth u ⋆ and v ⋆ satisfy p ositivit y constrain ts, whic h is the case for a large m a jorit y of pairs of histograms. Whenev er these constrain ts are violat ed we op timize again this problem b y using a slow er constrained Newton metho d. 7.2 Images Classification D at asets W e sample rand omly 2 N classes taken in the Caltec h-256 family of images and consider 70 images in eac h class. Each image is represen ted as a normal- ized histogram of GIST features, obtained using th e LE AR imp lemen tation 4 of GIST features (Douze et al. 2009). These features d escrib e 8 edge direc- tions at mid-resolution computed for eac h patc h of a 4 × 4 grid on eac h 3 http://peo ple.kyb.tueb ingen.mpg.de/suvrit/work/progs/metricn.html 4 http://lea r.inrialpes. fr/software 19 image. Eac h fea ture histogram is of dimension d = 8 × 4 × 4 = 128 and subsequently n orm alized to sum to one. W e sp lit th ese classes into t w o sets of N classes, ( a 1 , · · · , a N ) and ( a N +1 , · · · , a 2 N ) and study the resu lting N 2 binary classification prob lems that arise from eac h pair of classes in { a 1 , · · · , a N } × { a N +1 , · · · , a 2 N } . F or eac h of these N 2 binary classificatio n task w e split the 70 + 70 p oin ts from b oth classes in to 30 + 30 p oin ts to form a training set and 40 + 40 p oints to form a test set. This amount s to h a vin g n = 60 training p oint s follo wing the notations in tro du ced in Section 3.1. 7.3 Distances used in this b ench mark 7.3.1 Bin-to-bin dista nces W e consider th e l 1 , l 2 and Hellinger distances on GIST features vect ors, l 1 ( r , c ) = k r − c k 1 , l 2 ( r , c ) = k r − c k 2 , H ( r, c ) = k √ r − √ c k 2 , where √ r is the v ector whose co ord in ates are the squ ared ro ot of eac h co- ordinate of r . 7.3.2 Mahalanobis distances W e use the public implemen tations of LMNN (W einb erger and S aul 2009) and ITML (Da vis et al. 2007) to learn tw o different Mahalanobis distances for eac h task. W e r un b oth algorithms with default settings, that is k = 3 for LMNN and k = 4 for IT ML. W e use these algorithms on the Hellinger repre- sen tations { √ r i , i = 1 , · · · , n } of all histograms originally in the training set. W e ha v e considered this r epresen tation b ecause the Euclidean distance of t wo histograms us ing th e Hellinger map corresp ond s exactly to the Hellinger distance (Amari and Nagaok a 2001, p.57). Sin ce the Mahalanobis distance builds u p on the Euclidean d istance, w e argue that this representat ion is more adequate to learn Mahalanobis metrics in the probability simplex. Th e s ig- nifican t gain in p erformance observed in Figure 2 that is obtained th rough this simp le transformation confirm s this in tuition. 7.3.3 Ground Metric Learning W e lea rn ground m etrics u sing the follo wing settings. In eac h classification task, and for t w o images r i and r j , 1 ≤ i 6 = j ≤ 60, eac h w eigh t ω ij is s et 20 to 1 if b oth histograms come from the same class and to − 1 if they come from different classes. The w eight s ω ij are further normalized to ensure that P ( i,j ) ∈E + ω ij = 1 and P ( i,j ) ∈E − ω ij = − 1. As a consequ ence the total sum of weigh ts P ( i,j ) ∈E ω ij is equal to 0. Th e neigh b orho od parameter k is set to 3 to b e directly comparable to the same p arameter used for I TML and LMNN. T h e su bgradien t stepsize t 0 of Algorithm (2) is set to = 0 . 1, guided b y p reliminary exp erimen ts and by the fact that, b ecause of the normalization of the we igh ts ω ij in tro duced ab ov e, b oth th e current iteratio n M k in Algorithm (2) and the gradient steps ∇ + or ∇ − all ha v e comparable norms as matrices. W e p erform a minimum of 50 × 0 . 8 p gradien t steps in eac h inner lo op and s et p max to 8. Eac h inner lo op is terminated wh en the pr ogress is to o small, that is when the ob jectiv e d o es not pr ogress more th an 0 . 75% ev ery 6 steps, or when q reac h es q max = 200. W e compute initial p oin ts M 0 using different repr esen tativ e tables as d escrib ed in Algorithm (3). Figure 4 illustrates the v ariation in p erformance for differen t c hoices of M 0 . T here is no “natural” distance b etw een GIST features that w e could consid er . W e ha ve tried a few, taking for instance distances based on the 8 dir ections and 4 × 4 = 16 p ossib le lo cat ions in the grid describ ed by the 128 GIS T features, b ut w e could n ot come up with one that was comp etitiv e with an y of th e metho ds considered ab o ve . This situation illustrates ou r claim in the in tro duction th at GML can select agnostically a metric f or features without using any prior kno wledge on the features. 7.4 Comparison with t he SVM baseline F or eac h distance d , we consider its exp onen tiated kernel exp( − d/σ ) and use a S upp ort V ector Mac hine (SVM) classifier (Cortes and V apnik 1995 ) on the 80 test p oin ts estimate d with the 60 training p oin ts. W e use the follo wing parameter grid to train the SVM’s : the regularization parameter C is selected within the range { 1 , 10 , 100 } ; the ban d width parameter σ is selected as a multiple of the median distance d computed in the training set, that is σ is selected within the range { . 1 , . 2 , . 5 , 1 , 2 , 5 } × med { d ( r i , r j ) } . W e use a 4 f olds (testing on left-out fold) and 2 r ep eats cross v alidation pro ce- dure on the training set to select the parameter pair that has lo west av erage error. T ransp ortation d istances are not negativ e d efi nite in the general case, whic h is why we add a suffi cien t amoun t of diagonal r egularizatio n (minus the s m allest negativ e eigen v alue) on th e resulting 60 × 60 Gram matrices to ensure that th ey are p ositiv e d efinite in the training phase. 21 5 10 15 20 0.65 0.7 0.75 0.8 Number of retrieved Neighbours Average Proportion of Correct Neighbours Distance Accuracy in Test Set L1 L2 Hellinger LMNN k=3 (Hellinger) ITML k=4 (Hellinger) GML−EMD k=3 (M 0 = Typ ∞ ) 1 3 5 7 9 11 13 15 17 19 0.18 0.2 0.22 0.24 0.26 0.28 0.3 κ parameter in κ −NN κ −Nearest Neighbor Classification Error on Test Fold SVM error: 0.177 SVM error: 0.174 SVM error: 0.182 SVM error: 0.219 SVM error: 0.203 SVM error: 0.184 Figure 1: (left) Accuracy of eac h considered distance on the test set as measured b y the a ve rage prop ortion, for eac h datap oint in the test set, of p oints coming from the same class within its κ n earest neigh b ors . Th ese prop ortions we re av eraged ov er 25 × 25 = 625 classification prob lems using 40 + 40 test p oint s in eac h exp eriment. Th e ground metric in GML and Mahalanobis matrices in ITML and LMNN ha v e b een learned separately using a train set of 30 + 30 p oin ts. (righ t) κ -NN classification error u sing the considered metrics and the 30 + 30 p oin ts in th e training fold, b oth to compute the metrics when needed and to compare test p oin ts. These results are also av eraged for 40 + 40 test p oin ts in eac h of the 625 classificatio n tasks. By M 0 = T yp ∞ w e mean that th e initial p oint wa s compu ted using typica l tables and k = ∞ in Algorithm (3). 22 7.5 Results The most imp ortan t resu lts of this exp erimen tal section are s ummarized in Figure 1, whic h d ispla ys , for all considered d istances, their a v erage recall accuracy on the test set and the a verage classification error u sing a κ -nearest neigh b or classifier. T hese quantiti es are av eraged o v er N 2 = 25 2 = 625 bi- nary classifications. In this fi gure, GML used with EMD is shown to pro vide, on a v erage, the b est p ossible p erformance: the left han d figure considers test p oints and s h o w s that, for eac h p oin t considered on its own, GML-EMD se- lects on a ve rage m ore same class p oin ts as closest neigh b ors than an y other distance. Th e p erformance gap b etw een GML-EMD and comp eting dis- tances increases significant ly as the num b er of retriev ed neighbors is itself increased. The right h and fi gure displa ys the a v erage error o ver all 625 tasks of a κ -nearest neighbor classificatio n algorithm wh en considered with all d is- tances for v aryin g v alues of κ . In this case to o, GML com bined with EMD fares muc h b etter than comp eting d istances. The a ve rage error wh en using a SVM with th ese distances is repr esen ted in the leg end of the righ t-hand side figure. Our results agree with th e general observ ation in metric learning that S u pp ort V ector Mac hines p erf orm us u ally b etter than κ -nearest neigh- b or classifiers with learned metrics (W ein b erger and Saul 2009, T able 1). Note ho we v er that the κ -nearest-neigh b or classifier s eeded with GML-EMD has an a ve rage p erformance th at is d ir ectly comparable with that of s upp ort v ector mac hines s eeded with th e l 1 or l 2 k ern els. It is also w orth m entioning as a side r emark that the l 2 distance do es not p erform as well as the l 1 or Hellinger distances on these datasets, whic h v alidates our earlier statemen t that the Euclidean geome try is usu ally a p oor c hoice to compare histograms directly . This intuitio n is fur ther v alidated in Figure 2, w here Mahala nobis learning alg orithms are sho w to p erform significan tly b etter w hen they use the Hellinger repr esen tation of histograms. Figure 3 sh o ws that the p erformance of GML can v ary signifi cantly de- p endin g on th e neighborho od parameter k . W e h a ve also considered as a ground metric the in itial p oin t M 0 = T yp ∞ obtained by using typical tables and k = ∞ with Algo rithm 3. Th e corresp on d ing resu lts appear as EMD Typ ∞ curv es in the figure. These curv es are far b elo w those co rresp ond- ing to GML-EM D k = 3 ( M 0 = Typ ∞ ). Th is gap illustrates the fact that the subgradient descen t p erformed by Algorithm 2 do es ha ve a real impact on p erformance, and that c ho osing a go o d initial p oint in itself is n ot enough. Finally , Figure 4 r ep orts additional p erformance curv es for different ini- tial p oin ts M 0 . These exp erimen ts tend to sho w that, d espite their compu- tational o verhead, Barvinok’s t yp ical tables seem to provide a b ette r initial 23 p oint than indep endence ta bles in terms of av er age p erformance. Please note that N = 25 in Figures 1 and 2, and N = 10 in Figures 3 and 4. 8 Conclusion and F uture W ork 8.1 Ov erview W e hav e pr op osed in this pap er an app roac h to tune adaptiv ely the unique parameter of transp ortation distances, th e groun d metric, giv en a training dataset of histog rams. Th is approac h can b e applied on an y typ e of fea- tures, as long as a set of histograms along with side-inform ation, t ypically lab els, are pro vided for the algorithm to learn a go o d candidate for the ground metric. The algorithms p erforms a p ro jected subgradient descent on a difference of con v ex functions, and can only find lo cal m inimizers. W e prop ose a few initial p oin ts to comp ensate f or this arbitrariness, and sho w that our appr oac hes provi de, when compared to other comp eting distances, a su p erior a ve rage p erformance for a large set of image binary classificati on tasks using GIS T features histograms. 8.2 Ground Metr ic and Computational Sp eed W e ha ve argued in th e in tro duction of this pap er that the ground met- ric w as neve r considered so far as a parameter that could b e learned from data. The ground metric has ho w ev er attracted a lot of atten tion recently for a different r eason. Ling and Ok ada (2007 ), Gud m undss on et al. (2 007 ), P ele and W erman (2009), Ba et al. (2011) ha v e all recentl y argued that the computation of the EMD can b e dramatically s p ed up wh en th e groun d met- ric m atrix has a certa in structure. F or instance, Pel e and W erman (20 09) ha ve sh o wn that the computation of eac h earth mo v er’s distance can signif- ican tly reduced whenev er the larger v alues of an y arbitrary ground m etric are thresholded to a certain lev el. Ground metrics that follo w suc h con- strain ts are attractiv e b ecause they result in transp ortation p roblems w h ic h are pro v ably faster to compute. O ur w ork in this pap er suggests on th e other hand that the cont en t (and not the structure) of the ground m etric can b e learned to impro ve classification accuracy . W e b eliev e that th e com- bination of these t wo viewp oint s could r esult in transp ortatio n distances that are b oth adapted to the task at hand and fast to compu te. A str ategy to ac h iev e b oth goa ls would b e to enforce such s tructural constrain ts on candidate metrics M when lo oking for min imizers of criteria C k . 24 5 10 15 20 0.66 0.68 0.7 0.72 0.74 0.76 0.78 Number of retrieved Neighbours Average Proportion of Correct Neighbours Distance Accuracy in Test Set Hellinger LMNN k=3 LMNN k=3 (Hellinger) ITML k=4 ITML k=4 (Hellinger) 1 3 5 7 9 11 13 15 17 19 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.3 κ parameter in κ −NN κ −Nearest Neighbor Classification Error on Test Fold SVM error: 0.182 SVM error: 0.183 SVM error: 0.219 SVM error: 0.214 SVM error: 0.203 Figure 2: The exp erimenta l setting in this fi gure is identica l to th at of Figure 1, except that only t wo differen t v ersions of LMNN and IT ML are compared with the Hellinger d istance. This figure supp orts our claim in Section 7.3.2 that Mahalanobis learning m etho ds work b etter using the Hellinger represen tation of histograms, { √ r i , i = 1 , · · · , n } , rather than their straightfo rwa rd representat ion in the simplex { r i } i =1 , ··· ,n . 25 5 10 15 20 0.7 0.72 0.74 0.76 0.78 0.8 0.82 Number of retrieved Neighbours Average Proportion of Correct Neighbours Distance Accuracy in Test Set Hellinger GML−EMD k=1 (M 0 = Typ ∞ ) GML−EMD k=3 (M 0 = Typ ∞ ) GML−EMD k=8 (M 0 = Typ ∞ ) GML−EMD k= ∞ (M 0 = Typ ∞ ) EMD Typ ∞ 1 3 5 7 9 11 13 15 17 19 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 κ parameter in κ −NN κ −Nearest Neighbor Classification Error on Test Fold SVM error: 0.17 SVM error: 0.169 SVM error: 0.17 SVM error: 0.175 SVM error: 0.183 SVM error: 0.205 Figure 3: Distance accuracy and κ -nearest neigh b or classifier error for GML- EMD u sing different v alues for the neigh b orho o d p arameter k . The initial- ization setting M 0 = Typ ∞ is explained in the caption of Figure 1. The last curv e, EMD Typ ∞ displa ys the results corresp onding to th e EMD with a ground metric set d irectly to the output of Algorithm (3) u sing typical tables and k = ∞ . The difference in p erformance b etw een these curve s and that of GML-EMD k = 3 , ( M 0 = Typ ∞ ) illustrates the p rogress ac hiev ed by Algorithm 2 . The p erformance curve s also agree with the in tuition that the GML metric set at a giv en n eigh b orho o d parameter k has a comparativ e ad- v antage o ver other GML metrics wh en the κ parameter of nearest neigh b or classifiers is itself close to k . The exp erimental s etting is iden tical to that of Figure 1, except that exp erimen ts we re a ve raged h er e o v er 10 × 10 = 100 classification p roblems, instead of 625. Th ere is no o verlap b et we en these t wo sets of b inary classifications. 26 5 10 15 20 0.7 0.72 0.74 0.76 0.78 0.8 0.82 Number of retrieved Neighbours Average Proportion of Correct Neighbours Distance Accuracy in Test Set Hellinger GML−EMD k=3 (M 0 = Typ ∞ ) GML−EMD k=3 (M 0 = Ind ∞ ) GML−EMD k=3 (M 0 = Typ 3 ) GML−EMD k=3 (M 0 = 1 ) L1 1 3 5 7 9 11 13 15 17 19 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 κ parameter in κ −NN κ −Nearest Neighbor Classification Error on Test Fold SVM error: 0.17 SVM error: 0.17 SVM error: 0.167 SVM error: 0.172 SVM error: 0.165 SVM error: 0.166 Figure 4: Dista nce accuracy and κ -nearest neighbor p erformance for GML- EMD set at k = 3 u sing different initial p oin ts describ ed in Section 5, and particularly Section 5.3.2. This figure shows that, despite its computational o verhead, initializing Algorithm 2 using t yp ical tables p erforms b etter on av- erage th an using in dep end ence tables. The exp erimen tal setting is iden tical to that of Figure 3. 27 The mot computationally intensiv e p art of Algorithm (2) lies in the re- p eated calls to Algorithm (1) w hic h itself computes a num b er O ( n 2 ) optimal transp ortations b et w een pairs of histograms, where n is the size of the train- ing s et. W e ha ve us ed the n et work simplex algorithm with w arm starts to carry out these compu tations, but faster alte rnativ es along th e lines of the implemen tations pro vided b y P ele and W erman (2009) could pr ovide com- putational impr o vemen ts. The EMD is also kn o wn to accept lo w er b ounds in particular cases. An efficient lo wer b oun d wo uld redu ce considerab ly the complexit y of Algorithm (1) b y narr o win g do wn the computation of Optimal transp ortations to a smaller sub set of neighbor cand id ates. 8.3 F uture Applications W e ha ve p rop osed in this p ap er a set of te c h niques to learn groun d met- rics using histograms of arb itrary features. W e b eliev e these tec hniques will pro ve usefu l to study histograms of laten t features, suc h as topics Blei and L affert y (2009) or Dictionaries (Kreutz-Delgado et al. 2003, Mairal et al. 2009, Jenatton et al. 2010), for which natural metrics are not alw ays av ail- able. References R.K. Ah uja, T.L. Magnan ti, and J.B. Orlin. Network Flows: The ory, A lgo- rithms and A pplic ations . Prentic e Hall, New Jersey , 1993. J. Aitc h ison. The statistic al analysis of c omp ositional data . Chapm an & Hall, Ltd., Lond on, UK, UK, 1986. ISBN 0-412-280 60-4. J. Aitc hison and J. Egozcue. Comp ositional data analysis: Where are we and w here should we b e heading? Mathematic al Ge olo gy , 37(7 ):829– 850, 2005. Shun-Ic hi Amari and Hiroshi Nagaok a. Metho ds of Information Ge ometry . AMS vol . 191, 2001. K.D. Ba, H.L. Nguyen, H.N. Nguyen, and R. Rubinfeld. Sub lin ear time algorithms for earth mov ers distance. The ory of Computing Systems , 48 (2):42 8–442, 2011. A. Barvinok. Asymptotic estimates for the n u mb er of con tingency tables, in- teger flo ws, and v olumes of transp ortation p olytop es. International M ath- ematics R ese ar ch Notic es , 2009(2):348 , 2009 . 28 A. Barvinok. What d o es a random con tingency table lo ok like? Combina- torics, Pr ob ability and Computing , 19(04): 517–53 9, 2010. D. Bertsimas and J.N. Tsitsiklis. Intr o duction to line ar optimization . Athena Scien tifi c, 1997. D.M. Blei and J.D. Lafferty . T opic mo d els. T ext mining: classific ation, clustering, and applic ations , 10:71, 2009. D.M. Blei, A.Y. Ng, and M.I. Jordan. Laten t dir ic hlet allo cation. The Journal of M achine L e arning R ese ar ch , 3:993–10 22, 200 3. Stephen Bo yd and Liev en V anden b erghe. Convex Optimization . Cam bridge Univ ersit y Pr ess, 2004. J. Brick ell, I.S. Dh illon, S. Sra, and J.A. T ropp. The metric nearness pr ob- lem. SIAM J. Matrix Anal. Appl , 30(1):37 5–396, 2008. O. Chap elle, P . Haffner, and V. V apnik. SVMs for histogram based im- age classificati on. IEEE T r ansactions on Neur al Networks , 10(5):10 55, Septem b er 1999. Corinna Cortes and Vladimir V apnik. S u pp ort-v ector net works. Machine L e arning , 20:273–2 97, 199 5. ISSN 0885-6125 . J.V. Davis, B. Ku lis, P . Jain, S. S ra, and I.S. Dhillon. Information-theoretic metric learning. In Pr o c e e dings of the 24th International Con fer enc e on Machine L e arning , pages 209–216 . ACM, 2007. M.M. Deza an d E. Deza. Encyclop e dia of distanc es . S pringer V erlag, 2009. P er s i Diaconis and Bradley Efron. T esting for indep end ence in a tw o-w ay table: new in terpretations of the c hi-square statistic. The Annals of Statis- tics , 13(3):845 –913, 198 5. M. Douze, H. J´ egou, H. Sandhaw alia, L. Amsaleg, and C. S c h mid. Ev alu- ation of gist descriptors for web-scale image searc h. In Pr o c e e ding of the ACM International Confer enc e on Image and Vide o R etrieval , p age 19. A CM, 2009. L.R. F ord and F ulk erson. Flows in Networks . P r inceton Univ ersity Press, 1962. 29 A. F rangioni, A. Lo di, and G. Rinaldi. New approac h es for optimizing o ver the semimetric p olytope. Mathematic al pr o gr amming , 104( 2):375– 388, 2005. I.J. Go o d . Maximum en tropy for hypothesis formulation, esp ecially for m ul- tidimensional con tingency tables. The Annals of Mathematic al Statistics , pages 911–934 , 1963. J. Gudm und sson, O. Klein, C. Knauer, and M. Smid. Small manhattan net works and algorithmic applications for the earth mo v ers distance. In Pr o c e e dings of the 23r d Eur op e an W orkshop on Computatio nal Ge ometry , pages 174–177 , 2007. R. Horst and N.V. T h oai. DC p rogramming: o verview. Journal of Opti- mization The ory and Applic ations , 103(1): 1–43, 1999. R. J enatton, J. Mairal, G. Ob ozinski, and F. Bac h . Pr o ximal metho ds f or sparse h ierarchical d ictionary learning. In Pr o c e e dings of the International Confer enc e on Machine L e arning (ICML) . Citeseer, 2010. Thorsten Joac h im s . L e arning to Classify T ext Using Supp ort V e ctor Ma- chines: Metho ds, The ory, and Algo rithms . Klu wer Academic Pu blishers, 2002. H. Kashima, K. Tsu da, and A. Inokuc hi. Margi nalized k ernels b et we en lab eled graph s. In Pr o c e e dings of the 20th ICM L , pages 321–328 , 2003. K. Kreutz-Delgado, J.F. Murra y , B.D. Rao, K. En gan, T .W. Lee, and T.J. Sejno wski. Dictionary learning algorithms for sparse representati on. Neu- r al c omputation , 15(2):3 49–396 , 2003. Guy Lebanon. Metric learning for text do cumen ts. IEEE T r ansactions on Pattern Analysis and Machine Intel ligenc e , 28(4):497– 508, 2006. IS SN 0162- 8828. C. Leslie, E. Es k in , and W. S. Noble. The sp ectrum kernel: a strin g kernel for svm protein classific ation. In Pr o c. of PSB 2002 , p ages 564 –575, 2002. E. Levina and P . Bic ke l. T he earth mo v er ’s distance is the mallo w s distance: some ins igh ts f rom statistics. In Pr o c e e dings of the Eighth IEEE Interna- tional Confer enc e on Computer Vision , volume 2, page s 251–256. IEEE, 2001. 30 H. Ling and K. Ok ada. An efficien t earth mo v er’s distance alg orithm f or robust histogram comparison. IE EE tr ansactions on Patt. A n. and M ach. Intel l. , pages 840–853 , 200 7. D.G. Lo we. Ob ject recognition from lo cal scale-in v arian t features. In Com- puter Vision, 1999. The Pr o c e e dings of the Seventh IEEE International Confer enc e on , v olume 2, pages 1150 –1157 v ol.2, 1999. J. Mairal, F. Bac h , J . Po nce, and G. S apiro. Online dictionary learning for sparse co d in g. In Pr o c e e dings of the 26th Annual Internationa l Confer enc e on Machine L e arning , pages 689–696. ACM, 2009. CL Mallo ws. A n ote on asymptotic joint normality . The Annal s of Mathe- matic al Statistics , pages 508–515, 1972. A. Oliv a and A. T orralba. Mo d eling the shap e of the scene: A holistic represent ation of the sp atial env elop e. International Journal of Computer Vision , 42(3):145 –175, 200 1. Ofir P ele and Mic hael W erman. F ast and robust earth mo ver’s distances. In ICCV , 2009. S.T. Rac h ev. P r ob ability metrics and the stability of sto chastic mo dels . Wiley series in probabilit y and mathematical statistics: Applied probabilit y and statistics. Wiley , 1991. S.T. Rac hev and L. R ¨ usc hendorf. Mass T r ansp ortation Pr oblems: The ory , v olum e 1. Springer V erlag, 1998. Y. Rubn er, L.J. Guibas, and C. T omasi. The earth mo vers distance, m ulti- dimensional scaling, and color-based image retriev al. In Pr o c e e dings of the ARP A Image Understanding Worksho p , pages 661–668. Citeseer, 1997. Y. Rub ner, C. T omasi, and L.J. Guibas. The earth mov er’s d istance as a metric for imag e retriev al. IJCV: Interna tional Journal of Computer Vision , 40, 2000. P .D. T ao and L.T.H. An. Conv ex analysis approac h to dc programming: Theory , algorithms and applications. A cta M athematic a Vietnamic a , 22 (1):28 9–355, 1997. C ´ edric Villani. T opics in Optimal T r ansp ortation , volume 58. AMS Gradu- ate Stud ies in Mathematics, 2003. 31 K.Q. W einb erger and L.K. Saul. Distance metric learning for large margin nearest neighbor classification. The Journal of Machine L e arning R e- se ar ch , 10:207–24 4, 2009. K.Q. W ein b erger, J. Blitzer, and L.K. Saul. Distance m etric lea rning for large margin nearest neigh b or classification. In A dvanc es in Neur al Infor- mation Pr o c essing Systems , 2006. E.P . Xing, A.Y. Ng, M.I. Jordan, and S. Russell. Distance metric learning with app lication to clustering with sid e-information. A dvanc es in Neur al Information Pr o c essing Systems , pages 521–52 8, 2003. 32
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment