The Lovasz-Bregman Divergence and connections to rank aggregation, clustering, and web ranking

We extend the recently introduced theory of Lovasz-Bregman (LB) divergences (Iyer & Bilmes, 2012) in several ways. We show that they represent a distortion between a 'score' and an 'ordering', thus providing a new view of rank aggregation and order b…

Authors: Rishabh Iyer, Jeff Bilmes

The Lovasz-Bregman Divergence and connections to rank aggregation,   clustering, and web ranking
The Lo v´ asz-Bregman Div ergence and connections to rank aggregation, clustering, and w eb ranking ∗ Rishabh Iy er Dept. of Electrical Engineering Univ ersity of W ashington Seattle, W A-98175, USA Jeff Bilmes Dept. of Electrical Engineering Univ ersity of W ashington Seattle, W A-98175, USA Septem b er 19, 2018 Abstract W e extend the recen tly introduced theory of Lov´ asz Bregman (LB) div ergences [ 20 ] in sev eral wa ys. W e sho w that they represen t a distortion b etw een a “score” and an “ordering”, th us providing a new view of rank aggregation and order based clustering with in teresting connections to web ranking. W e sho w ho w the LB div ergences ha ve a n umber of prop erties akin to many p erm utation based metrics, and in fact hav e as sp ecial cases forms v ery similar to the Kendall- τ metric. W e also sho w how the LB divergences subsume a n umber of commonly used ranking measures in information retriev al, like NDCG [ 23 ] and A UC [ 36 ]. Unlik e the traditional p erm utation based metrics, ho w ever, the LB div ergence naturally captures a notion of “confidence” in the orderings, th us providing a new representation to applications inv olving aggregating scores as opp osed to just orderings. W e show ho w a num b er of recently used web ranking mo dels are forms of Lov´ asz Bregman rank aggregation and also observe that a natural form of Mallow’s mo del using the LB divergence has been used as conditional ranking mo dels for the “Learning to Rank” problem. 1 In tro duction The Bregman divergence first app eared in the context of relaxation techniques in conv ex programming [ 5 ], and has found numerous applications as a general framework in clustering [ 3 ], proximal minimization ([ 7 ]), and others. Many of these applications are due to the nice prop erties of the Bregman divergence, and the fact that they are parameterized by a single conv ex function. They also generalize a large class of div ergences b et w een v ectors. Recen tly Bregman div ergences hav e also b een defined b et w een matrices [ 39 ], b et ween functions [ 17 ] and also b et ween sets [ 20 ]. In this pap er, we inv estigate a sp ecific class of Bregman div ergences, parameterized via the Lov´ asz exten- sion of a submo dular function. Submo dular functions are a sp ecial class of discrete functions with interesting prop erties. Let V refer to a finite ground set { 1 , 2 , . . . , | V |} . A set function f : 2 V → R is submo dular if ∀ S, T ⊆ V , f ( S ) + f ( T ) ≥ f ( S ∪ T ) + f ( S ∩ T ). Submo dular functions ha ve attractive prop erties that make their exact or appro ximate optimization efficient and often practical [ 18 , 22 ]. They also naturally arise in man y problems in machine learning, computer vision, economics, op erations research, etc. A link b et ween con vexit y and submo dularit y is seen via the Lov´ asz extension ([ 13 , 30 ]) of the submo dular function. While submo dular functions are gro wing phenomenon in machine learning, recently there has b een an increasing set of applications for the Lov´ asz extension. In particular, recent work [ 1 , 2 ] has shown nice connections b et ween the Lo v´ asz extension and structured sparsity inducing norms. This work is concerned with yet another application of the Lov´ asz extension, in the form of the Lo v´ asz- Bregman divergence. This was first introduced in Iyer & Bilmes [ 20 ], in the con text of clustering ranked ∗ A shorter version of this app eared in Pro c. Uncertaint y in Artificial In telligence (UAI), Bellevue, 2013 1 v ectors. W e extend our work in several w ays, mainly theoretically , by b oth sho wing a num ber of connections to the p erm utation based metrics, to rank aggregation, to rank based clustering and to the “Learning to Rank” problem in web ranking. 1.1 Motiv ation The problems of rank aggregation and rank based clustering are ubiquitous in machine learning, information retriev al, and so cial choice theory . Below is a partial list of some of these applications. Meta W eb Searc h: W e are given a collection of search engines, each providing a ranking or a score vector, and the task is to aggregate these to generate a combined result [ 28 ]. Learning to Rank: The “Learning to rank” problem, whic h is a fundamental problem in mac hine learning, in volv es constructing a ranking mo del from training data. This problem has gained significant interest in w eb ranking and information retriev al [ 29 ]. V oter or Rank Clustering: This is an imp ortan t problem in social choice theory , where eac h voter pro vides a ranking or assigns a score to every item. A natural problem here is to meaningfully com bine these rankings [ 27 ]. Sometimes how ever the p opulation is heterogeneous and a mixture of distinct p opulations, eac h with its own aggregate represen tative, fits b etter. Com bining Classifiers and Bo osting: There has b een an increased interest in combining the output of differen t systems in an effort to improv e p erformance of pattern classifiers, something often used in Machine T ranslation [ 35 ] and Sp eec h Recognition[ 25 ]. One wa y of doing this [ 28 ] is to treat the output of ev ery classifier as a ranking and combine the individual rankings of weak classifiers to obtain the ov erall classification. This is akin to standard b o osting techniques [ 16 ], except that we consider rankings rather than just the v aluations. 1.2 P erm utation Based Distance Metrics First a bit on notation – a p erm utation σ is a bijection from [ n ] = { 1 , 2 , · · · , n } to itself. Given a p erm utation σ , we denote σ − 1 as the inv erse p erm utation suc h that σ ( i ) is the item assigned rank i , while σ − 1 ( j ) is the rank 1 assigned to item j and hence σ ( σ − 1 ( i )) = i . W e shall use σ x to denote a p ermutation induced through the ordering of a vector x suc h that x ( σ x (1)) ≥ x ( σ x (2)) · · · ≥ x ( σ x ( n )). Without loss of generality , w e assume that the p erm utation is defined via a decreasing order of elements. W e shall use v ( i ) , v [ i ] and v i in terchangeably to denote the i -th element in v . Giv en tw o p erm utations σ, π w e can define σ π as the com bined p erm utation, such that σ π ( i ) = σ ( π ( i )). Also giv en a vector x and a p erm utation σ , define xσ suc h that xσ ( i ) = x ( σ ( i )). W e also define σ x as σ x ( i ) = x ( σ − 1 ( i )). Recen tly a n umber of pap ers [ 28 , 27 , 24 , 32 ] hav e addressed the problem of combining rankings using p erm utation based distance metrics. Denote Σ as the set of p erm utations ov er [ n ]. Then d : Σ × Σ → R + is a p ermutation based distance metric if it satisfies the usual notions of a metric, viz. , ∀ σ, π, τ ∈ Σ , d ( σ, π ) ≥ 0 and d ( σ, π ) = 0 iff σ = π , d ( σ, π ) = d ( π , σ ) and d ( σ, π ) ≤ d ( σ, τ ) + d ( τ , π ). In addition, to represent a distance amongst p ermutations, another prop ert y which is usually required is that of left in v ariance to reorderings, i.e., d ( σ, π ) = d ( τ σ, τ π ) 2 . The most standard notion of a p erm utation based distance metric is the Kendall τ metric [ 24 ]: d T ( σ, π ) = X i,j,i σ − 1 π ( j )) (1) 1 This is opp osite from the conv ention used in [ 28 , 27 , 24 , 32 ] but follows the conv ention of [ 18 ]. 2 While in the literature this is called right inv ariance, we hav e left inv ariance due to our notation 2 Where I ( . ) is the indicator function. This distance metric represents the num b er of swap op erations required to conv ert a p erm utation σ to π . It’s not hard to see that it is a metric and it satisfies the ordering inv ariance prop ert y . Other often used metrics include the Sp earman’s fo otrule d S and rank correlation d R [ 10 ]: d S ( σ, π ) = n X i =1 | σ − 1 ( i ) − π − 1 ( i ) | (2) d R ( σ, π ) = n X i =1 ( σ − 1 ( i ) − π − 1 ( i )) 2 (3) A natural extension to a ranking mo del is the Mallows mo del [ 31 ], which is an exp onen tial mo del defined based on these p erm utation based distance metrics. This is defined as: p ( π | θ, σ ) = 1 Z ( θ ) exp( − θ d ( π, σ )) , with θ ≥ 0 . (4) This model has b een generalized b y [ 14 ] and also extended to multistage ranking by [ 15 ]. Lebanon and Laffert y [ 28 ] were amongst the first to use these mo dels in machine learning by prop osing an extended mallows mo del [ 14 ] to combine rankings in a manner lik e adab oost [ 16 ]. Similarly Meila et al [ 32 ] use the generalized Mallo ws mo del to infer the optimal com bined ranking. Another related though differen t problem is clustering rank ed data, inv estigated by [ 33 ], where they pro vide a k -means style algorithm. This was also extended to a machine learning context by [ 6 ]. 1.3 Score based P erm utation divergences In this pap er, we motiv ate another class of divergences, whic h capture the notion of distance b et ween p erm utations. Unlike the p erm utation based distance metrics, how ev er, these are distortion functions b et ween a “score” and a p erm utation. This, as we shall see, offers a new view of rank aggregation and order based clustering problems. W e shall also see a n umber of interesting connections to w eb ranking. Consider a scenario where we are giv en a collection of scores x 1 , x 2 , · · · , x n as opp osed to just a collection of orderings – i.e., each x i is an ordered vector and not just a p erm utation. This o ccurs in a num b er of real w orld applications. F or example, in the application of combining classifiers [ 28 ], the classifiers often output scores (in the form of say normalized confidence or probability distributions). While the rankings themselv es are informative, it is often more b eneficial to use the additional information in the form of scores if av ailable. This in some sense com bines the approach of Adab o ost [ 16 ] and Cranking [ 28 ], since the former is concerned only with the scores while the latter takes only the orderings. The case of voting is similar, where each voter migh t assign scores to every candidate (which can sometimes b e easier than assigning an ordering). This also applies to web-searc h where often the individual search engines (or p ossibly features) provide a confidence score for each webpage. Since these applications pro vide b oth the v aluations and the rankings, w e call these sc or e b ase d r anking applic ations. A sc or e b ase d p ermutation diver genc e is defined as follows. Given a conv ex set S , denote d : S × Σ → R + as a score based p ermutation divergence if ∀ x ∈ S , σ ∈ Σ , d ( x || σ ) ≥ 0 and d ( x || σ ) = 0 if and only if σ x = σ . Another desirable prop ert y is that of left inv ariance, viz. d ( x || σ ) = d ( τ x || τ σ ) , ∀ τ , σ ∈ Σ , x ∈ S . It is then immediately clear ho w the score based p erm utation divergence naturally mo dels the ab o ve scenario. The problem b ecomes one of finding a representativ e ordering, i.e., find a permutation σ that minimizes the av erage distortion to the set of p oin ts x 1 , · · · , x n . Similarly , in a clustering application, to cluster a set of ordered scores, a score based p erm utation divergence fits more naturally . The representativ es for each cluster are p erm utations, while the ob jects themselves are ordered vectors. Notice that in b oth cases, a purely p erm utation based distance metric would completely ignore the v alues, and just consider the induced orderings or p ermutations. T o our knowledge, this work is the first time that the notion of a score based p erm utation divergence has b een introduced formally , thus providing a nov el view to the rank aggregation and rank based clustering problems. 3 1.4 Our Con tributions In this pap er, w e in vestigate several theoretical prop erties of one such score based p erm utation div ergence – the LB div ergence. This w ork builds on our previous work [ 20 ], where we introduce the Lov´ asz-Bregman divergence. Our fo cus therein is mainly on the connections b et w een the Lov´ asz Bregman and a discrete Bregman div ergence connected with submo dular functions and we also provide a k-means framework for clustering ordered v ectors. In the present pap er, how ever, we make the connection to rank aggregation and clustering more precise, by motiv ating the class of score based p erm utation div ergences and sho wing relations to p erm utation based metrics and w eb ranking. W e show several new theoretical prop erties and interesting connections, summarized b elo w: • W e introduce a nov el notion of the generalized Bregman divergence based on a “subgradient map”. While this is of indep enden t theoretical in terest, it helps us characterize the Lov´ asz-Bregman div ergence. • W e show that the LB div ergence is indeed a score based p erm utation divergence with several similarities to p erm utation based metrics. • W e show that a form of weigh ted Kendall τ , and a form related to the Sp earman’s F o otrule, o ccurs as instances of the Lov´ asz-Bregman divergences. • W e also sho w how a n umber of loss functions used in IR and web ranking lik e the Normalized Discounted Cum ulative Gain (NDCG) and the Area Under the Curve (AUC) o ccur as instances of the Lov´ asz Bregman. • W e show some unique prop erties of the LB divergences not present in p erm utation-distance metrics. Notable amongst these are the prop erties that they naturally captures a notion of “confidence” of an ordering, and exhibit a priorit y for higher rankings, b oth of which are desirable in score based ranking applications. • W e define the Lov´ asz-Mallo ws mo del as a conditional mo del ov er b oth the scores and the ranking. W e also sho w how the LB div ergence can b e naturally extended to partial orderings. • W e show an application of the LB divergence as a conv ex regularizer in optimization problems sensitive to rankings, and also certain connections to the structured norms defined in [ 2 ]. • W e finally connect the LB divergence to rank aggregation and rank based clustering. W e show in fact that a num b er of ranking mo dels for web ranking used in the past are instances of Lov´ asz Bregman rank aggregation. W e also show that a num ber of conditional mo dels used in the past for learning to rank are closely related to the Lov´ asz-Mallo ws mo del. 2 The Lo v´ asz Bregman div ergences In this section, we shall briefly review the Lov´ asz extension and define forms of the generalized Bregman and the LB divergence. 2.1 The Generalized Bregman div ergences The notion used in this section follo ws from [ 34 , 38 ]. W e denote φ as a prop er conv ex function (i.e., it’s domain is non-empty and it do es not tak e the v alue −∞ ), rein t ( . ) and dom ( . ) as the relative in terior and domain resp ectiv ely . Recall that dom ( x ) = { x ∈ R n : φ ( x ) < ∞} and the relative interior of a set S is defined as rein t ( S ) = { x ∈ S : ∃  > 0 , B  ( x ) ∩ aff ( S ) ⊆ S } where B  ( x ) is a ball of radius  around x and aff ( S ) is the affine hull of S . A subgradient g at y ∈ dom ( φ ) is such that for any x, φ ( x ) ≥ φ ( y ) + h g , x − y i and the set of all subgradients at y is the sub differential and is denoted by ∂ φ ( y ). 4 The T aylor series approximation of a t wice differentiable conv e x function pro vides a natural w ay of generating a Bregman divergence ([ 5 ]). Giv en a twice differentiable and strictly con vex function φ : d φ ( x, y ) = φ ( x ) − φ ( y ) − h∇ φ ( y ) , x − y i . (5) In order to extend this notion to non-differen tiable conv ex functions, generalized Bregman divergences ha ve b een prop osed [ 38 , 26 ]. While gradients no longer exist at p oin ts of non-differentiabilit y , the directional deriv atives exist in the relative interior of the domain of φ , as long as the function is finite. Hence a natural form ulation is to replace the gradient b y the directional deriv ative, a notion whic h has b een pursued in [ 38 , 26 ]. In this pap er, we view the generalized Bregman div ergences slightly differently , in a wa y related to the approac h in [ 19 ]. In order to ensure that the subgradients exist, w e only consider the relative interior of the domain. Then define H φ ( y ) as a subgradient-map such that ∀ y ∈ reint ( dom ( φ )) , H φ ( y ) ∈ ∂ φ ( y ). Then given x ∈ dom ( φ ) , y ∈ reint ( dom ( φ )) and a subgradient map H φ , we define the generalized Bregman divergence as: d H φ φ ( x, y ) = φ ( x ) − φ ( y ) − hH φ ( y ) , x − y i (6) When φ is differen tiable, notice that ∂ φ ( y ) = {∇ φ ( y ) } and hence H φ ( y ) = ∇ ( y ). This notion is related to the v arian t defined in [ 19 ]. Given a conv ex function φ , with φ ∗ referring to it’s F enc hel conjugate (defined as φ ∗ ( y ) = sup x h x, y i − φ ( x ) , ∀ y ∈ dom ( φ ∗ )), and a subgradient g ∈ R n , a v ariant of the generalized Bregman defined in [ 19 ] is: B φ ( x, g ) = φ ( x ) + φ ∗ ( g ) − h g , x i . (7) It follows then from F enchel-Y oung duality ([ 34 ], Theorem 23.5) that B ( x, H φ ( y )) = d H φ φ ( x, y ). W e denote the class of divergences defined through Eqn. ( 6 ) as D S . This class admits an interesting characterization: Theorem 2.1. D S is exactly the class of diver genc es d which satisfy the c onditions that, (i) for any a , d ( x, a ) is c onvex in x and (ii) for any given ve ctors a, b , d ( x, a ) − d ( x, b ) is line ar in x . Pr o of. It’s not hard to chec k that any divergence in D S satisfies the conv ex-linear prop ert y . W e now sho w that any divergence, satisfying the con vex-linear prop ert y b elongs to D S . Giv en a vector a , d ( x, a ) is conv ex in x from (i), and hence let φ a ( x ) = d ( x, a ). F urther from (ii), d ( x, b ) − d ( x, a ) = −h h, x i + c , for some vector h ∈ R n and some constant c ∈ R . Hence d ( x, b ) = φ a ( x ) − h h, x i + c . No w substituting d ( b, b ) = 0, we hav e d ( b, b ) = φ a ( b ) − h h, b i + c = 0 (8) and hence c = − φ a ( b ) + h h, b i . Substituting this back ab o ve, w e hav e: d ( x, b ) = φ a ( x ) − h h, x i − φ a ( b ) + h h, b i (9) = φ a ( x ) − φ a ( b ) − h h, x − b i (10) Since d ( x, b ) ≥ 0, we hav e that h ∈ ∂ φ a ( b ). Hence we can define S φ a ( b ) = h . Since this holds for every b ∈ S , w e hav e that the conv ex-linear divergence b elongs to D S for an appropriate sub-gradient map. The ab o v e result provides the necessary and sufficient conditions for the generalized Bregman divergences. Moreo ver, this class is strictly more general than the class of Bregman divergences. In order to hav e a b etter understanding of the differen t subgradien ts at play , we provide a simple example of the generalized Bregman div ergences. Example 2.1. L et φ : R n → R : φ ( x ) = || x || 1 . The p oints of non-differ entiability ar e when x i = 0 . Define a sub gr adient map such that H φ (0) = 0 , which is a valid sub gr adient at x = 0 . We c an obtain the expr ession in this c ase fol lowing simple ge ometry (se e [ 38 ]). With the ab ove sub gr adient map, we obtain d H φ φ ( x, 0) = || x || 1 and thus get the l 1 norm. Other choic es of sub gr adients give differ ent valuations for d ( x, 0) . F or example, the gener alize d Br e gman diver genc e pr op ose d in ([ 38 ]), gives d ( x, 0) = 2 || x || 1 . 5 2.2 Prop erties of the Lov ´ asz Extension W e review some imp ortan t theoretical prop erties of the Lov´ asz extension. Given any vector y ∈ [0 , 1] n and it’s asso ciated p ermutation σ y , define S σ y j = { σ y (1) , · · · , σ y ( j ) } for j ∈ [ n ]. Notice that in general σ y need not b e unique (it will b e unique only if y is totally ordered), and hence let Σ y represen t the set of all p ossible p erm utations with this ordering. Then the Lov´ asz extension of f is defined as: ˆ f ( y ) = n X j =1 y [ σ y ( j )]( f ( S σ y j ) − f ( S σ y j − 1 )) (11) This is also called the Cho quet integral [ 9 ] of f . Though σ y migh t not b e unique (in that there maybe man y p ossible p erm utations corresp onding to the ordering of y ), the Lov´ asz extension is actually unique. F urthermore, ˆ f is conv ex if and only if f is submo dular. In addition, the Lov´ asz extension is also tight on the v ertices of the hypercub e, in that f ( X ) = ˆ f (1 X ) , ∀ X ⊆ V (where 1 X is the c haracteristic vector of X , i.e., 1 X ( j ) = I ( j ∈ X )) and hence it is a v alid contin uous extension. The Lo v´ asz extension is in general a non-smo oth conv ex function, and hence there do es not exist a unique subgradient at every p oint. The sub differen tial ∂ ˆ f ( y ) corresp onding to the Lov´ asz extension how ev er has a particularly interesting structure. It is instructive to consider an alternative representation of the Lov´ asz extension. Let ∅ = Y 0 ⊂ Y 1 ⊂ Y 2 ⊂ · · · ⊂ Y k denote a unique chain corresp onding to the p oin t y , such that y = P k j =1 λ j 1 Y j . Note that in general k ≤ n with equality only if y is totally ordered. Then the Lov´ asz extension can also b e expressed as[ 18 ]: ˆ f ( y ) = P k j =1 λ j f ( Y j ). F urthermore, we define ∂ f ( Y ) as the sub differen tial of a submo dular function f , which satisfies ∂ f ( Y ) = ∂ ˆ f (1 Y ). 3 When defined in this form, the sub differential corresp onding to the Lov´ asz extension has a particularly nice form: Lemma 2.1. (The or em 6.17, [ 18 ]): F or a submo dular function f and ve ctor y ∈ [0 , 1] n , ∂ ˆ f ( y ) = ∩{ ∂ f ( Y i ) | i = 1 , 2 · · · , k } . (12) Let Σ [ A,B ] represen t the set of all p ossible p erm utations of the elements in B \ A . Then we hav e the follo wing fact, which is easy to v erify: Prop osition 2.1. F or a submo dular function f and a ve ctor y , Σ y = Σ [ Y 0 , Y 1 ] × Σ [ Y 1 , Y 2 ] × · · · Σ [ Y k − 1 , Y k ] (13) Mor e over, | Σ y | = Π k i =1 | Y i \ Y i − 1 | . Hence we can see no w that if the p oin t y is totally ordered, k = n . The follo wing result due [ 18 , 13 ] pro vides a characterization of the extreme p oin ts of the Lov´ asz sub differential p olyhedron ∂ ˆ f ( y ): Lemma 2.2. [ 18 , 13 ] F or a submo dular function f , a ve ctor y and a p ermutation σ y ∈ Σ y , a ve ctor h f σ y define d as: h f σ y ( σ y ( j )) = f ( S σ y j ) − f ( S σ y j − 1 ) , ∀ j ∈ { 1 , 2 , · · · , n } forms an extr eme p oint of ∂ ˆ f ( y ) . Also, the numb er of extr eme p oints of ∂ ˆ f ( y ) is | Σ y | . Notice that the extreme subgradients are parameterized by the p erm utation σ y and hence we refer to them as h f σ y . Seen in this wa y , the Lov´ asz extension then takes an extremely simple form: ˆ f ( w ) = h h f σ w , w i . W e now p oin t out an in teresting prop ert y related to the extreme subgradien ts of ˆ f . Define P ( σ ) as a n − simplex corresp onding to a permutation σ (or chain C σ : ∅ ⊂ S σ 1 ⊂ · · · ⊂ S σ n = V ). In other w ords, P ( σ ) = con v (1 S σ i , i = 1 , 2 , · · · , n ). It’s easy to see that P ( σ ) ⊆ [0 , 1] n . In particular it carv es out a particular p olytope within the unit hypercub e. Then w e hav e the following result. 3 Note that this is not the standard definition of ∂ f ( Y ), but an alternate representation – see theorem 6.16 [ 18 ] 6 f ( X ) ˆ f ( x ) d ˆ f ( x, y ) 1) | X || V \ X | P i σ − 1 x σ ( j )) 2) g ( | X | ) P k i =1 x ( σ x ( i )) δ g ( i ) P n i =1 x ( σ x ( i )) δ g ( i ) − P k i =1 x ( σ y ( i )) δ g ( i ) 3) min {| X | , k } P k i =1 x ( σ x ( i )) P k i =1 x ( σ x ( i )) − P k i =1 x ( σ ( i )) 4) min {| X | , 1 } max i x i max i x i − x ( σ (1)) 5) P n i =1 | I ( i ∈ X ) − I ( i + 1 ∈ X ) P n i =1 | x i − x i +1 | P n i =1 | x i − x i +1 | I ( σ − 1 x σ ( i ) > σ − 1 x σ ( i + 1)) 6) I (1 ≤ | A | ≤ n − 1) max i x ( i ) − min i x ( i ) max i x ( i ) − x ( σ (1)) − min i x ( i ) + x ( σ ( n )) 7) I ( A 6 = ∅ , A 6 = V ) max i,j | x i − x j | max i,j | x i − x j | − | x ( σ (1) − x ( σ ( n )) | T able 1: Examples of the LB divergences. I ( . ) is the Indicator fn. Lemma 2.3. (L emma 6.19, [ 18 ]) Given a p ermutation σ ∈ Σ , for every ve ctor y ∈ P ( σ ) the ve ctor h f σ is an extr eme sub gr adient of ˆ f at y . If y b elongs to the (strict) interior of P ( σ ) , h f σ is a unique sub gr adient c orr esp onding to ˆ f at y . The ab o v e lemma p oin ts out a critical fact ab out the subgradien ts of the Lo v´ asz extension, in that they dep end only on the total ordering of a vector and are indep enden t of the vector itself. This also implies that if y is totally ordered (it b elongs to the in terior of P ( σ y )) then ∂ ˆ f ( y ) consists of a single (unique) subgradient. Hence, tw o entirely different but identically ordered vectors will hav e identical extreme subgradients. This fact is imp ortan t when defining and understanding the prop erties of the LB div ergence. Also, P ( σ ) carves a unique p olytope in [0 , 1] n and the following lemma provides some insight into this: Lemma 2.4. (L emma 6.18, [ 18 ]) Two p olytop es P ( σ 1 ) and P ( σ 2 ) shar e a fac e if and only if p ermutations σ 1 and σ 2 ar e such that ∃ k ≤ n , with S σ 1 j = S σ 2 j , ∀ j 6 = k . That is, tw o different p erm utation p olytopes are adjacent to eac h other only if the corresp onding p erm utations are related by a transp osition of adjacent elements. E.g., if n = 5 the p erm utations { 1 , 2 , 3 , 5 , 4 } and { 1 , 2 , 3 , 4 , 5 } share a face, while the p erm utations { 1 , 2 , 5 , 4 , 3 } and { 1 , 2 , 3 , 4 , 5 } do not. These prop erties of the Lov´ asz extension, as we shall see, pla y a key role in the prop erties and applications of the LB div ergence. 2.3 The Lo v´ asz Bregman div ergences W e are now in p osition to define the Lov´ asz-Bregman div ergence. Throughout this pap er, we restrict dom ( ˆ f ) to b e [0 , 1] n . F or the applications we consider in this pap er, we lose no generality with this assumption, since the scores can easily b e scaled to lie within this volume. Consider the case when y is totally ordered, and corresp ondingly | Σ y | = 1. It follo ws then from Lemma 2.3 that there e xists a unique subgradient and H ˆ f ( y ) = h f σ y . Hence for any x ∈ [0 , 1] n , w e hav e from Eqn. ( 6 ) that [ 20 ]: d ˆ f ( x, y ) = ˆ f ( x ) − h x, h f σ y i = h x, h f σ x − h f σ y i (14) Notice that this divergence dep ends only on x and σ y , and is indep enden t of y itself. In particular, the LB div ergence b et ween a vector x and any vector y ∈ P ( σ ) is the same for all y ∈ P ( σ ) (Lemma 2.3 ). W e also in vok e the following lemma from [ 21 ]: Lemma 2.5. (The or em 2.2, [ 21 ]) Given a submo dular function whose p olyhe dr on c ontains al l p ossible extr eme p oints and x which is total ly or der e d, d ˆ f ( x, y ) = 0 if and only if σ x = σ y . A t first sight it seems that the class of submo dular functions satisfying Lemma 2.5 is very sp ecific. W e p oin t out ho wev er that this class is quite general and many instances we consider in this pap er b elong to this class of functions. F or example, it is easy to see that the class of submo dular functions f ( X ) = g ( | X | ) where g is a concav e function satisfying g ( i ) − g ( i − 1) 6 = g ( j ) − g ( j − 1) for i 6 = j b elong to this class. 7 Hence the Lov´ asz-Bregman divergence is score based p erm utation div ergence, and we denote it as: d ˆ f ( x || σ ) = h x, h f σ x − h f σ i (15) Observ e that this form also b ears resemblan ce with Eqn. ( 7 ) . In particular, Eqn. ( 7 ) defines a distance b et w een a vector x and a subgradien t. On the other hand, Eqn. ( 15 ) is defined via a p erm utations. Since the extreme subgradients of the Lov´ asz extension hav e a one to one mapping with the p erm utations, Eqn. ( 15 ) can b e seen as an instance of Eqn. ( 7 ) . As w e shall see in the next section, this div ergence has a num ber of prop erties akin to the standard p ermutation based distance metrics. Since a large class of submo dular functions satisfy the ab o ve prop erty (of having all p ossible extreme p oints), the Lov´ asz-Bregman divergence forms a large class of divergences. Note that y ma y not alwa ys b e totally ordered. In this case, w e resort to the notion of subgradien t map defined in section 2.1 . A natural choice of a subgradient map in such cases is: H ˆ f ( y ) = P σ ∈ Σ y h f σ | Σ y | (16) Assuming f is a monotone non-decreasing non-negative and normalized submo dular function (we shall see that we can assume this without loss of generality), then it’s easy to see that 0 ∈ ∂ ˆ f (0), and corresp ondingly w e assume that H ˆ f (0) = 0. Notice that here d ˆ f ( x, 0) = ˆ f ( x ). Another related divergence is a generalized Bregman divergence obtained through φ ( x ) = ˆ f ( | x | ). W e denote this divergence as d | ˆ f | ( x, y ), and it is easy to see that d | ˆ f | ( x, y ) = d ˆ f ( x, y ) , ∀ x, y ∈ R n + . In other w ords, this generalizes the Lov´ asz-Bregman div ergence to other orthants, and hence we call this the extended Lo v´ asz-Bregman div ergence. In a manner similar to the Lov´ asz-Bregman divergence, w e can also define a v alid subgradien t map. The only difference here is that the there are additional p oin t of non-differentiabilit y at the c hange of orthan ts. This divergence captures a distance b et ween orderings while simultaneously considering the signs of the vectors. F or example t wo vectors [ − 1 , − 2] and [2 , 1] though ha ving the same ordering are differen t from the p erspective of the extended Lov´ asz-Bregman divergence. The LB divergence on the other hand just considers the ordering of the elements in a vector. 2.4 Lo v´ asz Bregman Div ergence Examples Belo w is a partial list of some instances of the Lo v´ asz-Bregman div ergence. W e shall see that a num b er of these are closely related to man y standard p erm utation based metrics. T able 1 considers several other examples of LB divergences. Cut function and symmetric submo dular functions: A fundamen tal submo dular function, which is also symmetric, is the graph cut function. This is f ( X ) = P i ∈ X P j ∈ V \ X d ij . The Lov´ asz extension of f is ˆ f ( x ) = P i,j d ij ( x i − x j ) + [ 2 ]. The LB div ergence corresp onding to ˆ f then has a nice form: d ˆ f ( x || σ ) = X i,j d ij | x i − x j | I ( x i < x j , σ − 1 ( i ) < σ − 1 ( j )) = X i,j d ij | x i − x j | I ( σ − 1 x ( i ) > σ − 1 x ( j ) , σ − 1 ( i ) < σ − 1 ( j )) = X i,j : σ − 1 ( i ) <σ − 1 ( j ) d ij | x i − x j | I ( σ − 1 x ( i ) > σ − 1 x ( j )) = X i σ − 1 x σ ( j )) . (17) W e in addition assume that d is symmetric (i.e., d ij = d j i , ∀ i, j ∈ V ) and hence f is also symmetric. Indeed a w eighted version of Kendall τ can b e written as d w T ( σ, π ) = P i,j : i σ − 1 π ( j )) and d ˆ f ( x || σ ) 8 is exactly then a form of d w T ( σ x , σ ), with w ij = d σ ( i ) σ ( j ) | x σ ( i ) − x σ ( j ) | . Moreov er, if d ij = 1 | x i − x j | , we hav e d ˆ f ( x || σ ) = d T ( σ x , σ ). Similarly , given the Kendall τ , d T ( σ, π ), w e can c ho ose an arbitrary x = x σ suc h that σ x = σ and with an appropriate c hoice of d ij , we hav e that d T ( σ, π ) = d ˆ f ( x σ || π ). Hence, we recov er the Kendall τ for that particular x . An in teresting sp ecial case of this is when f ( X ) = | X || V \ X | , in which case we get: d ˆ f ( x || σ ) = X i σ − 1 x σ ( j )) . This is also closely related to the Kendall τ metric. Cardinalit y based monotone submo dular functions: Another class of submo dular functions is f ( X ) = g ( | X | ) for some concav e function g . This form induces an interesting class of Lov´ asz Bregman divergences. In this case h f σ x ( σ x ( i )) = g ( i ) − g ( i − 1). Define δ g ( i ) = g ( i ) − g ( i − 1), then: d ˆ f ( x || σ ) = n X i =1 x [ σ x ( i )] δ g ( i ) − n X i =1 x [ σ ( i )] δ g ( i ) . (18) Notice that w e can start with any δ g suc h that δ g (1) ≥ δ g (2) ≥ · · · ≥ δ g ( n ), and through this we can obtain the corresp onding function g . Consider a sp ecific example, with δ g ( i ) = n − i . Then, d ˆ f ( x || σ ) = P n i =1 x [ σ ( i )] i − x [ σ x ( i )] i = h x, σ − 1 − σ − 1 x i . This expression lo oks similar to the Sp earman’s rule (Eqn. ( 2 ) ), except for b eing additionally weigh ted by x . W e can also extend this in several w ays. F or example, consider a restriction to the top m elemen ts ( m < n ). Define f ( X ) = min { g ( | X | ) , g ( m ) } . Then it is not hard to verify that: d ˆ f ( x || σ ) = m X i =1 x [ σ x ( i )] δ g ( i ) − m X i =1 x [ σ ( i )] δ g ( i ) . (19) A sp ecific example is f ( X ) = min {| X | , m } , where d ˆ f ( x || σ ) = m X i =1 x ( σ x ( i )) − x ( σ ( i )) . (20) In this case, the divergence b et ween x and σ is the difference b etw een the largest m v alues of x and the m first v alues of x under the ordering σ . Here the ordering is not really imp ortant, but it is just the sum of the top m v alues and hence if σ x and σ , under x , ha ve the same sum of first m v alues, the div ergence is zero (irrespective of their ordering or individual elemen t v aluations). W e can also define δ g , suc h that δ g (1) = 1 and δ g ( i ) = 0 , ∀ i 6 = 1. Then, d ˆ f ( x || σ ) = max j x ( j ) − x ( σ (1)) (this is equiv alen t to Eqn. ( 20 ) when m = 1). In this case, the divergence dep ends only on the top v alue, and if σ x and σ hav e the same leading element, the divergence is zero. 2.5 Lo v´ asz Bregman as ranking measures In this section, we show how the Lov´ asz Bregman subsumes and is closely related to several commonly used loss functions in Information Retriev al connected to ranking. The Normalized Discoun ted Cumulativ e Gain (NDCG): The NDCG metric [ 23 ] is one of the most widely used ranking measures in web search. Giv en a relev ance vector r , where the entry r i t ypically provides the relev ance of a do cumen t i ∈ { 1 , 2 , · · · , n } to a query , and an ordering of do cuments σ , the NDCG loss function with resp ect to a discount function D is defined as: L ( σ ) = P k i =1 r ( σ r ( i )) D ( i ) − P k i =1 r ( σ ( i )) D ( i ) P k i =1 r ( σ r ( i )) D ( i ) (21) 9 Here k ≤ n is often used as a cutoff. In tuitively the NDCG loss compares an ordering σ to the b est p ossible ordering σ r . The typical c hoice of D ( i ) = 1 log(1+ i ) , though in general any decreasing function can b e used. This function is closely related to a form of the LB div ergence. In particular, notice that L ( σ ) ∝ P k i =1 r ( σ r ( i )) D ( i ) − P k i =1 r ( σ ( i )) D ( i ) (since the denominator of Eqn. ( 21 ) is a constant) which is form of Eqn. ( 19 ) with m = k and c ho osing the function g ( i ) = P i j =1 D ( i ). It is not hard to see that when D is decreasing, g is concav e. Area Under the Curve: Another commonly used ranking measure is the Area under the curv e [ 36 ]. Unlik e NDCG how ever, this just relies on a partial ordering of the do cuments and not a complete ordering. In particular denote G as a set of “go od” do cumen ts and B as a set of “bad” do cumen ts. Then the loss function L ( σ ) corresp onding to an ordering of do cumen ts σ is L ( σ ) = 1 | G || B | X g ∈ G,b ∈ B I ( σ ( g ) > σ ( b )) . (22) This can b e seen as an instance of LB div ergence corresp onding to the cut function by choosing d ij = 1 | G || B | , ∀ i, j , x g = 1 , ∀ g ∈ G and x b = 0 , ∀ b ∈ B . 3 Lo v´ asz Bregman Prop erties In this section, we shall analyze some interesting prop erties of the LB divergences. While many of these prop erties show strong similarities with p erm utation based metrics, the Lov´ asz Bregman divergence enjoys some unique prop erties, thereby providing nov el insight into the problem of combining and clustering ordered v ectors. Non-negativit y and con vexit y: The LB divergence is a divergence, in that ∀ x, σ, d ˆ f ( x || σ ) ≥ 0. Addi- tionally if the submo dular p olyhedron of f has all p ossible extreme p oin ts, d ˆ f ( x || σ ) = 0 iff σ x = σ . Also the Lo v´ asz-Bregman divergence d ˆ f ( x || σ ) is conv ex in x for a given σ . Equiv alence Classes: The LB divergence of submo dular functions which differ only in a mo dular term are equal. Hence for a submo dular function f and a mo dular function m , d [ f + m ( x || σ ) = d ˆ f ( x || σ ). Since an y submo dular function can b e expressed as a difference b et ween a p olymatroid and a mo dular function [ 11 ], it follows that it suffices to consider p olymatroid (or non-negative monotone submo dular) functions while defining the LB divergences. Linearit y and Linear Separation: The LB divergence is a linear op erator in the submo dular function f . Hence for tw o submo dular functions f 1 , f 2 , d \ f 1 + f 2 ( x || σ ) = d ˆ f 1 ( x || σ ) + d ˆ f 2 ( x || σ ). The LB divergence has the property of linear separation — the set of p oin ts x equidistan t to tw o p erm utations σ 1 and σ 2 (i.e., { x : d ˆ f ( x || σ 1 ) = d ˆ f ( x || σ 2 ) } ) comprise a hyperplane. Similarly , for an y x , the set of p oin ts y suc h that d ˆ f ( x, y ) = constant, is P ( σ y ). In v ariance ov er relab elings: The p ermutation based distance metrics hav e the prop ert y of b eing left in v ariant with resp ect to reorderings, i.e., given p ermutations π , σ, τ , d ( π , σ ) = d ( τ π , τ σ ). While this prop erty ma y not b e true of the Lov´ asz Bregman divergences in general, the follo wing theorem sho ws that this is true for a large class of them. Theorem 3.1. Given a submo dular function f , such that ∀ σ, τ ∈ Σ , h f τ σ = τ h f σ , d ˆ f ( x || σ ) = d ˆ f ( τ x || τ σ ) . Pr o of. Recall that τ x = xτ − 1 . Hence d ˆ f ( τ x || τ σ ) = h τ x, h f σ x τ − h f σ τ i = h xτ − 1 , h f σ x τ − 1 − h f σ τ − 1 h = d ˆ f ( x || σ ) 10 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 (a) LB2D (b) LB3D1 (c) LB3D2 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 (d) KT2D (e) KT3D1 (f ) KT3D2 Figure 1: A visualization of d ˆ f ( x || σ ) (left three) and d T ( σ x , σ ) (right three). The figures sho ws a visualization in 2D, and tw o views in 3D for each, with σ as { 1 , 2 } and { 1 , 2 , 3 } and x ∈ [0 , 1] 2 and [0 , 1] 3 resp ectiv ely . This prop ert y seems a little demanding for a submo dular function. But a large class of submo dular functions can b e seen to ha ve this prop ert y . In fact, it can b e v erified that any cardinality based submo dular function has this prop ert y . Corollary 3.1.1. Given a submo dular function f such that f ( X ) = g ( | X | ) for some function g , then d ˆ f ( x || σ ) = d ˆ f ( τ x, τ σ ) . This follows directly from Eqn. ( 18 ) and observing that the extreme p oin ts of the corresp onding p oly- hedron are reorderings of each other. In other words, in these cases the submo dular p olyhedron forms a p erm utahedron. This prop ert y is true even for sums of such functions and therefore for many of the sp ecial cases whic h we hav e considered. F or example, notice that the cut function f ( X ) = | X || V \ X | and the functions f ( X ) = g ( | X | ) are b oth cardinality based functions. Dep endence on the v alues and not just the orderings: W e shall here analyze one key prop ert y of the LB div ergence that is not present in other p erm utation based divergences. Consider the problem of combining rankings where, given a collection of scores x 1 , · · · , x n , we wan t to come up with a joint ranking. An extreme case of this is where for some x all the elemen ts are the same. In this case x expresses no preference in the join t ranking. Indeed it is easy to verify that for such an x , d ˆ f ( x || σ ) = 0 , ∀ σ . Now given a x where all the elemen ts are almost equal (but not exactly equal), even though this vector is totally ordered, it expresses a v ery low confidence in it’s ordering. W e w ould exp ect for such an x , d ˆ f ( x || σ ) to b e small for every σ . Indeed w e hav e the following result: Theorem 3.2. Given a monotone submo dular function f and any p ermutation σ , d ˆ f ( x || σ ) ≤ n (max j f ( j ) − min j f ( j | V \ j )) ≤ n max j f ( j ) wher e  = max i,j | x i − x j | and f ( j | A ) = f ( A ∪ j ) − f ( A ) . Pr o of. Decomp ose x = min j x j 1 + r , where r i = x i − min j x j . Notice that | r i | ≤  . Moreov er, σ x = σ r and hence d ˆ f ( x || σ ) = d ˆ f ( min j x j 1 || σ ) + d ˆ f ( r || σ ) = d ˆ f ( r || σ ) since h 1 , h f σ r − h f σ i = f ( V ) − f ( V ) = 0. No w, d ˆ f ( r || σ ) = h r , h f σ r − h f σ i ≤ || r || 2 || h f σ r − h f σ || 2 . Finally note that || r || 2 ≤  √ n and || h f σ r − h f σ || 2 ≤ √ n ( max j f ( j ) − min j f ( j | V \ j )) and combining the tw o, we get the result. The second part follows from the monotonicit y of f . The ab ov e theorem implies that if the v ector x is such that all it’s elements are almost equal, then  is small and the LB divergence is also prop ortionately small. This b ound can b e improv ed in certain cases. F or example for the cut function, with f ( X ) = | X || V \ X | , we hav e that d ˆ f ( x || σ ) ≤ d T ( σ x , σ ) ≤ n ( n − 1) / 2, where d T is the Kendall τ . 11 Priorit y for higher rankings: W e show y et another nice prop ert y of the LB divergence with resp ect to a natural priority in rankings. This prop ert y has to do intrinsically with the submo dularit y of the generator function. W e hav e the following theorem, that demonstrates this: Lemma 3.1. Given p ermutations σ, π , such that P ( σ ) and P ( π ) shar e a fac e (say S σ k 6 = S π k ) and x ∈ P ( π ) ), then d ˆ f ( x || σ ) = ( x k − x k +1 )( f ( σ x ( k ) | S σ k − 1 ) − f ( σ x ( k ) | S σ k )) . This result directly follows from the definitions. Now consider the class of submo dular function f suc h that ∀ j, k / ∈ X , j 6 = k , f ( j | S ) − f ( j | S ∪ k ) is monotone decreasing as a function of S . An example of such a submo dular function is again f ( X ) = g ( | X | ), for a concav e function g . Then it is clear that from the ab o ve Lemma that d ˆ f ( x || σ ) will be larger for smaller k . In other w ords, if π and σ differ in the starting of the ranking, the div ergence is more than if π and σ differ somewhere tow ards the end of the ranking. This kind of w eighting is more prominent for the class of functions which dep end on the cardinality , i.e., f ( X ) = g ( | X | ). Recall that man y of our sp ecial cases b elong to this class. Then we hav e that d ˆ f ( x || σ ) = P n i =1 { x ( σ x ( i )) − x ( σ ( i )) } δ g ( i ). No w since δ g (1) ≥ δ g (2) ≥ · · · ≥ δ g ( n ), it then follows that if σ x and σ differ in the start of the ranking, they are p enalized more. Extensions to partial orderings and top m -Lists: So far we considered notions of distances b et ween a score x and a complete p erm utation σ . Often we ma y not b e in terested in a distance to a total ordering σ , but just a distance to a say a top- m list [ 27 ] or a partial ordering b et ween elemen ts [ 40 , 8 ]. The LB divergence also has a nice in terpretation for b oth of these. In particular, in the context of top m lists, w e can use Eqn. ( 19 ) . This exactly corresp onds to the divergence b et ween differen t or p ossibly ov erlapping sets of m ob jects. Moreo ver, if w e are simply interested in the top m elemen ts without the orderings, w e ha ve Eqn. ( 20 ) . A sp ecial case of this is when we may just b e in terested in the top v alue. Another interesting instance is of partial orderings, where w e do not care ab out the total ordering. F or example, in web ranking we often just care about the relev ant and irrelev ant documents and that the relev ant ones should be placed ab o v e the irrelev ant ones. W e can then define a distance d ˆ f ( x ||P ) where P refers to a partial ordering by using the cut based Lov´ asz Bregman (Eqn. ( 17 ) ) and defining the graph to ha ve edges corresp onding to the partial ordering. F or example if we are interested in a partial order 1 > 2 , 3 > 2 in the elements { 1 , 2 , 3 , 4 } , we can define d 1 , 2 = d 3 , 2 = 1 with the rest d ij = 0 in Eqn. ( 17 ) . Defined in this wa y , the LB divergence then measures the distortion b et ween a vector x and the partial ordering 1 > 2 , 3 > 2. In all these cases, we see that the extensions to partial rankings are natural in our framework, without needing to significantly change the expressions to admit these generalizations. Lo v´ asz-Mallo ws mo del: In this section, we extend the notion of Mallows mo del to the LB divergence. W e first define the Mallo ws mo del for the LB divergence: p ( x | θ , σ ) = exp( − θ d ˆ f ( x || σ )) Z ( θ, σ ) , θ ≥ 0 . (23) F or this distribution to b e a v alid probabilit y distribution, we assume that the domain D of x to b e a b ounded set (sa y for example [0 , 1] n ). W e also assume that the domain is symmetric ov er p erm utations (i.e., for all σ ∈ Σ , if x ∈ D , xσ ∈ D . Unlike the standard Mallow’s mo del, how ever, this is defined ov er scores (or v aluations) as opp osed to p erm utations. Giv en the class of LB divergences defining a probability distribution ov er suc h a symmetric set (i.e., the div ergences are inv ariant ov er relab elings) it follows that Z ( θ , σ ) = Z ( θ ). The reason for this is: Z ( θ, σ ) = Z x exp( − θ d ˆ f ( x, σ )) dx = Z x exp( − θ d ˆ f ( xσ − 1 , σ 0 )) dx = Z x 0 exp( − θ d ˆ f ( x 0 , σ 0 )) dx 0 = Z ( θ ) 12 (a) Norm-1 (b) Norm-2 Figure 2: A visualization of the sub-level sets of d | ˆ f | ( x, y ) for y = [1 , 2] (left) and y = [2 , 1] (right). where σ 0 = { 1 , 2 , · · · , n } . W e can also define an extended Mallo ws mo del for combining rankings, analogous to [ 28 ]. Unlike the Mallows mo del ho wev er this is a mo del ov er p erm utations given a collection of vectors X = { x 1 , · · · , x n } and parameters Θ = { θ 1 , · · · , θ n } . p ( σ | Θ , X ) = exp( − P n i =1 θ i d ˆ f ( x i || σ )) Z (Θ , X ) (24) This mo del can b e used to combine rankings using the LB divergences, in a manner akin to Cranking [ 28 ]. This extended Lov´ asz-Mallows mo del also admits an interesting Bay esian interpretation, thereby providing a generativ e view to this mo del: p ( σ | Θ , X ) ∝ p ( σ ) n Y i =1 p ( x i | σ, θ i ) . (25) Again this directly follows from the fact that in this case, in the Lov´ asz-Mallows mo del, the normalizing constan ts (which are indep enden t of σ ) cancel out. W e shall actually see some very interesting connections b et w een this conditional mo del and web ranking. Lo v´ asz Bregman as a regularizer: Consider the problem where we wan t to minimize a c on v ex function sub ject to a p erm utation constraint. In these cases, the Lo v´ asz Bregman can b e seen as a regularizer since the problem min x : x ∈P ( σ ) ψ ( x ) is equiv alent to min x : d ˆ f ( x,σ )=0 ψ ( x ). Alternatively , we may consider: min x ψ ( x ) + λd ˆ f ( x || σ ) . (26) The ab o ve problem still remains a con vex optimization problem since the LB divergence is conv ex in it’s first argumen t. This problem is intimately related to the proximal metho ds inv estigated in [ 1 ]. Indeed, Eqn. 26 is equiv alent to min x ψ ( x ) − λ h h σ f , x i + ˆ f ( x ) and is of the form min x φ ( x ) + ˆ f ( x ), and hence the metho ds in [ 1 ] apply to this problem. Also, the connection b et ween the LB divergences and the submo dular norms can b e made more explicit in the following manner. Consider the extended Lov´ asz-Bregman divergences d | ˆ f | . When f is monotone, 0 ∈ ∂ | ˆ f | (0) and hence choosing an appropriate subgradient map ensures that d | ˆ f | ( x, 0) = ˆ f ( | x | ), which are the p olyhedral norms defined in [ 1 ]. Moreo ver, w e get other in teresting norm structures when y 6 = 0. F or example, Figure 2 sho ws a visualization of the sub-level sets of d | ˆ f | ( x, y ) for tw o choices of y with different orderings. W e see that these sublevel sets are op en and naturally show preference to the orderings defined via the particular y . 4 Applications Rank Aggregation: As argued ab o ve, the LB divergence is a natural mo del for the problem of combining scores, where b oth the ordering and the v aluations are provided. If we ignore the v alues, but just consider the rankings, this then b ecomes rank aggregation. A natural choice in such problems is the Kendall τ 13 (a) LB2Dclus (b) LBEuc2Dclus (c) LB3Dclus (d) LBmax3Dclus Figure 3: Clustering with the (a) Lov´ asz-Bregman in 2D, (b) Com bination of Lov´ asz-Bregman and Euclidean in 2D, (c) Lo v´ asz-Bregman in 3D and (d) Lov´ asz-Bregman with top 1-list in 3D. All these use f ( X ) = p | X | distance [ 28 , 27 , 32 ]. On the other hand, if we consider only the v alues without explicitly modeling the orderings, then this b ecomes an incarnation of b o osting [ 16 ]. The Lov´ asz-Bregman divergence tries to com bine b oth asp ects of this problem – by combining orderings using a p erm utation based divergence, while sim ultaneously using the additional information of the confidence in the orderings provided by the v aluations. W e can then p ose this problem as: σ ∈ argmin σ 0 ∈ Σ n X i =1 d ˆ f ( x i || σ 0 ) (27) The ab o ve notion of the represen tative ordering (also known as the mean ordering) is very common in many applications [ 3 ] and has also b een used in the context of combining rankings [ 32 , 28 , 27 ]. Unfortunately this problem in the context of the p erm utation based metrics were sho wn to b e NP hard [ 4 ]. Surprisingly for the LB divergence this problem is easy (and has a closed form). In particular, the representativ e p erm utation is exactly the ordering corresp onding to the arithmetic mean of the elements in X . Lemma 4.1. [ 20 ] Given a submo dular function f , the L ov´ asz Br e gman r epr esentative (Eqn. ( 27 ) ) is σ = σ µ , wher e µ = 1 n P n i =1 x i This result builds on the known result for Bregman divergences [ 3 ]. This seems somewhat surprising at first. Notice, ho wev er, that the arithmetic mean uses additional information ab out the scores and its confidence, as opp osed to just the orderings. In this con text, the result then seems reasonable since we would exp ect that the representativ es b e closely related to the ordering of the arithmetic mean of the ob jects. W e shall also see that this notion has in fact b een ubiquitously but uninten tionally used in the web ranking and information retriev al comm unities. W e illustrate the utility of the Lo v´ asz Bregman rank aggre gation through the following argument. Assume that a particular v ector x is uninformativ e ab out the true ordering (i.e, the v alues of x are almost equal). Then with the LB divergence and any p erm utation π , d ( x || π ) ≈ 0, and hence this vector will not contribute to the mean ordering. Instead if we use a p erm utation based metric, it will ignore the v alues but consider only the p erm utation. As a result, the mean ordering tends to consider suc h v ectors x whic h are uninformative ab out the true ordering. As an example, consider a set of scores: X = { 1 . 9 , 2 } , { 1 . 8 , 2 } , { 1 . 95 , 2 } , { 2 , 1 } , { 2 . 5 , 1 . 2 } . The representativ e of this collection as seen by a p erm utation based metric would b e the p erm utation { 1 , 2 } though the former three vectors hav e v ery low confidence. The arithmetic mean of these vectors is how ever { 2 . 03 , 1 . 64 } and the Lov´ asz Bregman representativ e would b e { 2 , 1 } . The arithmetic mean also provides a notion of confidence of the p opulation. In particular, if the total v ariation [ 2 ] of the arithmetic mean is small, it implies that the p opulation is not confiden t ab out its ordering, while if the v ariation is high, it provides a certificate of a homogeneous p opulation. Figure 1 provides a visualization the Lov´ asz-Bregman divergence using the cut function and the Kendall τ metric, visualized in 2 and 3 dimensions resp ectively . W e see the similarity b et ween the tw o divergences and at the same time, the dep endence on the “scores” in the Lov´ asz-Bregman case. 14 Learning to Rank: W e inv estigate a sp ecific instance of the rank aggregation problem with reference to the problem of “learning to rank.” A large class of algorithms hav e b een prop osed for this problem – see [ 29 ] for a surv ey on this. A sp ecific class of algorithms for this problem hav e fo cused on maxim um margin learning using ranking based loss functions (see [ 40 , 8 ] and references therein). While w e hav e seen that the ranking based losses themselves are instances of the LB divergence, the feature functions are also c losely related. In particular, given a query q , we denote a feature v ector corresp onding to do cument i ∈ { 1 , 2 , · · · , n } as x i ∈ R d , where each element of x i denotes a quality of do cumen t i based on a particular indicator or feature. Denote X = { x 1 , · · · , x n } . W e assume we ha ve d feature functions (one migh t be for example a match with the title, another might b e pagerank, etc). Denote x j i as the score of the j th feature corresp onding to do cumen t i and x j ∈ R n as the score vector corresp onding to feature j o ver all the documents. In other w ords, x j = ( x j 1 , x j 2 , · · · , x j n ). One p ossible choice of feature function is: φ ( X , σ ) = d X j =1 w j d ˆ f ( x j || σ ) (28) for a weigh t vector w ∈ R d . Given a particular weigh t vector w , the inference problem then is to find the p erm utation σ whic h minimizes φ ( X , σ ). Thanks to Lemma 4.1 , the p erm utation σ is exactly the ordering of the vector P n j =1 w j x j . It is not hard to see that this exactly corresp onds to ordering the scores w > x i for i ∈ { 1 , 2 , · · · , n } . Interestingly many of the feature functions used in [ 40 , 8 ] are forms closely related Eqn. ( 28 ) . In fact the motiv ation to define these feature functions is exactly that the inference problem for a giv en set of weigh ts w b e solved b y simply ordering the scores w > x i for every i ∈ { 1 , 2 , · · · , n } [ 8 ]. W e see that through Eqn. ( 28 ), we hav e a large class of p ossible feature functions for this problem. W e also p oin t out a connection b et w een the learning to rank problem and the Lov´ asz-Mallo ws mo del. In particular, recen t work [ 12 ] defined a conditional probabilit y mo del ov er p ermutations as: p ( σ | w, X ) = exp( w > φ ( X , σ )) Z . (29) This conditional mo del is then exactly the extended Lo v´ asz-Mallo ws mo del of Eqn. ( 24 ) when φ is defined as in Eqn. ( 28 ) . The conditional mo dels used in [ 12 ] are in fact closely related to this and corresp ondingly Eqn. ( 29 ) offers a large class of conditional ranking mo dels for the learning to rank problem. Clustering: A natural generalization of rank aggregation is the problem of clustering. In this context, w e assume a heterogeneous mo del, where the data is represen ted as mixtures of ranking mo dels, with e ac h mixture representing a homogeneous p opulation. It is natural to define a clustering ob jective in such scenarios. Assume a set of representativ es Σ = { σ 1 , · · · , σ k } and a set of clusters C = {C 1 , C 2 , · · · , C k } . The clustering ob jective is then: min C , Σ P k j =1 P i : x i ∈C j d ˆ f ( x i || σ i ). As sho wn in [ 20 ], a simple k-means style algorithm finds a lo cal minima of the ab o ve ob jective. Moreov er due to simplicity of obtaining the means in this case, this algorithm is extremely scalable and practical. One additional prop ert y of the Lov´ asz-Bregman divergences is that they are easily amenable to other Bregman div ergences as well. Since the sum of Bregman div ergences is also a Bregman div ergence, the k-means algorithm will go through almost identically if we add other Bregman divergences (for example, the Euclidean distance). W e demonstrate the clustering patterns obtained through these div ergences in Figure 3 for randomly sampled p oints in 2 and 3 D. The second figure (from left) demonstrates the combined clustering using the Lov´ asz-Bregman and squared-Euclidean distance, which shows additional dep endence on the v alues through the Euclidean distance. Similarly the last figure (from left), pro vides a clustering using the top m ob jective, with m = 1 (Eqn. ( 19 )). 5 Discussion T o our kno wledge, this work is the first introduces the notion of “score based divergences” in preference and ranking based learning. Man y of the results in this pap er are due to some interesting prop erties of the Lov´ asz 15 extension and Bregman divergences. This also provides interesting connections b et ween web ranking and the p ermutation based metrics. This idea is mildly related to the work of [ 37 ] where they use the Cho quet in tegral (of whic h the Lov´ asz extension is a sp ecial case) for preference learning. Unlike our pap er, how ev er, they do not fo cus on the divergences formed by the integral. Finally , it will b e interesting to use these ideas in real world applications inv olving rank aggregation, clustering, and learning to rank. Ac knowledgmen ts: W e thank Matthai Phillipose, Stefanie Jegelk a and the melodi-lab submo dular group at UWEE for discussions and the anonymous review ers for very useful reviews. This material is based up on work supp orted b y the National Science F oundation under Grant No. (I IS-1162606), and is also supp orted by a Go ogle, a Microsoft, and an Intel research aw ard. References [1] F. Bach. Structured sparsity-inducing norms through submo dular functions. NIPS , 2010. [2] F. Bach. Learning with Submo dular functions: A conv ex Optimization P ersp ectiv e. Arxiv , 2011. [3] Arindam Banerjee, Srujana Meregu, Indra jit. S. Dhilon, and Jo ydeep Ghosh. Clustering with Bregman div ergences. JMLR , 6:1705–1749, 2005. [4] J. Bartholdi, C.A. T o vey , and M.A. T rick. V oting sc hemes for which it can b e difficult to tell who won the election. So cial Choic e and welfar e , 6(2):157–165, 1989. [5] L.M Bregman. The relaxation metho d of finding the common p oin t of conv ex sets and its application to the solution of problems in conv ex programming. USSR Comput. Math and Math Physics , 7, 1967. [6] L.M. Busse, P . Orbanz, and J.M. Buhmann. Cluster analysis of heterogeneous rank data. In In ICML , v olume 227, pages 113–120, 2007. [7] Y. Censor and S.A. Zenios. Par al lel optimization: The ory, algorithms, and applic ations . Oxford Univ ersity Press, USA, 1997. [8] Soumen Chakrabarti, Ra jiv Khanna, Uma Saw ant, and Chiru Bhattacharyy a. Structured learning for non-smo oth ranking losses. In SIGKDD , pages 88–96. ACM, 2008. [9] Gustav e Cho quet. Theory of capacities. In Annales de linstitut F ourier , volume 5, page 87, 1953. [10] D Critchlo w. Metric metho ds for analyzing p artial ly r anke d data in L e ctur e Notes in Statistics No. 34 . Springer-V erlag, Berlin 1985, 1985. [11] W. H. Cunningham. Decomp osition of submo dular functions. Combinatoric a , 3(1):53–68, 1983. [12] Avina v a Dub ey , Jinesh Machc hhar, Chiranjib Bhattacharyy a, and Soumen Chakrabarti. Conditional mo dels for non-smo oth ranking loss functions. In ICDM , pages 129–138, 2009. [13] J. Edmonds. Submo dular functions, matroids and certain p olyhedra. Combinatorial structur es and their Applic ations , 1970. [14] M.A. Fligner and J.S. V erducci. Distance based ranking mo dels. Journal of the R oyal Statistic al So ciety. Series B (Metho dolo gic al) , pages 359–369, 1986. [15] M.A. Fligner and J.S. V erducci. Multistage ranking mo dels. Journal of the Americ an Statistic al Asso ciation , 83(403):892–901, 1988. [16] Y oa v F reund, Ra j Iyer, Rob ert E Schapire, and Y oram Singer. An efficient bo osting algorithm for com bining preferences. JMLR , 4:933–969, 2003. 16 [17] B.A. F rigyik, S. Sriv astav a, and M.R. Gupta. F unctional Bregman div ergence. In In ISIT , pages 1681–1685. IEEE, 2008. [18] S. F ujishige. Submo dular functions and optimization , volume 58. Elsevier Science, 2005. [19] G.J. Gordon. Regret b ounds for prediction problems. In In COL T , pages 29–40. ACM, 1999. [20] R. Iy er and J. Bilmes. The submo dular Bregman and Lov´ asz-Bregman divergences with applications. In NIPS , 2012. [21] R. Iyer and J. Bilmes. The submo dular Bregman and Lov´ asz-Bregman divergences with applications: Extended v ersion. In NIPS , 2012. [22] R. Iyer, S. Jegelk a, and J. Bilmes. F ast semidifferential based submo dular function optimization. In ICML , 2013. [23] Kalerv o J¨ arv elin and Jaana Kek¨ al¨ ainen. IR ev aluation metho ds for retrieving highly relev an t do cumen ts. In In SIGIR , pages 41–48. ACM, 2000. [24] M.G. Kendall. A new measure of rank correlation. Biometrika , 30(1/2):81–93, 1938. [25] Katrin Kirchhoff et al. Combining articulatory and acoustic information for sp eec h recognition in noisy and rev erb eran t environmen ts. In ICSLP , volume 98, pages 891–894. Citeseer, 1998. [26] K.C. Kiwiel. Proximal minimization metho ds with generalized Bregman functions. SIAM Journal on Contr ol and Optimization , 35(4):1142–1168, 1997. [27] A. Klementiev, D. Roth, and K. Small. Unsup ervised rank aggregation with distance-based mo dels. In ICML , 2008. [28] G. Lebanon and J. Lafferty . Cranking: Com bining rankings using conditional probability mo dels on p erm utations. In ICML , 2002. [29] Tie-Y an Liu. Learning to rank for information retriev al. F oundations and T r ends in Information R etrieval , 3(3):225–331, 2009. [30] L. Lov´ asz. Submo dular functions and conv exity . Mathematic al Pr o gr amming , 1983. [31] C.L. Mallows. Non-null ranking mo dels. i. Biometrika , 44(1/2):114–130, 1957. [32] M. Meil˘ a, K. Phadnis, A. P atterson, and J. Bilmes. Consensus ranking under the exp onen tial mo del. In In UAI , 2007. [33] T.B. Murphy and D. Martin. Mixtures of distance-based mo dels for ranking data. Computational statistics & data analysis , 41(3):645–655, 2003. [34] R.T. Ro ck afellar. Convex analysis , volume 28. Princeton Univ Pr, 1970. [35] An tti-V eikko I Rosti, Necip F azil Ayan, Bing Xiang, Spyros Matsouk as, Richard Sch w artz, and Bonnie Dorr. Combining outputs from multiple machine translation systems. In NAACL - HL T , 2007. [36] Ken t A Spac kman. Signal detection theory: V aluable to ols for ev aluating inductive learning. In Pr o c e e dings of the sixth international workshop on Machine le arning , 1989. [37] Ali F allah T ehrani, W eiw ei Cheng, and Eyk e H ¨ ullermeier. Preference learning using the choquet integral: The case of multipartite ranking. IEEE T r ansactions on F uzzy Systems , 2012. [38] M. T elgarsky and S. Dasgupta. Agglomerative Bregman clustering. In ICML , 2012. 17 [39] K. Tsuda, G. Ratsch, and M.K. W armuth. Matrix exp onen tiated gradient up dates for on-line learning and Bregman pro jection. JMLR , 6(1):995, 2006. [40] Yisong Y ue, Thomas Finley , Filip Radlinski, and Thorsten Joachims. A supp ort vector metho d for optimizing a verage precision. In SIGIR . ACM, 2007. 18

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment