A note on the triangle inequality for the Jaccard distance

A note on the triangle inequalit y for the Jaccard distance Sven Kosub Departmen t of Computer & Information Scienc e, Univ ersit y o f Konstanz Bo x 6 7, D- 78457 K onstanz, G erman y Sven.Kosub@ uni-konstanz.de Decem ber 9, 2016 Abstract Two simple pro ofs o f the triang le inequa lity for the Jaccard distance in terms of nonnegative, m onotone, submo dular functions are given and discussed. The Jaccard index [8] is a classical similarity measure on sets w ith a lot of pr actical applications in inform ation r etriev al, d ata mining, mac hine learnin g, and many more (cf., e.g., [7]). Measuring the relativ e size of the o v erlap of t w o ﬁn ite sets A and B , the Jaccard index J and the asso ciated Jaccard distance J δ are f ormally d eﬁned as: J ( A, B ) = def | A ∩ B | | A ∪ B | , J δ ( A, B ) = def 1 − J ( A, B ) = 1 − | A ∩ B | | A ∪ B | = | A △ B | | A ∪ B | where J ( ∅ , ∅ ) = def 1. The Jaccard distance J δ is kno wn to fu lﬁll all p r op erties of a metric, most notably , the triangle inequalit y—a fact that h as b een observ ed many t imes, e.g., via metric transforms [12, 13, 4], emb eddings in v ector spaces (e.g., [15, 11, 4]), min- wise indep enden t p erm utations [1], or sometimes cum b ersome arithmetics [10, 3]. A v ery simple, elemen tary pro of of the triangle inequalit y w as giv en in [5] using an appropriate partitioning of s ets. Here, w e giv e tw o more simp le, d irect pro ofs of the triangle inequ alit y . O n e pr o of comes without an y set diﬀerence or disjoint ness of sets. It is based only on the fundamental equation | A ∪ B | + | A ∩ B | = | A | + | B | . As suc h, the pro of is generic and leads to (sub)mo du lar v er s ions of the J accard distance (as deﬁned b elo w ). The second p ro of un folds a sub tle d iﬀerence b et w een the t w o p ossible v ersions. Though the original motiv ation was to give a pro of of the tria ngle inequalit y as sim p le as p ossible, the link w ith submo dular functions is in teresting in itself (a s also recen tly suggested in [6]). Let X b e a ﬁnite, n on-empt y ground set. A set function f : P ( X ) → R is said to b e submo dular on X if f ( A ∪ B ) + f ( A ∩ B ) ≤ f ( A ) + f ( B ) for all A, B ⊆ X . If all inequalities are equations then f is called mo dular on X . It is kno wn that f is submo dular on X if and on ly if the follo wing condition holds (cf., e.g., [14]): f ( A ∪ { x } ) − f ( A ) ≥ f ( B ∪ { x } ) − f ( B ) for all A ⊆ B ⊆ X , x ∈ B (1) A set function f is monotone if f ( A ) ≤ f ( B ) for all A ⊆ B ⊆ X ; f is nonnegativ e if f ( A ) ≥ 0 f or all A ⊆ X . E ach nonn egativ e, monotone, modu lar fun ction f on X can b e 1 written as f ( A ) = γ + P i ∈ A c i where γ , c i ≥ 0 for all i ∈ X (cf., e.g., [14]). E xamples are set cardinalit y or degree sum in graphs. Standard examples of nonnegativ e, m onotone, submo d ular set fu nctions are m atroid rank, net wo rk ﬂo w to a sink, en tropy of sets of random v ariables, and neigh b orho o d size in bipartite graphs. Let f b e a nonnegativ e, monotone, su bmo du lar set fun ction on X . F or sets A, B ⊆ X , w e deﬁne t w o cand id ates for submo dular Jac c ar d distanc es , J δ,f and J ∆ δ,f , as follo ws: J δ,f ( A, B ) = def 1 − f ( A ∩ B ) f ( A ∪ B ) , J ∆ δ,f = def f ( A △ B ) − f ( ∅ ) f ( A ∪ B ) , where J δ,f ( A, B ) = J ∆ δ,f ( A, B ) = def 0 if f ( A ∪ B ) = 0. It is clear that 0 ≤ J δ,f ( A, B ) ≤ J ∆ δ,f ( A, B ). If f is mo dular then J δ,f = J ∆ δ,f . In particular, for f ( A ) = | A | (i.e., the cardinalit y of the set A ⊆ X ), we obtain the standard J accard distance J δ = J δ,f = J ∆ δ,f . First, w e giv e a simple p ro of of the triangle inequalit y for J δ,f . In terestingly , this is only p ossible for mo dular s et fu nctions (see the third remark after Theorem 3 ). Lemma 1. L e t f b e a nonne gative, monotone, su bmo dular set function on X . Then, for al l se ts A, B , C ⊆ X , it holds th at f ( A ∩ C ) · f ( B ∪ C ) + f ( A ∪ C ) · f ( B ∩ C ) ≤ f ( C ) ·  f ( A ) + f ( B )  . Pr o of. W e ea sily obtain f ( A ∩ C ) · f ( B ∪ C ) ≤ f ( A ∩ C ) ·  f ( B ) + f ( C ) − f ( B ∩ C )  (submo d u larit y of f ) ≤ f ( C ) ·  f ( B ) − f ( B ∩ C ) + f ( A ∩ C )  (monotonicit y o f f ) and, by s w apping A and B , f ( A ∪ C ) · f ( B ∩ C ) ≤ f ( C ) ·  f ( A ) − f ( A ∩ C ) + f ( B ∩ C )  . Ov erall, f ( A ∩ C ) · f ( B ∪ C ) + f ( A ∪ C ) · f ( B ∩ C ) ≤ f ( C ) ·  f ( B ) − f ( B ∩ C ) + f ( A ∩ C ) + f ( A ) − f ( A ∩ C ) + f ( B ∩ C )  = f ( C ) ·  f ( B ) + f ( A )  This sho ws the lemma. Corollary 2. L et f b e a nonne gative, monotone, submo dular set fu nction on X . Then, for al l sets S, T ⊆ X , it holds that f ( S ∩ T ) · f ( S ∪ T ) ≤ f ( S ) · f ( T ) . Pr o of. Ap ply Lemm a 1 to sets A = def S , B = def S and C = def T . Theorem 3. L et f b e a nonne gative, monotone, mo dular set fu nction on X . Then, f or al l se ts A, B , C ⊆ X , it holds th at J δ,f ( A, B ) ≤ J δ,f ( A, C ) + J δ,f ( C, B ) . 2 Pr o of. S a y that a set A is a null set iﬀ f ( A ) = 0. Obs er ve that if at least one of the sets is a null set then th e inequalit y is satisﬁed. So, it is enough to show the equiv alen t inequalit y f ( A ∩ C ) f ( A ∪ C ) + f ( B ∩ C ) f ( B ∪ C ) ≤ 1 + f ( A ∩ B ) f ( A ∪ B ) = f ( A ) + f ( B ) f ( A ∪ B ) (2) for arb itrary non -null sets A, B , C ⊆ I . This is seen as follo ws: f ( A ∩ C ) f ( A ∪ C ) + f ( B ∩ C ) f ( B ∪ C ) = f ( A ∩ C ) · f ( B ∪ C ) + f ( A ∪ C ) · f ( B ∩ C ) f ( A ∪ C ) · f ( B ∪ C ) ≤ f ( C ) ·  f ( A ) + f ( B )  f ( A ∪ C ) · f ( B ∪ C ) (b y Lemma 1) ≤ f ( C ) ·  f ( A ) + f ( B )  f  ( A ∪ C ) ∩ ( B ∪ C )  · f ( A ∪ B ∪ C ) (b y Corollary 2) ≤ f ( C ) f  ( A ∩ B ) ∪ C  · f ( A ) + f ( B ) f ( A ∪ B ) (monotonicit y o f f ) ≤ f ( A ) + f ( B ) f ( A ∪ B ) (monotonicit y of f ) This pro v es the theorem. R emarks : W e co mment on the p ro of of the triangle inequ alit y f or J δ,f : 1. It follo ws fr om Theorem 3 that the triangle in equalit y is v alid for the stand ard Jaccard distance J δ , the generalized Jaccard distance giv en for v ectors x, y ∈ R n b y 1 − P n i =1 min { x i , y i } P n i =1 max { x i , y i } (with the sub case that x i = µ A ( z ) and y i = µ B ( z ) denote m ultiplicities of (o ccur- rences of ) z in m ultisets A and B ; cf. [9]), and the Steinhaus distance [1 2, 4] (i.e., an y set m easur es, includ ing probability measures). W e mention that all these results can equally easily b e prov en b y the argum ents in [5]; ho w ev er, for m o dular fu nctions satisfying f ( ∅ ) > 0, these argumen ts fail. 2. T h eorem 3 is true f or nonnegativ e, monotone, mo dular functions deﬁn ed o ver dis- tributiv e lattices; Lemm a 1 and Corollary 2 also hold for nonn egativ e, monotone, submo d ular functions deﬁned o v er distribu tiv e lattice s. Notice that J ∆ δ,f is not de- ﬁned o v er all distribu tive lattices ( see also the th ird remark after T heorem 4). 3. In general, Th eorem 3 is not true for n onnegativ e, monotone, submo dular fu nctions: An y set function f su c h that f ( A ) = f ( B ) = f ( A ∪ B ) > f ( A ∩ B ) ≥ 0 for non- empt y , incomparable sets A, B refutes J δ,f ( A, B ) ≤ J δ,f ( A, A ∪ B ) + J δ,f ( A ∪ B , B ). Concrete examples include linear cost fu nctions with b udget restrictions, i.e., f ( A ) = min { B , P i ∈ A c i } , or the n eighb orh o o d size in a b ipartite graph G = ( U ⊎ V , E ), i.e., f ( A ) = | Γ( A ) | where A ⊆ U and Γ( A ) = S u ∈ A { v ∈ V |{ u, v } ∈ E } . 3 Next w e give a simple pro of of the triangle inequalit y for J ∆ δ,f . Theorem 4. L et f b e a nonne g ative, monotone, submo dular set function on X . Then, for al l sets A, B , C ⊆ X , i t holds that J ∆ δ,f ( A, B ) ≤ J ∆ δ,f ( A, C ) + J ∆ δ,f ( C, B ) . Pr o of. W e split the set C into t w o disjoint sets C 0 ⊆ A ∪ B and C 1 ⊆ A ∪ B , b oth p ossibly empt y , su c h that C = C 0 ∪ C 1 . W e obtain f ( A △ C ) − f ( ∅ ) f ( A ∪ C ) + f ( B △ C ) − f ( ∅ ) f ( B ∪ C ) ≥ f ( A △ C ) + f ( B △ C ) − 2 f ( ∅ ) f ( A ∪ B ∪ C 1 ) (monotonicit y of f ) ≥ f ( A △ C ∪ B △ C ) − f ( ∅ ) f ( A ∪ B ∪ C 1 ) (submo d u larit y , monotonicit y o f f ) ≥ f ( A △ B ∪ C 1 ) − f ( ∅ ) f ( A ∪ B ∪ C 1 ) (monotonicit y of f ) ≥ f ( A △ B ) f ( A ∪ B ) − f ( ∅ ) f ( A ∪ B ∪ C 1 ) (submo d u larit y of f , Cond. (1)) ≥ f ( A △ B ) f ( A ∪ B ) − f ( ∅ ) f ( A ∪ B ) (monotonicit y of f ) This sho ws the theorem. R emarks : W e co mment on the p ro of of the triangle inequ alit y f or J ∆ δ,f : 1. It follo ws once more from Theorem 4 that the standard J accard d istance, the gen- eralized Jaccard d istance, and the S teinhaus distance satisfy the triangle inequalit y . Moreo ver, J ∆ δ,f is also a (pseudo)metric for, e.g., linear cost functions w ith bud get restrictions and th e neigh b orho o d size in bipartite graphs. 2. T h eorem 4 suggests that J ∆ δ,f is the righ t deﬁ n ition of a su bmo du lar Jaccard distance. As a consequence, one m ight sa y th at the submo dular Jaccard (similarit y) index should b e deﬁned as the inv erse su bmo du lar Jaccard d istance, i. e., J ∆ f ( A, B ) = def 1 − J ∆ δ,f = 1 − f ( A △ B ) − f ( ∅ ) f ( A ∪ B ) Again, if f ( A ) = | A | then we obtain the standard Jaccard ind ex J = J ∆ f = 1 − J δ,f . 3. T h ough J ∆ δ,f migh t generally not b e d eﬁned o ver a giv en distrib utiv e lattice, it can b e seen that for eac h nonnegativ e, monoto ne, submo dular fun ction f : F → R deﬁned on a f amily F ⊆ P ( X ) closed und er u nion and intersecti on, there is a (not necessarily unique) nonnegativ e, monotone, s ubmo d ular extension f : P ( X ) → R on X su c h that f ( A ) = f ( A ) for all A ∈ F (e.g., [16]), so that J ∆ δ, f can b e used instead. Ac knowledgmen ts: I am grateful to Ulrik Br andes (K onstanz) and Julian M¨ uller (Konstanz) for helpful discussio ns. 4 References [1] M. S. Charik ar. Similarity Es timation T e chniques from Rounding Algorithms. In: Pr o c e e dings of the 34th Annual A CM Symp osium on The ory of Computing (STOC’2002) , pp. 38 0–38 8. A CM P ress, New Y ork, NY, 20 02. [2] M. M. Deza , E. Deza. Encyclop e dia of Distanc es. Spr inger, Ber lin, 200 9. [3] O . F ujita. Metrics based on av erage distance b etw een sets. Jap an Journal o f Industrial and Applie d Mathematics , 30 (1):1–19 , 2013 . [4] A. Ga rdner, J. Kanno , C. A. Duncan, R. Selmic. Measuring Distance Betw een Unordered Sets of Diﬀerent Sizes. In: Pr o c e e dings of the 2014 IEEE Confer enc e on Computer V ision and Pattern R e c o gnition (CVPR’2014) , pp. 137 –143 . IEEE, New Jersey , NJ, 2 014. [5] G. Gilb ert. Distance b etw ee n sets. Letters to Natur e , 239(5 368):17 4, 1972 . [6] J . Gillenw ater, R. Iyer, B. Lusch, R. K ida mbi, J . A. Bilmes. Submo dular Hamming Metrics . In: A dvanc es in Neur al Information Pr o c essing Systems 28 , 3141 - 3149 . NIPS P ro ceedings, Decem b er 2 0 15. [7] J . C. Gower. Similarity , Dissimilarit y and D istance, Measures of. In: S. K otz, C. B. Read, N. Balakr ishnan, B. Vidako vic (eds.), Encyclop e dia of St atistic al Scienc es , vol. 12 ., pp. 7730– 7738. 2nd edition, J ohn Wiley , New Y o r k, NY, 2008. [8] P . Jaccard. ´ Etude comparative de la distribution ﬂorale dans une p or tion des Alpes et du Jura. Bul letin de la So ci´ et´ e V audoise des Scienc es Nature l les , 37 (142):54 7–579 , 19 01. [9] W. A. Kosters , J. F. J . La r os. Metrics for Mining Multisets. In: M. Br amer, F. Co enen, M. Petridis (eds.), R ese ar ch and Development in In t el ligent Systems XXIV, Pr o c e e dings of the Twenty-seventh SGAI International Confer enc e on Innovative T e chniques and Applic ations of Artiﬁcial Intel ligenc e (AI’2007) , pp. 2 9 3–30 3. Springer, Berlin, 20 07. [10] M. Lev andowsky , D. Winter. Distance b etw een s ets. Le tter s to Natu re , 234 (5323):3 4–35, 1 971. [11] A. H. Lipk us. A pro o f o f the triangle inequalit y for the T animoto distance. Journ al of Mathematic al Chemistry , 26:26 3–265 , 19 99. [12] E. Mar czewski, H. Steinhaus. On a ce r tain distance of sets and the corr esp onding distance of functions. Col lo quium M athematicum , 6:31 9–327 , 1958. [13] D. A. Simovici, C. Djeraba . Mathematic al T o ols for Data Mining . Springer , L o ndon, 200 8. [14] A. Schrijver. Combi natorial O ptimization , vol. B. Spring er, Ber lin, 2003 . [15] T. T. T animoto. An elemen tary mathematical theory of c la ssiﬁcation and prediction. IBM Repo rt, Nov ember 19 58. [16] D. M. T o pkis. Minimizing a submo dular function on a lattice. Op er ations R ese ar ch , 26(2):305 – 321, 197 8. 5

A note on the triangle inequality for the Jaccard distance

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment