Locality-Sensitive Hashing for f-Divergences: Mutual Information Loss and Beyond

Lo calit y-Sensitiv e Hashing for f -Div ergences and Kre ˘ ın Kernels: Mutual Information Loss and Bey ond Lin Chen 1 , 2 Hossein Esfandiari 2 Thomas F u 2 V ahab S. Mirrokni 2 1 Y ale Univ ersity 2 Go ogle Researc h lin.chen@yale.edu, {esfandiari,thomasfu,mirrokni}@google.com Abstract Computing appro ximate nearest neighbors in high dimensional spaces is a central problem in large-scale data mining with a wide range of applications in mac hine learning and data science. A popular and eﬀectiv e tec hnique in computing nearest neigh b ors appro ximately is the lo c ality-sensitive hashing (LSH) sc heme. In this pap er, w e aim to dev elop LSH schemes for distance functions that measure the distance b et w een tw o probabilit y distributions, particularly for f -div ergences as well as a generalization to capture m utual information loss. First, we provide a general framework to design LHS schemes for f -div ergence distance functions and develop LSH schemes for the gener alize d Jensen-Shannon diver genc e and triangular discrimination in this framework. W e show a tw o-sided approximation result for approximation of the generalized Jensen-Shannon divergence by the Hellinger distance, which may be of independent interest. Next, we show a general metho d of reducing the problem of designing an LSH scheme for a Kre ˘ ın kernel (whic h can b e expressed as the diﬀerence of tw o p ositiv e deﬁnite k ernels) to the problem of maximum inner pro duct search. W e exemplify this metho d by applying it to the mutual information loss , due to its sev eral imp ortan t applications such as mo del compression. 1 In tro duction A cen tral problem in machine learning and data mining is to ﬁnd top- k similar items to eac h item in a dataset. Such problems, referred to as appr oximate ne ar est neighb or problems, are esp ecially c hallenging in high dimensional spaces and are an imp ortan t part of a wide range of data mining tasks such as ﬁnding near-duplicate pages in a corpus of images or web pages, or clustering items in a high-dimensional metric space. A p opular technique for solving these problems is the lo c ality-sensitive hashing (LSH) technique [19]. In this metho d, items in a high-dimensional metric space are ﬁrst mapp ed in to buck ets (via a hashing scheme) with the prop ert y that closer items hav e a higher c hance of b eing assigned to the same buck et. LSH-based nearest neigh b or metho ds limit their scop e of search to the items that fall into the same buc ket in which the target item resides 1 . Lo calit y sensitiv e hashing w as ﬁrst introduced and studied b y [19]. They pro vide a family of basic lo calit y-sensitiv e hash functions for the Hamming distance in a d -dimensional space and for the L 1 distance in a d -dimensional Euclidean space. They also show that suc h a family of hash functions pro vides a randomized (1 +  ) -appro ximation algorithm for the nearest neighbor searc h problem with sublinear space and sublinear query time. F ollo wing [19], several families of locality-sensitiv e hash functions hav e b een designed and implemen ted for diﬀerent metrics, eac h serving a certain application. W e summarize further results in this area in Section 1.1. In several applications, data p oin ts can b e represented as probabilit y distributions. One example is the space of users’ browsed w eb pages, read articles or watc hed videos. In order to represent such data, one can represent each user by a distribution of do cumen ts they read, and the do cumen ts b y topics included in 1 W e note that LSH is a popular data-independent tec hnique for nearest neighbor search. Another category of nearest neigh b or search algorithms, referred to as data-dep enden t techniques, are le arning-to-hash methods [ 37] which learn a hash function that maps each item to a compact code. Ho w ever, this line of work is out of the scop e of this paper. 1 those do cumen ts. Other examples are time series distributions, con tent of documents, or images that can b e represented as histograms. Particularly , analysis of similarities in time series distributions or documents can b e used in the context of attacks, and spam detection. Analysis of user similarities can be used in recommendation systems and online advertisemen t. In fact, many of the aforementioned applications deal with huge datasets and require very time eﬃcient algorithms to ﬁnd similar data p oin ts. These applications motiv ated us to study LSH functions for distributions, esp ecially for distance measures with information-theoretic justiﬁcations. In fact, in addition to k -nearest neigh b or, LSH functions can b e used to implemen t v ery fast distributed algorithms for traditional clusterings suc h as k -means [7]. Recen tly , Mao et al. [26] noticed the imp ortance and lack of LSH functions for the distance of distributions, esp ecially for information-theoretic measures. They attempted to design an LSH to capture the famous Jensen-Shannon (JS) diver genc e . Ho wev er, instead of directly providing lo cality-sensitiv e hash functions for Jensen-Shannon div ergence, they take tw o steps to turn this distance function in to a new distance function that is easier to hash. They ﬁrst lo ok ed at a less common divergence measure S2JSD which is the square ro ot of tw o times the JS divergence. Then they deﬁned a related distance function S 2 J S D approx new , whic h w as obtained b y only keeping the linear terms in the T aylor expansion of the logarithm in the expression of S2JSD and designed a locality-sensitiv e hash function for the new measure S 2 J S D approx new . This is an interesting w ork; ho wev er, unfortunately it does not pro vide an y bound on the actual JS div ergence using the LSH that they designed for S 2 J S D approx new . Our results resolve this issue b y pro viding LSH sc hemes with pro v able guaran tees for information-theoretic distance measures including the JS divergence and its generalizations. Mu and Y an [27] proposed an LSH algorithm for non-metric data b y embedding them in to a reproducing k ernel Kre ˘ ın space. Ho wev er, their metho d is indeed data-dep enden t. Given a ﬁnite set of data points M , they compute the distance matrix D whose ( i, j ) -en try is the distance betw een i and j , where b oth i and j are data p oints in M . Data is embedded in to a reproducing kernel Kre ˘ ın space by performing singular v alue decomp osition on a transform of the distance matrix D . The em b edding changes if we are giv en another dataset. Our Contributions. In this pap er, we ﬁrst study LSH schemes for f -div ergences 2 b et w een tw o probability distributions. W e ﬁrst in Prop osition 1 pro vide a simple reduction to ol for designing LSH sc hemes for the family of f -div ergence distance functions. This prop osition is not hard to pro ve but might b e of indep enden t in terest. Next w e use this to ol and provide LSH sc hemes for t wo examples of f -div ergence distance functions, Jensen-Shannon div ergence and triangular discrimination. In terestingly our result holds for a generalized v ersion of Jensen-Shannon div ergence. W e apply this tool to design and analyze an LSH sc heme for the generalized Jensen-Shannon (GJS) divergence through approximation by the squared Hellinger distance. W e use a similar tec hnique to provide an LSH for triangular discrimination. Our appro ximation is prov ably lo wer b ounded by a factor 0 . 69 for the Jensen-Shannon divergence and is low er bounded by a factor 0 . 5 for triangular discrimination. The appro ximation result of the generalized Jensen-Shannon div ergence b y the squared Hellinger requires a more in volv ed analysis and the low er and upp er b ounds dep end on the w eight parameter. This approximation result ma y be of indep enden t interest for other machine learning tasks suc h as approximate information-theoretic clustering [12]. Our technique may b e useful for designing LSH sc hemes for other f -divergences. Next, we prop ose a general approac h to designing an LSH for Kre ˘ ın k ernels. A Kre ˘ ın kernel is a kernel function that can be expressed as the diﬀerence of tw o positive deﬁnite k ernels. Our approach is built up on a reduction to the problem of maximum inner product search (MIPS) [33, 28, 41]. In con trast to our LSH schemes for f -div ergence functions via approximation, our approac h for Kre ˘ ın k ernels inv olves no appro ximation and is theoretically lossless . Con trary to [27], this approach is data-indep endent. W e exemplify our approac h b y designing an LSH function sp eciﬁcally for mutual information loss. Mutual information loss is of our particular in terest due to its sev eral imp ortan t applications such as mo del compression [6, 17], and compression in discrete memoryless channels [20, 30, 42]. 2 The formal deﬁnition of f -divergence is presented in Section 2.2. 2 1.1 Other Related W ork Datar et al. [16] designed an LSH for L p distances using p -stable distributions. Bro der [10] designed MinHash for the Jaccard similarity . LSH for other distances and similarity measures were prop osed later, for example, angle similarit y [11], spherical LSH on a unit h yp ersphere [34], rank similarity [40], and non-metric LSH [27]. Li et al. [24] demonstrated that uniform quan tization outp erforms the standard metho d in [16] with a random oﬀset. Gorisse et al. [18] prop osed an LSH family for χ 2 distance b y relating it to the L 2 distance via an algebraic transform. Interested readers are referred to a more comprehensive survey of existing LSH metho ds [38]. Another related problem is the construction of feature maps of p ositiv e deﬁnite k ernels. A feature map maps a data point into a usually higher-dimensional space such that the inner pro duct in that space agrees with the kernel in the original space. Explicit feature maps for additiv e k ernels are in tro duced in [35]. Bregman div ergences are another broad class of distances that arise naturally in practical applications. The nearest neighbor search problem for Bregman divergences we re studied in [3, 2, 1]. 2 Preliminaries 2.1 Lo calit y-Sensitiv e Hashing Let M b e the univ ersal set of items (the database), endow ed with a distance function D . Ideally , w e w ould lik e to ha ve a family of hash functions such that for an y tw o items p and q in M that are close to eac h other, their hash v alues collide with a higher probabilit y , and if they reside far apart, their hash v alues collide with a lo wer probability . A family of hash functions with the ab o ve prop ert y is said to be lo calit y-sensitive. A hash v alue is also known as a buck et in other literature. Using this metaphor, hash functions are imagined as sorters that place items into buck ets. If hash functions are lo calit y-sensitiv e, it suﬃces to searc h the buck et in to which an item falls if one wan ts to know its nearest neighbors. The ( r 1 , r 2 , p 1 , p 2 ) -sensitiv e LSH family form ulates the in tuition of lo calit y sensitivity and is formally deﬁned in Deﬁnition 1. Deﬁnition 1 ([19]) . Let H = { h : M → U } b e a family of hash functions, where U is the set of possible hash v alues. Assume that there is a distribution h ∼ H o ver the family of functions. This family H is called ( r 1 , r 2 , p 1 , p 2 ) -sensitiv e ( r 1 < r 2 and p 1 > p 2 ) for D , if for ∀ p, q ∈ M the following statemen ts hold: (1) if D ( p, q ) ≤ r 1 , then Pr h ∼H [ h ( p ) = h ( q )] ≥ p 1 ; (2) if D ( p, q ) > r 2 , then Pr h ∼H [ h ( p ) = h ( q )] ≤ p 2 . W e would like to note that the gap betw een the high probabilit y p 1 and p 2 can be ampliﬁed by constructing a comp ound hash function that concatenates multiple functions from an LSH family . F or example, one can construct g : M → U K suc h that g ( p ) , ( h 1 ( p ) , . . . , h K ( p )) for ∀ p ∈ M , where h 1 , . . . , h K are chosen from the LSH family H . This conjunctiv e construction reduces the amount of items in one buck et. T o impro ve the recall, an additional disjunction is introduced. T o b e precise, if g 1 , . . . , g L are L suc h comp ound hash functions, we search all of the buck ets g 1 ( p ) , . . . , g L ( p ) in order to ﬁnd the nearest neighbors of p . 2.2 f -Div ergence Let P and Q b e t wo probability measures asso ciated with a common sample space Ω . W e write P  Q if P is absolutely contin uous with respect to Q , which requires that for every subset A of Ω , Q ( A ) = 0 imply P ( A ) = 0 . Let f : (0 , ∞ ) → R b e a conv ex function that satisﬁes f (1) = 0 . If P  Q , the f -div ergence from P to Q [14] is deﬁned by D f ( P k Q ) = Z Ω f  dP dQ  dQ, (1) pro vided that the right-hand side exists, where dP dQ is the Radon-Nikodym deriv ative of P with respect to Q . In general, an f -divergence is not symmetric: D f ( P k Q ) 6 = D f ( Q k P ) . If f KL ( t ) = t ln t + (1 − t ) , the f KL -div ergence yields the KL diver genc e D KL ( P k Q ) = R Ω ln dP dQ dP [13]. If hel ( t ) = 1 2 ( √ t − 1) 2 , the hel -div ergence is the squar e d Hel linger distanc e H 2 ( P , Q ) = 1 2 R Ω ( √ dP − √ dQ ) 2 [15]. 3 If δ ( t ) = ( t − 1) 2 t +1 , the δ -div ergence is the triangular discrimination (also known as Vincze-Le Cam distance) [22, 36]. If the sample space is ﬁnite, the triangular discrimination b et w een P and Q is giv en by ∆( P k Q ) = P i ∈ Ω ( P ( i ) − Q ( i )) 2 P ( i )+ Q ( i ) . The Jensen-Shannon (JS) diver genc e is a symmetrized v ersion of the KL div ergence. If P  Q , Q  P and M = ( P + Q ) / 2 , the JS div ergence is deﬁned by D JS ( P k Q ) = 1 2 D KL ( P k M ) + 1 2 D KL ( Q k M ) . (2) 2.3 Mutual Information Loss and Generalized Jensen-Shannon Div ergence The m utual information loss arises naturally in many machine learning tasks, suc h as information-theoretic clustering [17] and categorical feature compression [6]. Supp ose that t wo random v ariables X and C ob eys a join t distribution p ( X, C ) . This join t distribution can mo del a dataset where X denotes the feature v alue of a data p oint and C denotes its lab el [6]. Let X and C denote the support of X and C ( i.e. , the univ ersal set of all possible feature v alues and labels), resp ectiv ely . Consider clustering tw o feature v alues in to a new com bined v alue. This operation can b e represented by the follo wing map π x,y : X → X \ { x, y } ∪ { z } such that π x,y ( t ) = ( t, t ∈ X \ { x, y } , z , t = x, y , where x and y are the tw o feature v alues to b e clustered and z / ∈ X is the new com bined feature v alue. T o mak e the dataset after applying the map π x,y preserv e as muc h information of the original dataset as p ossible, one has to select tw o feature v alues x and y suc h that the m utual information loss incurred by the clustering op eration mil ( x, y ) = I ( X ; C ) − I ( π x,y ( X ); C ) is minimized, where I ( · ; · ) is the mutual information betw een t wo random v ariables [13]. Note that the mutual information loss (MIL) diver genc e mil : X × X → R is symmetric in b oth arguments and alw ays non-negative due to the data pro cessing inequality [13]. Next, w e motiv ate the generalized Jensen-Shannon div ergence. If we let P and Q b e the conditional distribution of C giv en X = x and X = y , respectively , suc h that P ( c ) = p ( C = c | X = x ) and Q ( c ) = p ( C = c | X = y ) , the mutual information loss can b e re-written as λD KL ( P k M λ ) + (1 − λ ) D KL ( Q k M λ ) , (3) where λ = p ( x ) p ( x )+ p ( y ) and the distribution M λ = λP + (1 − λ ) Q . Note that (3) is a generalized version of (2) . Therefore, w e deﬁne the gener alize d Jensen-Shannon (GJS) diver genc e betw een P and Q [25, 5, 17] b y D λ GJS ( P k Q ) = λD KL ( P k M λ ) + (1 − λ ) D KL ( Q k M λ ) , where λ ∈ [0 , 1] and M λ = λP + (1 − λ ) Q . W e immediately hav e D 1 / 2 GJS ( P k Q ) = D JS ( P k Q ) , which indicates that the JS divergence is indeed a sp ecial case of the GJS divergence when λ = 1 / 2 . The GJS div ergence has another equiv alent deﬁnition D λ GJS ( P k Q ) = H ( M λ ) − λH ( P ) − (1 − λ ) H ( Q ) , where H ( · ) denotes the Shannon en tropy [13]. In con trast to the MIL div ergence, the GJS D λ GJS ( · k · ) is not symmetric in general as the w eight λ ∈ [0 , 1] is ﬁxed and not necessarily equal to 1 / 2 . W e will sho w in Lemma 1 that the GJS div ergence is an f -divergence. 2.4 P ositiv e Deﬁnite Kernel and Kre ˘ ın Kernel W e ﬁrst review the deﬁnition of a positive deﬁnite kernel. Deﬁnition 2 (P ositive deﬁnite k ernel [32]) . Let X b e a non-empt y set. A symmetric, real-v alued map k : X × X → R is a positive deﬁnite k ernel on X if for all positive in teger n , real n umbers a 1 , . . . , a n ∈ R , and x, . . . , x n ∈ X , it holds that P n i =1 P n j =1 a i a j k ( x i , x j ) ≥ 0 . A kernel is said to be a K r e ˘ ın kernel if it can b e represen ted as the diﬀerence of tw o positive deﬁnite k ernels. The formal deﬁnition is presented b elo w. 4 Deﬁnition 3 (Kre ˘ ın k ernel [29]) . Let X b e a non-empty set. A symmetric, real-v alued map k : X × X → R is a Kre ˘ ın k ernel on X if there exists t wo p ositiv e deﬁnite kernels k 1 and k 2 on X suc h that k ( x, y ) = k 1 ( x, y ) − k 2 ( x, y ) holds for all x, y ∈ X . 3 LSH Sc hemes for f -Div ergences W e build LSH sc hemes for f -div ergences based on approximation via another f -div ergence if the latter admits an LSH family . If D f and D g are tw o div ergences asso ciated with conv ex functions f and g as deﬁned b y (1) , the appro ximation ratio of D f ( P k Q ) to D g ( P k Q ) is determined b y the ratio of the functions f and g , as w ell as the ratio of P to Q (to be precise, inf i ∈ Ω P ( i ) Q ( i ) ) [31]. Prop osition 1 (Pro of in App endix D) . L et β 0 ∈ (0 , 1) , L, U > 0 and let f and g b e two c onvex functions (0 , ∞ ) → R that ob ey f (1) = 0 , g (1) = 0 , and f ( t ) , g ( t ) > 0 for every t 6 = 1 . L et P b e a set of pr ob ability me asur es on a ﬁnite sample sp ac e Ω such that for every i ∈ Ω and P , Q ∈ P , 0 < β 0 ≤ P ( i ) Q ( i ) ≤ β − 1 0 . A ssume that for every β ∈ ( β 0 , 1) ∪ (1 , β − 1 0 ) , it holds that 0 < L ≤ f ( β ) g ( β ) ≤ U < ∞ . If H forms an ( r 1 , r 2 , p 1 , p 2 ) - sensitive family for g -diver genc e on P , then it is also an ( Lr 1 , U r 2 , p 1 , p 2 ) -sensitive family for f -diver genc e on P . Prop osition 1 provides a general strategy of constructing LSH families for f -div ergences. The p erformance of suc h LSH families dep ends on the tigh tness of the approximation. In Sections 3.1 and 3.2, as instances of the general strategy , we derive concrete results for the generalized Jensen-Shannon divergence and triangular discrimination, resp ectively . 3.1 Generalized Jensen-Shannon Div ergence First, Lemma 1 sho ws that the GJS div ergence is indeed an instance of f -divergence. Lemma 1 (Pro of in App endix C) . Deﬁne m λ ( t ) = λt ln t − ( λt + 1 − λ ) ln ( λt + 1 − λ ) . F or any λ ∈ [0 , 1] , m λ ( t ) is c onvex on (0 , ∞ ) and m λ (1) = 0 . F urthermor e, m λ -diver genc e yields the GJS diver genc e with p ar ameter λ . W e c ho ose to approximate it via the squared Hellinger distance, whic h pla ys a central role in the construction of the hash family with desired prop erties. The approximation guarantee is established in Theorem 1. W e sho w that the ratio of D λ GJS ( P k Q ) to H 2 ( P , Q ) is upp er bounded by the function U ( λ ) and low er bounded by the function L ( λ ) . F urthermore, Theorem 1 shows that U ( λ ) ≤ 1 , whic h implies that the squared Hellinger distance is an upp er b ound of the GJS divergence. Theorem 1 (Pro of in Appendix B) . W e assume that the sample sp ac e Ω is ﬁnite. L et P and Q b e two diﬀer ent distributions on Ω . F or every t > 0 and λ ∈ (0 , 1) , we have L ( λ ) H 2 ( P , Q ) ≤ D λ GJS ( P k Q ) ≤ U ( λ ) H 2 ( P , Q ) ≤ H 2 ( P , Q ) , wher e L ( λ ) = 2 min { η ( λ ) , η (1 − λ ) } , η ( λ ) = − λ ln λ and U ( λ ) = 2 λ (1 − λ ) 1 − 2 λ ln 1 − λ λ . W e sho w Theorem 1 b y showing a t wo-sided appro ximation result regarding m λ and hel . This result migh t b e of indep endent interest for other machine learning tasks, say , approximate information-theoretic clustering [12]. Lemma 2 (Pro of in Appendix A) . Deﬁne κ λ ( t ) = m λ ( t ) hel( t ) . F or every t > 0 and λ ∈ (0 , 1) , we have κ λ ( t ) = κ 1 − λ (1 /t ) and κ λ ( t ) ∈ [ L ( λ ) , U ( λ )] . 5 0.2 0.4 0.6 0.8 1.0 λ 0.2 0.4 0.6 0.8 1.0 L ( λ ) U ( λ ) Figure 1: Upp er and lo wer functions U ( λ ) and L ( λ ) . W e illustrate the upp er and lo wer b ound functions U ( λ ) and L ( λ ) in Fig. 1. Recall that if λ = 1 / 2 , the generalized Jensen-Shannon divergence reduces to the usual Jensen-Shannon divergence. Theorem 1 yields the approximation guarantee 0 . 69 < ln 2 ≤ D JS ( P k Q ) H 2 ( P,Q ) ≤ 1 . If the common sample space Ω with which the t w o distributions P and Q are asso ciated is ﬁnite, one can identify P and Q with the | Ω | -dimensional vectors [ P ( i )] i ∈ Ω and [ Q ( i )] i ∈ Ω , resp ectiv ely . In this case, H 2 ( P , Q ) = 1 2 k √ P − √ Q k 2 2 , which is exactly half of the squared L 2 distance b et ween the tw o vectors √ P , [ p P ( i ) ] i ∈ Ω and √ Q , [ p Q ( i ) ] i ∈ Ω . Therefore, the squared Hellinger distance can b e endow ed with the L 2 -LSH family [16] applied to the square ro ot of the vector. In light of this, the lo calit y-sensitive hash function that we prop ose for the generalized Jensen-Shannon div ergence is h a ,b ( P ) = & a · √ P + b r ' , (4) where a ∼ N (0 , I ) is a | Ω | -dimensional standard normal random v ector, · denotes the inner pro duct, b is uniformly at random on [ 0 , r ] , and r is a p ositiv e real n umber. Theorem 2 (Pro of in Appendix E) . L et c = k √ P − √ Q k 2 and f 2 b e the pr ob ability density function of the absolute value of the standar d normal distribution. The hash functions { h a ,b } deﬁne d in (4) form a ( R, c 2 U ( λ ) L ( λ ) R, p 1 , p 2 ) -sensitive family for the gener alize d Jensen-Shannon diver genc e with p ar ameter λ , wher e R > 0 , p 1 = p (1) , p 2 = p ( c ) , and p ( u ) = R r 0 1 u f 2 ( t/u )(1 − t/r ) dt . 3.2 T riangular Discrimination Recall that triangular discrimination is the δ -div ergence, where δ ( t ) = ( t − 1) 2 t +1 . As sho wn in the proof of Theorem 3 (App endix F), the function δ can b e approximated by the function hel ( t ) that deﬁnes the squared Hellinger distance 1 ≤ δ ( t ) hel( t ) ≤ 2 . The squared Hellinger distance can b e sketc hed via L 2 -LSH after taking the square root, as exempliﬁed in Section 3.1. By Proposition 1, the LSH family for the square Hellinger distance also forms an LSH family for the triangular discrimination. Theorem 3 sho ws that the LSH family deﬁned in (4) form a ( R, 2 c 2 R, p 1 , p 2 ) -sensitiv e family for triangular discrimination. Theorem 3 (Pro of in Appendix F) . L et c = k √ P − √ Q k 2 and f 2 b e the pr ob ability density function of the absolute value of the standar d normal distribution. The hash functions { h a ,b } deﬁne d in (4) form a ( R, 2 c 2 R, p 1 , p 2 ) -sensitive family for triangular discrimination, wher e R > 0 , p 1 = p (1) , p 2 = p ( c ) , and p ( u ) = R r 0 1 u f 2 ( t/u )(1 − t/r ) dt . 4 Kre ˘ ın-LSH for Mutual Information Loss In this section, we ﬁrst sho w that the m utual information loss is a Kre ˘ ın kernel. Then w e propose K r e ˘ ın-LSH , an asymmetric LSH metho d [33] for m utual information loss. W e would like to remark that this metho d 6 can be easily extended to other Kre ˘ ın kernels, provided that the asso ciated p ositiv e deﬁnite k ernels allow an explicit feature map. 4.1 Mutual Information Loss is a Kre ˘ ın Kernel Recall that in Section 2.3 w e assume a join t distribution p ( X, C ) whose supp ort is X × C . Let x, y ∈ X b e represen ted by x = [ p ( c, x ) : c ∈ C ] ∈ [0 , 1] |C | and y = [ p ( c, y ) : c ∈ C ] ∈ [0 , 1] |C | , respectively . W e consider the m utual information loss of merging x and y , which is giv en by I ( X ; C ) − I ( π x,y ( X ); C ) . Theorem 4 (Pro of in Appendix H) . The mutual information loss mil ( x , y ) is a K r e ˘ ın kernel on [0 , 1] |C | . In other wor ds, ther e exist two p ositive deﬁnite kernels K 1 and K 2 on [0 , 1] |C | such that mil ( x , y ) = K 1 ( x , y ) − K 2 ( x , y ) . T o b e explicit, we set K 1 ( x , y ) = k ( P c ∈C p ( c, x ) , P c ∈C p ( c, y )) and K 2 ( x , y ) = P c ∈C k ( p ( c, x ) , p ( c, y )) , wher e k ( a, b ) = a ln a a + b + b ln b a + b . T o pro ve Theorem 4 and construct explicit feature maps for K 1 and K 2 , we need the follo wing lemma. Lemma 3 (Pro of in Appendix G) . The kernel k is a p ositive deﬁnite kernel on [0 , 1] . Mor e over, it is endowe d with the fol lowing explicit fe atur e map x 7→ Φ w ( x ) such that k ( x, y ) = R R Φ w ( x ) ∗ Φ w ( y ) dw , wher e Φ w ( x ) , e − iw ln( x ) q x 2 sech( πw ) 1+4 w 2 and Φ w ( x ) ∗ denotes the c omplex c onjugate of Φ w ( x ) . The map Φ( x ) : w 7→ Φ w ( x ) is called the fe atur e map of x . The integral R R Φ w ( x ) ∗ Φ w ( y ) dw is also denoted b y a Hermitian inner pro duct h Φ( x ) , Φ( y ) i . 4.2 Kre ˘ ın-LSH for Mutual Information Loss No w we are ready to present an asymmetric LSH scheme [33] for mutual information loss. This metho d can b e easily extended to other Kre ˘ ın kernels, pro vided that the asso ciated p ositiv e deﬁnite kernels admit an explicit feature map. In fact, we reduce the problem of designing the LSH for a Kre ˘ ın k ernel to the problem of designing the LSH for maximum inner pro duct search (MIPS) [33, 28, 41]. W e call this general reduction K r e ˘ ın-LSH . 4.2.1 Reduction to Maxim um Inner Product Search Our reduction is based on the following observ ation. Supp ose that K is a Kre ˘ ın k ernel on X suc h that K = K 1 − K 2 where K 1 and K 2 are p ositive deﬁnite kernels on X . Assume that K 1 and K 2 admit feature maps Φ 1 and Φ 2 suc h that K 1 ( x, y ) = h Ψ 1 ( x ) , Ψ 1 ( y ) i and K 2 ( x, y ) = h Ψ 2 ( x ) , Ψ 2 ( y ) i . Then the Kre ˘ ın k ernel K can also represented as an inner pro duct K ( x, y ) = h Φ 1 ( x ) ⊕ Φ 2 ( x ) , Φ 1 ( y ) ⊕ − Φ 2 ( y ) i , (5) where ⊕ denotes the direct sum. If we deﬁne a pair of transforms T 1 ( x ) , Φ 1 ( x ) ⊕ Φ 2 ( x ) and T 2 ( x ) , Φ 1 ( x ) ⊕ − Φ 2 ( x ) , then we hav e K ( x, y ) = h T 1 ( x ) , T 2 ( y ) i . W e call this pair of transforms left and right K r e ˘ ın tr ansforms . W e exemplify this technique by applying it to the MIL div ergence. F or ease of exp osition, w e deﬁne ρ ( w ) , 2 sech( πw ) 1+4 w 2 . The prop osed approac h Kre ˘ ın-LSH is presen ted in Algorithm 1. T o make the intuition of (5) applicable in a practical implemen tation, w e hav e to truncate and discretize the integral k ( x, y ) = R R Φ w ( x ) ∗ Φ w ( y ) dw . First we analyze the truncation. The analysis is similar to Lemma 10 of [4]. Lemma 4 (T runcation error b ound, pro of in Appendix I) . If t > 0 and x, y ∈ [0 , 1] , the trunc ation err or c an b e b ounde d as fol lows    k ( x, y ) − R t − t Φ w ( x ) ∗ Φ w ( y ) dw    ≤ 4 e − t . T o discretize the ﬁnite integral R t − t Φ w ( x ) ∗ Φ w ( y ) dw , w e divide the intev al in to 2 J sub-in terv als of length ∆ . The follo wing lemma bounds the discretization error. 7 Algorithm 1 Kre ˘ ın-LSH Input: Discretization parameters J ∈ N and ∆ > 0 . Output: The left and right Kre ˘ ın transform η 1 and η 2 . 1: w j ← ( j − 1 / 2)∆ for j = 1 , . . . , J 2: Construct the atomic transform τ ( x, w , j ) , " cos( w ln( x )) s 2 x Z j ∆ ( j − 1)∆ ρ ( w 0 ) dw 0 , sin( w ln( x )) s 2 x Z j ∆ ( j − 1)∆ ρ ( w 0 ) dw 0 # . 3: Construct the left and righ t basic transform η 1 ( x ) , J M j =1 τ ( p ( x ) , w j , j ) ⊕ J M j =1 M c ∈C τ ( p ( c, x ) , w j , j ) , η 2 ( x ) , J M j =1 τ ( p ( x ) , w j , j ) ⊕ J M j =1 M c ∈C − τ ( p ( c, x ) , w j , j ) . 4: Construct the left and righ t Kre ˘ ın transform T 1 ( x , M ) , [ η 1 , q M − k η 1 ( x ) k 2 2 , 0] , T 2 ( y , M ) , [ η 2 , 0 , q M − k η 2 ( x ) k 2 2 ] . where M is a constant such that M ≥ k η 1 ( x ) k 2 2 (note that k η 1 ( x ) k 2 = k η 2 ( x ) k 2 ). 5: Sample a ∼ N (0 , I ) and construct the hash function h ( x ; M ) , sign ( a > T ( x , M )) , where T is either the left or right transform. Lemma 5 (Discretization error b ound, pro of in App endix J) . If J is a p ositive inte ger, ∆ > 0 , and w j = ( j − 1 / 2)∆ , the discr etization err or is b ounde d as fol lows       Z ∆ J − ∆ J Φ w ( x ) ∗ Φ w ( y ) dw − * J M j =1 τ ( x, w j , j ) , J M j =1 τ ( y , w j , j ) +       ≤ 2∆ , wher e τ ( x, w , j ) = h cos( w ln( x )) q 2 x R j ∆ ( j − 1)∆ ρ ( w 0 ) dw 0 , sin( w ln( x )) q 2 x R j ∆ ( j − 1)∆ ρ ( w 0 ) dw 0 i ∈ R 2 . By Lemmas 4 and 5, to guaran tee that the total appro ximation error (including b oth truncation and discretization errors) is at most  , it suﬃces to set ∆ =  4(1+ |C | ) and J ≥ 4(1+ |C | )  ln 8(1+ |C | )  . 4.2.2 LSH for Maxim um Inner Product Search The second stage of our prop osed metho d is to apply LSH to the MIPS problem. As an example, in Line 5, w e use the Simple-LSH in tro duced b y [28]. Let us ha ve a quick review of Simple-LSH . Assume that M ⊆ R d is a ﬁnite set of vectors and that for all x ∈ M , there is a universal b ound on the squared 2-norm, i.e. , k x k 2 2 ≤ M . Neyshabur and Srebro [28] assume that M = 1 without loss of generality . W e allow M to b e any p ositive real num b er. F or t wo vectors x , y ∈ M , Simple-LSH performs the following transform L 1 ( x ) , [ x , p M − k x k 2 2 , 0] , L 2 ( y ) , [ y , 0 , p M − k y k 2 2 ] . Note that the norm of L 1 and L 2 is M and that therefore their cosine similarit y equals their inner product. In fact, Simple-LSH is a reduction from MIPS to LSH for the cosine similarity . Then a random-projection-based LSH for the cosine similarit y [11, 38] h ( x ) , sign( x > L i ( x )) , a ∼ N (0 , I ) , i = 1 , 2 can b e used for MIPS and thereby LSH for the MIL div ergence via our reduction. 8 Discussion W e ha ve some important remarks for practical implemen tation of Kre ˘ ın-LSH. Although [28] pro vides a theoretical guaran tee for LSH for MIPS, as noted in [41], the additional term p M − k x k 2 2 ma y dominate in the 2 -norm and signiﬁcan tly degrade the p erformance of LSH. T o circumv ent this issue, w e recommend a metho d that partitions the dataset according to the 2 -norm, e.g. , the norm-ranging metho d [41]. 5 Exp erimen t Results GJS (a) λ = 1 / 2 GJS (b) λ = 1 / 3 GJS (c) λ = 1 / 10 Figure 2: The empirical p erformance of Hellinger appro ximation (a) F ashion MNIST (b) MNIST (c) CIF AR-10 Figure 3: Precision vs. sp eed-up factor for diﬀeren t λ ’s. (a) F ashion MNIST (b) MNIST (c) CIF AR-10 Figure 4: Precision vs. sp eed-up factor for diﬀeren t sk etch sizes. Appro ximation Guaran tee. In the ﬁrst part, we v erify the theoretical b ounds deriv ed in Theorem 1 on real data. W e used the laten t Dirichlet allo cation to extract the topic distributions of Reuters-21578, Distribution 1.0. The num b er of topics is set to 10 . W e sampled 100 documents uniformly at random and computed the GJS div ergence and Hellinger distance b et ween each pair of topic distributions. Each dot in Fig. 2 represen ts the topic distribution of a do cumen t. The horizontal axis denotes the Hellinger distance 9 while the vertical axis denotes the GJS divergence. W e chose diﬀerent parameter v alues ( λ = 1 / 2 , 1 / 3 , 1 / 10 ) for the GJS div ergence. F rom the three subﬁgures, we observe that both the upp er and low er b ounds are tigh t for the data. Nearest Neighbor Searc h. In the second part, we apply the proposed LSH sc heme for the GJS div ergence to the nearest neighbor searc h problem in F ashion MNIST [39], MNIST [23], and CIF AR-10 [21]. Eac h image in the datasets is ﬂattened in to a v ector and L 1 -normalized, thereb y summing to 1 . As describ ed in Section 2.1, a concatenation of hash functions is used. W e denote the n umber of concatenated hash functions b y K and the num b er of comp ound hash functions by L . In the ﬁrst set of experiments, w e set K = 3 and v ary L from 20 to 40 . W e measure the execution time of LSH-based k -nearest neighbor searc h and the exact (brute-force) algorithm, where k is set to 20 . Both algorithms were run on a 2.2 GHz Intel Core i7 pro cessor. The sp eed-up factor is the ratio of the execution time of the exact algorithm to that of the LSH-based method. The quality of the result returned by the LSH-based metho d is quan tiﬁed by its precision, which is the fraction of correct nearest neighbors among the retrieved items. W e would lik e to remark that the precision and recall are equal in our case since b oth algorithms return k items. W e also v ary the parameter of the GJS divergence and c ho ose λ from { 1 / 2 , 1 / 3 , 1 / 10 } . The result is illustrated in Figs. 3a to 3c. W e observ e a trade-oﬀ b et ween the quality of the output (precision) and computational eﬃciency (sp eed-up factor). The performance app ears to b e robust to the parameter of the GJS divergence. In the second set of exp erimen ts, w e ﬁx the parameter of the GJS div ergence to 1 / 2 ; i.e. , the JS div ergence is used. The num b er of concatenated hash functions K ranges from 3 to 5 or 4 to 6 . The result is presen ted in Figs. 4a to 4c. In addition to the aforementioned quality-eﬃciency trade-oﬀ, w e observe that a larger K results in a more eﬃcient algorithm given the same target precision. 6 Conclusion In this paper, we propose a general strategy of designing an LSH family for f -div ergences. W e exemplify this strategy b y dev eloping LSH sc hemes for the generalized Jensen-Shannon divergence and triangular discrimination in this framework. They are endo wed with an LSH family via the Hellinger approximation. In particular, w e show a t wo-sided approximation for the generalized Jensen-Shannon div ergence b y the Hellinger distance. This may b e of indep enden t interest. Next, we prop ose a general approach to designing an LSH sc heme for Kre ˘ ın k ernels via a reduction to the problem of maximum inner product search. In con trast to our strategy for f -div ergences, this approach inv olves no appro ximation and is theoretically lossless. W e exemplify this approach by applying to mutual information loss. A ckno wledgments LC was supp orted b y the Go ogle PhD F ello wship. 10 References [1] Ahmed Abdelkader, Sunil Ary a, Guilherme D da F onseca, and Da vid M Mount. “Approximate nearest neigh b or searc hing with non-Euclidean and w eighted distances”. In: SODA . SIAM. 2019, pp. 355–372. [2] Amirali Ab dullah and Suresh V enkatasubramanian. “A directed isop erimetric inequality with application to bregman near neigh b or low er b ounds”. In: STOC . ACM. 2015, pp. 509–518. [3] Amirali Ab dullah, John Mo eller, and Suresh V enkatasubramanian. “Appro ximate Bregman near neigh b ors in sublinear time: Beyond the triangle inequalit y”. In: SoCG . ACM. 2012, pp. 31–40. [4] Amirali Ab dullah, Ravi Kumar, Andrew McGregor, Sergei V assilvitskii, and Suresh V enkatasubramanian. “Sk etching, Embedding and Dimensionality Reduction in Information Theoretic Spaces”. In: AIST A TS . 2016, pp. 948–956. [5] Sy ed Mum taz Ali and Samuel D Silv ey. “A general class of co eﬃcien ts of divergence of one distribution from another”. In: Journal of the R oyal Statistic al So ciety. Series B (Metho dolo gic al) (1966), pp. 131– 142. [6] MohammadHossein Bateni, Lin Chen, Hossein Esfandiari, Thomas F u, V ahab S Mirrokni, and Afshin Rostamizadeh. “Categorical F eature Compression via Submo dular Optimization”. In: ICML (2019). [7] A dity a Bhaskara and Maheshaky a Wijew ardena. “Distributed Clustering via LSH Based Data P arti- tioning”. In: ICML . 2018, pp. 569–578. [8] Christopher M Bishop. Pattern r e c o gnition and machine le arning . springer, 2006. [9] S. Bochner, M. T enen baum, and H. P ollard. L e ctur es on F ourier Inte gr als . Annals of mathematics studies. Princeton Universit y Press, 1959. [10] Andrei Z Bro der. “On the resem blance and con tainment of documents”. In: SEQUENCES . IEEE. 1997, pp. 21–29. [11] Moses S Charikar. “Similarit y estimation techniques from rounding algorithms”. In: STOC . ACM. 2002, pp. 380–388. [12] Kamalika Chaudhuri and Andrew McGregor. “Finding Metric Structure in Information Theoretic Clustering.” In: COL T . V ol. 8. 2008, p. 10. [13] Thomas M Co ver and Jo y A Thomas. Elements of information the ory . John Wiley & Sons, 2012. [14] Imre Csiszár. “Eine informationstheoretische ungleich ung und ihre anw endung auf b ew eis der ergo dizitaet v on markoﬀsc hen ketten”. In: Publ. Math. Inst. Hungar. A c ad. 8 (1963), pp. 95–108. [15] Constan tinos Daskalakis and Qinxuan Pan. “Square Hellinger Subadditivity for Bay esian Netw orks and its Applications to Iden tity T esting”. In: COL T . 2017, pp. 697–703. [16] Ma yur Datar, Nicole Immorlica, Piotr Indyk, and V ahab S Mirrokni. “Lo cality-sensitiv e hashing scheme based on p-stable distributions”. In: SoCG . A CM. 2004, pp. 253–262. [17] Inderjit S Dhillon, Subramany am Mallela, and Rah ul Kumar. “A divisive information-theoretic feature clustering algorithm for text classiﬁcation”. In: JMLR 3.Mar (2003), pp. 1265–1287. [18] Da vid Gorisse, Matthieu Cord, and F rederic Precioso. “Locality-sensitiv e hashing for c hi2 distance”. In: IEEE tr ansactions on p attern analysis and machine intel ligenc e 34.2 (2012), pp. 402–409. [19] Piotr Indyk and Ra jeev Motw ani. “Appro ximate nearest neigh b ors: tow ards remo ving the curse of dimensionalit y”. In: STOC . ACM. 1998, pp. 604–613. [20] Assaf Karto wsky and Ido T al. “Greedy-Merge Degrading has Optimal P ow er-Law”. In: IEEE T r ansac- tions on Information The ory 65.2 (2018), pp. 917–934. [21] Alex Krizhevsky and Geoﬀrey Hin ton. L e arning multiple layers of fe atur es fr om tiny images . T ech. rep. Citeseer, 2009. 11 [22] Lucien Le Cam. A symptotic metho ds in statistic al de cision the ory . Springer Science & Business Media, 2012. [23] Y ann LeCun, Léon Bottou, Y oshua Bengio, and Patric k Haﬀner. “Gradien t-based learning applied to do cumen t recognition”. In: Pr o c e e dings of the IEEE 86.11 (1998), pp. 2278–2324. [24] Ping Li, Mic hael Mitzenmac her, and Anshumali Shriv astav a. “Co ding for random pro jections”. In: ICML . 2014, pp. 676–684. [25] Jianh ua Lin. “Div ergence measures based on the Shannon en tropy”. In: IEEE T r ansactions on Infor- mation the ory 37.1 (1991), pp. 145–151. [26] Xianling Mao, Bo-Si F eng, Yi-Jing Hao, Liqiang Nie, Hey an Huang, and Guih ua W en. “S2JSD-LSH: A Lo calit y-Sensitiv e Hashing Sc hema for Probability Distributions.” In: AAAI . 2017, pp. 3244–3251. [27] Y adong Mu and Sh uicheng Y an. “Non-Metric Lo calit y-Sensitive Hashing.” In: AAAI . 2010, pp. 539–544. [28] Behnam Neyshabur and Nathan Srebro. “On Symmetric and Asymmetric LSHs for Inner Product Searc h”. In: ICML . 2015, pp. 1926–1934. [29] Cheng So on Ong, Xavier Mary, Stéphane Canu, and Alexander J Smola. “Learning with non-p ositiv e k ernels”. In: ICML . ACM. 2004, p. 81. [30] Y uta Sakai and Ken-ic hi Iwata. “Sub optimal quan tizer design for outputs of discrete memoryless c hannels with a ﬁnite-input alphab et”. In: ISIT . IEEE. 2014, pp. 120–124. [31] Igal Sason and Sergio V erdú. “ f -div ergence Inequalities”. In: IEEE T r ansactions on Information The ory 62.11 (2016), pp. 5973–6006. [32] Bernhard Schölk opf. “The kernel trick for distances”. In: NeurIPS . 2001, pp. 301–307. [33] Ansh umali Shriv astav a and Ping Li. “Asymmetric LSH (ALSH) for sublinear time maximum inner pro duct search (MIPS)”. In: NeurIPS . 2014, pp. 2321–2329. [34] Kengo T erasaw a and Y uzuru T anaka. “Spherical lsh for appro ximate nearest neighbor search on unit h yp ersphere”. In: W orkshop on A lgorithms and Data Structur es . Springer. 2007, pp. 27–38. [35] Andrea V edaldi and Andrew Zisserman. “Eﬃcien t additiv e k ernels via explicit feature maps”. In: IEEE tr ansactions on p attern analysis and machine intel ligenc e 34.3 (2012), pp. 480–492. [36] István Vincze. “On the concept and measure of information contained in an observ ation”. In: Contribu- tions to Pr ob ability . Elsevier, 1981, pp. 207–214. [37] Jingdong W ang, Ting Zhang, Nicu Sebe, Heng T ao Shen, et al. “A surv ey on learning to hash”. In: IEEE tr ansactions on p attern analysis and machine intel ligenc e 40.4 (2018), pp. 769–790. [38] Jingdong W ang, Heng T ao Shen, Jingkuan Song, and Jianqiu Ji. “Hashing for similarit y searc h: A surv ey”. In: arXiv pr eprint arXiv:1408.2927 (2014). [39] Han Xiao, Kashif Rasul, and Roland V ollgraf. “F ashion-mnist: a no vel image dataset for benchmarking mac hine learning algorithms”. In: arXiv pr eprint arXiv:1708.07747 (2017). [40] Ja y Y agnik, Dennis Strelow, David A Ross, and R uei-sung Lin. “The p ow er of comparative reasoning”. In: ICCV . IEEE. 2011, pp. 2431–2438. [41] Xiao Y an, Jinfeng Li, Xiny an Dai, Hongzhi Chen, and James Cheng. “Norm-Ranging LSH for Maxim um Inner Pro duct Searc h”. In: NeurIPS . 2018, pp. 2952–2961. [42] Jiuy ang Alan Zhang and Brian M Kurk oski. “Lo w-complexity quantization of discrete memoryless c hannels”. In: ISIT A . IEEE. 2016, pp. 448–452. 12 App endix A Pro of of Lemma 2 The ﬁrst equation κ λ ( t ) = κ 1 − λ (1 /t ) can b e v eriﬁed directly by plugging in 1 − λ and 1 /t . In the sequel, w e sho w the second equation κ λ ( t ) ∈ [ L ( λ ) , U ( λ )] , which needs a detailed and careful analysis and discussion. The deriv ative of κ λ , denoted by κ 0 λ ( t ) , is 2  λ  √ t − 1  + 1  ln( λ ( t − 1) + 1) − 2 λ √ t ln( t )  √ t − 1  3 √ t . W e deﬁne f 1 ( t ) = 2  λ  √ t − 1  + 1  ln( λ ( t − 1) + 1) − 2 λ √ t ln( t ) . Its deriv ative f 0 1 ( t ) is − λ √ t ( λ ( t − 1) + 1)  2( λ − 1)  √ t − 1  + ( λ ( t − 1) + 1)(log( t ) − log( λ ( t − 1) + 1))  . Deﬁne f 2 ( t ) = 2( λ − 1)  √ t − 1  + ( λ ( t − 1) + 1)(log( t ) − log( λ ( t − 1) + 1)) . Its deriv ative f 0 2 ( t ) is ( λ − 1)  √ t − 1  t + λ (log( t ) − log ( λ ( t − 1) + 1)) . Its second deriv ativ e f 00 2 ( t ) is (1 − λ )  2( λ − 1) + √ t ( λ ( t − 1) + 1)  2 t 2 ( λ ( t − 1) + 1) . First, w e assume λ ∈ (0 , 1 / 2) . In this case, we hav e 1 − λ λ > 1 and λ ( t − 1) + 1 > 0 . Notice that f 3 ( t ) = 2( λ − 1) + √ t ( λ ( t − 1) + 1) is a strictly increasing function in t . Therefore, if t >  1 − λ λ  2 , we obtain f 3 ( t ) > f 3  1 − λ λ  2 ! = ( λ − 1)( λ + 1)(2 λ − 1) λ 2 > 0 . Therefore f 00 2 ( t ) > 0 if t >  1 − λ λ  2 . Thus we deduce that f 0 2 ( t ) is increasing in t if t >  1 − λ λ  2 , which yields f 0 2 ( t ) > f 0 2  1 − λ λ  2 ! = λ  2 λ + (1 − λ ) log  1 − λ λ  − 1  1 − λ . Deﬁne g ( λ ) = 2 λ + (1 − λ ) log  1 − λ λ  − 1 . Its deriv ative g 0 ( λ ) = − 1 λ − log  1 λ − 1  + 2 is negative if λ < 1 / 2 and p ositiv e if λ > 1 / 2 . Therefore g ( λ ) ≥ g (1 / 2) = 0 . Th us w e obtain that if t >  1 − λ λ  2 , f 0 2 ( t ) > 0 , which implies that f 2 ( t ) is increasing in t if t >  1 − λ λ  2 . Thus we hav e f 2 ( t ) > f 2  1 − λ λ  2 ! = (1 − λ )  4 λ + log  1 λ − 1  − 2  λ . Deﬁne g 1 ( λ ) = 4 λ + log  1 λ − 1  − 2 . Its deriv ative g 0 1 ( t ) = 1 ( λ − 1) λ + 4 is non-p ositiv e, which implies that g 1 is decreasing in λ . Therefore, if t >  1 − λ λ  2 , we hav e f 2 ( t ) > 1 − λ λ g (1 / 2) = 0 . Since λ ( t − 1) + 1 > 0 , w e obtain that f 0 1 ( t ) < 0 and therefore f 1 ( t ) is decreasing if t >  1 − λ λ  2 . W e ha ve f 1 ( t ) < f 1  1 − λ λ  2 ! = 0 . 13 If t >  1 − λ λ  2 , since  √ t − 1  3 √ t > 0 , we deduce that κ 0 λ ( t ) < 0 . If t < 1 , since f 3 ( t ) is strictly increasing in t , we hav e f 3 ( t ) < f 3 (1) = 2 λ − 1 < 0 , which implies that f 00 2 ( t ) < 0 . Therefore, w e obtain that f 0 2 ( t ) is strictly decreasing on (0 , 1) . Thus w e ha ve f 0 2 ( t ) > f 0 2 (1) = 0 , whic h implies that f 2 ( t ) is strictly increasing on (0 , 1) . W e immediately hav e f 2 ( t ) < f 2 (1) = 0 for ∀ t ∈ (0 , 1) , whic h yields that f 0 1 ( t ) > 0 and therefore f 1 ( t ) is strictly increasing on (0 , 1) . F or ∀ t ∈ (0 , 1) , it holds that f 1 ( t ) < f 1 (1) = 0 . Since  √ t − 1  3 √ t < 0 , we deduce that κ 0 λ ( t ) > 0 for t ∈ (0 , 1) . The interv al that remains unexplored is I =  1 ,  1 − λ λ  2  . Since f 3 (1) = 2 λ − 1 < 0 and f 3   1 − λ λ  2  = ( λ − 1)( λ +1)(2 λ − 1) λ 2 > 0 , we know that f 3 ( t ) has a real ro ot on this interv al. Notice that f 3 ( t ) can b e view ed as a cubic function in √ t . Deﬁne f 4 ( x ) = 2 λ + λx 3 + (1 − λ ) x − 2 and we hav e f 3 ( t ) = f 4 ( √ t ) . The cubic function f 4 is strictly monotone if λ ∈ (0 , 1) . Therefore, the real ro ot of f 3 on I is unique and we denote it by ρ ( λ ) . No w we divide the in terv al I =  1 ,  1 − λ λ  2  in to tw o subin terv als I 1 = (1 , ρ ( λ )) and I 2 =  ρ ( λ ) ,  1 − λ λ  2  . Since f 3 ( t ) < 0 on I 1 and f 3 ( t ) > 0 on I 2 , we hav e f 00 2 ( t ) < 0 on I 1 and f 00 2 ( t ) > 0 on I 2 . Therefore, we deduce that f 0 2 ( t ) strictly decreases on I 1 and strictly increases on I 2 . Note that f 0 2 (1) = 0 and f 0 2  1 − λ λ  2 ! = λ  2 λ + (1 − λ ) log  1 − λ λ  − 1  1 − λ > 0 . T o see this, we deﬁne g 2 ( λ ) = 2 λ + (1 − λ ) log  1 − λ λ  − 1 . Its second deriv ative is g 00 2 ( λ ) = 1 λ 2 − λ 3 > 0 , which implies that g 2 ( λ ) is strictly conv ex and g 0 2 ( λ ) has a unique root. Observ e that λ = 1 / 2 is a ro ot of g 0 2 ( λ ) . W e deduce that g 2 ( λ ) > g 2 (1 / 2) = 0 for λ ∈ (0 , 1 / 2) , whic h immediately yields that f 0 2   1 − λ λ  2  > 0 . Thus the function f 0 2 ( t ) has a unique ro ot (denoted by ρ 1 ( λ ) ) on I . Therefore, the function f 2 ( t ) strictly decreases on I 3 = (1 , ρ 1 ( λ )) and strictly increases on I 4 =  ρ 1 ( λ ) ,  1 − λ λ  2  . Note that f 2 (1) = 0 and f 2  1 − λ λ  2 ! = (1 − λ )  4 λ + log  1 − λ λ  − 2  λ > 0 . T o see the ab ov e inequality , we deﬁne g 3 ( λ ) = 4 λ + log  1 − λ λ  − 2 . Its deriv ative is g 0 3 ( λ ) = (1 − 2 λ ) 2 ( λ − 1) λ < 0 , which implies that g 3 ( λ ) strictly decreases and that g 3 ( λ ) > g 3 (1 / 2) = 0 for λ ∈ (0 , 1 / 2) . As a result, we deduce that f 2   1 − λ λ  2  > 0 . Thus w e obtain that the function f 2 ( t ) has a unique ro ot (denoted by ρ 2 ( λ ) ) on I and that f 0 1 ( t ) is p ositiv e on I 5 = (1 , ρ 2 ( λ )) and negative on I 6 =  ρ 2 ( λ ) ,  1 − λ λ  2  , whic h implies that f 1 strictly increases on I 5 and strictly decreases on I 6 . Note that f 1 (1) = f 1   1 − λ λ  2  = 0 . W e conclude that f 1 ( t ) > 0 on I , whic h implies that κ 0 λ ( t ) > 0 on I . F rom the ab ov e analysis, we see that if λ ∈ (0 , 1 / 2) , the function κ 0 λ ( t ) has no real ro ot on (0 , ∞ ) \ { 1 ,  1 − λ λ  2 } . Since lim t → 1 κ λ ( t ) = 4(1 − λ ) λ > 0 , κ λ  1 − λ λ  2 ! = 0 , w e deduce that the deriv ative κ 0 λ ( t ) has a unique ro ot at t =  1 − λ λ  2 if λ ∈ (0 , 1 / 2) . By ( ?? ) , we know that it also holds for λ ∈ (1 / 2 , 1) . F urthermore, w e kno w that the deriv ativ e is p ositive if t <  1 − λ λ  2 and is negativ e if t >  1 − λ λ  2 . Thus the maxim um of κ λ is attained at t =  1 − λ λ  2 and it is exactly U ( λ ) . Next, we assume λ = 1 / 2 . W e ha ve κ 1 / 2 ( t ) = t log ( t ) + ( t + 1)(log(2) − log ( t + 1))  √ t − 1  2 . 14 Its deriv ative is κ 0 1 / 2 ( t ) =  √ t + 1  log  t +1 2  − √ t log ( t )  √ t − 1  3 √ t Deﬁne f 5 ( t ) =  √ t + 1  log  t +1 2  − √ t log ( t ) . Its deriv ative is f 0 5 ( t ) = 2  √ t − 1  + ( t + 1) log  t +1 2  − ( t + 1) log ( t ) 2 √ t ( t + 1) . Then we deﬁne f 6 ( t ) = 2  √ t − 1  + ( t + 1) log  t +1 2  − ( t + 1) log ( t ) , whose deriv ative is f 0 6 ( t ) = √ t − 1 t − log(2 t ) + log ( t + 1) and second deriv ativ e f 00 6 ( t ) = 1 t 3 + t 2 − 1 2 t 3 / 2 . If we set f 00 6 ( t ) > 0 , we get t 1 / 2 + t 3 / 2 < 2 , which is equiv alent to t < 1 . Therefore f 00 6 ( t ) is p ositiv e on (0 , 1) and negativ e on (1 , ∞ ) , whic h implies that f 0 6 ( t ) < f 0 6 (1) = 0 for t 6 = 1 . W e deduce that f 6 ( t ) is strictly decreasing in t and thus has a unique ro ot. Since t = 1 is a ro ot of f 6 ( t ) , it is the unique ro ot, whic h implies that f 6 ( t ) and f 0 5 ( t ) are b oth p ositiv e on (0 , 1) and negative on (1 , ∞ ) . As a result, we deduce that f 5 ( t ) < f 5 (1) = 0 for t 6 = 1 . Thus we conclude that κ 0 1 / 2 ( t ) is p ositiv e on (0 , 1) and negative on (1 , ∞ ) . W e can verify that t = 1 is indeed a root of κ 0 1 / 2 ( t ) . So far we hav e sho wn for t ∈ (0 , 1) that the deriv ative κ 0 λ ( t ) is p ositiv e if t <  1 − λ λ  2 and is negative if t >  1 − λ λ  2 . Thus the maxim um of κ λ is attained at t =  1 − λ λ  2 and it is exactly U ( λ ) . The inﬁmum is min { lim t → 0 + κ λ ( t ) , lim t →∞ κ λ ( t ) = min {− 2(1 − λ ) ln (1 − λ ) , − 2 λ ln λ } . } Therefore we conclude κ λ ∈ [ L ( λ ) , U ( λ )] . App endix B Pro of of Theorem 1 In addition to Lemma 2, we need the following lemma. Lemma 6 (Theorem 6 of [31]) . L et f and g b e two c onvex functions that satisfy f (1) = 0 and g (1) = 0 , r esp e ctively. The function g ( t ) > 0 for every t ∈ (0 , 1) ∪ (1 , ∞ ) . L et P and Q b e two distributions on a c ommon ﬁnite sample sp ac e Ω . Deﬁne β 1 = inf i ∈ Ω Q ( i ) P ( i ) and β 2 = inf i ∈ Ω P ( i ) Q ( i ) . W e assume that β 1 , β 2 ∈ [0 , 1) . Then we have D f ( P k Q ) ≤ κ ∗ D g ( P k Q ) , wher e κ ∗ = sup β ∈ ( β 2 , 1) ∪ (1 ,β − 1 1 ) f ( β ) g ( β ) . By Lemmas 2 and 6, we hav e L ( λ ) H 2 ( P , Q ) ≤ D λ GJS ( P k Q ) ≤ U ( λ ) H 2 ( P , Q ) . No w we show that U ( λ ) ≤ 1 . Its deriv ative U 0 ( λ ) has a unique ro ot at λ = 1 / 2 on the interv al (0 , 1) and it is p ositiv e if λ < 1 / 2 and negativ e if λ > 1 / 2 . Therefore U ( λ ) ≤ U (1 / 2) = 1 . 15 App endix C Pro of of Lemma 1 The equation m λ (1) = 0 can b e veriﬁed by plugging in t = 1 directly . W e compute the second deriv ative of m λ d 2 m λ dt 2 = λ (1 − λ ) t 2 λ + (1 − λ ) t . If λ ∈ [0 , 1] and t ∈ (0 , ∞ ) , we hav e d 2 m λ dt 2 ≥ 0 , which implies the con vexit y of m λ . The m λ -div ergence equals to D m λ ( P k Q ) = Z Ω λ ln dP dQ dP − ( λdP + (1 − λ ) dQ ) ln  λ dP dQ + 1 − λ  while the MIL-divergence equals D λ GJS ( P k Q ) = Z Ω λ ln dP /dQ λdP /dQ + (1 − λ ) dP + (1 − λ ) ln 1 λdP /dQ + (1 − λ ) dQ = Z Ω λ ln dP dQ dP − ( λdP + (1 − λ ) dQ ) ln  λ dP dQ + 1 − λ  . Th us we conclude that the m λ -div ergence yields the MIL-divergence with parameter λ . App endix D Pro of of Prop osition 1 Let P and Q b e tw o probabilit y measures in P . If P and Q are equal, D f ( P k Q ) = 0 . Therefore for an y hash function h , it holds that h ( P ) = h ( Q ) , whic h implies that Pr h ∼H [ h ( P ) = h ( Q )] = 1 ≥ p 1 . In the sequel, we assume that P and Q are diﬀeren t. Since P and Q are t wo diﬀerent distributions, there exists i ∈ Ω suc h that P ( i ) < Q ( i ) . W e sho w this by contradiction. Assume that ∀ i ∈ Ω , P ( i ) ≥ Q ( i ) . Since P and Q are diﬀerent, there exists i 0 ∈ Ω such that P ( i 0 ) 6 = Q ( i 0 ) . Since P ( i ) ≥ Q ( i ) holds for ∀ i ∈ Ω , we ha ve P ( i 0 ) > Q ( i 0 ) . Therefore P i ∈ Ω P ( i ) > P i ∈ Ω Q ( i ) . How ever, b oth P and Q sum to 1 , which leads to a con tradiction. Therefore, we obtain the existence of i suc h that P ( i ) < Q ( i ) , which yields β 2 , inf i ∈ Ω P ( i ) Q ( i ) < 1 . Similarly , we hav e β 1 , inf i ∈ Ω Q ( i ) P ( i ) < 1 . Since P ( i ) and Q ( i ) are non-negative for ∀ i ∈ Ω , w e hav e β 1 , β 2 ≥ 0 . In sum, we show ed that β 1 , β 2 ∈ [0 , 1) . By the deﬁnition of β 0 , we know the following in terv al inclusion ( β 2 , β − 1 1 ) ⊆ ( β 0 , β − 1 0 ) . Recall that U = sup β ∈ ( β 0 , 1) ∪ (1 ,β − 1 0 ) f ( β ) g ( β ) , L = inf β ∈ ( β 0 , 1) ∪ (1 ,β − 1 0 ) f ( β ) g ( β ) . By Lemma 6, w e obtain the approximation guarantee L · D g ( P k Q ) ≤ D f ( P k Q ) ≤ U · D g ( P k Q ) (6) There are t wo cases to consider. In the ﬁrst case, we assume that D f ( P k Q ) ≤ Lr 1 . By (6) , we hav e D g ( P k Q ) ≤ r 1 . Since H is an ( r 1 , r 2 , p 1 , p 2 ) -sensitiv e family for g -div ergence, it holds that Pr h ∼H [ h ( P ) = h ( Q )] ≥ p 1 . Similarly , if D g ( P k Q ) > U r 2 , w e ha ve Pr h ∼H [ h ( P ) = h ( Q )] ≤ p 2 . Th us, H forms an ( Lr 1 , U r 2 , p 1 , p 2 ) -sensitiv e family for f -div ergence on P . 16 App endix E Pro of of Theorem 2 If D λ GJS ( P k Q ) ≤ R , by Theorem 1, we hav e    √ P − p Q    2 ≤ s 2 R L ( λ ) , R 1 . If D λ GJS ( P k Q ) ≥ c 2 U ( λ ) L ( λ ) R , w e ha ve    √ P − p Q    2 ≥ c s 2 R L ( λ ) = cR 1 . By the construction and prop erties of locality-sensitiv e hash family for L 2 distance prop osed in [16, Section 3.2], w e kno w that h a ,b forms a ( R 1 , cR 1 , p 1 , p 2 ) -sensitiv e hash family for the L 2 distance b etw een tw o v ectors √ P and √ Q . Therefore, pro vided that D λ GJS ( P k Q ) ≤ R , which implies    √ P − √ Q    2 ≤ R 1 , we hav e Pr[ h a ,b ( P ) = h a ,b ( Q )] ≥ p 1 . Similarly , if D λ GJS ( P k Q ) ≥ c 2 U ( λ ) L ( λ ) R , w e ha ve Pr[ h a ,b ( P ) = h a ,b ( Q )] ≤ p 2 . App endix F Pro of of Theorem 3 The deriv ative of the ratio function κ ( t ) = δ ( t ) hel( t ) is κ 0 ( t ) = 1 − t √ t ( t + 1) 2 . It is p ositiv e when t < 1 and negativ e when t > 1 . Therefore for ∀ t ∈ (0 , ∞ ) , κ ( t ) ≤ κ (1) = 2 and κ ( t ) ≥ min { lim t → 0 + κ ( t ) , lim t →∞ κ ( t ) } = 1 . By Lemma 6, w e ha ve H 2 ( P , Q ) ≤ ∆( P k Q ) ≤ 2 H 2 ( P , Q ) . If ∆( P k Q ) ≤ R , w e hav e    √ P − p Q    2 ≤ √ 2 R , R 1 . If D λ GJS ( P k Q ) ≥ 2 c 2 R , w e ha ve    √ P − p Q    2 ≥ √ 2 Rc = cR 1 . By the construction and prop erties of locality-sensitiv e hash family for L 2 distance prop osed in [16, Section 3.2], w e kno w that h a ,b forms a ( R 1 , cR 1 , p 1 , p 2 ) -sensitiv e hash family for the L 2 distance b etw een tw o v ectors √ P and √ Q . Therefore, pro vided that ∆( P k Q ) ≤ R , which implies    √ P − √ Q    2 ≤ R 1 , we hav e Pr[ h a ,b ( P ) = h a ,b ( Q )] ≥ p 1 . Similarly , if ∆( P k Q ) ≥ 2 c 2 R , w e ha ve Pr[ h a ,b ( P ) = h a ,b ( Q )] ≤ p 2 . 17 App endix G Pro of of Lemma 3 First, we would like to note that k is homogeneous, i.e. , for all c ≥ 0 , it holds that k ( cx, cy ) = ck ( x, y ) . Its k ernel signature [35] is K ( λ ) , k ( e λ/ 2 , e − λ/ 2 ) = e − λ 2  e λ + 1  ln  e λ + 1  − e λ λ  . First, let us review the deﬁnition of a p ositiv e deﬁnite function. Deﬁnition 4 ([9]) . W e call a complex-v alued function f : R → C is p ositiv e deﬁnite if 1. it is contin uous in the ﬁnite region and is b ounded on R 2. it is Hermitian, i.e. , f ( − x ) = f ( x ) 3. it satisﬁes the follo wing conditions: for an y real n umbers x 1 , . . . , x n ∈ R , the matrix A = ( f ( x i − x j )) n i,j =1 is p ositive semideﬁnite. Next we will show that K is a p ositiv e deﬁnite function by sho wing that it is the F ourier transform of a non-negativ e function. W e ha ve the follo wing F ourier transform and inv erse F ourier transform K ( λ ) = Z R e − iλw 2 sech( π w ) 1 + 4 w 2 dw , κ ( w ) , 1 2 π Z R K ( λ ) e iλw dλ = 2 sech( π w ) 1 + 4 w 2 . Then we need the follo wing lemmata. Lemma 7. If f ( x ) = R R e − ixt g ( t ) dt is the F ourier tr ansform of a non-ne gative function g ( t ) , then it is p ositive deﬁnite. Pr o of of L emma 7. Let x 1 , . . . , x n ∈ R b e arbitrary real n umbers and a 1 , . . . , a n b e arbitrary complex n umbers. Let us compute the quadratic form directly n X j,k =1 f ( x j − x k ) a j a k = Z R n X j,k =1 e − i ( x j − x k ) t a j a k g ( t ) dt = Z R       n X j =1 a j e − ix j t       2 g ( t ) dt ≥ 0 . Lemma 8 (Lemma 1 in [35]) . A homo gene ous kernel is p ositive deﬁnite if, and only if, its signatur e K ( λ ) is a p ositive deﬁnite function. Since 2 sech( πw ) 1+4 w 2 ≥ 0 holds for ∀ w ∈ R , we deduce that K ( λ ) is the F ourier transform of a non-negative function. Lemma 7 implies that K ( λ ) is a p ositiv e deﬁnite function. Therefore k is a p ositiv e deﬁnite k ernel b y Lemma 8. Let us deﬁne the feature map Φ w ( x ) , e − iw ln( x ) r x 2 sech( π w ) 1 + 4 w 2 . Since k ( x, y ) is homogeneous, we hav e k ( x, y ) = √ xy k ( p x/y , p y /x ) = √ xy K (ln( y /x )) = √ xy Z R e − i ln( y /x ) w 2 sech( π w ) 1 + 4 w 2 dw = Z R Φ w ( x ) ∗ Φ w ( y ) dw . 18 App endix H Pro of of Theorem 4 Let z denote the merged v alue. If w e deﬁne η ( u ) , − u ln( u ) , the mutual information loss is mil( x , y ) = X c ∈C  p ( c, x ) ln p ( c, x ) p ( c ) p ( x ) + p ( c, y ) ln p ( c, y ) p ( c ) p ( y ) − p ( c, z ) ln p ( c, z ) p ( c ) p ( z )  = X c ∈C  p ( c, x ) ln p ( c, x ) p ( x ) + p ( c, y ) ln p ( c, y ) p ( y ) − p ( c, z ) ln p ( c, z ) p ( z )  = η ( p ( x )) + η ( p ( y )) − η ( p ( z )) − X c ∈C [ η ( p ( c, x )) + η ( p ( c, y )) − η ( p ( c, z ))] . By the deﬁnition of k , w e ha ve k ( a, b ) = η ( a ) + η ( b ) − η ( a + b ) . As a result, w e re-write mil( x , y ) as mil( x , y ) = k ( p ( x ) , p ( y )) − X c ∈C k ( p ( c, x ) , p ( c, y )) = K 1 ( x , y ) − K 2 ( x , y ) . Lemma 3 indicates that k is a p ositiv e deﬁnite kernel. In light of the techniques for constructing new k ernels presented in [8, Section 6.2], we obtain that that K 1 and K 2 are p ositive deﬁnite k ernels. App endix I Pro of of Lemma 4 Recall that k ( x, y ) = R R Φ w ( x ) ∗ Φ w ( y ) dw . W e ha ve     k ( x, y ) − Z t − t Φ w ( x ) ∗ Φ w ( y ) dw     =      Z | w | >t Φ w ( x ) ∗ Φ w ( y ) dw      ≤ Z | w | >t    e iw ln( x/y ) √ xy ρ ( w )    dw ( a ) ≤ 2 Z ∞ t ρ ( w ) dw ( b ) ≤ 8 Z ∞ t e − π w dw = 8 π e − π t ≤ 4 e − t , where ( a ) is due to   e iw ln( x/y ) √ xy   ≤ 1 and ( b ) is due to 2 sech( π w ) 1 + 4 w 2 ≤ 2 sech( π w ) = 4 e π w + e − π w ≤ 4 e − π w . App endix J Pro of of Lemma 5 As the ﬁrst step, we re-write the in tegral Z ∆ J − ∆ J Φ w ( x ) ∗ Φ w ( y ) dw = J X j = − J +1 Z j ∆ ( j − 1)∆ e iw ln( x/y ) √ xy ρ ( w ) dw . Then we b ound the discretization error       Z ∆ J − ∆ J Φ w ( x ) ∗ Φ w ( y ) dw − J X j = − J +1 Z j ∆ ( j − 1)∆ e iw j ln( x/y ) √ xy ρ ( w ) dw       ≤ J X j = − J +1 Z j ∆ ( j − 1)∆    e iw ln( x/y ) − e iw j ln( x/y )    √ xy ρ ( w ) dw ( a ) ≤ J X j = − J +1 Z j ∆ ( j − 1)∆ | ln( x/y ) | ∆ 2 √ xy ρ ( w ) dw = ∆ 2 √ xy | ln( x/y ) | Z ∆ J − ∆ J ρ ( w ) dw ( b ) ≤ 2∆ , 19 where ( a ) is due to    e iw ln( x/y ) − e iw j ln( x/y )    ≤ | ln( x/y ) || w − w j |≤ ∆ 2 | ln( x/y ) | . and ( b ) is due to R ∆ J − ∆ J ρ ( w ) dw ≤ R R ρ ( w ) dw = 2 ln 2 and √ xy | ln( x/y ) |≤ √ x | ln( x ) | + √ y | ln( y ) |≤ 4 e . Next we re-write the partial Riemann sum b y substituting the new index k = 1 − j 0 X j = − J +1 Z j ∆ ( j − 1)∆ e iw j ln( x/y ) √ xy ρ ( w ) dw = J X k =1 Z (1 − k )∆ − k ∆ e i (1 / 2 − k )∆ ln( x/y ) √ xy ρ ( w ) dw = J X k =1 Z k ∆ ( k − 1)∆ e − iw k ln( x/y ) √ xy ρ ( w ) dw . Therefore the entire Riemann sum can b e re-written as J X j = − J +1 Z j ∆ ( j − 1)∆ e iw j ln( x/y ) √ xy ρ ( w ) dw = J X j =1 Z j ∆ ( j − 1)∆ ( e iw j ln( x/y ) + e − iw j ln( x/y ) ) √ xy ρ ( w ) dw = 2 J X j =1 (cos( w j ln x ) cos( w j ln y ) + sin ( w j ln x ) sin( w j ln y )) √ xy Z j ∆ ( j − 1)∆ ρ ( w ) dw = * J M j =1 τ ( x, w j , j ) , J M j =1 τ ( y , w j , j ) + . 20

Locality-Sensitive Hashing for f-Divergences: Mutual Information Loss and Beyond

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment