Coding for Random Projections and Approximate Near Neighbor Search

Coding for Random Projections and Approximate Near Neighbor Search Ping Li Department of Statistics & Biostatistics Department of Computer Science Rutgers Univ ersity Piscataw ay , NJ 08854, USA pingli@sta t.rutgers. edu Michael Mitzenmacher School of Engineering and Applied Sciences Harv ard Univ ersity Cambridge, MA 02138, USA michaelm@e ecs.harvar d.edu Anshumali Shriv astav a Department of Computer Science Cornell Unive rsity Ithaca, NY 14853, USA anshu@cs.c ornell.edu Abstract This technical note compa res two cod ing (qu antization) schemes for rand om pro jections in the context o f sub-linear time appro ximate n ear neighb or searc h. The ﬁrst sch eme is based on uniform quan tization [4] while th e secon d sch eme utilizes a unifo rm quantizatio n plus a uniform ly random offset [1] (which has been popular in practice). The prior w ork [4] com pared the two schemes in the context of similarity esti- mation and trainin g linear classiﬁers, with the conclusio n that the step of random offset is n ot necessary and may hurt the perfor mance (d ependin g on the similarity level). The task of near neighb or search is related to similarity estimation with impo rtance distinctions and req uires own study . In this pa per , we demonstra te that in the con text of n ear ne ighbor sear ch, the step of rand om offset is not needed either and may hur t the perfor mance (sometimes signiﬁcantly so, d ependin g on the similarity and o ther parameters). For approximate near neigh bor search, when the target similarity le vel is high (e.g., correlation > 0 . 85 ), our analysis sugg est to use a uniform q uantization to build h ash tables, with a bin width w = 1 ∼ 1 . 5 . On the other hand, when the target similarity level is no t that high, it is preferable to use larger w values (e.g., w ≥ 2 ∼ 3 ) . This is equivalent to say that it sufﬁces to use only a small num ber of bits (or e ven just 1 bit) to code each hashed v alue in the context of sublinear time near neighbo r search. An extensive experimental study on two reasonably large datasets conﬁrms the theoretical ﬁnding. Coding for building hash tables is a different task from coding for similarity estimation. F or near neigh- bor search, we need coding of the pr ojected data to deter mine which buckets the d ata points should b e placed in (and the code d v alues are not stored). For similarity estimation, the purp ose of coding is for accurately estimating the similarities using small storage space. Therefor e, if necessary , we can actually code the projected data twice (with different bin widths). In this pap er , we do n ot study the impo rtant issue of “re- ranking ” of r etrieved d ata p oints by using estimated similarities. That step is need ed when exact (all pairwise) similarities c an not be practica lly stored or compu ted on the ﬂy . In a con current work [5], we demonstra te that th e retr iev al accuracy can be further improved by using no nlinear estimators of the similarities based on a 2-bit coding scheme. 1 1 Introd uction This paper focuses on the comparison of two quantizatio n schemes for random projec tions in the context of sublinear time near neighbo r search. The task of near neighbor search is to identify a set of data points which a re “mo st similar” ( in so me measu re of similarit y) to a que ry data poin t. Efﬁcient algorithms for nea r neighb or searc h hav e numerous applicatio ns i n search , da tabas es, machine learning , re commendin g syste ms, compute r vision, etc. Dev elop ing ef ﬁcien t algor ithms for ﬁnding n ear neighb ors has been an a cti v e resear ch topic si nce the earl y days of moder n computing [2]. Near neig hbor search with e xtre mely high-d imension al data (e.g., tex ts or images) is still a challeng ing task and an activ e resea rch problem. Among many types of simila rity measures, the (square d) Euclidian distance (denote d by d ) and the c or - relatio n (denoted by ρ ) are most co mmonly used. W itho ut loss o f genera lity , consider tw o hi gh-di mension al data vec tors u, v ∈ R D . The squared Euclidean distance and correlatio n are deﬁned as follo ws: d = D X i =1 | u i − v i | 2 , ρ = P D i =1 u i v i q P D i =1 u 2 i q P D i =1 v 2 i (1) In practice, it appears that the correlation is more often used th an the distance, pa rtly bec ause | ρ | is nicely normaliz ed within 0 and 1. In fact, in this study , we w ill assume that the mar ginal l 2 norms P D i =1 | u i | 2 and P D i =1 | v i | 2 are kno wn. This is a reasonab le assumption . Computing the marg inal l 2 norms only requires scanni ng the data once, which is anyway neede d during the data collectio n process. In machine learning practic e, it is common to ﬁrst no rmalize the data (to ha v e un it l 2 norm) before feed ing the data to cl assiﬁca- tion (e.g., SVM) or clusteri ng (e.g., K-means) algorit hms. For con ven ience , through out this paper , we assume unit l 2 norms, i.e., ρ = P D i =1 u i v i q P D i =1 u 2 i q P D i =1 v 2 i = D X i =1 u i v i , where D X i =1 u 2 i = D X i =1 v 2 i = 1 (2) 1.1 Random Pr ojections As an ef fecti v e tool for d imensio nality redu ction , the idea of rand om pro jectio ns is to m ultipl y the data , e.g., u, v ∈ R D , with a rand om normal projection matrix R ∈ R D × k (where k ≪ D ), to generate : x = u × R ∈ R k , y = v × R ∈ R k , R = { r ij } D i =1 k j =1 , r ij ∼ N (0 , 1) i.i.d. (3) The method of random projecti ons has become popula r for large-s cale machine learning applica tions such as class iﬁcation , regre ssion, matrix facto rizatio n, singular value deco mpositio n, near neighbor search, etc. The poten tial beneﬁts of coding with a small number of bit s arise because the (unc oded) projected data, x j = P D i =1 u i r ij and y j = P D i =1 v i r ij , being real-v alu ed numbers , are neither con venient/ economical for storag e and transmissio n, nor w ell-sui ted for inde xing. T he fo cus of this p aper is on app roximat e (sublinear time) n ear n eighbo r searc h in the frame wo rk of loca lity sen sitive has hing [3]. In partic ular , we will compare two coding (quan tizati on) schemes of random projection s [1, 4] in the context of near neigh bor search. 1.2 Unif orm Quantization The recent work [4] pro posed an intuiti v e coding scheme, based on a simple unifor m quantiz ation: h ( j ) w ( u ) = ⌊ x j /w ⌋ , h ( j ) w ( v ) = ⌊ y j /w ⌋ (4) where w > 0 is the bin width and ⌊ . ⌋ is the standar d ﬂ oor operat ion. The follo win g theorem is pro ved in [4] abou t the collision probabilit y P w = Pr  h ( j ) w ( u ) = h ( j ) w ( v )  . 2 Theor em 1 P w = Pr  h ( j ) w ( u ) = h ( j ) w ( v )  = 2 ∞ X i =0 Z ( i +1) w iw φ ( z ) " Φ ( i + 1) w − ρz p 1 − ρ 2 ! − Φ iw − ρz p 1 − ρ 2 !# dz (5) In addi tion, P w is a monoto nicall y incr easin g functio n of ρ . The fact t hat P w is a mono tonic ally increasin g function of ρ makes (4) a su itable c oding sch eme for app rox- imate near neighb or search in the genera l framewo rk of locality sensiti v e hashing (LSH ). 1.3 Unif orm Quantization with Random Offset [1] propos ed the follo wing well-kno wn codin g scheme, which uses windo ws and a random of fset : h ( j ) w ,q ( u ) =  x j + q j w  , h ( j ) w ,q ( v ) =  y j + q j w  (6) where q j ∼ unif or m (0 , w ) . [1] sho wed that the collisi on probabilit y can be written as P w ,q = Pr  h ( j ) w ,q ( u ) = h ( j ) w ,q ( v )  = Z w 0 1 √ d 2 φ  t √ d   1 − t w  dt (7) where d = || u − v || 2 = 2(1 − ρ ) is the Euclidean dis tance between u and v . Compared with (6 ), th e sc heme (4) does not use the additional randomization with q ∼ unif or m (0 , w ) (i.e., the of fset). [4] elabora ted the follo wing adv an tages of (4 ) in the conte xt of similarit y estimation: 1. Operatio nally , h w is simpler than h w ,q . 2. W it h a ﬁxed w , h w is alw ays more accura te than h w ,q , often signiﬁca ntly so. 3. For eac h coding scheme, one can separate ly ﬁnd the optimum bin width w . The optimized h w is also more accurate than optimized h w ,q , often signiﬁca ntly so. 4. h w requir es a smaller number of bits than h w ,q . In this paper , we will compare h w ,q with h w in the conte xt of sublinear time near neighbo r search. 1.4 Sublinear T ime c -App r oximate Near Neighbor Searc h Consider a data vector u . Suppose there exists another vecto r whose Euclidian distance ( √ d ) from u is at most √ d 0 (the target distance). The goal of c -appr oximate √ d 0 -near neighbo r algorithms is to return data vec tors (with high probabil ity) whose Euclidian distance s from u are at most c × √ d 0 with c > 1 . Recall that, in our deﬁnition, d = 2(1 − ρ ) is the squared Euclidian distance. T o be consistent with [1], we pres ent the resu lts in terms of √ d . Correspon ding to the t ar get distanc e √ d 0 , the tar get similarity ρ 0 can be computed from d 0 = 2(1 − ρ 0 ) i.e., ρ 0 = 1 − d 0 / 2 . T o simplify the present ation, we focus on ρ ≥ 0 (a s is common in prac tice), i.e., 0 ≤ d ≤ 2 . Once we ﬁx a tar get similarity ρ 0 , c can not exceed a certai n value : c p 2(1 − ρ 0 ) ≤ √ 2 = ⇒ c ≤ r 1 1 − ρ 0 (8) For e xample, when ρ 0 = 0 . 5 , we must hav e 1 ≤ c ≤ √ 2 . 3 Under the general framew ork, the performance of an L SH algorit hm larg ely depends on the differ ence (gap) between the two col lision probabili ties P (1) and P (2) (respe cti v ely correspon ding to √ d 0 and c √ d 0 ): P (1) w = Pr ( h w ( u ) = h w ( v )) when d = || u − v || 2 2 = d 0 (9) P (2) w = Pr ( h w ( u ) = h w ( v )) when d = || u − v || 2 2 = c 2 d 0 (10) Correspo nding to h w ,q , the collis ion probabilitie s P (1) w ,q and P (2) w ,q are anal ogous ly deﬁned. A large r diff erence between P (1) and P (2) implies a more efﬁcient LS H algorith m. The follo wing “ G ” v alues ( G w for h w and G w ,q for h w ,q ) charac terize the gaps: G w = log 1 /P (1) w log 1 /P (2) w , G w ,q = log 1 /P (1) w ,q log 1 /P (2) w ,q (11) A smaller G (i.e., larg er dif ferenc e between P (1) and P (2) ) leads to a potential ly more efﬁcient LSH algo- rithm and ρ < 1 c is particular ly desirable [3]. The general theory says the query time for c -approx imate d 0 -near ne ighbo r is dominated by O ( N G ) distance ev aluations, where N is the tot al number of da ta ve ctors in the collec tion. This is better than O ( N ) , the cost of a linear scan. 2 Comparison of the Collision Probabi lities T o help unders tand the intuition w hy h w may lead to better perfor mance than h w ,q , in this section we exa mine their collision probab ilities P w and P w ,q , which can be expres sed in terms of the standard normal pdf and cdf functio ns: φ ( x ) = 1 √ 2 π e − x 2 2 and Φ( x ) = R x −∞ φ ( x ) dx , P w ,q = Pr  h ( j ) w ,q ( u ) = h ( j ) w ,q ( v )  = 2Φ  w √ d  − 1 − 2 √ 2 π w / √ d + 2 w/ √ d φ  w √ d  (12) P w = Pr  h ( j ) w ( u ) = h ( j ) w ( v )  = 2 ∞ X i =0 Z ( i +1) w iw φ ( z ) " Φ ( i + 1) w − ρz p 1 − ρ 2 ! − Φ iw − ρz p 1 − ρ 2 !# dz (13) It is clear that P w ,q → 1 a s w → ∞ . Figure 1 plots both P w and P w ,q for selected ρ value s. The differe nce between P w and P w ,q becomes appare nt when w is not small. For example , when ρ = 0 , P w quickl y approaches the limit 0.5 while P w ,q kee ps incre asing (t o 1) as w inc reases . Intuiti v ely , t he fact that P w ,q → 1 when ρ = 0 , is undesi rable be cause it means two or thogo nal vect ors will hav e the same code d value . Thus, it is not sur prisin g that h w will ha v e better performan ce than h w ,q , for both similar ity estimation and sublinear time near neighbor search. 4 0 1 2 3 4 5 6 7 8 9 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 w Prob ρ = 0 P w P w,q 0 1 2 3 4 5 6 7 8 9 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 w Prob ρ = 0.25 P w P w,q 0 1 2 3 4 5 6 7 8 9 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 w Prob ρ = 0.5 P w P w,q 0 1 2 3 4 5 6 7 8 9 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 w Prob ρ = 0.75 P w P w,q 0 1 2 3 4 5 6 7 8 9 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 w Prob ρ = 0.9 P w P w,q 0 1 2 3 4 5 6 7 8 9 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 w Prob ρ = 0.99 P w P w,q Figure 1: Collision probab ilities , P w and P w ,q , for ρ = 0 , 0 . 25 , 0 . 5 , 0 . 7 5 , 0 . 9 , and 0 . 99 . The scheme h w has smaller collis ion probabilitie s than the scheme [1] h w ,q , especial ly when w > 2 . 3 Theor etical Comparison of the Gaps Figure 2 compares G w with G w ,q at their “optimum” w va lues, as function s of c , for a wide range of targ et similarit y ρ 0 le ve ls. Basically , at each c and ρ 0 , we choose the w to minimize G w and the w to m inimize G w ,q . This ﬁgure illustrates that G w is smaller than G w ,q , noticeab ly so in the lo w similarity regi on. Figure 3, Figure 4, Figure 5, and Figure 6 present G w and G w ,q as functions of w , for ρ 0 = 0 . 99 , ρ 0 = 0 . 95 , ρ 0 = 0 . 9 and ρ 0 = 0 . 5 , respecti v ely . In each ﬁgure, we plot the curve s for a wide range of c valu es. These ﬁgures illustra te where the optimum w val ues are obtained. Clearly , in the high similarity reg ion, the smallest G va lues are obtained at low w v alues, especially at small c . In the low (or moderate ) similarit y region, the smallest G v alues are usually attained at relati v ely large w . In pract ice, we normally ha v e to pre-spe cify a w , for all c and ρ 0 v alues . In other wor ds, the “opti mum” G value s present ed in Figure 2 are in general not attain able. Therefore, F igure 7, Figure 8 , Figure 9, and Figure 10 present G w and G w ,q as function s of c , for ρ 0 = 0 . 99 , ρ 0 = 0 . 95 , ρ 0 = 0 . 9 and ρ 0 = 0 . 5 , respec ti ve ly . In each ﬁgure, we plot the curves for a w ide range of w v alues. These ﬁ gures again conﬁrm that G w is smaller than G w ,q . 5 1 1.01 1.02 1.03 1.04 1.05 1.06 0.9 0.92 0.94 0.96 0.98 1 c Optimum Gap ρ 0 = 0.1 G w G w,q 1/c 1 1.05 1.1 1.15 1.2 0.8 0.85 0.9 0.95 1 c Optimum Gap ρ 0 = 0.2 G w G w,q 1/c 1 1.05 1.1 1.15 1.2 0.7 0.75 0.8 0.85 0.9 0.95 1 c Optimum Gap ρ 0 = 0.3 G w G w,q 1/c 1 1.1 1.2 1.3 0.6 0.7 0.8 0.9 1 c Optimum Gap ρ 0 = 0.4 G w G w,q 1/c 1 1.1 1.2 1.3 1.4 1.5 0.5 0.6 0.7 0.8 0.9 1 c Optimum Gap ρ 0 = 0.5 G w G w,q 1/c 1 1.1 1.2 1.3 1.4 1.5 1.6 0.5 0.6 0.7 0.8 0.9 1 c Optimum Gap ρ 0 = 0.6 G w G w,q 1/c 1 1.2 1.4 1.6 1.8 2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 c Optimum Gap ρ 0 = 0.7 G w G w,q 1/c 1 1.5 2 2.5 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 c Optimum Gap ρ 0 = 0.8 G w G w,q 1/c 1 1.5 2 2.5 3 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 c Optimum Gap ρ 0 = 0.85 G w G w,q 1/c 1 1.5 2 2.5 3 3.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 c Optimum Gap ρ 0 = 0.9 G w G w,q 1/c 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 c Optimum Gap ρ 0 = 0.95 G w G w,q 1/c 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 c Optimum Gap ρ 0 = 0.99 G w G w,q 1/c Figure 2: C ompariso n of the optimum gaps (smaller the better) for h w and h w ,q . For each ρ 0 and c , we can ﬁnd the smalle st gaps ind i vidua lly for h w and h w ,q , ov er th e entire range of w . W e can see th at for all tar ge t similarit y lev els ρ 0 , both h w ,q and h w exh ibit better performan ce than 1 /c . h w alw ays has smaller gap than h w ,q , althoug h in high similarity regio n both schemes perform similarly . 6 0 1 2 3 4 5 6 0.94 0.95 0.96 0.97 0.98 0.99 1 w Gap ρ 0 = 0.99 , c = 1.05 G w G w,q 0 1 2 3 4 5 6 0.88 0.9 0.92 0.94 0.96 0.98 1 w Gap ρ 0 = 0.99 , c = 1.1 G w G w,q 0 1 2 3 4 5 6 0.8 0.85 0.9 0.95 1 w Gap ρ 0 = 0.99 , c = 1.2 G w G w,q 0 1 2 3 4 5 6 0.7 0.75 0.8 0.85 0.9 0.95 1 w Gap ρ 0 = 0.99 , c = 1.3 G w G w,q 0 1 2 3 4 5 6 0.6 0.7 0.8 0.9 1 w Gap ρ 0 = 0.99 , c = 1.4 G w G w,q 0 1 2 3 4 5 6 0.6 0.7 0.8 0.9 1 w Gap ρ 0 = 0.99 , c = 1.5 G w G w,q 0 1 2 3 4 5 6 0.5 0.6 0.7 0.8 0.9 1 w Gap ρ 0 = 0.99 , c = 1.7 G w G w,q 0 1 2 3 4 5 6 0.4 0.5 0.6 0.7 0.8 0.9 1 w Gap ρ 0 = 0.99 , c = 2 G w G w,q 0 1 2 3 4 5 6 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 w Gap ρ 0 = 0.99 , c = 2.5 G w G w,q 0 1 2 3 4 5 6 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 w Gap ρ 0 = 0.99 , c = 3 G w G w,q 0 1 2 3 4 5 6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 w Gap ρ 0 = 0.99 , c = 4 G w G w,q 0 1 2 3 4 5 6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 w Gap ρ 0 = 0.99 , c = 5 G w G w,q Figure 3: The gaps G w and G w ,q as fu nction s of w , for ρ 0 = 0 . 99 . In each panel, we plo t both G w and G w ,q for a particu lar c v alue. The plots illustrate where the optimum w value s are obtained . 7 0 1 2 3 4 5 6 0.94 0.95 0.96 0.97 0.98 0.99 1 w Gap ρ 0 = 0.95 , c = 1.05 G w G w,q 0 1 2 3 4 5 6 0.88 0.9 0.92 0.94 0.96 0.98 1 w Gap ρ 0 = 0.95 , c = 1.1 G w G w,q 0 1 2 3 4 5 6 0.8 0.85 0.9 0.95 1 w Gap ρ 0 = 0.95 , c = 1.2 G w G w,q 0 1 2 3 4 5 6 0.7 0.8 0.9 1 w Gap ρ 0 = 0.95 , c = 1.3 G w G w,q 0 1 2 3 4 5 6 0.6 0.7 0.8 0.9 1 w Gap ρ 0 = 0.95 , c = 1.4 G w G w,q 0 1 2 3 4 5 6 0.5 0.6 0.7 0.8 0.9 1 w Gap ρ 0 = 0.95 , c = 1.5 G w G w,q 0 1 2 3 4 5 6 0.5 0.6 0.7 0.8 0.9 1 w Gap ρ 0 = 0.95 , c = 1.7 G w G w,q 0 1 2 3 4 5 6 0.4 0.5 0.6 0.7 0.8 0.9 1 w Gap ρ 0 = 0.95 , c = 2 G w G w,q 0 1 2 3 4 5 6 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 w Gap ρ 0 = 0.95 , c = 2.5 G w G w,q 0 1 2 3 4 5 6 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 w Gap ρ 0 = 0.95 , c = 3 G w G w,q 0 1 2 3 4 5 6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 w Gap ρ 0 = 0.95 , c = 4 G w G w,q 0 1 2 3 4 5 6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 w Gap ρ 0 = 0.95 , c = 4.4 G w G w,q Figure 4: The gaps G w and G w ,q as function s of w , for ρ 0 = 0 . 95 and a rang e of c v alu es. 8 0 1 2 3 4 5 6 0.94 0.95 0.96 0.97 0.98 0.99 1 w Gap ρ 0 = 0.9 , c = 1.05 G w G w,q 0 1 2 3 4 5 6 0.88 0.9 0.92 0.94 0.96 0.98 1 w Gap ρ 0 = 0.9 , c = 1.1 G w G w,q 0 1 2 3 4 5 6 0.8 0.85 0.9 0.95 1 w Gap ρ 0 = 0.9 , c = 1.2 G w G w,q 0 1 2 3 4 5 6 0.7 0.75 0.8 0.85 0.9 0.95 w Gap ρ 0 = 0.9 , c = 1.3 G w G w,q 0 1 2 3 4 5 6 0.6 0.7 0.8 0.9 1 w Gap ρ 0 = 0.9 , c = 1.4 G w G w,q 0 1 2 3 4 5 6 0.6 0.7 0.8 0.9 1 w Gap ρ 0 = 0.9 , c = 1.5 G w G w,q 0 1 2 3 4 5 6 0.5 0.6 0.7 0.8 0.9 1 w Gap ρ 0 = 0.9 , c = 1.7 G w G w,q 0 1 2 3 4 5 6 0.4 0.5 0.6 0.7 0.8 0.9 1 w Gap ρ 0 = 0.9 , c = 2 G w G w,q 0 1 2 3 4 5 6 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 w Gap ρ 0 = 0.9 , c = 2.5 G w G w,q Figure 5: The gaps G w and G w ,q as funct ions of w , for ρ 0 = 0 . 9 and a ran ge of c v al ues. 9 0 1 2 3 4 5 6 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1 w Gap ρ 0 = 0.5 , c = 1.05 G w G w,q 0 1 2 3 4 5 6 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1 w Gap ρ 0 = 0.5 , c = 1.1 G w G w,q 0 1 2 3 4 5 6 0.75 0.8 0.85 0.9 0.95 1 w Gap ρ 0 = 0.5 , c = 1.2 G w G w,q 0 1 2 3 4 5 6 0.6 0.7 0.8 0.9 1 w Gap ρ 0 = 0.5 , c = 1.3 G w G w,q 0 1 2 3 4 5 6 0.6 0.7 0.8 0.9 1 w Gap ρ 0 = 0.5 , c = 1.35 G w G w,q 0 1 2 3 4 5 6 0.5 0.6 0.7 0.8 0.9 1 w Gap ρ 0 = 0.5 , c = 1.4 G w G w,q Figure 6: The gaps G w and G w ,q as funct ions of w , for ρ 0 = 0 . 5 and a ran ge of c v al ues. 10 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 c Gap ρ 0 = 0.99 , w = 0.25 G w G w,q 1/c 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 c Gap ρ 0 = 0.99 , w = 0.5 G w G w,q 1/c 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 c Gap ρ 0 = 0.99 , w = 0.75 G w G w,q 1/c 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 c Gap ρ 0 = 0.99 , w = 1 G w G w,q 1/c 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 c Gap ρ 0 = 0.99 , w = 1.25 G w G w,q 1/c 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 c Gap ρ 0 = 0.99 , w = 1.5 G w G w,q 1/c 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 c Gap ρ 0 = 0.99 , w = 1.75 G w G w,q 1/c 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 c Gap ρ 0 = 0.99 , w = 2 G w G w,q 1/c 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 c Gap ρ 0 = 0.99 , w = 2.5 G w G w,q 1/c 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 c Gap ρ 0 = 0.99 , w = 3 G w G w,q 1/c 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 c Gap ρ 0 = 0.99 , w = 4 G w G w,q 1/c 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 c Gap ρ 0 = 0.99 , w = 5 G w G w,q 1/c Figure 7: The gaps G w and G w ,q as funct ions of c , for ρ 0 = 0 . 99 . In each panel, we plot both G w and G w ,q for a particu lar w va lue. 11 1 1.5 2 2.5 3 3.5 4 4.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 c Gap ρ 0 = 0.95 , w = 0.25 G w G w,q 1/c 1 1.5 2 2.5 3 3.5 4 4.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 c Gap ρ 0 = 0.95 , w = 0.5 G w G w,q 1/c 1 1.5 2 2.5 3 3.5 4 4.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 c Gap ρ 0 = 0.95 , w = 0.75 G w G w,q 1/c 1 1.5 2 2.5 3 3.5 4 4.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 c Gap ρ 0 = 0.95 , w = 1 G w G w,q 1/c 1 1.5 2 2.5 3 3.5 4 4.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 c Gap ρ 0 = 0.95 , w = 1.25 G w G w,q 1/c 1 1.5 2 2.5 3 3.5 4 4.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 c Gap ρ 0 = 0.95 , w = 1.5 G w G w,q 1/c 1 1.5 2 2.5 3 3.5 4 4.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 c Gap ρ 0 = 0.95 , w = 1.75 G w G w,q 1/c 1 1.5 2 2.5 3 3.5 4 4.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 c Gap ρ 0 = 0.95 , w = 2 G w G w,q 1/c 1 1.5 2 2.5 3 3.5 4 4.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 c Gap ρ 0 = 0.95 , w = 2.5 G w G w,q 1/c 1 1.5 2 2.5 3 3.5 4 4.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 c Gap ρ 0 = 0.95 , w = 3 G w G w,q 1/c 1 1.5 2 2.5 3 3.5 4 4.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 c Gap ρ 0 = 0.95 , w = 4 G w G w,q 1/c 1 1.5 2 2.5 3 3.5 4 4.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 c Gap ρ 0 = 0.95 , w = 5 G w G w,q 1/c Figure 8: The gaps G w and G w ,q as funct ions of c , for ρ 0 = 0 . 95 . In each panel, we plot both G w and G w ,q for a particu lar w va lue. 12 1 1.5 2 2.5 3 3.5 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 c Gap ρ 0 = 0.9 , w = 0.25 G w G w,q 1/c 1 1.5 2 2.5 3 3.5 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 c Gap ρ 0 = 0.9 , w = 0.5 G w G w,q 1/c 1 1.5 2 2.5 3 3.5 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 c Gap ρ 0 = 0.9 , w = 0.75 G w G w,q 1/c 1 1.5 2 2.5 3 3.5 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 c Gap ρ 0 = 0.9 , w = 1 G w G w,q 1/c 1 1.5 2 2.5 3 3.5 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 c Gap ρ 0 = 0.9 , w = 1.25 G w G w,q 1/c 1 1.5 2 2.5 3 3.5 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 c Gap ρ 0 = 0.9 , w = 1.5 G w G w,q 1/c 1 1.5 2 2.5 3 3.5 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 c Gap ρ 0 = 0.9 , w = 1.75 G w G w,q 1/c 1 1.5 2 2.5 3 3.5 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 c Gap ρ 0 = 0.9 , w = 2 G w G w,q 1/c 1 1.5 2 2.5 3 3.5 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 c Gap ρ 0 = 0.9 , w = 2.5 G w G w,q 1/c 1 1.5 2 2.5 3 3.5 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 c Gap ρ 0 = 0.9 , w = 3 G w G w,q 1/c 1 1.5 2 2.5 3 3.5 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 c Gap ρ 0 = 0.9 , w = 4 G w G w,q 1/c 1 1.5 2 2.5 3 3.5 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 c Gap ρ 0 = 0.9 , w = 5 G w G w,q 1/c Figure 9: The gaps G w and G w ,q as functions of c , for ρ 0 = 0 . 9 . In each panel, we plot both G w and G w ,q for a particu lar w va lue. 13 1 1.1 1.2 1.3 1.4 1.5 0.5 0.6 0.7 0.8 0.9 1 c Gap ρ 0 = 0.5 , w = 0.25 G w G w,q 1/c 1 1.1 1.2 1.3 1.4 1.5 0.5 0.6 0.7 0.8 0.9 1 c Gap ρ 0 = 0.5 , w = 0.5 G w G w,q 1/c 1 1.1 1.2 1.3 1.4 1.5 0.5 0.6 0.7 0.8 0.9 1 c Gap ρ 0 = 0.5 , w = 0.75 G w G w,q 1/c 1 1.1 1.2 1.3 1.4 1.5 0.5 0.6 0.7 0.8 0.9 1 c Gap ρ 0 = 0.5 , w = 1 G w G w,q 1/c 1 1.1 1.2 1.3 1.4 1.5 0.5 0.6 0.7 0.8 0.9 1 c Gap ρ 0 = 0.5 , w = 1.25 G w G w,q 1/c 1 1.1 1.2 1.3 1.4 1.5 0.5 0.6 0.7 0.8 0.9 1 c Gap ρ 0 = 0.5 , w = 1.5 G w G w,q 1/c 1 1.1 1.2 1.3 1.4 1.5 0.5 0.6 0.7 0.8 0.9 1 c Gap ρ 0 = 0.5 , w = 1.75 G w G w,q 1/c 1 1.1 1.2 1.3 1.4 1.5 0.5 0.6 0.7 0.8 0.9 1 c Gap ρ 0 = 0.5 , w = 2 G w G w,q 1/c 1 1.1 1.2 1.3 1.4 1.5 0.5 0.6 0.7 0.8 0.9 1 c Gap ρ 0 = 0.5 , w = 2.5 G w G w,q 1/c 1 1.1 1.2 1.3 1.4 1.5 0.5 0.6 0.7 0.8 0.9 1 c Gap ρ = 0.5 , w = 3 G w G w,q 1/c 1 1.1 1.2 1.3 1.4 1.5 0.5 0.6 0.7 0.8 0.9 1 c Gap ρ = 0.5 , w = 4 G w G w,q 1/c 1 1.1 1.2 1.3 1.4 1.5 0.5 0.6 0.7 0.8 0.9 1 c Gap ρ 0 = 0.5 , w = 5 G w G w,q 1/c Figure 10: The gaps G w and G w ,q as funct ions of c , for ρ 0 = 0 . 5 . In each panel, we plot both G w and G w ,q for a particu lar w va lue. 14 4 Optimal Gaps T o view the optimal ga ps more cl early , Figure 11 an d Figure 12 plot the best gap s (left panels) and the optimal w v alues (right panels ) at which the best gaps are attained, for selected val ues of c and the entire range of ρ . The results can be summarized as follo ws • At any ρ and c , the optimal gap G w ,q is always at least as large as the optimal gap G w . At relati ve ly lo w similarit ies, the optimal G w ,q can be substan tially lar ger than the optimal G w . • When the targ et similarity le ve l ρ is high (e.g., ρ > 0 . 85 ), for both schemes h w and h w ,q , the optimal w value s are relati v ely lo w , for example, w = 1 ∼ 1 . 5 when 0 . 85 < ρ < 0 . 9 . In this regio n, both h w ,q and h w beha vi or similarly . • When the targe t s imilarity le v el ρ is not so high, fo r h w , it is best to use a lar ge value of w , in particular w ≥ 2 ∼ 3 . In compar ison, for h w ,q , the optimal w va lues gro w smoothly with decreasi ng ρ . These plots again conﬁrm the pre vious comparisons: (i) we should alwa ys replace h w ,q with h w ; (ii) if we use h w and target at very high similarity , a good choice of w might be w = 1 ∼ 1 . 5 ; (iii) if we use h w and the tar get similarity is not too high, then we can safely use w = 2 ∼ 3 . W e should also mention that, although the optimal w va lues for h w appear to exhibit a “jump” in the right pa nels of Figure 11 and Fig ure 1 2, the cho ice of w d oes not i nﬂuence th e pe rformanc e much, as sho wn in pre viou s plots. In F igures 3 to 6, we ha ve seen that ev en when the optimal w appear to approach “ ∞ ”, the actual gaps are not much dif ferenc e between w = 3 and w ≫ 3 . In the real-dat a ev alua tions in the next sectio n, we will see the same pheno menon for h w . Note that the Gaussian density decays rapidly at the tail, for examp le, 1 − Φ(6) = 9 . 9 × 10 − 10 . If w e choos e w = 1 . 5 , or 2, or 3, then w e jus t need a small number of bits to code each hashed valu e. 15 0 0.2 0.4 0.6 0.8 1 0.6 0.7 0.8 0.9 1 ρ Optimum Gap c = 1.05 G w G w,q 0 0.2 0.4 0.6 0.8 1 0 1 2 3 4 5 6 ρ Optimum w c = 1.05 G w G w,q 0 0.2 0.4 0.6 0.8 1 0.6 0.7 0.8 0.9 1 ρ Optimum Gap c = 1.1 G w G w,q 0 0.2 0.4 0.6 0.8 1 0 1 2 3 4 5 6 ρ Optimum w c = 1.1 G w G w,q 0 0.2 0.4 0.6 0.8 1 0.6 0.7 0.8 0.9 1 ρ Optimum Gap c = 1.3 G w G w,q 0 0.2 0.4 0.6 0.8 1 0 1 2 3 4 5 6 ρ Optimum w c = 1.3 G w G w,q Figure 11: Left panels : the optimal (smallest) gaps at giv en c valu es and the entire range of ρ . W e can see that G w ,q is alway s lar ger than G w , conﬁrming that it is better to use h w instea d of h w ,q . Right panels : the optimal value s of w at which the optimal gaps are attained. When the tar get similarity ρ is very high, it is best to use a relati v ely small w . When the target similarity is not that high, if we use h w , it is best to use w > 3 . 16 0.5 0.6 0.7 0.8 0.9 1 0.54 0.56 0.58 0.6 0.62 0.64 ρ Optimum Gap c = 1.5 G w G w,q 0.6 0.7 0.8 0.9 1 0 1 2 3 4 5 6 ρ Optimum w c = 1.5 G w G w,q 0.6 0.7 0.8 0.9 1 0.46 0.48 0.5 0.52 0.54 0.56 ρ Optimum Gap c = 1.7 G w G w,q 0.6 0.7 0.8 0.9 1 0 1 2 3 4 5 6 ρ Optimum w c = 1.7 G w G w,q 0.7 0.75 0.8 0.85 0.9 0.95 1 0.36 0.38 0.4 0.42 0.44 0.46 ρ Optimum Gap c = 2 G w G w,q 0.7 0.75 0.8 0.85 0.9 0.95 1 0 1 2 3 4 5 6 ρ Optimum w c = 2 G w G w,q Figure 12: Left panels : the optimal (smallest) gaps at giv en c valu es and the entire range of ρ . W e can see that G w ,q is alway s lar ger than G w , conﬁrming that it is better to use h w instea d of h w ,q . Right panels : the optimal value s of w at which the optimal gaps are attained. When the tar get similarity ρ is very high, it is best to use a relati v ely small w . When the target similarity is not that high, if we use h w , it is best to use w > 3 . 17 5 An Experimental Study T wo datasets, P eekab oom and Y outub e , are used in our experiment s for val idatin g the theoretical results. P eekab oom is a st andard image retrie v al dat aset, which is divi ded into two subsets, one w ith 1998 data points and ano ther with 5559 9 data points. W e use the lar g er subset for b uild ing hash tables and the s maller subset for query data points. The reported experiment al results are a ve raged ove r all query data points. A v aila ble in the U CI reposit ory , Y outube is a multi-vie w dataset. For simplicity , we only use the larges t set of aud io fe atures . The o rigin al traini ng set, w ith 9793 4 da ta po ints, i s used for b uil ding ha sh ta bles. 5000 data point s, randomly selecte d from the origina l test set, are used as query data points . W e use the standard ( K, L ) -LSH implementat ion [3]. W e generat e K × L indep enden t hash functio ns h i,j , i = 1 to K , j = 1 to L . F or each hash table j , j = 1 to L , we con catena te K h ash func tions < h 1 ,j , h 2 ,j , h 3 ,j , ..., h K,j > . For eac h data poin t, we compute the hash v alue s and place them (in fact , their pointe rs) into the appropria te bu ck ets of the hash table i . In the query phase, we compute the hash val ue of the que ry data poin ts using the s ame hash f unctio ns to ﬁnd the b uc ke t in which the qu ery data poi nt belo ngs to and only search for near neighbor among the data points in that b uck et of hash table i . W e repeat the proces s for each hash table and the ﬁ nal retrie ve d data points are the union of the retrie v ed data points in all the hash tables. Ideally , the number of retrie v ed data points will be substan tially smaller than the total number of da ta poin ts. W e use the term fraction r e trie ved to indicate th e ratio of the number of re trie v ed data points ov er the tota l number of data points. A smaller val ue of fracti on r etrie ved would be more desir able. T o thoroughly ev alua te the two coding schemes, we conduct exten si ve experi ments on the two dataset s, by using man y combinatio ns o f K (from 3 to 40) and L (from 1 to 200). At each choice of ( K, L ) , we vary w from 0.5 to 5. T hus, the total numb er of combin ations is lar ge , a nd the experime nts are ver y time-consumin g. There are many ways to ev alua te the performanc e of an LSH scheme. W e could specify a threshold of similarit y and only count the retrie v ed data points whose (exact) similarity is abov e the threshold as “true positi ves”. T o av oid specifyin g a threshold and consider the fact that in practice people often would like to retrie ve the top- T nearest neighbor s, we take a simple approach by computin g the reca ll based on top- T neighb ors. For example, suppos e the number of retrie v ed data points is 120, among which 70 data points belong to the top- T . Then the recall value would be 70 /T = 70% if T = 100 . Id eally , we hope the recalls would be as h igh as possi ble and i n the meanwhi le we hope to keep the fract ion retrie v ed as lo w as possib le. Figure 13 presents the results on Y outub e for T = 100 and target recal ls fro m 0 .1 to 0.99. In eve ry pane l, we set a tar get recall threshold. At eve ry bin width w , we ﬁnd the smallest fracti on r et riev ed ov er a wide range of L SH parameters, K an d L . Note that, if the tar get recall is high (e.g., 0.95), we basically hav e to ef fecti vely lower the targe t threshold ρ , so that w e do not hav e to go down the re-rank ed list too far . T he plots sho w that, for high targ et recalls, we need to use relati vel y lar ge w (e.g., w ≥ 2 ∼ 3 ), and for lo w tar get recalls, we should use a relati v ely small w (e.g., w = 1 . 5 ). Figures 14 to 18 present similar results on the Y outube dataset for T = 50 , 20 , 10 , 5 , 3 . W e only include plots with re lati v ely hi gh r ecalls whi ch are often more u seful in p ractic e. Figures 1 9 to 2 4 presen t the results on the P eekaboom dataset, which are essentially very similar to the results on the Y outube datas et. These plot s conﬁrm the pre viou s theoretical analysis: (i) it is ess ential ly alwa ys better to use h w instea d of h w ,q , i.e., the random offset is not needed; (ii) when using h w and the targ et recall is high (which essen- tially means when the target similarity is lo w), it is better to use a relati v ely larg e w (e.g., w = 2 ∼ 3 ); (iii) when using h w and the tar get recall is low , it is better to use a smaller w (e.g., w = 1 . 5 ) ; (i v) when using h w , the inﬂuenc e is w is not that much as long as it is in a reasona ble range, which is important in practice . 18 0 1 2 3 4 5 0.3 0.4 0.5 0.6 0.7 w Fraction Retrieved Youtube: Top 100 Recall = 0.99 h w,q h w 0 1 2 3 4 5 0.3 0.4 0.5 0.6 0.7 w Fraction Retrieved Youtube: Top 100 Recall = 0.98 h w,q h w 0 1 2 3 4 5 0.2 0.3 0.4 0.5 0.6 w Fraction Retrieved Youtube: Top 100 Recall = 0.97 h w,q h w 0 1 2 3 4 5 0.2 0.3 0.4 0.5 w Fraction Retrieved Youtube: Top 100 Recall = 0.95 h w,q h w 0 1 2 3 4 5 0.1 0.2 0.3 0.4 0.5 w Fraction Retrieved Youtube: Top 100 Recall = 0.93 h w,q h w 0 1 2 3 4 5 0.1 0.2 0.3 0.4 w Fraction Retrieved Youtube: Top 100 Recall = 0.9 h w,q h w 0 1 2 3 4 5 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22 w Fraction Retrieved Youtube: Top 100 Recall = 0.85 h w,q h w 0 1 2 3 4 5 0.06 0.08 0.1 0.12 0.14 0.16 0.18 w Fraction Retrieved Youtube: Top 100 Recall = 0.8 h w,q h w 0 1 2 3 4 5 0.04 0.06 0.08 0.1 0.12 w Fraction Retrieved Youtube: Top 100 Recall = 0.7 h w,q h w 0 1 2 3 4 5 0.02 0.03 0.04 0.05 0.06 w Fraction Retrieved Youtube: Top 100 Recall = 0.6 h w,q h w 0 1 2 3 4 5 0.01 0.02 0.03 0.04 w Fraction Retrieved Youtube: Top 100 Recall = 0.5 h w,q h w 0 1 2 3 4 5 0.005 0.01 0.015 0.02 0.025 w Fraction Retrieved Youtube: Top 100 Recall = 0.4 h w,q h w 0 1 2 3 4 5 0 0.005 0.01 0.015 0.02 w Fraction Retrieved Youtube: Top 100 Recall = 0.3 h w,q h w 0 1 2 3 4 5 2 4 6 8 10 x 10 −3 w Fraction Retrieved Youtube: Top 100 Recall = 0.2 h w,q h w 0 1 2 3 4 5 1 2 3 4 5 x 10 −3 w Fraction Retrieved Youtube: Top 100 Recall = 0.1 h w,q h w Figure 13: Y outube T op 100 . In each panel, we plot the optimal fractio n re trie ved at a targe t r ecall v alue (for top-10 0) with respec t to w for both coding schemes h w and h w ,q . Lower is bett er . 19 0 1 2 3 4 5 0.3 0.4 0.5 0.6 0.7 w Fraction Retrieved Youtube: Top 50 Recall = 0.99 h w,q h w 0 1 2 3 4 5 0.2 0.3 0.4 0.5 0.6 w Fraction Retrieved Youtube: Top 50 Recall = 0.97 h w,q h w 0 1 2 3 4 5 0.1 0.2 0.3 0.4 0.5 w Fraction Retrieved Youtube: Top 50 Recall = 0.95 h w,q h w 0 1 2 3 4 5 0.1 0.15 0.2 0.25 0.3 w Fraction Retrieved Youtube: Top 50 Recall = 0.9 h w,q h w 0 1 2 3 4 5 0.08 0.1 0.12 0.14 0.16 0.18 0.2 w Fraction Retrieved Youtube: Top 50 Recall = 0.85 h w,q h w 0 1 2 3 4 5 0.06 0.08 0.1 0.12 0.14 0.16 w Fraction Retrieved Youtube: Top 50 Recall = 0.8 h w,q h w 0 1 2 3 4 5 0.03 0.04 0.05 0.06 0.07 0.08 w Fraction Retrieved Youtube: Top 50 Recall = 0.7 h w,q h w 0 1 2 3 4 5 0.01 0.02 0.03 0.04 0.05 w Fraction Retrieved Youtube: Top 50 Recall = 0.6 h w,q h w Figure 14: Y o utube T op 50 . In each panel, we plot the optimal fraction re trie ved at a targ et r ecal l value (for top-50 ) w ith respect to w for both coding schemes h w and h w ,q . 20 0 1 2 3 4 5 0.3 0.4 0.5 0.6 0.7 w Fraction Retrieved Youtube: Top 20 Recall = 0.99 h w,q h w 0 1 2 3 4 5 0.2 0.3 0.4 0.5 w Fraction Retrieved Youtube: Top 20 Recall = 0.97 h w,q h w 0 1 2 3 4 5 0.1 0.2 0.3 0.4 0.5 w Fraction Retrieved Youtube: Top 20 Recall = 0.95 h w,q h w 0 1 2 3 4 5 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 w Fraction Retrieved Youtube: Top 20 Recall = 0.9 h w,q h w 0 1 2 3 4 5 0.06 0.08 0.1 0.12 0.14 0.16 0.18 w Fraction Retrieved Youtube: Top 20 Recall = 0.85 h w,q h w 0 1 2 3 4 5 0.04 0.06 0.08 0.1 0.12 0.14 w Fraction Retrieved Youtube: Top 20 Recall = 0.8 h w,q h w 0 1 2 3 4 5 0.02 0.03 0.04 0.05 0.06 0.07 w Fraction Retrieved Youtube: Top 20 Recall = 0.7 h w,q h w 0 1 2 3 4 5 0.01 0.02 0.03 0.04 w Fraction Retrieved Youtube: Top 20 Recall = 0.6 h w,q h w Figure 15: Y o utube T op 20 . In each panel, we plot the optimal fraction re trie ved at a targ et r ecal l value (for top-20 ) w ith respect to w for both coding schemes h w and h w ,q . 21 0 1 2 3 4 5 0.2 0.3 0.4 0.5 0.6 0.7 w Fraction Retrieved Youtube: Top 10 Recall = 0.99 h w,q h w 0 1 2 3 4 5 0.1 0.2 0.3 0.4 0.5 w Fraction Retrieved Youtube: Top 10 Recall = 0.97 h w,q h w 0 1 2 3 4 5 0.1 0.2 0.3 0.4 0.5 w Fraction Retrieved Youtube: Top 10 Recall = 0.95 h w,q h w 0 1 2 3 4 5 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22 w Fraction Retrieved Youtube: Top 10 Recall = 0.9 h w,q h w 0 1 2 3 4 5 0.05 0.07 0.09 0.11 0.13 0.15 w Fraction Retrieved Youtube: Top 10 Recall = 0.85 h w,q h w 0 1 2 3 4 5 0.04 0.06 0.08 0.1 0.12 w Fraction Retrieved Youtube: Top 10 Recall = 0.8 h w,q h w 0 1 2 3 4 5 0.02 0.03 0.04 0.05 0.06 w Fraction Retrieved Youtube: Top 10 Recall = 0.7 h w,q h w 0 1 2 3 4 5 0.01 0.02 0.03 0.04 w Fraction Retrieved Youtube: Top 10 Recall = 0.6 h w,q h w Figure 16: Y o utube T op 10 . In each panel, we plot the optimal fraction re trie ved at a targ et r ecal l value (for top-10 ) w ith respect to w for both coding schemes h w and h w ,q . 22 0 1 2 3 4 5 0.2 0.3 0.4 0.5 0.6 w Fraction Retrieved Youtube: Top 5 Recall = 0.99 h w,q h w 0 1 2 3 4 5 0.1 0.2 0.3 0.4 0.5 w Fraction Retrieved Youtube: Top 5 Recall = 0.97 h w,q h w 0 1 2 3 4 5 0.1 0.2 0.3 0.4 w Fraction Retrieved Youtube: Top 5 Recall = 0.95 h w,q h w 0 1 2 3 4 5 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 w Fraction Retrieved Youtube: Top 5 Recall = 0.9 h w,q h w 0 1 2 3 4 5 0.04 0.06 0.08 0.1 0.12 0.14 w Fraction Retrieved Youtube: Top 5 Recall = 0.85 h w,q h w 0 1 2 3 4 5 0.03 0.04 0.05 0.06 0.07 0.08 w Fraction Retrieved Youtube: Top 5 Recall = 0.8 h w,q h w 0 1 2 3 4 5 0.01 0.02 0.03 0.04 0.05 w Fraction Retrieved Youtube: Top 5 Recall = 0.7 h w,q h w 0 1 2 3 4 5 0.005 0.01 0.015 0.02 0.025 0.03 w Fraction Retrieved Youtube: Top 5 Recall = 0.6 h w,q h w Figure 17: Y o utube T op 5 . In each panel, we plot the optimal fractio n r etr iev ed at a targe t re call val ue (for top-5) with respe ct to w for both coding schemes h w and h w ,q . 23 0 1 2 3 4 5 0.2 0.3 0.4 0.5 0.6 w Fraction Retrieved Youtube: Top 3 Recall = 0.99 h w,q h w 0 1 2 3 4 5 0.1 0.2 0.3 0.4 0.5 w Fraction Retrieved Youtube: Top 3 Recall = 0.97 h w,q h w 0 1 2 3 4 5 0.1 0.2 0.3 0.4 w Fraction Retrieved Youtube: Top 3 Recall = 0.95 h w,q h w 0 1 2 3 4 5 0.06 0.08 0.1 0.12 0.14 0.16 0.18 w Fraction Retrieved Youtube: Top 3 Recall = 0.9 h w,q h w 0 1 2 3 4 5 0.04 0.06 0.08 0.1 0.12 0.14 w Fraction Retrieved Youtube: Top 3 Recall = 0.85 h w,q h w 0 1 2 3 4 5 0.02 0.03 0.04 0.05 0.06 0.07 w Fraction Retrieved Youtube: Top 3 Recall = 0.8 h w,q h w 0 1 2 3 4 5 0.01 0.02 0.03 0.04 0.05 w Fraction Retrieved Youtube: Top 3 Recall = 0.7 h w,q h w 0 1 2 3 4 5 0.005 0.01 0.015 0.02 0.025 0.03 w Fraction Retrieved Youtube: Top 3 Recall = 0.6 h w,q h w Figure 18: Y o utube T op 3 . In each panel, we plot the optimal fractio n r etr iev ed at a targe t re call val ue (for top-3) with respe ct to w for both coding schemes h w and h w ,q . 24 0 1 2 3 4 5 0.5 0.6 0.7 0.8 0.9 w Fraction Retrieved Peekaboom: Top 100 Recall = 0.99 h w,q h w 0 1 2 3 4 5 0.3 0.4 0.5 0.6 0.7 w Fraction Retrieved Peekaboom: Top 100 Recall = 0.97 h w,q h w 0 1 2 3 4 5 0.2 0.3 0.4 0.5 0.6 0.7 w Fraction Retrieved Peekaboom: Top 100 Recall = 0.95 h w,q h w 0 1 2 3 4 5 0.1 0.2 0.3 0.4 0.5 w Fraction Retrieved Peekaboom: Top 100 Recall = 0.9 h w,q h w 0 1 2 3 4 5 0.1 0.15 0.2 0.25 0.3 0.35 w Fraction Retrieved Peekaboom: Top 100 Recall = 0.85 h w,q h w 0 1 2 3 4 5 0.05 0.1 0.15 0.2 0.25 0.3 w Fraction Retrieved Peekaboom: Top 100 Recall = 0.8 h w,q h w 0 1 2 3 4 5 0.04 0.06 0.08 0.1 0.12 0.14 0.16 w Fraction Retrieved Peekaboom: Top 100 Recall = 0.7 h w,q h w 0 1 2 3 4 5 0.02 0.04 0.06 0.08 0.1 0.12 w Fraction Retrieved Peekaboom: Top 100 Recall = 0.6 h w,q h w Figure 19: Peekaboo m T op 100 . In each panel, we plot the optimal fra ction re trie ved at a tar get rec all v alue (for top-1 00) w ith respect to w for both coding schemes h w and h w ,q . 25 0 1 2 3 4 5 0.4 0.5 0.6 0.7 0.8 0.9 w Fraction Retrieved Peekaboom: Top 50 Recall = 0.99 h w,q h w 0 1 2 3 4 5 0.3 0.4 0.5 0.6 w Fraction Retrieved Peekaboom: Top 50 Recall = 0.97 h w,q h w 0 1 2 3 4 5 0.2 0.3 0.4 0.5 w Fraction Retrieved Peekaboom: Top 50 Recall = 0.95 h w,q h w 0 1 2 3 4 5 0.1 0.15 0.2 0.25 0.3 0.35 w Fraction Retrieved Peekaboom: Top 50 Recall = 0.9 h w,q h w 0 1 2 3 4 5 0.05 0.1 0.15 0.2 0.25 0.3 w Fraction Retrieved Peekaboom: Top 50 Recall = 0.85 h w,q h w 0 1 2 3 4 5 0.05 0.1 0.15 0.2 0.25 w Fraction Retrieved Peekaboom: Top 50 Recall = 0.8 h w,q h w 0 1 2 3 4 5 0.02 0.04 0.06 0.08 0.1 0.12 0.14 w Fraction Retrieved Peekaboom: Top 50 Recall = 0.7 h w,q h w 0 1 2 3 4 5 0 0.02 0.04 0.06 0.08 0.1 0.12 w Fraction Retrieved Peekaboom: Top 50 Recall = 0.6 h w,q h w Figure 20: Peekaboo m T op 50 . In each panel, we plot the optimal fracti on ret rie ved at a tar get r ecall v alue (for top-5 0) with respec t to w for both coding schemes h w and h w ,q . 26 0 1 2 3 4 5 0.4 0.5 0.6 w Fraction Retrieved Peekaboom: Top 20 Recall = 0.99 h w,q h w 0 1 2 3 4 5 0.2 0.3 0.4 0.5 0.6 w Fraction Retrieved Peekaboom: Top 20 Recall = 0.97 h w,q h w 0 1 2 3 4 5 0.1 0.2 0.3 0.4 0.5 w Fraction Retrieved Peekaboom: Top 20 Recall = 0.95 h w,q h w 0 1 2 3 4 5 0.05 0.1 0.15 0.2 0.25 0.3 w Fraction Retrieved Peekaboom: Top 20 Recall = 0.9 h w,q h w 0 1 2 3 4 5 0.05 0.1 0.15 0.2 0.25 0.3 w Fraction Retrieved Peekaboom: Top 20 Recall = 0.85 h w,q h w 0 1 2 3 4 5 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 w Fraction Retrieved Peekaboom: Top 20 Recall = 0.8 h w,q h w 0 1 2 3 4 5 0.02 0.04 0.06 0.08 0.1 0.12 w Fraction Retrieved Peekaboom: Top 20 Recall = 0.7 h w,q h w 0 1 2 3 4 5 0 0.02 0.04 0.06 0.08 0.1 w Fraction Retrieved Peekaboom: Top 20 Recall = 0.6 h w,q h w Figure 21: Peekaboo m T op 20 . In each panel, we plot the optimal fracti on ret rie ved at a tar get r ecall v alue (for top-2 0) with respec t to w for both coding schemes h w and h w ,q . 27 0 1 2 3 4 5 0.3 0.4 0.5 0.6 0.7 w Fraction Retrieved Peekaboom: Top 10 Recall = 0.99 h w,q h w 0 1 2 3 4 5 0.2 0.3 0.4 0.5 w Fraction Retrieved Peekaboom: Top 10 Recall = 0.97 h w,q h w 0 1 2 3 4 5 0.1 0.2 0.3 0.4 0.5 w Fraction Retrieved Peekaboom: Top 10 Recall = 0.95 h w,q h w 0 1 2 3 4 5 0.05 0.1 0.15 0.2 0.25 0.3 w Fraction Retrieved Peekaboom: Top 10 Recall = 0.9 h w,q h w 0 1 2 3 4 5 0 0.05 0.1 0.15 0.2 0.25 w Fraction Retrieved Peekaboom: Top 10 Recall = 0.85 h w,q h w 0 1 2 3 4 5 0 0.05 0.1 0.15 0.2 w Fraction Retrieved Peekaboom: Top 10 Recall = 0.8 h w,q h w 0 1 2 3 4 5 0 0.02 0.04 0.06 0.08 0.1 w Fraction Retrieved Peekaboom: Top 10 Recall = 0.7 h w,q h w 0 1 2 3 4 5 0 0.02 0.04 0.06 0.08 0.1 w Fraction Retrieved Peekaboom: Top 10 Recall = 0.6 h w,q h w Figure 22: Peekaboo m T op 10 . In each panel, we plot the optimal fracti on ret rie ved at a tar get r ecall v alue (for top-1 0) with respec t to w for both coding schemes h w and h w ,q . 28 0 1 2 3 4 5 0.2 0.3 0.4 0.5 0.6 0.7 w Fraction Retrieved Peekaboom: Top 5 Recall = 0.99 h w,q h w 0 1 2 3 4 5 0.1 0.2 0.3 0.4 0.5 w Fraction Retrieved Peekaboom: Top 5 Recall = 0.97 h w,q h w 0 1 2 3 4 5 0.1 0.2 0.3 0.4 0.5 w Fraction Retrieved Peekaboom: Top 5 Recall = 0.95 h w,q h w 0 1 2 3 4 5 0 0.05 0.1 0.15 0.2 0.25 w Fraction Retrieved Peekaboom: Top 5 Recall = 0.9 h w,q h w 0 1 2 3 4 5 0 0.05 0.1 0.15 0.2 0.25 w Fraction Retrieved Peekaboom: Top 5 Recall = 0.85 h w,q h w 0 1 2 3 4 5 0.02 0.04 0.06 0.08 0.1 0.12 0.14 w Fraction Retrieved Peekaboom: Top 5 Recall = 0.8 h w,q h w 0 1 2 3 4 5 0 0.02 0.04 0.06 0.08 w Fraction Retrieved Peekaboom: Top 5 Recall = 0.7 h w,q h w 0 1 2 3 4 5 0 0.02 0.04 0.06 0.08 w Fraction Retrieved Peekaboom: Top 5 Recall = 0.6 h w,q h w Figure 23: Pee kaboom T op 5 . In each pa nel, we plot the op timal frac tion r et rie ved at a tar ge t rec all v alu e (for top-5) with respe ct to w for both coding schemes h w and h w ,q . 29 0 1 2 3 4 5 0.2 0.3 0.4 0.5 0.6 0.7 w Fraction Retrieved Peekaboom: Top 3 Recall = 0.99 h w,q h w 0 1 2 3 4 5 0.1 0.2 0.3 0.4 0.5 w Fraction Retrieved Peekaboom: Top 3 Recall = 0.97 h w,q h w 0 1 2 3 4 5 0.1 0.15 0.2 0.25 0.3 w Fraction Retrieved Peekaboom: Top 3 Recall = 0.95 h w,q h w 0 1 2 3 4 5 0 0.05 0.1 0.15 0.2 0.25 w Fraction Retrieved Peekaboom: Top 3 Recall = 0.9 h w,q h w 0 1 2 3 4 5 0 0.05 0.1 0.15 0.2 w Fraction Retrieved Peekaboom: Top 3 Recall = 0.85 h w,q h w 0 1 2 3 4 5 0 0.02 0.04 0.06 0.08 0.1 0.12 w Fraction Retrieved Peekaboom: Top 3 Recall = 0.8 h w,q h w 0 1 2 3 4 5 0 0.02 0.04 0.06 0.08 w Fraction Retrieved Peekaboom: Top 3 Recall = 0.7 h w,q h w 0 1 2 3 4 5 0 0.01 0.02 0.03 0.04 w Fraction Retrieved Peekaboom: Top 3 Recall = 0.6 h w,q h w Figure 24: Pee kaboom T op 3 . In each pa nel, we plot the op timal frac tion r et rie ved at a tar ge t rec all v alu e (for top-3) with respe ct to w for both coding schemes h w and h w ,q . 30 6 Conclusion W e hav e compared two quantizatio n (coding) schemes for random projection s in the contex t of sublin ear time ap proxima te near neigh bor search . The rec ently prop osed scheme based on unifo rm quantiz ation [4] is simpler than the inﬂuential existing work [1] (which used uniform quantizat ion with a random offs et). Our analys is conﬁrms that, under the general theory of LSH , the ne w scheme [4] is simpler and more accurate than [1]. In other words, the step of random off set in [1] is not needed and may hurt the performan ce. Our ana lysis pro vides the practical guidel ines for using the proposed c oding sche me t o build hash ta bles. Our recommendati on is to use a bin width about w = 1 . 5 when the targ et similarity is high and a bin width about w = 3 w hen the tar get similarity is not that high. In addition, usi ng the propos ed coding scheme based on uniform quantization (without the random of fset), the inﬂuence of w is not very sensiti v e, w hich makes it v ery con veni ent in practical applicati ons. Refer ences [1] Mayur Datar , Nicole Immorlica, Piotr I ndyk, and V ahab S. Mirro kn. L ocality -sensi ti ve hashing scheme based on p -stable distrib ut ions. In SCG , pages 253 – 262, Brooklyn , NY , 2004 . [2] Jerome H. Friedman, F . B ask ett, and L. S hustek . An algorithm for ﬁnding nearest neighbors. IEEE T ransact ions on Computer s , 24:1000–1 006, 1975. [3] Piotr Indyk and Rajee v Motwan i. Approx imate nearest neighbor s: T owa rds re movin g the curse of dimensio nality . In STOC , pa ges 604–613 , Dallas, TX, 1998. [4] Ping Li, Mich ael Mit zenmach er , and Anshumali Shri v ast a v a. Coding for ran dom pr ojectio ns. T echnical report , arXiv :1308 .2218 , 2013. [5] Ping Li, Michael Mitzenmacher , and Anshumali Shri v asta v a. Codin g for random projection s and non- linear estimator s. T echnica l report, 2014. 31

Coding for Random Projections and Approximate Near Neighbor Search

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment