Improved Asymmetric Locality Sensitive Hashing (ALSH) for Maximum Inner Product Search (MIPS)

Impro ved Asymmetric Locality Sensitiv e Hashing (ALSH) f or Maximum Inner Pr oduct Search (MIPS) Anshumali Shrivastav a Ping Li Departmen t of C ompu ter Science Computing and Inform ation Science Cornell University , Ithaca, NY 14853 , USA anshu@cs.cor nell.edu Departmen t of Statistics and Biostatistics Departmen t of Computer Science Rutgers University , Piscataw ay , NJ 0885 4, USA pingli@stat. rutgers.edu Abstract Recently it was shown that the pro blem of Max- imum Inner Product Search (MIPS) is efﬁcient and it admits provably s ub-linear hashin g al- gorithms. Asymmetric transform ations before hashing were the key in solving MIPS which w as otherwise hard. In [18], the auth ors u se asym- metric transfo rmations which c onv ert the prob- lem of ap proxim ate MIPS into the prob lem of approx imate near neighbor search which can be efﬁciently solved using hashing. In this work, we provide a different transformation wh ich c on- verts the problem of approx imate MIPS into the problem of appro ximate cosine similarity search which can be ef ﬁciently solv ed using signed ran- dom pr ojections. Theo retical analysis show that the new scheme is sign iﬁcantly better than the original sch eme for MIPS. Experim ental ev alua- tions strongly support the theoretical ﬁndings. 1 Intr oduction In this pap er , we revisit the pro blem of Ma ximum Inner Pr oduct Searc h (MIPS) , which was stud ied in a recent tech- nical repo rt [18]. I n this report th e autho rs present the ﬁrst provably fast alg orithm for MIPS, wh ich was c onsidered hard [16, 11]. Given an input query point q ∈ R D , the task of MIPS is to ﬁnd p ∈ S , where S is a g iant collection of size N , which maximizes (approxim ately) the inner prod- uct q T p : p = arg max x ∈S q T x (1) The MIPS pr oblem is related to the p roblem of n ear neigh- bor sear ch (NN S) . For example, L2-NNS p = arg min x ∈S || q − x || 2 2 = arg min x ∈S ( || x || 2 2 − 2 q T x ) (2) or , correlation-NNS p = arg max x ∈S q T x k q kk x k = arg max x ∈S q T x k x k (3) Submitted to AIST A TS 2015. These three pr oblems are equivalent if the n orm of every element x ∈ S is constant. Clearly , the value of the n orm || q || 2 has no effect for the argm ax. In many scenarios, MIPS arises na turally at places where the n orms of the el- ements in S have sign iﬁcant variations [11]. As revie wed in [18], examples of applica tions o f MIPS include recom - mender system [1 2, 2, 11], lar ge-scale object detec tion with DPM [6, 4, 10, 10], structural SVM [4], and multi-class la- bel prediction [16, 11, 19]. Asymmetric LSH (ALSH) : L ocality Sensitive Hashin g (LSH) [9] is po pular in practice for efﬁciently s olving NNS. In the prior work [18], the c oncept of “asymmetr ic LSH” (ALSH) was prop osed that one c an tran sform th e inpu t query Q ( p ) and data in the collection P ( x ) independen tly , where the tr ansform ations Q and P are different. [1 8] de- veloped a particular set of transformations to con vert MIPS into L2- NNS and then solved the pro blem by standard L2 - hash [3]. In this p aper, we name the scheme in [1 8] as L2-ALSH . Asymme try in hashing has become po pular re- cently , and it has been ap plied for ha shing higher ord er sim- ilarity [17], data dependent hashing [15], s ketching [5] etc. Our cont rib ution : In th is study , we pro pose another scheme for AL SH, by developing a n ew set of asymmet- ric transfo rmations to co nv ert MIPS into a pr oblem of correlation -NNS, which is solved by “sign rand om projec- tions” [8, 1] . W e na me this new sche me as Sig n-ALSH . Our th eoretical an alysis and experimen tal stud y show that Sign-LSH is more advantageous than L2-ALSH for MIPS. 2 Review: Locality Sensitiv e Hashing (LSH) The prob lem o f efﬁciently ﬁn ding near est n eighbo rs has been an active research sin ce th e very early days of com- puter science [7]. App roximate versions of the n ear neigh- bor sear ch proble m [9] were prop osed to break the linear query time bottleneck . The following fo rmulation for a p- proxim ate near neighb or search is often adopted. Deﬁnition: ( c -Appr oximate Near Neighbor or c - NN) Given a set o f points in a D -dimension al spa ce R D , and parameters S 0 > 0 , δ > 0 , construct a data structu r e which, given any qu ery point q , does the following with pr obability 1 − δ : if the r e exists an S 0 -near neighb or of q in S , it r eports some cS 0 -near neighb or of q in S . Locality Se nsitive Hashin g (LSH) [9] is a family of func- tions, with the property th at more similar items have a higher collision p robab ility . L SH trades off query time with extra ( one time) pr eprocessing cost and space . Existence of a n LSH family tran slates in to p rovably sublinear que ry time algorithm for c-NN prob lems. Deﬁnition: (Locality Sensitiv e Hashing (LSH)) A fa mily H is called ( S 0 , cS 0 , p 1 , p 2 ) -sensitive if , for any two points x, y ∈ R D , h chosen uniformly fr om H sa tisﬁes: • if S im ( x, y ) ≥ S 0 then P r H ( h ( x ) = h ( y )) ≥ p 1 • if S im ( x, y ) ≤ cS 0 then P r H ( h ( x ) = h ( y )) ≤ p 2 F or efﬁcient app r o ximate ne ar est n eighbo r sear ch, p 1 > p 2 and c < 1 is needed . Fact 1 : Given a family of ( S 0 , c S 0 , p 1 , p 2 ) - sensitiv e hash function s, one can construct a data structure for c -NN with O ( n ρ log n ) query time an d space O ( n 1+ ρ ) , where ρ = log p 1 log p 2 < 1 . LSH is a gener ic framework an d an implementation of LSH requires a concrete hash functio n. 2.1 LSH f or L 2 distance [3] presented an LSH family fo r L 2 distances. Formally , giv en a ﬁxed window size r , we sample a ran dom vector a with ea ch com ponen t f rom i.i.d. norm al, i.e., a i ∼ N ( 0 , 1) , and a scalar b gener ated unifo rmly at ran dom fro m [0 , r ] . The hash function is deﬁned as: h L 2 a,b ( x ) =  a T x + b r  (4) where ⌊⌋ is th e ﬂoor operatio n. The collision prob ability under this scheme can be shown to be P r ( h L 2 a,b ( x ) = h L 2 a,b ( y )) (5) = 1 − 2Φ( − r /d ) − 2 √ 2 π ( r /d )  1 − e − ( r /d ) 2 / 2  where Φ( x ) = R x −∞ 1 √ 2 π e − x 2 2 dx and d = || x − y || 2 is th e Euclidean distance between the vectors x an d y . 2.2 LSH f or correlation Another popu lar LSH family is the so-called “sign random projection s” [8, 1]. Again, we choose a rando m vector a with a i ∼ N ( 0 , 1) . The hash fun ction is deﬁned as: h S ig n ( x ) = sig n ( a T x ) (6) And collision probability is P r ( h S ig n ( x ) = h S ig n ( y )) = 1 − 1 π cos − 1  x T y k x kk y k  (7) This hashing scheme is also p opularly k nown a s signed random pr ojec tions (SRP) 3 Review of ALSH f or MIPS and L2-ALS H In [18], it was shown that the fr amew ork of locality sen- siti ve hashing is restrictive for solv ing MIPS. The inheren t assumption of the same hash f unction for b oth the transfor- mation as well as the quer y was unn ecessary in the classi- cal LSH f ramework an d it was the main hu rdle in ﬁnding provable sub-linear algo rithms fo r MIPS with LSH. For the theoretical guaran tees o f LSH to work the re was no require- ment of sym metry . Incor porating asymmetry in the ha shing schemes was the ke y in solving MIPS efﬁ ciently . Deﬁnition [1 8]: ( Asymmetric Locality Sen siti ve Hashing (ALSH)) A family H , along with the two vector f unc- tions Q : R D 7→ R D ′ ( Query T ransformation ) an d P : R D 7→ R D ′ ( Preprocessing T r ansformation ), is called ( S 0 , c S 0 , p 1 , p 2 ) -sensitiv e if for a given c -NN in- stance with query q , a nd the hash fu nction h chosen uni- formly from H satisﬁes the following: • if S i m ( q , x ) ≥ S 0 then P r H ( h ( Q ( q ))) = h ( P ( x ))) ≥ p 1 • if S im ( q , x ) ≤ cS 0 then P r H ( h ( Q ( q )) = h ( P ( x ))) ≤ p 2 Here x is any point in the collection S . Note that the query tr ansforma tion Q is only applied on the query and the p re-pro cessing transfor mation P is applied to x ∈ S while cr eating h ash tables. By letting Q ( x ) = P ( x ) = x , we can recover the v anilla LSH. Using dif ferent transform ations (i.e., Q 6 = P ), it is po ssible to coun ter th e fact that self similarity is n ot high est with inne r produc ts which is the main argum ent of f ailure of LSH. W e only just need the prob ability of the new co llision e vent { h ( Q ( q )) = h ( P ( y )) } to satisfy the condition s of deﬁnitio n of ALSH for S im ( q , y ) = q T y . Theorem 1 [18] Given a family of hash fu nction H and the associated query and prepr ocessing transformation s P and Q , which is ( S 0 , c S 0 , p 1 , p 2 ) -sen sitive, o ne ca n con - struct a data structu r e for c - NN with O ( n ρ log n ) qu ery time and space O ( n 1+ ρ ) , wher e ρ = log p 1 log p 2 . [18] also pr ovided an explicit con struction of ALSH, wh ich we call L2-ALSH . W ithout loss of g enerality , one can al- ways assume || x i || 2 ≤ U < 1 , ∀ x i ∈ S (8) for some U < 1 . If this is not the case, then we can always scale down the norm s without altering the arg max . Since the norm of the query d oes not affect the arg max in MIPS, for simplicity it was assumed || q || 2 = 1 . This con dition can be removed easily (see Section 5 for details). In L 2- ALSH, two vector transform ations P : R D 7→ R D + m and Q : R D 7→ R D + m are deﬁned as follows: P ( x ) = [ x ; || x || 2 2 ; || x || 4 2 ; .... ; || x || 2 m 2 ] (9) Q ( x ) = [ x ; 1 / 2; 1 / 2; .... ; 1 / 2] , (10) where [;] is the concatenatio n. P ( x ) appe nds m scaler s of the f orm || x || 2 i 2 at the end of the vector x , while Q(x) simply app ends m “1/2” to the en d of the vector x . By observing || P ( x i ) || 2 2 = || x i || 2 2 + || x i || 4 2 + ... + || x i || 2 m 2 + || x i || 2 m +1 2 || Q ( q ) || 2 2 = || q || 2 2 + m/ 4 = 1 + m/ 4 Q ( q ) T P ( x i ) = q T x i + 1 2 ( || x i || 2 2 + || x i || 4 2 + ... + || x i || 2 m 2 ) one can obtain the following k ey equality: || Q ( q ) − P ( x i ) || 2 2 = (1 + m/ 4) − 2 q T x i + || x i || 2 m +1 2 (11) Since || x i || 2 ≤ U < 1 , we h av e || x i || 2 m +1 → 0 at the tower rate (exponen tial to exponential). Thus, as long a s m is not too small (e.g., m ≥ 3 would sufﬁce), we have arg max x ∈S q T x ≃ arg min x ∈S || Q ( q ) − P ( x ) || 2 (12) This scheme is the ﬁrst co nnection between solving u n- normalized MI PS and app roxim ate near neig hbor search. T ransforma tions P and Q , when norms are less than 1 , pro- vide correctio n to the L 2 distance || Q ( q ) − P ( x i ) || 2 mak- ing it r ank correlate with the (un-n ormalized) inner prod- uct. The gener al id ea of ALSH was partially inspir ed by the work on th ree-way similarity search [1 7], where they applied d ifferent hashing fun ctions for hand ling query and data in the repositor y . 3.1 Intuition f or the B etter Scheme Asymmetric tr ansformatio ns give us enoug h ﬂexibility to modify norms without changing inner products. Th e trans- formation provided in [18] used this ﬂexibility to c onv ert MIPS to standard n ear neighb or searc h in L 2 space for which we have standard hash f unction s. Signed random projection s are po pular h ash function s widely adop ted for correlation or cosine similarity . W e use asym metric trans- formation to convert appro ximate MIPS into approx imate maximum correlatio n search. T he tr ansformatio ns and the collision pro bability of th e hashin g f unctions determin es the efﬁciency of the obtained ALSH algorithm. W e show that th e new transfor mation with SRP is better suited f or ALSH compared to the existing L2-ALSH. Note that in th e recent work on co ding for random pr ojections [13, 14], it was alre ady shown that sign ran dom projections (o r 2-bit random projections) can outper form L2LSH. 4 The New Pr oposal: Sign-ALSH 4.1 Fr om MIPS to Correlation-NNS W e assum e for simplicity that || q || 2 = 1 as th e norm of the query does no t chang e the o rdering , we show in the next section h ow to ge t rid of this assumption. W ithout loss of generality let || x i || 2 ≤ U < 1 , ∀ x i ∈ S as it can always be achieved b y scaling th e data by large eno ugh number . 0 0.2 0.4 0.6 0.8 1 0.6 0.7 0.8 0.9 1 c ρ * S = 0.5U S = 0.1U S = 0.5U S = 0.1U S 0 = 0.5U S 0 = 0.1U Sign L2 0 0.2 0.4 0.6 0.8 1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 c ρ * S 0 = 0.9U S 0 = 0.5U Sign L2 Figure 1: Op timal values of ρ ∗ (lower is b etter) with r e- spect to approxima tion ratio c fo r dif ferent S 0 , obtained by a grid search over parame ters U and m , given S 0 and c . The curves show that Sig n-ALSH (solid curves) is notice- ably better than L2 -ALSH (dashed cu rves) in terms of their optimal ρ ∗ values. The results for L2-ALSH were from the prior work [18]. For clarity , the results are in two ﬁgures. W e deﬁne two vector tr ansform ations P : R D 7→ R D + m and Q : R D 7→ R D + m as follows: P ( x ) = [ x ; 1 / 2 − || x || 2 2 ; 1 / 2 − || x || 4 2 ; .... ; 1 / 2 − || x || 2 m 2 ] (13) Q ( x ) = [ x ; 0 ; 0; .... ; 0] , (14) Using || Q ( q ) || 2 2 = || q || 2 2 = 1 , Q ( q ) T P ( x i ) = q T x i , and || P ( x i ) || 2 2 = || x i || 2 2 + 1 / 4 + || x i || 4 2 − || x i || 2 2 + 1 / 4 + || x i || 8 2 − || x i || 4 2 + ... + 1 / 4 + || x i || 2 m +1 2 − || x i || 2 m 2 = m/ 4 + || x i || 2 m +1 2 we obtain the following k ey equality: Q ( q ) T P ( x i ) k Q ( q ) k 2 k P ( x i ) k 2 = q T x i q m/ 4 + || x i || 2 m +1 2 (15) The term || x i || 2 m +1 → 0 , again vanishes a t the tower rate. This means we have a pprox imately arg max x ∈S q T x ≃ arg max x ∈S Q ( q ) T P ( x i ) k Q ( q ) k 2 k P ( x i ) k 2 (16) 0 0.2 0.4 0.6 0.8 1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 m = 2, U = 0.75 c ρ S 0 = 0.9U S 0 = 0.1U 0 0.2 0.4 0.6 0.8 1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 m = 3, U = 0.85 c ρ S 0 = 0.9U S 0 = 0.1U Figure 2: Th e solid curves are the o ptimal ρ values of Sign- ALSH from Figure 1. The dashe d curves repr esent the ρ values fo r ﬁxed parameter s: m = 2 and U = 0 . 75 (lef t panel), a nd m = 3 an d U = 0 . 85 (rig ht pan el). E ven with ﬁxed parameters, the ρ do es not degrade much. This provide s a nother solutio n for solv ing MIPS using known methods for appro ximate correlation-NNS. 4.2 F ast Algorithms f or MIPS Using Sign Random Projections Eq. (16) shows that MIPS reduces to the standard approxi- mate near n eighbo r search problem which can be efﬁciently solved by sign rand om projection s, i.e., h S ig n (deﬁned b y Eq. (6)). Formally , we can state the following theorem. Theorem 2 Given a c -ap pr oximate instance of MIPS, i.e., S im ( q , x ) = q T x , and a query q such that || q || 2 = 1 alon g with a co llection S ha ving || x || 2 ≤ U < 1 ∀ x ∈ S . Let P and Q be the vector transformations deﬁ ned in Eq. (13) and Eq. (14), res pectively . W e ha ve the following two con - ditions for hash function h S ig n (deﬁne d by Eq. (6)) • if q T x ≥ S 0 then P r [ h S ig n ( Q ( q )) = h S ig n ( P ( x ))] ≥ 1 − 1 π cos − 1 S 0 p m/ 4 + U 2 m +1 ! • if q T x ≤ cS 0 then P r [ h S ig n ( Q ( q )) = h S ig n ( P ( x ))] ≤ 1 − 1 π cos − 1   min { cS 0 , z ∗ } q m/ 4 + (min { cS 0 , z ∗ } ) 2 m +1   wher e z ∗ =  m/ 2 2 m +1 − 2  2 − m − 1 . Proof: When q T x ≥ S 0 , we have, accor ding to Eq. (7) P r [ h S ig n ( Q ( q )) = h S ig n ( P ( x ))] = 1 − 1 π cos − 1   q T x q m/ 4 + || x || 2 m +1 2   ≥ 1 − 1 π cos − 1 q T x p m/ 4 + U 2 m +1 ! When q T x ≤ cS 0 , by noting that q T x ≤ k x k 2 , we have P r [ h S ig n ( Q ( q )) = h S ig n ( P ( x ))] = 1 − 1 π cos − 1   q T x q m/ 4 + || x || 2 m +1 2   ≤ 1 − 1 π cos − 1 q T x p m/ 4 + ( q T x ) 2 m +1 ! F or th is on e-dimension al functio n f ( z ) = z √ a + z b , wher e z = q T x , a = m/ 4 an d b = 2 m +1 ≥ 2 , we know f ′ ( z ) = a − z b ( b/ 2 − 1) ( a + z b ) 3 / 2 One can also check th at f ′′ ( z ) ≤ 0 for 0 < z < 1 , i.e., f ( z ) is a conca ve function. The maximum of f ( z ) is attained a t z ∗ =  2 a b − 2  1 /b =  m/ 2 2 m +1 − 2  2 − m − 1 If z ∗ ≥ cS 0 , then we need to use f ( cS 0 ) as the boun d.  Therefo re, we have obtained , in LSH terminolo gy , p 1 = 1 − 1 π cos − 1 S 0 p m/ 4 + U 2 m +1 ! (17) p 2 = 1 − 1 π cos − 1   min { cS 0 , z ∗ } q m/ 4 + (min { cS 0 , z ∗ } ) 2 m +1   , (18) z ∗ =  m/ 2 2 m +1 − 2  2 − m − 1 (19) Theorem 1 allows us to construct data structures with w orst case O ( n ρ log n ) q uery tim e gu arantees fo r c -app roximate MIPS, where ρ = log p 1 log p 2 . For any given c < 1 , there al ways exist U < 1 and m such th at ρ < 1 . T his way , w e ob tain a sub linear qu ery time algorithm fo r MIPS. Because ρ is a fu nction o f 2 parameters, th e best q uery time cho oses U and m , which minimizes the value o f ρ . For conv enience, we deﬁne ρ ∗ = min U,m log  1 − 1 π cos − 1  S 0 √ m/ 4+ U 2 m +1  log 1 − 1 π cos − 1 min { cS 0 ,z ∗ } q m/ 4+(min { cS 0 ,z ∗ } ) 2 m +1 !! (20) See Figure 1 f or the plots of ρ ∗ , which also compar es the optimal ρ values for L2 -ALSH in the prior work [18]. The results show that Sign-AL SH is noticeably better . 4.3 P arameter Selection Figure 2 presents the ρ values f or two sets of selected pa- rameters: ( m, U ) = (2 , 0 . 75 ) a nd ( m, U ) = (3 , 0 . 8 5) . W e ca n see that ev en if we use ﬁxed par ameters, th e p er- forman ce would not degrad e mu ch. Th is e ssentially frees practitioner s from t he burden of choosing parameters. 5 Remove Dep endence on Norm of Query Changing nor ms of the query does not af fect the arg max x ∈C q T x , and hen ce, in p ractice for re trieving t op- k , normalizing the query should not affect the perform ance. But for theo retical p urposes, we want the r untime guara n- tee to be in depend ent of || q || 2 . Note, both LSH and ALSH schemes solve the c -app roximate instance of th e problem , which requires a threshold S 0 = q t x and an approximation ratio c . For this g i ven c -appro ximate instance we ch oose optimal para meters K an d L . I f the qu eries have vary- ing n orms, which is likely the case in practical scen arios, then given a c -approx imate MI PS instan ce, n ormalizin g the query will chang e the problem because it will change the threshold S 0 and also the appro ximation ratio c . The opti- mal p arameters fo r th e alg orithm K and L , which are also the size o f the da ta structure, change with S 0 and c . This will requ ire r e-doin g the costly prepro cessing with every change in qu ery . Thus, the qu ery time which is d ependen t on ρ should be indepen dent of the query . T ransforma tions P and Q were prec isely meant to remove the d ependen cy of correlatio n on th e n orms of x but at the same time keep ing the inner pro ducts same. Realizing the fact that we are allowed asym metry , we can use the same idea to get rid of the norm of q . Let M be the up per bound on all th e norms i.e. M = max x ∈C || x || 2 . In othe r words M is the radius of the space. Let U < 1 , deﬁne the transfor mations, T : R D → R D as T ( x ) = U x M (21) and transfor mations P, Q : R D → R D + m are the same fo r the Sign-ALSH scheme as deﬁned in Eq (13) and (14). Giv en the q uery q and a ny data po int x , o bserve that the inner products between P ( Q ( T ( q ))) and Q ( P ( T ( x ))) is P ( Q ( T ( q ))) T Q ( P ( T ( x ))) = q T x ×  U 2 M 2  (22) P ( Q ( T ( q ))) ap pends ﬁrst m zeros co mpon ents to T ( q ) and then m compo nents of the form 1 / 2 − || q || 2 i . Q ( P ( T ( q ))) does the same thing but in a d ifferent order . No w we are working in D + 2 m dimen sions. It is not difﬁcult to see that the no rms of P ( Q ( T ( q ))) and Q ( P ( T ( q ))) is giv en by || P ( Q ( T ( q ))) || 2 = r m 4 + || T ( q ) || 2 m +1 2 (23) || Q ( P ( T ( x ))) || 2 = r m 4 + || T ( x ) || 2 m +1 2 (24) The transformation s are very asymmetric b ut we k now tha t it is necessary . Therefo re th e correlation or the cosine similar ity between P ( Q ( T ( q ))) and Q ( P ( T ( x ))) is C or r = q T x ×  U 2 M 2  q m 4 + || T ( q ) || 2 m +1 2 q m 4 + || T ( x ) || 2 m +1 2 (25) Note || T ( q ) || 2 m +1 2 , || T ( x ) || 2 m +1 2 ≤ U < 1 , ther efore both || T ( q ) || 2 m +1 2 and || T ( x ) || 2 m +1 2 conv erge to zero at a tower rate an d we get approx imate mo notonicity of correlation with the inner products. W e can apply sign random projec- tions to hash P ( Q ( T ( q ))) and Q ( P ( T ( q ))) . Using the fact 0 ≤ || T ( q ) || 2 m +1 2 ≤ U and 0 ≤ || T ( x ) || 2 m +1 2 ≤ U , it is no t d ifﬁcult to get p 1 and p 2 for Sign-ALSH, without any conditions o n an y n orms. Simp li- fying the expression, we get the following value of optimal ρ u (u for unrestricte d). ρ ∗ u = min U,m, log  1 − 1 π cos − 1  S 0 ×  U 2 M 2  m 4 + U 2 m +1  log  1 − 1 π cos − 1  cS 0 ×  4 U 2 M 2  m  (26) s.t. U 2 m +1 < m (1 − c ) 4 c , m ∈ N + , and 0 < U < 1 . W ith this v alue of ρ ∗ u , we can state our main theorem. Theorem 3 F or the pr oblem of c -app r oximate MIP S in a bound ed space, one can construct a data structu r e havin g O ( n ρ ∗ u log n ) q uery time and spa ce O ( n 1+ ρ ∗ u ) , wher e ρ ∗ u < 1 is the solution to constraint optimization (26). Note, for all c < 1 , we always have ρ ∗ u < 1 becau se the constraint U 2 m +1 < m (1 − c ) 4 c is always tru e for b ig enou gh m . T he o nly assump tion f or efﬁciently solving M IPS that we need is that th e space is bounde d, which is always satis - ﬁed for a ny ﬁnite d ataset. ρ ∗ u depend s on M , the radius of the space, which is expected. 0 20 40 60 80 100 0 10 20 30 Recall (%) Precision (%) MovieLens Top 1, K = 512 L2−ALSH Sign,m=2,U=0.75 Sign,m=3,U=0.85 0 20 40 60 80 100 0 20 40 60 80 Recall (%) Precision (%) MovieLens Top 5, K = 512 L2−ALSH Sign,m=2,U=0.75 Sign,m=3,U=0.85 0 20 40 60 80 100 0 20 40 60 80 Recall (%) Precision (%) MovieLens Top 10, K = 512 L2−ALSH Sign,m=2,U=0.75 Sign,m=3,U=0.85 0 20 40 60 80 100 0 5 10 15 20 Recall (%) Precision (%) MovieLens Top 1, K = 256 L2−ALSH Sign,m=2,U=0.75 Sign,m=3,U=0.85 0 20 40 60 80 100 0 20 40 60 Recall (%) Precision (%) MovieLens Top 5, K = 256 L2−ALSH Sign,m=2,U=0.75 Sign,m=3,U=0.85 0 20 40 60 80 100 0 20 40 60 80 Recall (%) Precision (%) MovieLens Top 10, K = 256 L2−ALSH Sign,m=2,U=0.75 Sign,m=3,U=0.85 0 20 40 60 80 100 0 5 10 15 Recall (%) Precision (%) MovieLens Top 1, K = 128 L2−ALSH Sign,m=2,U=0.75 Sign,m=3,U=0.85 0 20 40 60 80 100 0 10 20 30 40 50 Recall (%) Precision (%) MovieLens Top 5, K = 128 L2−ALSH Sign,m=2,U=0.75 Sign,m=3,U=0.85 0 20 40 60 80 100 0 20 40 60 Recall (%) Precision (%) MovieLens Top 10, K = 128 L2−ALSH Sign,m=2,U=0.75 Sign,m=3,U=0.85 0 20 40 60 80 100 0 2 4 6 8 10 Recall (%) Precision (%) MovieLens Top 1, K = 64 L2−ALSH Sign,m=2,U=0.75 Sign,m=3,U=0.85 0 20 40 60 80 100 0 10 20 30 40 Recall (%) Precision (%) MovieLens Top 5, K = 64 L2−ALSH Sign,m=2,U=0.75 Sign,m=3,U=0.85 0 20 40 60 80 100 0 10 20 30 40 50 Recall (%) Precision (%) MovieLens Top 10, K = 64 L2−ALSH Sign,m=2,U=0.75 Sign,m=3,U=0.85 Figure 3: Movielens . Precision-Recall curves (high er is better), o f retrieving top- T items, for T = 1 , 5 , 1 0 . W e vary the number o f hashes K fro m 64 to 512 . W e comp are L2-ALSH (u sing par ameters recomm ended in [1 8]) with our p roposed Sign-ALSH using two sets of parameters: ( m = 2 , U = 0 . 75) and ( m = 3 , U = 0 . 85) . Sign- ALSH is noticeably better . 6 Ranking Evaluations In [1 8], the L2-ALSH scheme was shown to outp erform the LSH for L2 distance in retrieving maximum inner pro ducts. Since o ur pro posal is an impr ovement over L2- ALSH, we focus o n compariso ns with L 2-ALSH. In this section, we compare L2-ALSH with Sign-ALSH based on ranking. 6.1 Datasets W e use the two p opular collaborative ﬁltering datasets MovieLens 10M and Net ﬂix , for the task of item recom- mendation s. These a re also the same da tasets used in [18]. Each dataset is a sparse us er -item ma trix R , where R ( i, j ) indicates the ratin g of user i for movie j . For getting the latent f eature vectors from user item matrix, we f ollow the metho dology of [18]. They use Pure SVD proced ure described in [2] to generate user an d item latent vector s, which in volves co mputin g th e SVD of R R = W Σ V T where W is n users × f matr ix an d V is n item × f matr ix for some chosen rank f also known as latent dimension. After the SVD step, the rows of matrix U = W Σ are treated as the user characteristic vectors while rows of ma- trix V correspon d to the item characteristic vectors. This simple procedure has been sh own to outperfor m othe r pop- ular r ecommen dation algo rithms for the task of top item recommen dations in [2], o n these two datasets. W e u se the same choices fo r the laten t dimen sion f , i.e., f = 150 for Movielens and f = 30 0 fo r Netﬂix as [18]. 0 20 40 60 80 100 0 5 10 15 20 25 Recall (%) Precision (%) Netflix Top 1, K = 512 L2−ALSH Sign,m=2,U=0.75 Sign,m=3,U=0.85 0 20 40 60 80 100 0 10 20 30 40 50 Recall (%) Precision (%) Netflix Top 5, K = 512 L2−ALSH Sign,m=2,U=0.75 Sign,m=3,U=0.85 0 20 40 60 80 100 0 20 40 60 Recall (%) Precision (%) Netflix Top 10, K = 512 L2−ALSH Sign,m=2,U=0.75 Sign,m=3,U=0.85 0 20 40 60 80 100 0 5 10 15 20 Recall (%) Precision (%) Netflix Top 1, K = 256 L2−ALSH Sign,m=2,U=0.75 Sign,m=3,U=0.85 0 20 40 60 80 100 0 10 20 30 40 Recall (%) Precision (%) Netflix Top 5, K = 256 L2−ALSH Sign,m=2,U=0.75 Sign,m=3,U=0.85 0 20 40 60 80 100 0 20 40 60 Recall (%) Precision (%) Netflix Top 10, K = 256 L2−ALSH Sign,m=2,U=0.75 Sign,m=3,U=0.85 0 20 40 60 80 100 0 5 10 15 Recall (%) Precision (%) Netflix Top 1, K = 128 L2−ALSH Sign,m=2,U=0.75 Sign,m=3,U=0.85 0 20 40 60 80 100 0 10 20 30 40 Recall (%) Precision (%) Netflix Top 5, K = 128 L2−ALSH Sign,m=2,U=0.75 Sign,m=3,U=0.85 0 20 40 60 80 100 0 10 20 30 40 50 Recall (%) Precision (%) Netflix Top 10, K = 128 L2−ALSH Sign,m=2,U=0.75 Sign,m=3,U=0.85 0 20 40 60 80 100 0 2 4 6 8 10 Recall (%) Precision (%) Netflix Top 1, K = 64 L2−ALSH Sign,m=2,U=0.75 Sign,m=3,U=0.85 0 20 40 60 80 100 0 5 10 15 20 25 Recall (%) Precision (%) Netflix Top 5, K = 64 L2−ALSH Sign,m=2,U=0.75 Sign,m=3,U=0.85 0 20 40 60 80 100 0 10 20 30 40 Recall (%) Precision (%) Netflix Top 10, K = 64 L2−ALSH Sign,m=2,U=0.75 Sign,m=3,U=0.85 Figure 4: Netﬂix . Precision-Recall curves ( higher is better), of r etrieving top- T item s, for T = 1 , 5 , 10 . W e vary the number o f hashes K fro m 64 to 512 . W e comp are L2-ALSH (u sing par ameters recomm ended in [1 8]) with our p roposed Sign-ALSH using two sets of parameters: ( m = 2 , U = 0 . 75) and ( m = 3 , U = 0 . 85) . Sign- ALSH is noticeably better . 6.2 Ev aluations In this section, we show ho w the ranking of the two ALSH schemes, L2 -ALSH and Sign -ALSH, correla tes with the top- T inn er produ cts. Given a user i and its cor respond ing user vector u i , we compute the top- T gold stan dard items based on the actual inne r products u T i v j , ∀ j . W e th en gen- erate K d ifferent hash codes of the vector u i and all the item vectors v j s and then compute M atches j = K X t =1 1 ( h t ( u i ) = h t ( v j )) , (27) where 1 is the indicator function and th e subscript t is used to distinguish indep enden t draws of h . Based on M atche s j we rank all the item s. Ideally , for a better hashin g scheme, M atches j should b e hig her for items having higher inn er produ cts with the g iv en u ser u i . Th is pro cedure gen erates a sorted list of all th e items fo r a given user vector u i cor- respond ing to the each hash functio n under consideration. For L2-ALSH, we used the sam e param eters used and rec- ommend ed in [ 18]. For Sign-ALSH, we u sed th e two recommen ded c hoices shown in Section 4 .3, which are U = 0 . 75 , m = 2 and U = 0 . 8 5 , m = 3 . It shou ld be noted that Sign-ALSH does not have th e param eter r . W e comp ute th e pr ecision and reca ll of the top - T items for T ∈ { 1 , 5 , 1 0 } , obtained from the sorted list based on M atches . T o compute this precision and recall, we st art at the to p of th e ranked item list and w alk down in order . Sup - pose we are at the k th ranked item, we check if this item belongs to the go ld standard top- T list. If it is on e of the top- T gold standar d item , then we in crement the coun t of 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 1 recall Fraction Multiplications MovieLens Top 1 L2−LSH Sign−ALSH 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 1 recall Fraction Multiplications MovieLens Top 5 L2−LSH Sign−ALSH 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 1 recall Fraction Multiplications MovieLens Top 10 L2−LSH Sign−ALSH Figure 5: MovieLens . Recall-FIP (Fractio ns of In ner Produc ts) cur ves for top- 1, to p-5, and top -10, for Sign- ALSH with L2-ALSH. W e used the recomm ended parameters for L2-ALSH [18]. For Sign-ALSH, we used m = 2 and U = 0 . 75 . r elevant seen by 1, else we move to k + 1 . By k th step, we have already seen k items, so the tota l items seen is k . Th e precision and recall at that point is then computed as: P reci sion = relev ant seen k , Reca l l = relev ant seen T W e show per forman ce for K ∈ { 64 , 128 , 256 , 512 } . Note that it is important to b alance bo th precision an d recall. The method w hich obtains high er precisio n at a given recall is superior . Hig her precision indicates high er ra nking of the relev ant items. W e r eport averaged precision s and recalls over 2000 rando mly chosen users. The p lots for MovieLe ns and Netﬂix datasets are sh own in Figu re 3 and Figure 4 resp ectiv ely . W e can clearly see, that our pro posed Sig n-ALSH scheme gives sign iﬁcantly higher precision r ecall curves than the L2- ALSH scheme, indicating better correlation of the top neighb ors und er in- ner prod ucts with Sign-AL SH co mpared to L2-ALSH. In addition, th ere is n ot mu ch d ifference in the two different combinatio ns of the parameters U an d m in Sign-ALSH. The results are very consistent across both datasets. 7 LSH Bucket ing Experiments In th is section, we ev aluate the actual savings in th e nu m- ber o f inne r produ ct ev aluations for recom mendin g top- T items for th e MovieLens dataset. For this, we implem ented the standard ( K , L ) alg orithms in [9], wh ere K is numb er of ha shes in each hash table and L is the to tal n umber o f tables. For each qu ery p oint, the retur ned results are the union o f matches in all L tab les. T o ﬁnd the top- T items, we need to comp ute the actual inner pr oducts on ly on the candidate items retrieved b y the bucketing proced ure. In this experim ent, we choose T ∈ { 1 , 5 , 10 } and compu te the recall value for ea ch co mbination of ( T , K, L ) for ev- ery query . For example, given q uery q and a ( K , L ) -LSH scheme, if T = 10 and only 5 of the tr ue t op-1 0 d ata points are retrieved, the r ecall will be 50% for this ( T , K, L ) . At the same time, we can also comp ute the FIP (frac tion of inner products): FIP = ( K × L ) + T otal Re tr ie v ed T o tal Items (28) which is basically the total number of inner pro ducts e valu- ation (wh ere K × L rep resents the cost of hashing), n ormal- ized by the total nu mber of items in the repository . Thus, for each q and ( T , K , L ) , we can com pute two values: r e- call and FIP . W e also n eed to ﬁgure ou t a way to aggregate the results for all queries. T y pically the pe rforma nce of bucketing a lgorithm is very sensiti ve to the choice o f hashin g par ameters K and L . I de- ally , to ﬁnd best K and L , we nee d to know the operating threshold S 0 and the appro ximation ratio c in adv ance. Un- fortun ately , the data and the q ueries are very diverse an d therefor e fo r retrieving top- T near neighb ors th ere is no common ﬁxed thresho ld S 0 and app roxima tion ratio c that works for dif ferent queries. Our goal is to compare the hashing schemes, and minimize the effect o f K an d L on th e ev aluation. T o get away with the effect of K an d L , we perf orm rig orou s ev aluations of various K and L which includes op timal choices at various thresholds. For both th e hashin g schemes, we then select the best perfor ming K and L and report the perfo rmance. This inv olves running the bucketing experiments fo r thou - sands of combin ations and th en choosing the best K an d L to marginalize the ef fect of parameters in the compariso ns. This all ensures that our ev aluation is f air . W e choo se the following scheme . For ea ch ( T , K, L ) , we co mpute the averaged recall and av eraged FIP , over all quer ies. Th en for eac h “target” rec all level (and T ), we can ﬁnd the ( K, L ) which produ ces the best (lowest) av eraged FIP . This way , for each T , we can compu te a “FIP-recall” curve, wh ich can be used to compar e Sign- ALSH with L2-ALSH. W e use K ∈ { 4 , 5 , .., 20 } and L ∈ { 1 , 2 , 3 , ..., 200 } . The results are summar ized in Figure 5. W e can clearly see from the plots that for achieving the same r ecall for top- T , Sign -ALSH sch eme needs to do less computatio ns compare d to L2-ALSH. 8 Conclusion The MIPS (maximu m inne r p roduct search) p roblem has numero us impo rtant application s in machine learning, databases, an d in formation retriev al. [18] developed the framework of Asymmetric LSH and provid ed an explicit scheme (L2-ALSH) f or approxim ate MI PS in su blinear time. I n this stud y , we present ano ther asymmetric transfo r- mation scheme (Sign-AL SH) which co n verts the problem of max imum inner pro ducts into the problem of ma ximum correlation search, which is subseq uently solved b y sign random projections. Theor etical analysis and e xperim ental study dem onstrate that Sign-A LSH can b e noticeab ly more advantageous than L2-ALSH . Acknowledgeme nt The research is sup ported in part by ONR-N00014 -13-1 - 0764, NSF-III-1 3609 71, AFOSR-F A9550 -13-1 -0137 , an d NSF-Bigdata-14 19210 . Th e metho d and theo retical analy- sis for Sign- ALSH were conducted right after the initial sub- mission o f our ﬁrst work on ALSH [1 8] in Febru ary 2014 . The intensive experimen ts (e specially the LSH bucketing experiments), however , wer e n ot f ully com pleted u ntil Ju ne 2014 due to the demand of c omputatio nal resou rces, be- cause we exhaustiv ely exper imented a wide range of K (numb er o f hashes) and L ( number of ta bles) fo r im ple- menting ( K, L ) -LSH scheme s. Here, we also would like to thank t he computin g supp orting team (LCSR) at Rutgers CS depar tment as well as the IT sup port staff at Rutgers Statistics dep artment, for setting up the workstation s espe- cially the server with 1.5TB memory . Refer ences [1] M. S. Char ikar . Similarity estimation tech niques f rom round ing algorith ms. In STOC , pages 380–3 88, M on- treal, Quebec, Canada, 2002. [2] P . Cremonesi, Y . Koren, and R. T urrin. Perfor mance of recomm ender algorith ms on top -n recom menda- tion ta sks. In Pr oceedings of the fourth ACM confer- ence o n Recommend er systems , pages 39 –46. A CM, 2010. [3] M. Datar, N. Immo rlica, P . Indy k, and V . S. Mirro kn. Locality-sensitive hashin g scheme based on p - stable distributions. In SCG , pa ges 25 3 – 26 2, Bro oklyn, NY , 2004 . [4] T . Dean, M. A. Ruz on, M. Segal, J. Sh lens, S. V ijaya- narasimhan , and J. Y a gnik. Fast, accurate detection of 100,0 00 object classes on a single mac hine. In Com- puter V ision an d P attern Re cognition (CVPR), 2013 IEEE Confer ence on , pages 1814–18 21. I EEE, 2013 . [5] W . Do ng, M. Char ikar, and K. Li. Asym metric dis- tance estimation wi th s ketches for similarity search in high-d imensional spaces. In SIGIR , pages 123–1 30, 2008. [6] P . F . Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramana n. Object detection with discrimi- natively trained p art-based mode ls. P a ttern Ana ly- sis an d Ma chine Intelligence, IEEE T ransactions o n , 32(9) :1627– 1645, 201 0. [7] J. H. Friedm an and J. W . T ukey . A pr ojection pursuit algorithm fo r explorato ry data analy sis. IEE E T rans- actions on Computers , 23(9):881 –890 , 1974. [8] M. X. Goema ns and D. P . W illiamson. Improved approx imation algor ithms fo r maximu m cu t and sat- isﬁability pr oblems using sem ideﬁnite pro gramm ing. Journal of ACM , 42(6) :1115– 1145 , 1995. [9] P . I ndyk and R. Mo twani. Approximate nearest neigh- bors: T owards removing th e curse o f dimension ality . In STOC , pages 604–613 , Dallas, TX, 199 8. [10] T . Joachims, T . Finley , and C.-N. J. Y u. Cuttin g- plane train ing of structu ral svms. Machine Learning , 77(1) :27–59 , 200 9. [11] N. K oenigstein, P . Ram, and Y . Shavitt. Ef ﬁcient re- triev al of recomm endations in a matrix factorization framework. I n CIKM , pages 535–5 44, 2012 . [12] Y . K oren, R. Bell, and C. V o linsky . Matrix factoriza- tion techniqu es for recomme nder systems. [13] P . Li, M. Mitzenmach er , and A. Shriv asta va. Codin g for random projection s. I n ICML , 2014. [14] P . Li, M. Mitzenmach er , and A. Shriv asta va. Codin g for rand om pro jections and app roxima te near neigh - bor search. T echnical report, arXiv:1403.81 44 , 2014. [15] B. Neyshabur , N. Sr ebro, R. Salakh utdinov , Y . Makary chev , and P . Y ad ollahpo ur . The power of asymmetry in bin ary hashing. In NIPS , Lake T ahoe, NV , 2013 . [16] P . Ram an d A. G. Gray . Maximum inner-pro duct search u sing co ne trees. In KDD , pag es 9 31–93 9, 2012. [17] A. Sh riv asta va and P . Li. Beyond pairwise: Prov- ably f ast algorithms for approximate k-way similarity search. In NIPS , Lake T ahoe , NV , 2013. [18] A. Shriv asta va and P . Li. Asymmetric lsh (alsh) for sub linear time maximum inn er pr oduct search (mips). T echnical rep ort, arXiv:1405.58 69 (T o Ap- pear in NIPS 2014. Initially submitted to KDD 2 014), 2014. [19] R. W eber, H.-J. Schek, and S. Blott. A quan titati ve analysis and perf orman ce stud y fo r similarity-search methods in high-dimensional spaces. I n VLDB , pages 194–2 05, 1998 .

Improved Asymmetric Locality Sensitive Hashing (ALSH) for Maximum Inner Product Search (MIPS)

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment