Improved Asymmetric Locality Sensitive Hashing (ALSH) for Maximum Inner Product Search (MIPS)

Recently it was shown that the problem of Maximum Inner Product Search (MIPS) is efficient and it admits provably sub-linear hashing algorithms. Asymmetric transformations before hashing were the key in solving MIPS which was otherwise hard. In the p…

Authors: Anshumali Shrivastava, Ping Li

Improved Asymmetric Locality Sensitive Hashing (ALSH) for Maximum Inner   Product Search (MIPS)
Impro ved Asymmetric Locality Sensitiv e Hashing (ALSH) f or Maximum Inner Pr oduct Search (MIPS) Anshumali Shrivastav a Ping Li Departmen t of C ompu ter Science Computing and Inform ation Science Cornell University , Ithaca, NY 14853 , USA anshu@cs.cor nell.edu Departmen t of Statistics and Biostatistics Departmen t of Computer Science Rutgers University , Piscataw ay , NJ 0885 4, USA pingli@stat. rutgers.edu Abstract Recently it was shown that the pro blem of Max- imum Inner Product Search (MIPS) is efficient and it admits provably s ub-linear hashin g al- gorithms. Asymmetric transform ations before hashing were the key in solving MIPS which w as otherwise hard. In [18], the auth ors u se asym- metric transfo rmations which c onv ert the prob- lem of ap proxim ate MIPS into the prob lem of approx imate near neighbor search which can be efficiently solved using hashing. In this work, we provide a different transformation wh ich c on- verts the problem of approx imate MIPS into the problem of appro ximate cosine similarity search which can be ef ficiently solv ed using signed ran- dom pr ojections. Theo retical analysis show that the new scheme is sign ificantly better than the original sch eme for MIPS. Experim ental ev alua- tions strongly support the theoretical findings. 1 Intr oduction In this pap er , we revisit the pro blem of Ma ximum Inner Pr oduct Searc h (MIPS) , which was stud ied in a recent tech- nical repo rt [18]. I n this report th e autho rs present the first provably fast alg orithm for MIPS, wh ich was c onsidered hard [16, 11]. Given an input query point q ∈ R D , the task of MIPS is to find p ∈ S , where S is a g iant collection of size N , which maximizes (approxim ately) the inner prod- uct q T p : p = arg max x ∈S q T x (1) The MIPS pr oblem is related to the p roblem of n ear neigh- bor sear ch (NN S) . For example, L2-NNS p = arg min x ∈S || q − x || 2 2 = arg min x ∈S ( || x || 2 2 − 2 q T x ) (2) or , correlation-NNS p = arg max x ∈S q T x k q kk x k = arg max x ∈S q T x k x k (3) Submitted to AIST A TS 2015. These three pr oblems are equivalent if the n orm of every element x ∈ S is constant. Clearly , the value of the n orm || q || 2 has no effect for the argm ax. In many scenarios, MIPS arises na turally at places where the n orms of the el- ements in S have sign ificant variations [11]. As revie wed in [18], examples of applica tions o f MIPS include recom - mender system [1 2, 2, 11], lar ge-scale object detec tion with DPM [6, 4, 10, 10], structural SVM [4], and multi-class la- bel prediction [16, 11, 19]. Asymmetric LSH (ALSH) : L ocality Sensitive Hashin g (LSH) [9] is po pular in practice for efficiently s olving NNS. In the prior work [18], the c oncept of “asymmetr ic LSH” (ALSH) was prop osed that one c an tran sform th e inpu t query Q ( p ) and data in the collection P ( x ) independen tly , where the tr ansform ations Q and P are different. [1 8] de- veloped a particular set of transformations to con vert MIPS into L2- NNS and then solved the pro blem by standard L2 - hash [3]. In this p aper, we name the scheme in [1 8] as L2-ALSH . Asymme try in hashing has become po pular re- cently , and it has been ap plied for ha shing higher ord er sim- ilarity [17], data dependent hashing [15], s ketching [5] etc. Our cont rib ution : In th is study , we pro pose another scheme for AL SH, by developing a n ew set of asymmet- ric transfo rmations to co nv ert MIPS into a pr oblem of correlation -NNS, which is solved by “sign rand om projec- tions” [8, 1] . W e na me this new sche me as Sig n-ALSH . Our th eoretical an alysis and experimen tal stud y show that Sign-LSH is more advantageous than L2-ALSH for MIPS. 2 Review: Locality Sensitiv e Hashing (LSH) The prob lem o f efficiently fin ding near est n eighbo rs has been an active research sin ce th e very early days of com- puter science [7]. App roximate versions of the n ear neigh- bor sear ch proble m [9] were prop osed to break the linear query time bottleneck . The following fo rmulation for a p- proxim ate near neighb or search is often adopted. Definition: ( c -Appr oximate Near Neighbor or c - NN) Given a set o f points in a D -dimension al spa ce R D , and parameters S 0 > 0 , δ > 0 , construct a data structu r e which, given any qu ery point q , does the following with pr obability 1 − δ : if the r e exists an S 0 -near neighb or of q in S , it r eports some cS 0 -near neighb or of q in S . Locality Se nsitive Hashin g (LSH) [9] is a family of func- tions, with the property th at more similar items have a higher collision p robab ility . L SH trades off query time with extra ( one time) pr eprocessing cost and space . Existence of a n LSH family tran slates in to p rovably sublinear que ry time algorithm for c-NN prob lems. Definition: (Locality Sensitiv e Hashing (LSH)) A fa mily H is called ( S 0 , cS 0 , p 1 , p 2 ) -sensitive if , for any two points x, y ∈ R D , h chosen uniformly fr om H sa tisfies: • if S im ( x, y ) ≥ S 0 then P r H ( h ( x ) = h ( y )) ≥ p 1 • if S im ( x, y ) ≤ cS 0 then P r H ( h ( x ) = h ( y )) ≤ p 2 F or efficient app r o ximate ne ar est n eighbo r sear ch, p 1 > p 2 and c < 1 is needed . Fact 1 : Given a family of ( S 0 , c S 0 , p 1 , p 2 ) - sensitiv e hash function s, one can construct a data structure for c -NN with O ( n ρ log n ) query time an d space O ( n 1+ ρ ) , where ρ = log p 1 log p 2 < 1 . LSH is a gener ic framework an d an implementation of LSH requires a concrete hash functio n. 2.1 LSH f or L 2 distance [3] presented an LSH family fo r L 2 distances. Formally , giv en a fixed window size r , we sample a ran dom vector a with ea ch com ponen t f rom i.i.d. norm al, i.e., a i ∼ N ( 0 , 1) , and a scalar b gener ated unifo rmly at ran dom fro m [0 , r ] . The hash function is defined as: h L 2 a,b ( x ) =  a T x + b r  (4) where ⌊⌋ is th e floor operatio n. The collision prob ability under this scheme can be shown to be P r ( h L 2 a,b ( x ) = h L 2 a,b ( y )) (5) = 1 − 2Φ( − r /d ) − 2 √ 2 π ( r /d )  1 − e − ( r /d ) 2 / 2  where Φ( x ) = R x −∞ 1 √ 2 π e − x 2 2 dx and d = || x − y || 2 is th e Euclidean distance between the vectors x an d y . 2.2 LSH f or correlation Another popu lar LSH family is the so-called “sign random projection s” [8, 1]. Again, we choose a rando m vector a with a i ∼ N ( 0 , 1) . The hash fun ction is defined as: h S ig n ( x ) = sig n ( a T x ) (6) And collision probability is P r ( h S ig n ( x ) = h S ig n ( y )) = 1 − 1 π cos − 1  x T y k x kk y k  (7) This hashing scheme is also p opularly k nown a s signed random pr ojec tions (SRP) 3 Review of ALSH f or MIPS and L2-ALS H In [18], it was shown that the fr amew ork of locality sen- siti ve hashing is restrictive for solv ing MIPS. The inheren t assumption of the same hash f unction for b oth the transfor- mation as well as the quer y was unn ecessary in the classi- cal LSH f ramework an d it was the main hu rdle in finding provable sub-linear algo rithms fo r MIPS with LSH. For the theoretical guaran tees o f LSH to work the re was no require- ment of sym metry . Incor porating asymmetry in the ha shing schemes was the ke y in solving MIPS effi ciently . Definition [1 8]: ( Asymmetric Locality Sen siti ve Hashing (ALSH)) A family H , along with the two vector f unc- tions Q : R D 7→ R D ′ ( Query T ransformation ) an d P : R D 7→ R D ′ ( Preprocessing T r ansformation ), is called ( S 0 , c S 0 , p 1 , p 2 ) -sensitiv e if for a given c -NN in- stance with query q , a nd the hash fu nction h chosen uni- formly from H satisfies the following: • if S i m ( q , x ) ≥ S 0 then P r H ( h ( Q ( q ))) = h ( P ( x ))) ≥ p 1 • if S im ( q , x ) ≤ cS 0 then P r H ( h ( Q ( q )) = h ( P ( x ))) ≤ p 2 Here x is any point in the collection S . Note that the query tr ansforma tion Q is only applied on the query and the p re-pro cessing transfor mation P is applied to x ∈ S while cr eating h ash tables. By letting Q ( x ) = P ( x ) = x , we can recover the v anilla LSH. Using dif ferent transform ations (i.e., Q 6 = P ), it is po ssible to coun ter th e fact that self similarity is n ot high est with inne r produc ts which is the main argum ent of f ailure of LSH. W e only just need the prob ability of the new co llision e vent { h ( Q ( q )) = h ( P ( y )) } to satisfy the condition s of definitio n of ALSH for S im ( q , y ) = q T y . Theorem 1 [18] Given a family of hash fu nction H and the associated query and prepr ocessing transformation s P and Q , which is ( S 0 , c S 0 , p 1 , p 2 ) -sen sitive, o ne ca n con - struct a data structu r e for c - NN with O ( n ρ log n ) qu ery time and space O ( n 1+ ρ ) , wher e ρ = log p 1 log p 2 . [18] also pr ovided an explicit con struction of ALSH, wh ich we call L2-ALSH . W ithout loss of g enerality , one can al- ways assume || x i || 2 ≤ U < 1 , ∀ x i ∈ S (8) for some U < 1 . If this is not the case, then we can always scale down the norm s without altering the arg max . Since the norm of the query d oes not affect the arg max in MIPS, for simplicity it was assumed || q || 2 = 1 . This con dition can be removed easily (see Section 5 for details). In L 2- ALSH, two vector transform ations P : R D 7→ R D + m and Q : R D 7→ R D + m are defined as follows: P ( x ) = [ x ; || x || 2 2 ; || x || 4 2 ; .... ; || x || 2 m 2 ] (9) Q ( x ) = [ x ; 1 / 2; 1 / 2; .... ; 1 / 2] , (10) where [;] is the concatenatio n. P ( x ) appe nds m scaler s of the f orm || x || 2 i 2 at the end of the vector x , while Q(x) simply app ends m “1/2” to the en d of the vector x . By observing || P ( x i ) || 2 2 = || x i || 2 2 + || x i || 4 2 + ... + || x i || 2 m 2 + || x i || 2 m +1 2 || Q ( q ) || 2 2 = || q || 2 2 + m/ 4 = 1 + m/ 4 Q ( q ) T P ( x i ) = q T x i + 1 2 ( || x i || 2 2 + || x i || 4 2 + ... + || x i || 2 m 2 ) one can obtain the following k ey equality: || Q ( q ) − P ( x i ) || 2 2 = (1 + m/ 4) − 2 q T x i + || x i || 2 m +1 2 (11) Since || x i || 2 ≤ U < 1 , we h av e || x i || 2 m +1 → 0 at the tower rate (exponen tial to exponential). Thus, as long a s m is not too small (e.g., m ≥ 3 would suffice), we have arg max x ∈S q T x ≃ arg min x ∈S || Q ( q ) − P ( x ) || 2 (12) This scheme is the first co nnection between solving u n- normalized MI PS and app roxim ate near neig hbor search. T ransforma tions P and Q , when norms are less than 1 , pro- vide correctio n to the L 2 distance || Q ( q ) − P ( x i ) || 2 mak- ing it r ank correlate with the (un-n ormalized) inner prod- uct. The gener al id ea of ALSH was partially inspir ed by the work on th ree-way similarity search [1 7], where they applied d ifferent hashing fun ctions for hand ling query and data in the repositor y . 3.1 Intuition f or the B etter Scheme Asymmetric tr ansformatio ns give us enoug h flexibility to modify norms without changing inner products. Th e trans- formation provided in [18] used this flexibility to c onv ert MIPS to standard n ear neighb or searc h in L 2 space for which we have standard hash f unction s. Signed random projection s are po pular h ash function s widely adop ted for correlation or cosine similarity . W e use asym metric trans- formation to convert appro ximate MIPS into approx imate maximum correlatio n search. T he tr ansformatio ns and the collision pro bability of th e hashin g f unctions determin es the efficiency of the obtained ALSH algorithm. W e show that th e new transfor mation with SRP is better suited f or ALSH compared to the existing L2-ALSH. Note that in th e recent work on co ding for random pr ojections [13, 14], it was alre ady shown that sign ran dom projections (o r 2-bit random projections) can outper form L2LSH. 4 The New Pr oposal: Sign-ALSH 4.1 Fr om MIPS to Correlation-NNS W e assum e for simplicity that || q || 2 = 1 as th e norm of the query does no t chang e the o rdering , we show in the next section h ow to ge t rid of this assumption. W ithout loss of generality let || x i || 2 ≤ U < 1 , ∀ x i ∈ S as it can always be achieved b y scaling th e data by large eno ugh number . 0 0.2 0.4 0.6 0.8 1 0.6 0.7 0.8 0.9 1 c ρ * S = 0.5U S = 0.1U S = 0.5U S = 0.1U S 0 = 0.5U S 0 = 0.1U Sign L2 0 0.2 0.4 0.6 0.8 1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 c ρ * S 0 = 0.9U S 0 = 0.5U Sign L2 Figure 1: Op timal values of ρ ∗ (lower is b etter) with r e- spect to approxima tion ratio c fo r dif ferent S 0 , obtained by a grid search over parame ters U and m , given S 0 and c . The curves show that Sig n-ALSH (solid curves) is notice- ably better than L2 -ALSH (dashed cu rves) in terms of their optimal ρ ∗ values. The results for L2-ALSH were from the prior work [18]. For clarity , the results are in two figures. W e define two vector tr ansform ations P : R D 7→ R D + m and Q : R D 7→ R D + m as follows: P ( x ) = [ x ; 1 / 2 − || x || 2 2 ; 1 / 2 − || x || 4 2 ; .... ; 1 / 2 − || x || 2 m 2 ] (13) Q ( x ) = [ x ; 0 ; 0; .... ; 0] , (14) Using || Q ( q ) || 2 2 = || q || 2 2 = 1 , Q ( q ) T P ( x i ) = q T x i , and || P ( x i ) || 2 2 = || x i || 2 2 + 1 / 4 + || x i || 4 2 − || x i || 2 2 + 1 / 4 + || x i || 8 2 − || x i || 4 2 + ... + 1 / 4 + || x i || 2 m +1 2 − || x i || 2 m 2 = m/ 4 + || x i || 2 m +1 2 we obtain the following k ey equality: Q ( q ) T P ( x i ) k Q ( q ) k 2 k P ( x i ) k 2 = q T x i q m/ 4 + || x i || 2 m +1 2 (15) The term || x i || 2 m +1 → 0 , again vanishes a t the tower rate. This means we have a pprox imately arg max x ∈S q T x ≃ arg max x ∈S Q ( q ) T P ( x i ) k Q ( q ) k 2 k P ( x i ) k 2 (16) 0 0.2 0.4 0.6 0.8 1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 m = 2, U = 0.75 c ρ S 0 = 0.9U S 0 = 0.1U 0 0.2 0.4 0.6 0.8 1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 m = 3, U = 0.85 c ρ S 0 = 0.9U S 0 = 0.1U Figure 2: Th e solid curves are the o ptimal ρ values of Sign- ALSH from Figure 1. The dashe d curves repr esent the ρ values fo r fixed parameter s: m = 2 and U = 0 . 75 (lef t panel), a nd m = 3 an d U = 0 . 85 (rig ht pan el). E ven with fixed parameters, the ρ do es not degrade much. This provide s a nother solutio n for solv ing MIPS using known methods for appro ximate correlation-NNS. 4.2 F ast Algorithms f or MIPS Using Sign Random Projections Eq. (16) shows that MIPS reduces to the standard approxi- mate near n eighbo r search problem which can be efficiently solved by sign rand om projection s, i.e., h S ig n (defined b y Eq. (6)). Formally , we can state the following theorem. Theorem 2 Given a c -ap pr oximate instance of MIPS, i.e., S im ( q , x ) = q T x , and a query q such that || q || 2 = 1 alon g with a co llection S ha ving || x || 2 ≤ U < 1 ∀ x ∈ S . Let P and Q be the vector transformations defi ned in Eq. (13) and Eq. (14), res pectively . W e ha ve the following two con - ditions for hash function h S ig n (define d by Eq. (6)) • if q T x ≥ S 0 then P r [ h S ig n ( Q ( q )) = h S ig n ( P ( x ))] ≥ 1 − 1 π cos − 1 S 0 p m/ 4 + U 2 m +1 ! • if q T x ≤ cS 0 then P r [ h S ig n ( Q ( q )) = h S ig n ( P ( x ))] ≤ 1 − 1 π cos − 1   min { cS 0 , z ∗ } q m/ 4 + (min { cS 0 , z ∗ } ) 2 m +1   wher e z ∗ =  m/ 2 2 m +1 − 2  2 − m − 1 . Proof: When q T x ≥ S 0 , we have, accor ding to Eq. (7) P r [ h S ig n ( Q ( q )) = h S ig n ( P ( x ))] = 1 − 1 π cos − 1   q T x q m/ 4 + || x || 2 m +1 2   ≥ 1 − 1 π cos − 1 q T x p m/ 4 + U 2 m +1 ! When q T x ≤ cS 0 , by noting that q T x ≤ k x k 2 , we have P r [ h S ig n ( Q ( q )) = h S ig n ( P ( x ))] = 1 − 1 π cos − 1   q T x q m/ 4 + || x || 2 m +1 2   ≤ 1 − 1 π cos − 1 q T x p m/ 4 + ( q T x ) 2 m +1 ! F or th is on e-dimension al functio n f ( z ) = z √ a + z b , wher e z = q T x , a = m/ 4 an d b = 2 m +1 ≥ 2 , we know f ′ ( z ) = a − z b ( b/ 2 − 1) ( a + z b ) 3 / 2 One can also check th at f ′′ ( z ) ≤ 0 for 0 < z < 1 , i.e., f ( z ) is a conca ve function. The maximum of f ( z ) is attained a t z ∗ =  2 a b − 2  1 /b =  m/ 2 2 m +1 − 2  2 − m − 1 If z ∗ ≥ cS 0 , then we need to use f ( cS 0 ) as the boun d.  Therefo re, we have obtained , in LSH terminolo gy , p 1 = 1 − 1 π cos − 1 S 0 p m/ 4 + U 2 m +1 ! (17) p 2 = 1 − 1 π cos − 1   min { cS 0 , z ∗ } q m/ 4 + (min { cS 0 , z ∗ } ) 2 m +1   , (18) z ∗ =  m/ 2 2 m +1 − 2  2 − m − 1 (19) Theorem 1 allows us to construct data structures with w orst case O ( n ρ log n ) q uery tim e gu arantees fo r c -app roximate MIPS, where ρ = log p 1 log p 2 . For any given c < 1 , there al ways exist U < 1 and m such th at ρ < 1 . T his way , w e ob tain a sub linear qu ery time algorithm fo r MIPS. Because ρ is a fu nction o f 2 parameters, th e best q uery time cho oses U and m , which minimizes the value o f ρ . For conv enience, we define ρ ∗ = min U,m log  1 − 1 π cos − 1  S 0 √ m/ 4+ U 2 m +1  log 1 − 1 π cos − 1 min { cS 0 ,z ∗ } q m/ 4+(min { cS 0 ,z ∗ } ) 2 m +1 !! (20) See Figure 1 f or the plots of ρ ∗ , which also compar es the optimal ρ values for L2 -ALSH in the prior work [18]. The results show that Sign-AL SH is noticeably better . 4.3 P arameter Selection Figure 2 presents the ρ values f or two sets of selected pa- rameters: ( m, U ) = (2 , 0 . 75 ) a nd ( m, U ) = (3 , 0 . 8 5) . W e ca n see that ev en if we use fixed par ameters, th e p er- forman ce would not degrad e mu ch. Th is e ssentially frees practitioner s from t he burden of choosing parameters. 5 Remove Dep endence on Norm of Query Changing nor ms of the query does not af fect the arg max x ∈C q T x , and hen ce, in p ractice for re trieving t op- k , normalizing the query should not affect the perform ance. But for theo retical p urposes, we want the r untime guara n- tee to be in depend ent of || q || 2 . Note, both LSH and ALSH schemes solve the c -app roximate instance of th e problem , which requires a threshold S 0 = q t x and an approximation ratio c . For this g i ven c -appro ximate instance we ch oose optimal para meters K an d L . I f the qu eries have vary- ing n orms, which is likely the case in practical scen arios, then given a c -approx imate MI PS instan ce, n ormalizin g the query will chang e the problem because it will change the threshold S 0 and also the appro ximation ratio c . The opti- mal p arameters fo r th e alg orithm K and L , which are also the size o f the da ta structure, change with S 0 and c . This will requ ire r e-doin g the costly prepro cessing with every change in qu ery . Thus, the qu ery time which is d ependen t on ρ should be indepen dent of the query . T ransforma tions P and Q were prec isely meant to remove the d ependen cy of correlatio n on th e n orms of x but at the same time keep ing the inner pro ducts same. Realizing the fact that we are allowed asym metry , we can use the same idea to get rid of the norm of q . Let M be the up per bound on all th e norms i.e. M = max x ∈C || x || 2 . In othe r words M is the radius of the space. Let U < 1 , define the transfor mations, T : R D → R D as T ( x ) = U x M (21) and transfor mations P, Q : R D → R D + m are the same fo r the Sign-ALSH scheme as defined in Eq (13) and (14). Giv en the q uery q and a ny data po int x , o bserve that the inner products between P ( Q ( T ( q ))) and Q ( P ( T ( x ))) is P ( Q ( T ( q ))) T Q ( P ( T ( x ))) = q T x ×  U 2 M 2  (22) P ( Q ( T ( q ))) ap pends first m zeros co mpon ents to T ( q ) and then m compo nents of the form 1 / 2 − || q || 2 i . Q ( P ( T ( q ))) does the same thing but in a d ifferent order . No w we are working in D + 2 m dimen sions. It is not difficult to see that the no rms of P ( Q ( T ( q ))) and Q ( P ( T ( q ))) is giv en by || P ( Q ( T ( q ))) || 2 = r m 4 + || T ( q ) || 2 m +1 2 (23) || Q ( P ( T ( x ))) || 2 = r m 4 + || T ( x ) || 2 m +1 2 (24) The transformation s are very asymmetric b ut we k now tha t it is necessary . Therefo re th e correlation or the cosine similar ity between P ( Q ( T ( q ))) and Q ( P ( T ( x ))) is C or r = q T x ×  U 2 M 2  q m 4 + || T ( q ) || 2 m +1 2 q m 4 + || T ( x ) || 2 m +1 2 (25) Note || T ( q ) || 2 m +1 2 , || T ( x ) || 2 m +1 2 ≤ U < 1 , ther efore both || T ( q ) || 2 m +1 2 and || T ( x ) || 2 m +1 2 conv erge to zero at a tower rate an d we get approx imate mo notonicity of correlation with the inner products. W e can apply sign random projec- tions to hash P ( Q ( T ( q ))) and Q ( P ( T ( q ))) . Using the fact 0 ≤ || T ( q ) || 2 m +1 2 ≤ U and 0 ≤ || T ( x ) || 2 m +1 2 ≤ U , it is no t d ifficult to get p 1 and p 2 for Sign-ALSH, without any conditions o n an y n orms. Simp li- fying the expression, we get the following value of optimal ρ u (u for unrestricte d). ρ ∗ u = min U,m, log  1 − 1 π cos − 1  S 0 ×  U 2 M 2  m 4 + U 2 m +1  log  1 − 1 π cos − 1  cS 0 ×  4 U 2 M 2  m  (26) s.t. U 2 m +1 < m (1 − c ) 4 c , m ∈ N + , and 0 < U < 1 . W ith this v alue of ρ ∗ u , we can state our main theorem. Theorem 3 F or the pr oblem of c -app r oximate MIP S in a bound ed space, one can construct a data structu r e havin g O ( n ρ ∗ u log n ) q uery time and spa ce O ( n 1+ ρ ∗ u ) , wher e ρ ∗ u < 1 is the solution to constraint optimization (26). Note, for all c < 1 , we always have ρ ∗ u < 1 becau se the constraint U 2 m +1 < m (1 − c ) 4 c is always tru e for b ig enou gh m . T he o nly assump tion f or efficiently solving M IPS that we need is that th e space is bounde d, which is always satis - fied for a ny finite d ataset. ρ ∗ u depend s on M , the radius of the space, which is expected. 0 20 40 60 80 100 0 10 20 30 Recall (%) Precision (%) MovieLens Top 1, K = 512 L2−ALSH Sign,m=2,U=0.75 Sign,m=3,U=0.85 0 20 40 60 80 100 0 20 40 60 80 Recall (%) Precision (%) MovieLens Top 5, K = 512 L2−ALSH Sign,m=2,U=0.75 Sign,m=3,U=0.85 0 20 40 60 80 100 0 20 40 60 80 Recall (%) Precision (%) MovieLens Top 10, K = 512 L2−ALSH Sign,m=2,U=0.75 Sign,m=3,U=0.85 0 20 40 60 80 100 0 5 10 15 20 Recall (%) Precision (%) MovieLens Top 1, K = 256 L2−ALSH Sign,m=2,U=0.75 Sign,m=3,U=0.85 0 20 40 60 80 100 0 20 40 60 Recall (%) Precision (%) MovieLens Top 5, K = 256 L2−ALSH Sign,m=2,U=0.75 Sign,m=3,U=0.85 0 20 40 60 80 100 0 20 40 60 80 Recall (%) Precision (%) MovieLens Top 10, K = 256 L2−ALSH Sign,m=2,U=0.75 Sign,m=3,U=0.85 0 20 40 60 80 100 0 5 10 15 Recall (%) Precision (%) MovieLens Top 1, K = 128 L2−ALSH Sign,m=2,U=0.75 Sign,m=3,U=0.85 0 20 40 60 80 100 0 10 20 30 40 50 Recall (%) Precision (%) MovieLens Top 5, K = 128 L2−ALSH Sign,m=2,U=0.75 Sign,m=3,U=0.85 0 20 40 60 80 100 0 20 40 60 Recall (%) Precision (%) MovieLens Top 10, K = 128 L2−ALSH Sign,m=2,U=0.75 Sign,m=3,U=0.85 0 20 40 60 80 100 0 2 4 6 8 10 Recall (%) Precision (%) MovieLens Top 1, K = 64 L2−ALSH Sign,m=2,U=0.75 Sign,m=3,U=0.85 0 20 40 60 80 100 0 10 20 30 40 Recall (%) Precision (%) MovieLens Top 5, K = 64 L2−ALSH Sign,m=2,U=0.75 Sign,m=3,U=0.85 0 20 40 60 80 100 0 10 20 30 40 50 Recall (%) Precision (%) MovieLens Top 10, K = 64 L2−ALSH Sign,m=2,U=0.75 Sign,m=3,U=0.85 Figure 3: Movielens . Precision-Recall curves (high er is better), o f retrieving top- T items, for T = 1 , 5 , 1 0 . W e vary the number o f hashes K fro m 64 to 512 . W e comp are L2-ALSH (u sing par ameters recomm ended in [1 8]) with our p roposed Sign-ALSH using two sets of parameters: ( m = 2 , U = 0 . 75) and ( m = 3 , U = 0 . 85) . Sign- ALSH is noticeably better . 6 Ranking Evaluations In [1 8], the L2-ALSH scheme was shown to outp erform the LSH for L2 distance in retrieving maximum inner pro ducts. Since o ur pro posal is an impr ovement over L2- ALSH, we focus o n compariso ns with L 2-ALSH. In this section, we compare L2-ALSH with Sign-ALSH based on ranking. 6.1 Datasets W e use the two p opular collaborative filtering datasets MovieLens 10M and Net flix , for the task of item recom- mendation s. These a re also the same da tasets used in [18]. Each dataset is a sparse us er -item ma trix R , where R ( i, j ) indicates the ratin g of user i for movie j . For getting the latent f eature vectors from user item matrix, we f ollow the metho dology of [18]. They use Pure SVD proced ure described in [2] to generate user an d item latent vector s, which in volves co mputin g th e SVD of R R = W Σ V T where W is n users × f matr ix an d V is n item × f matr ix for some chosen rank f also known as latent dimension. After the SVD step, the rows of matrix U = W Σ are treated as the user characteristic vectors while rows of ma- trix V correspon d to the item characteristic vectors. This simple procedure has been sh own to outperfor m othe r pop- ular r ecommen dation algo rithms for the task of top item recommen dations in [2], o n these two datasets. W e u se the same choices fo r the laten t dimen sion f , i.e., f = 150 for Movielens and f = 30 0 fo r Netflix as [18]. 0 20 40 60 80 100 0 5 10 15 20 25 Recall (%) Precision (%) Netflix Top 1, K = 512 L2−ALSH Sign,m=2,U=0.75 Sign,m=3,U=0.85 0 20 40 60 80 100 0 10 20 30 40 50 Recall (%) Precision (%) Netflix Top 5, K = 512 L2−ALSH Sign,m=2,U=0.75 Sign,m=3,U=0.85 0 20 40 60 80 100 0 20 40 60 Recall (%) Precision (%) Netflix Top 10, K = 512 L2−ALSH Sign,m=2,U=0.75 Sign,m=3,U=0.85 0 20 40 60 80 100 0 5 10 15 20 Recall (%) Precision (%) Netflix Top 1, K = 256 L2−ALSH Sign,m=2,U=0.75 Sign,m=3,U=0.85 0 20 40 60 80 100 0 10 20 30 40 Recall (%) Precision (%) Netflix Top 5, K = 256 L2−ALSH Sign,m=2,U=0.75 Sign,m=3,U=0.85 0 20 40 60 80 100 0 20 40 60 Recall (%) Precision (%) Netflix Top 10, K = 256 L2−ALSH Sign,m=2,U=0.75 Sign,m=3,U=0.85 0 20 40 60 80 100 0 5 10 15 Recall (%) Precision (%) Netflix Top 1, K = 128 L2−ALSH Sign,m=2,U=0.75 Sign,m=3,U=0.85 0 20 40 60 80 100 0 10 20 30 40 Recall (%) Precision (%) Netflix Top 5, K = 128 L2−ALSH Sign,m=2,U=0.75 Sign,m=3,U=0.85 0 20 40 60 80 100 0 10 20 30 40 50 Recall (%) Precision (%) Netflix Top 10, K = 128 L2−ALSH Sign,m=2,U=0.75 Sign,m=3,U=0.85 0 20 40 60 80 100 0 2 4 6 8 10 Recall (%) Precision (%) Netflix Top 1, K = 64 L2−ALSH Sign,m=2,U=0.75 Sign,m=3,U=0.85 0 20 40 60 80 100 0 5 10 15 20 25 Recall (%) Precision (%) Netflix Top 5, K = 64 L2−ALSH Sign,m=2,U=0.75 Sign,m=3,U=0.85 0 20 40 60 80 100 0 10 20 30 40 Recall (%) Precision (%) Netflix Top 10, K = 64 L2−ALSH Sign,m=2,U=0.75 Sign,m=3,U=0.85 Figure 4: Netflix . Precision-Recall curves ( higher is better), of r etrieving top- T item s, for T = 1 , 5 , 10 . W e vary the number o f hashes K fro m 64 to 512 . W e comp are L2-ALSH (u sing par ameters recomm ended in [1 8]) with our p roposed Sign-ALSH using two sets of parameters: ( m = 2 , U = 0 . 75) and ( m = 3 , U = 0 . 85) . Sign- ALSH is noticeably better . 6.2 Ev aluations In this section, we show ho w the ranking of the two ALSH schemes, L2 -ALSH and Sign -ALSH, correla tes with the top- T inn er produ cts. Given a user i and its cor respond ing user vector u i , we compute the top- T gold stan dard items based on the actual inne r products u T i v j , ∀ j . W e th en gen- erate K d ifferent hash codes of the vector u i and all the item vectors v j s and then compute M atches j = K X t =1 1 ( h t ( u i ) = h t ( v j )) , (27) where 1 is the indicator function and th e subscript t is used to distinguish indep enden t draws of h . Based on M atche s j we rank all the item s. Ideally , for a better hashin g scheme, M atches j should b e hig her for items having higher inn er produ cts with the g iv en u ser u i . Th is pro cedure gen erates a sorted list of all th e items fo r a given user vector u i cor- respond ing to the each hash functio n under consideration. For L2-ALSH, we used the sam e param eters used and rec- ommend ed in [ 18]. For Sign-ALSH, we u sed th e two recommen ded c hoices shown in Section 4 .3, which are U = 0 . 75 , m = 2 and U = 0 . 8 5 , m = 3 . It shou ld be noted that Sign-ALSH does not have th e param eter r . W e comp ute th e pr ecision and reca ll of the top - T items for T ∈ { 1 , 5 , 1 0 } , obtained from the sorted list based on M atches . T o compute this precision and recall, we st art at the to p of th e ranked item list and w alk down in order . Sup - pose we are at the k th ranked item, we check if this item belongs to the go ld standard top- T list. If it is on e of the top- T gold standar d item , then we in crement the coun t of 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 1 recall Fraction Multiplications MovieLens Top 1 L2−LSH Sign−ALSH 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 1 recall Fraction Multiplications MovieLens Top 5 L2−LSH Sign−ALSH 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 1 recall Fraction Multiplications MovieLens Top 10 L2−LSH Sign−ALSH Figure 5: MovieLens . Recall-FIP (Fractio ns of In ner Produc ts) cur ves for top- 1, to p-5, and top -10, for Sign- ALSH with L2-ALSH. W e used the recomm ended parameters for L2-ALSH [18]. For Sign-ALSH, we used m = 2 and U = 0 . 75 . r elevant seen by 1, else we move to k + 1 . By k th step, we have already seen k items, so the tota l items seen is k . Th e precision and recall at that point is then computed as: P reci sion = relev ant seen k , Reca l l = relev ant seen T W e show per forman ce for K ∈ { 64 , 128 , 256 , 512 } . Note that it is important to b alance bo th precision an d recall. The method w hich obtains high er precisio n at a given recall is superior . Hig her precision indicates high er ra nking of the relev ant items. W e r eport averaged precision s and recalls over 2000 rando mly chosen users. The p lots for MovieLe ns and Netflix datasets are sh own in Figu re 3 and Figure 4 resp ectiv ely . W e can clearly see, that our pro posed Sig n-ALSH scheme gives sign ificantly higher precision r ecall curves than the L2- ALSH scheme, indicating better correlation of the top neighb ors und er in- ner prod ucts with Sign-AL SH co mpared to L2-ALSH. In addition, th ere is n ot mu ch d ifference in the two different combinatio ns of the parameters U an d m in Sign-ALSH. The results are very consistent across both datasets. 7 LSH Bucket ing Experiments In th is section, we ev aluate the actual savings in th e nu m- ber o f inne r produ ct ev aluations for recom mendin g top- T items for th e MovieLens dataset. For this, we implem ented the standard ( K , L ) alg orithms in [9], wh ere K is numb er of ha shes in each hash table and L is the to tal n umber o f tables. For each qu ery p oint, the retur ned results are the union o f matches in all L tab les. T o find the top- T items, we need to comp ute the actual inner pr oducts on ly on the candidate items retrieved b y the bucketing proced ure. In this experim ent, we choose T ∈ { 1 , 5 , 10 } and compu te the recall value for ea ch co mbination of ( T , K, L ) for ev- ery query . For example, given q uery q and a ( K , L ) -LSH scheme, if T = 10 and only 5 of the tr ue t op-1 0 d ata points are retrieved, the r ecall will be 50% for this ( T , K, L ) . At the same time, we can also comp ute the FIP (frac tion of inner products): FIP = ( K × L ) + T otal Re tr ie v ed T o tal Items (28) which is basically the total number of inner pro ducts e valu- ation (wh ere K × L rep resents the cost of hashing), n ormal- ized by the total nu mber of items in the repository . Thus, for each q and ( T , K , L ) , we can com pute two values: r e- call and FIP . W e also n eed to figure ou t a way to aggregate the results for all queries. T y pically the pe rforma nce of bucketing a lgorithm is very sensiti ve to the choice o f hashin g par ameters K and L . I de- ally , to find best K and L , we nee d to know the operating threshold S 0 and the appro ximation ratio c in adv ance. Un- fortun ately , the data and the q ueries are very diverse an d therefor e fo r retrieving top- T near neighb ors th ere is no common fixed thresho ld S 0 and app roxima tion ratio c that works for dif ferent queries. Our goal is to compare the hashing schemes, and minimize the effect o f K an d L on th e ev aluation. T o get away with the effect of K an d L , we perf orm rig orou s ev aluations of various K and L which includes op timal choices at various thresholds. For both th e hashin g schemes, we then select the best perfor ming K and L and report the perfo rmance. This inv olves running the bucketing experiments fo r thou - sands of combin ations and th en choosing the best K an d L to marginalize the ef fect of parameters in the compariso ns. This all ensures that our ev aluation is f air . W e choo se the following scheme . For ea ch ( T , K, L ) , we co mpute the averaged recall and av eraged FIP , over all quer ies. Th en for eac h “target” rec all level (and T ), we can find the ( K, L ) which produ ces the best (lowest) av eraged FIP . This way , for each T , we can compu te a “FIP-recall” curve, wh ich can be used to compar e Sign- ALSH with L2-ALSH. W e use K ∈ { 4 , 5 , .., 20 } and L ∈ { 1 , 2 , 3 , ..., 200 } . The results are summar ized in Figure 5. W e can clearly see from the plots that for achieving the same r ecall for top- T , Sign -ALSH sch eme needs to do less computatio ns compare d to L2-ALSH. 8 Conclusion The MIPS (maximu m inne r p roduct search) p roblem has numero us impo rtant application s in machine learning, databases, an d in formation retriev al. [18] developed the framework of Asymmetric LSH and provid ed an explicit scheme (L2-ALSH) f or approxim ate MI PS in su blinear time. I n this stud y , we present ano ther asymmetric transfo r- mation scheme (Sign-AL SH) which co n verts the problem of max imum inner pro ducts into the problem of ma ximum correlation search, which is subseq uently solved b y sign random projections. Theor etical analysis and e xperim ental study dem onstrate that Sign-A LSH can b e noticeab ly more advantageous than L2-ALSH . Acknowledgeme nt The research is sup ported in part by ONR-N00014 -13-1 - 0764, NSF-III-1 3609 71, AFOSR-F A9550 -13-1 -0137 , an d NSF-Bigdata-14 19210 . Th e metho d and theo retical analy- sis for Sign- ALSH were conducted right after the initial sub- mission o f our first work on ALSH [1 8] in Febru ary 2014 . The intensive experimen ts (e specially the LSH bucketing experiments), however , wer e n ot f ully com pleted u ntil Ju ne 2014 due to the demand of c omputatio nal resou rces, be- cause we exhaustiv ely exper imented a wide range of K (numb er o f hashes) and L ( number of ta bles) fo r im ple- menting ( K, L ) -LSH scheme s. Here, we also would like to thank t he computin g supp orting team (LCSR) at Rutgers CS depar tment as well as the IT sup port staff at Rutgers Statistics dep artment, for setting up the workstation s espe- cially the server with 1.5TB memory . Refer ences [1] M. S. Char ikar . Similarity estimation tech niques f rom round ing algorith ms. In STOC , pages 380–3 88, M on- treal, Quebec, Canada, 2002. [2] P . Cremonesi, Y . Koren, and R. T urrin. Perfor mance of recomm ender algorith ms on top -n recom menda- tion ta sks. In Pr oceedings of the fourth ACM confer- ence o n Recommend er systems , pages 39 –46. A CM, 2010. [3] M. Datar, N. Immo rlica, P . Indy k, and V . S. Mirro kn. Locality-sensitive hashin g scheme based on p - stable distributions. In SCG , pa ges 25 3 – 26 2, Bro oklyn, NY , 2004 . [4] T . Dean, M. A. Ruz on, M. Segal, J. Sh lens, S. V ijaya- narasimhan , and J. Y a gnik. Fast, accurate detection of 100,0 00 object classes on a single mac hine. In Com- puter V ision an d P attern Re cognition (CVPR), 2013 IEEE Confer ence on , pages 1814–18 21. I EEE, 2013 . [5] W . Do ng, M. Char ikar, and K. Li. Asym metric dis- tance estimation wi th s ketches for similarity search in high-d imensional spaces. In SIGIR , pages 123–1 30, 2008. [6] P . F . Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramana n. Object detection with discrimi- natively trained p art-based mode ls. P a ttern Ana ly- sis an d Ma chine Intelligence, IEEE T ransactions o n , 32(9) :1627– 1645, 201 0. [7] J. H. Friedm an and J. W . T ukey . A pr ojection pursuit algorithm fo r explorato ry data analy sis. IEE E T rans- actions on Computers , 23(9):881 –890 , 1974. [8] M. X. Goema ns and D. P . W illiamson. Improved approx imation algor ithms fo r maximu m cu t and sat- isfiability pr oblems using sem idefinite pro gramm ing. Journal of ACM , 42(6) :1115– 1145 , 1995. [9] P . I ndyk and R. Mo twani. Approximate nearest neigh- bors: T owards removing th e curse o f dimension ality . In STOC , pages 604–613 , Dallas, TX, 199 8. [10] T . Joachims, T . Finley , and C.-N. J. Y u. Cuttin g- plane train ing of structu ral svms. Machine Learning , 77(1) :27–59 , 200 9. [11] N. K oenigstein, P . Ram, and Y . Shavitt. Ef ficient re- triev al of recomm endations in a matrix factorization framework. I n CIKM , pages 535–5 44, 2012 . [12] Y . K oren, R. Bell, and C. V o linsky . Matrix factoriza- tion techniqu es for recomme nder systems. [13] P . Li, M. Mitzenmach er , and A. Shriv asta va. Codin g for random projection s. I n ICML , 2014. [14] P . Li, M. Mitzenmach er , and A. Shriv asta va. Codin g for rand om pro jections and app roxima te near neigh - bor search. T echnical report, arXiv:1403.81 44 , 2014. [15] B. Neyshabur , N. Sr ebro, R. Salakh utdinov , Y . Makary chev , and P . Y ad ollahpo ur . The power of asymmetry in bin ary hashing. In NIPS , Lake T ahoe, NV , 2013 . [16] P . Ram an d A. G. Gray . Maximum inner-pro duct search u sing co ne trees. In KDD , pag es 9 31–93 9, 2012. [17] A. Sh riv asta va and P . Li. Beyond pairwise: Prov- ably f ast algorithms for approximate k-way similarity search. In NIPS , Lake T ahoe , NV , 2013. [18] A. Shriv asta va and P . Li. Asymmetric lsh (alsh) for sub linear time maximum inn er pr oduct search (mips). T echnical rep ort, arXiv:1405.58 69 (T o Ap- pear in NIPS 2014. Initially submitted to KDD 2 014), 2014. [19] R. W eber, H.-J. Schek, and S. Blott. A quan titati ve analysis and perf orman ce stud y fo r similarity-search methods in high-dimensional spaces. I n VLDB , pages 194–2 05, 1998 .

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment