Improved Densification of One Permutation Hashing

Impro ved Densiﬁcation of One P ermutation Hashing Anshumali Shrivasta va Departmen t of C ompu ter Science Computing and Inform ation Science Cornell University Ithaca, NY 14853 , USA anshu@cs.cor nell.edu Ping Li Departmen t of Statis tics and Biostatistics Departmen t of C ompu ter Science Rutgers University Piscataw ay , NJ 0885 4, USA pingli@stat. rutgers.edu Abstract The e xisting work on densiﬁcation of on e perm u- tation hashin g [24] r educes the query pro cessing cost of the ( K , L ) -parameterized Locality Sen- siti ve Hashing (LSH) algo rithm with minwise hashing, fro m O ( dK L ) to merely O ( d + K L ) , where d is the nu mber of nonzero s of th e data vector , K is the numb er o f hashes in each hash table, and L is the number of h ash tab les. While that is a substantial improvement, our analy- sis re veals that the existing densiﬁcation scheme in [24] is sub- optimal. In particular , th ere is n o enoug h rand omness in that procedure , which af- fects its accuracy on very sparse datasets. In this paper, we provid e a new densiﬁcation pro- cedure which is prov ably better than the existing scheme [24]. This im provement is m ore signiﬁ- cant for very sparse datasets which are com mon over the web . The improved technique has the same cost of O ( d + K L ) for query processing, thereby mak ing it strictly pref erable over the ex- isting proced ure. Experime ntal evaluations o n public datasets, in the task of hash ing based near neighbo r search, support o ur theoretical ﬁndings. 1 Intr oduction Binary representations are common fo r h igh dim ensional sparse da ta over the web [ 8, 2 5, 26, 1], especially for text data represented by high-orde r n -grams [4, 12]. Binary vectors can also be equivalently vie wed as sets, over the universe o f all the featu res, containin g only loca tions of the non-zero en tries. Giv en two sets S 1 , S 2 ⊆ Ω = { 1 , 2 , ..., D } , a popular measur e of similarity between sets (or binary vectors) is the r esemblance R , deﬁned as R = | S 1 ∩ S 2 | | S 1 ∪ S 2 | = a f 1 + f 2 − a , (1) where f 1 = | S 1 | , f 2 = | S 2 | , and a = | S 1 ∩ S 2 | . It is well-known that minwise hashing belongs to the Lo- cality Sensitive Hash ing (LSH) family [5, 9]. The method applies a random perm utation π : Ω → Ω , on the gi ven set S , and stores the min imum value after the p ermutation mapping . Formally , h π ( S ) = min( π ( S )) . (2) Giv en sets S 1 and S 2 , it can be sho w n by elementary prob- ability arguments that P r ( h π ( S 1 ) = h π ( S 2 )) = | S 1 ∩ S 2 | | S 1 ∪ S 2 | = R . (3) The probability o f collision ( equality of h ash v alues), un- der minwise hashing, is eq ual to the similarity of interest R . T his property , also known as the LSH p r operty [14, 9], makes minwise hash f unctions h π suitable for cre ating hash buckets, which leads to sublinear algorithms fo r similarity search. Because of this same LSH property , minwise hash- ing is a popular in dexing techn ique for a variety of large- scale data processing applications, which include duplicate detection [4, 1 3], all-pair similarity [3], fast lin ear learn- ing [19], temporal c orrelation [10], 3-way similar ity & re- triev al [17, 23], grap h algorithms [6, 11, 21], and more. Querying wit h a standard ( K , L ) -paramete rized LSH al- gorithm [14], for fast similarity search, requires comput- ing K × L min -hash v alues per query , where K is the number of hashes in each hash table and L is the num- ber of h ash tables. In theory , the value o f K L grows with the d ata size [14]. In practice, typ ically , this n umber ranges from a few hundreds to a fe w thousand s. Thus, pro- cessing a single query , for near-neighbor search, requires ev alua ting hundreds or thousand s of independen t permuta- tions π (o r cheape r universal app roxima tions to permuta- tions [7, 22, 20]) over th e giv en d ata vector . If d d enotes the numb er of non-zero s in the q uery v ector , then the query prepro cessing co st i s O ( dK L ) which is also the bottleneck step in the LSH algorith m [1 4]. Query time (laten cy) is crucial in many user -facing applications, such as search. Linear learnin g with b -bit minwise hashing [19], req uires multiple evaluations (say k ) of h π for a giv en data vec- tor . Compu ting k d ifferent min-h ashes of the test data costs O ( dk ) , while after p rocessing, classifying this d ata vector (with S VM or logistic re gression) only requires a single in- ner produ ct with th e weight vector which is O ( k ) . Again, the bottlenec k step during testing pred iction is the ev alua- tion o f k m in-hashes. T esting time directly translates into the latency of on-line classiﬁ cation systems. The idea o f storing k contigu ous m inimum values after one single permutatio n [4, 15, 16] leads to hash v alues which do not satisfy the LSH p roper ty because the ha shes are not proper ly aligned. The estimators a re also not linear , and therefor e they do n ot lea d to feature r epresentation f or lin- ear learning with resemblance. T his is a serious limitation. Recently it w as shown that a “rotation” technique [24] for den sifying sparse sketches from one permutation hash- ing [18] so lves the problem of costly p rocessing with min - wise hashing (See Sec . 2 ). The scheme only requ ires a single p ermutatio n and generates k dif ferent hash values, satisfying the LSH property (i.e., Eq.(3)), in linear time O ( d + k ) , thereb y reducing a factor d in the processing cost compa red to the original minwise hashing. Our Contributions: I n this pap er , we argue that the exist- ing densiﬁcation scheme [24] is n ot the o ptimal way of den - sifying the sparse sketches of one permutation hashing at the giv en processing cost. In particular , we provide a prov- ably b etter den siﬁcation scheme for generatin g k hashes with the same proce ssing co st of O ( d + k ) . Our contribu- tions can be summarized as follows. • Ou r detailed variance analy sis of the h ashes obtained from the existing densiﬁcation scheme [2 4] rev eals that there is no enoug h r andomn ess in that pro cedure which leads to high variance in v ery sparse datasets. • W e provid e a ne w densiﬁcation schem e for on e per- mutation hashing with provably smaller v ariance than the scheme in [24]. The improvement beco mes mo re signiﬁcant for very s parse datasets which are c ommon in practice. The improved scheme retains the com- putational complexity of O ( d + k ) for com puting k different hash e valuations of a gi ven v ector . • W e provide experimental e vid ences on publicly a vail- able datasets, which demonstrate the sup eriority of the improved densiﬁcation pr ocedur e over the e xist- ing scheme, in the task of resemblance est imation and as well as the task of near neig hbor retrieval with LSH. 2 Backgr o und 2.1 One P ermutation Hashing As illustrated i n Figure 1, in stead of cond ucting k inde- penden t perm utations, one p ermutation hashing [18] uses only one permutatio n and partitions the (permuted ) feature space into k bins. In other words, a single permutatio n π is used to ﬁrst shu fﬂe the given binary vector , and then the shufﬂed vector is binned into k evenly spaced bins. Th e k minimums, computed for each bin separately , are the k different hash values. Obviously , empty bins are possible. Bin 0 Bin 1 Bin 2 Bin 3 Bin 4 Bin 5 Ê : ¹ ; 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 Ê : • Ú ) 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 1 1 0 1 0 0 1 1 0 Ê : • Û ) 0 0 0 0 0 1 1 1 0 0 0 0 1 0 1 0 1 1 0 0 0 0 0 0 OPH ( • Ú ) E 1 E 2 0 1 OPH ( • Û ) E 1 E 0 0 E Figure 1: One permutation hash es [18] for vectors S 1 and S 2 using a single permutation π . For bins not c ontaining any non-zeros, we use special symbol “E”. For example, in F igure 1, π ( S 1 ) and π ( S 2 ) deno te th e state of the binar y vectors S 1 and S 2 after applyin g permutation π . These shufﬂed vectors are then d ivided into 6 bins o f length 4 each. W e start the numbering from 0. W e loo k into each bin and store the correspondin g min imum non - zero index. For bin s no t containing any non-zeros, w e use a special symbol “E” to denote empty bins. W e also denote M j ( π ( S )) =  π ( S ) ∩  D j k , D ( j + 1) k  (4) W e assum e fo r the r est o f the paper that D is divisible by k , otherwise we can alw ays pad extra dumm y features. W e deﬁne OP H j (“OPH” for one permutation hashing) as OP H j ( π ( S )) = ( E , if π ( S ) ∩ h Dj k , D ( j +1) k  = φ M j ( π ( S )) mo d D k , otherwise (5) i.e., OP H j ( π ( S )) d enotes the minimum v alue in Bin j , under permutation mapp ing π , as sho wn in the exam- ple in Figure 1. If this intersection is null, i.e., π ( S ) ∩ h Dj k , D ( j +1) k  = φ , then OP H j ( π ( S )) = E . Consider the events of “simultaneously empty bin” I j emp = 1 and “simultan eously non-emp ty bin” I j emp = 0 , between giv en vectors S 1 and S 2 , deﬁned as: I j emp = ( 1 , if OP H j ( π ( S 1 )) = O P H j ( π ( S 2 )) = E 0 otherwise (6) Simultaneou sly empty bins are on ly deﬁned with respect to two sets S 1 and S 2 . I n Figure 1, I 0 emp = 1 and I 2 emp = 1 , while I 1 emp = I 3 emp = I 4 emp = I 5 emp = 0 . Bi n 5 is only empty for S 2 and not for S 1 , so I 5 emp = 0 . Giv en a bin n umber j , if it is not simultaneou sly empty ( I j emp = 0 ) for both the vectors S 1 and S 2 , [18] showed P r  OP H j ( π ( S 1 )) = O P H j ( π ( S 2 ))     I j emp = 0  = R (7) On th e oth er h and, wh en I j emp = 1 , n o suc h g uarantee ex- ists. When I j emp = 1 collision does not have enough infor- mation about the similarity R . Since the event I j emp = 1 can only be deter mined given the two vectors S 1 and S 2 and the materialization of π , one permutation hashing can- not be directly used for ind exing, especially wh en the data are very sparse. In particular , OP H j ( π ( S )) does not lead to a valid LSH hash function because of the co upled event I j emp = 1 in (7). The simp le strategy of ign oring empty bins leads to biased estimators o f resemblance a nd shows poor p erform ance [24]. Because of this same reason, one permutatio n hashing cannot be directly used to extract ran- dom features for linear learning with resemblance kernel. 2.2 Densifying One P ermutation Hashing for Indexing and Linear Learning [24] p ropo sed a “rotatio n” scheme that assigns new values to all the emp ty bins, generated from o ne permu tation hash- ing, in an unbiased fashion. The rotation scheme for ﬁlling the empty bins from Fig ure 1 is shown in Figure 2. The idea is that for ev ery emp ty bin, th e scheme borrows the value of the clo sest non-em pty bin in the c lockwise direc- tion (circular right hand side) added with offset C .  Bin  0  Bin  1  Bin  2  Bin  3  Bin  4  Bin  5  H(S 1 )  1+C   1  2+C  2   0  1  H(S 2 )  1+C   1  0+C  0    0  1+2C    Figure 2: De nsiﬁcation by “rotation” for ﬁl ling empty bins generated fro m one per mutation hashing [24]. Every empty bin is assigne d the value of the closest n on-emp ty bin, to- wards right (circular ), with an offset C . For th e conﬁgu- ration sho wn in Figure 1, the above ﬁgure sho ws the new assigned values (in red) of empty bins after densiﬁcation. Giv en th e con ﬁguration in Figure 1, for Bin 2 correspond - ing to S 1 , we borrow the value 2 from Bin 3 along with an additional o ffset of C . In teresting is the case of Bin 5 for S 2 , the circular right is Bin 0 which was empty . Bin 0 borrows from Bin 1 ac quiring value 1 + C , Bin 5 borrows this value with anoth er of f set C . The v a lue of Bin 5 ﬁnally becomes 1 + 2 C . The v alue of C = D k + 1 enfor ces p roper alignment and ensures no u nexpected co llisions. W ithout this of fset C , Bin 5, which was not simultaneo usly empty , after reassignment, will have value 1 for b oth S 1 and S 2 . This would be an error as in itially there was no collision (note I 5 emp = 0 ). Multiplication by the distance o f the non- empty bin, from where the value was b orrowed, ensures that the new values of simu ltaneous emp ty bins ( I j emp = 1 ), at any loca tion j for S 1 and S 2 , never match if their new values come from different bin numbers. Formally the hash ing sch eme with “rotation”, deno ted by H , is deﬁn ed as: H j ( S ) =        OP H j ( π ( S )) if OP H j ( π ( S )) 6 = E OP H ( j + t ) mod k ( π ( S )) + tC otherwise (8) t = min z , s.t. OP H ( j + z ) mod k ( π ( S )) 6 = E (9) Here C = D k + 1 is a con stant. This den siﬁcation scheme ensu res that when ever I j emp = 0 , i.e., Bin j is simultaneously empty for any two S 1 and S 2 under considerations, th e ne wly a ssigned v alue m imics the collision probability of the nearest s imultaneou sly non- empty bin towards right (circular) hand side making the ﬁ- nal collision probability equ al to R , irrespecti ve of whether I j emp = 0 o r I j emp = 1 . [2 4] proved this fact as a theorem. Theorem 1 [24] Pr ( H j ( S 1 ) = H j ( S 2 )) = R (10) Theorem 1 implies that H satisﬁes the LSH p roperty and hence it is suitable for indexing based sub linear similarity search. Gener ating K L different hash values of H only re- quires O ( d + K L ) , wh ich saves a factor of d in the query processing cost co mpared to th e cost of O ( dK L ) with tra- ditional minwise hashing. F or f ast li near learning [1 9] with k different hash values the new scheme o nly needs O ( d + k ) testing (or pr ediction) time compared to standard b - bit min - wise hashing which requir es O ( dk ) time for testing. 3 V arianc e Analysis of Existing Scheme W e ﬁrst provide the variance analy sis of the existing scheme [24]. Theo rem 1 leads to an unbiased estimator of R between S 1 and S 2 deﬁned as: ˆ R = 1 k k − 1 X j =0 1 {H j ( S 1 ) = H j ( S 2 ) } . (11) Denote the numb er of simultaneou sly empty bins by N emp = k − 1 X j =0 1 { I j emp = 1 } , (12) where 1 is the ind icator functio n. W e partitio n the event ( H j ( S 1 ) = H j ( S 2 )) into two cases d ependin g on I j emp . Let M N j ( Non-empty Ma tch at j ) a nd M E j ( Empty Match at j ) be the ev ents deﬁned as: M N j = 1 { I j emp = 0 an d H j ( S 1 ) = H j ( S 2 ) } (13) M E j = 1 { I j emp = 1 an d H j ( S 1 ) = H j ( S 2 ) } (14) Note tha t, M N j = 1 = ⇒ M E j = 0 and M E j = 1 = ⇒ M N j = 0 . This combine d with Theo rem 1 implies, E ( M N j | I j emp = 0) = E ( M E j | I j emp = 1) = E ( M E j + M N j ) = R ∀ j (15) It is not difﬁcult to show that, E  M N j M N i   i 6 = j, I j emp = 0 an d I i emp = 0  = R ˜ R, where ˜ R = a − 1 f 1+ f 2 − a − 1 . Using these n ew ev ents, we hav e ˆ R = 1 k k − 1 X j =0  M E j + M N j  (16) W e are interested in computin g V ar ( ˆ R ) = E      1 k k − 1 X j =0  M E j + M N j    2    − R 2 (17) For notation al conv enience we will use m to denote th e ev ent k − N emp = m , i.e., th e e xpression E ( . | m ) means E ( . | k − N emp = m ) . T o simplify the analysis, we will ﬁrst compute the condition al expectation f ( m ) = E      1 k k − 1 X j =0  M E j + M N j    2     m    (18) By expansion and linearity of expectation, we obtain k 2 f ( m ) = E   X i 6 = j M N i M N j     m   + E   X i 6 = j M N i M E j     m   + E   X i 6 = j M E i M E j     m   + E " k X i =1  ( M N j ) 2 + ( M E j ) 2      m # M N j = ( M N j ) 2 and M E j = ( M E j ) 2 as they are indicator function s and can only take values 0 and 1. Hence, E   k − 1 X j =0  ( M N j ) 2 + ( M E j ) 2      m   = k R (19) The values of the rem aining three term s are given by the following 3 Lemmas; See the proof s in the Appendix. Lemma 1 E   X i 6 = j M N i M N j     m   = m ( m − 1) R ˜ R (20) Lemma 2 E   X i 6 = j M N i M E j     m   = 2 m ( k − m ) " R m + ( m − 1) R ˜ R m # (21) Lemma 3 E   X i 6 = j M E i M E j     m   = ( k − m )( k − m − 1) × " 2 R m + 1 + ( m − 1) R ˜ R m + 1 # (22) Combining the expressions f rom th e above 3 Lemmas and Eq.(19), we can comp ute f ( m ) . T aking a furth er expec- tation over values of m to rem ove the con ditional depen- dency , the variance of ˆ R can be shown in the next The orem. Theorem 2 V ar ( ˆ R ) = R k + A R k + B R ˜ R k − R 2 (23) A = 2 E  N emp k − N emp + 1  B = ( k + 1) E  k − N emp − 1 k − N emp + 1  The theor etical values of A an d B can be co mputed using the pr o bability of the event P r ( N emp = i ) , denoted by P i , which is given by Theor em 3 in [18]. P i = k − i X s =0 ( − 1) s k ! i ! s !( k − i − s )! f 1 + f 2 − a − 1 Y t =0 D  1 − i + s k  − t D − t 4 Intuition for the I mpr oved Sch eme ÷   L Ù y z L Ú ÷   L Ù y z L Ú ÷   L Ù y z L Ù 1 2 1 3 1 4 1 1 a b c Figure 3: Illustration o f the existing densiﬁcation scheme [ 24]. The 3 boxes indicate 3 simultaneously no n- empty bins. Any simultaneously empty bin has 4 possi- ble po sitions shown by b lank spaces. Arrow indicates the choice of simultan eous non-empty bin s p icked by simul- taneously e mpty bins occu rring in the correspond ing posi- tions. A simu ltaneously e mpty bin occurring in p osition 3 uses the informatio n from Bin c . The ran domne ss is in the position number of these bins which depend s on π . Consider a situation in Figure 3 , where there are 3 simul- taneously no n-emp ty bins ( I emp = 0 ) fo r given S 1 and S 2 . Th e actual position nu mbers of these simultaneo usly non-em pty bins are random. T he simultaneously emp ty bins ( I emp = 1 ) can o ccur in any order in the 4 blan k spaces. T he ar rows in the ﬁgure show the simultaneously non-em pty bins wh ich ar e bein g p icked by the simultane- ously empty bins ( I emp = 1 ) located in the shown blank spaces. The random ness in the system is in the ord ering of simultaneou sly em pty and simultaneou sly non-emp ty bins. Giv en a simultaneously non-emp ty Bin t ( I t emp = 0 ), the pr obability th at it is picked by a giv en simultaneo usly empty Bin i ( I i emp = 1 ) is exactly 1 m . This is becau se the permutation π is pe rfectly ra ndom and gi ven m , any orderin g of m simultaneously non-em pty bins and k − m simultaneou sly empty bins are equ ally likely . Hence, we obtain the term h R m + ( m − 1) R ˜ R m i in Lemma 2. On the other hand, under the given scheme, the probab ility that two simultaneously emp ty b ins, i and j , (i.e., I i emp = 1 , I j emp = 1 ) , b oth pick the same simu ltaneous non- empty Bin t ( I t emp = 0 ) is gi ven by (see proof of Lemma 3) p = 2 m + 1 (24) The value of p is h igh because there is no enoug h random- ness in th e selection procedu re. Since R ≤ 1 and R ≤ R ˜ R , if we can reduce this probab ility p then we reduce the value of [ pR + (1 − p ) R ˜ R ] . This directly redu ces the value of ( k − m )( k − m − 1) h 2 R m +1 + ( m − 1) R ˜ R m +1 i as giv en by Lemma 3. Th e reduction scales with N emp . For every simultaneously empty bin, the current scheme uses the info rmation of the closest non- empty bin in the right. Because of the symmetry in the arguments, chang ing the direction to left instead o f right also leads to a valid densiﬁcation scheme with exactly same v ar iance. This is where we can infuse r andom ness w ithout vio lating the alignment necessary fo r u nbiased densiﬁcation. W e sh ow that ran domly switching between left and right prov ably improves (redu ces) the variance by m aking the sampling proced ure of simultan eously no n-empty bin s mo re ra ndom. 5 The Impr oved Densiﬁcation Scheme ÷   L Ù y z L Ú ÷   L Ù y z L Ú ÷   L Ù y z L Ù 1 2 1 1 3 4 1 a b c Figure 4: Illu stration of the improved densiﬁcation scheme. For ev ery simultan eously empty bin, in th e b lank po sition, instead o f always cho osing the simu ltaneously non -empty bin from righ t, th e new scheme rando mly chooses to go either left or right. A simultaneo usly em pty bin occu rring at position 2 uniform ly chooses amon g Bin a or Bin b . Our pro posal is explained in Figur e 4. In stead of using the value of the closest non- empty bin from the right (c ircular), we will ch oose to go either left o r right with p robability 1 2 . This adds more rando mness in the selection procedu re. In the new scheme, we only need to store 1 rando m bit for each bin, which decid es th e direction (circular left or cir- cular right) to pro ceed for ﬁnding the closest non- empty bin. The n ew assignmen t of the em pty bins f rom Figure 1 is sho wn in Figu re 5. Every b in number i has an i.i.d. Bernoulli random variable q i (1 bit) associa ted with it. I f Bin i is emp ty , we check th e value of q i . If q i = 1 , we move circ ular right to ﬁnd the closest non -empty b in and use its value. In case when q = 0 , we move circular left.  Bin  0  Bin  1  Bin  2  Bin  3  Bin  4  Bin  5    Direction  Bits  (q)  0  1  0  0  1  1  H + (S 1 )  1+C   1  1+C  2   0  1  H + (S 2 )  0+2C   1  1+C  0    0  1+2C  Figure 5: Assigned values (in red) of empty bins from Figure 1 using the impr oved d ensiﬁcation pr ocedur e. Ev- ery empty Bin i uses the value of the clo sest non-emp ty bin, tow ards circular left or circular righ t dependin g on the random direction bit q i , with offset C . For S 1 , we have q 0 = 0 fo r empty Bin 0 , we therefore move circular left an d borrow value from Bin 5 with offset C making the ﬁ nal value 1 + C . Similarly for empty Bin 2 we have q 2 = 0 an d we use the v alue of Bin 1 (circular left) added with C . For S 2 and Bin 0, we have q 0 = 0 and th e next circular left bin is Bin 5 which is em pty so we con tinue and borrow value from B in 4, which is 0, with offset 2 C . It is a f actor of 2 because we traveled 2 bins to locate the ﬁrst non-em pty bin. For Bin 2, again q 2 = 0 and th e closest circular left non- empty bin is Bin 1, at distance 1, so the new value of Bin 2 for S 2 is 1 + C . F o r Bin 5, q 5 = 1 , so we go circular r ight and ﬁnd no n-empty Bin 1 at distance 2. Th e new hash value of Bin 5 is therefor e 1 + 2 C . No te that the non-emp ty bins remain unchan ged. Formally , let q j j = { 0 , 1 , 2 , ..., k − 1 } b e k i.i.d. Bernoulli random variables such that q j = 1 with probab ility 1 2 . The improved hash function H + is giv en by H + j ( S ) =                            OP H ( j − t 1 ) mod k ( π ( S )) + t 1 C if q j = 0 an d OP H j ( π ( S )) = E OP H ( j + t 2 ) mod k ( π ( S )) + t 2 C if q j = 1 an d OP H j ( π ( S )) = E OP H j ( π ( S )) otherwise (25) where t 1 = min z , s.t. OP H ( j − z ) mod k ( π ( S )) 6 = E (26) t 2 = min z , s.t. OP H ( j + z ) mod k ( π ( S )) 6 = E (27 ) with same C = D k + 1 . Computin g k hash ev alua tions with H + requires e valuating π ( S ) followed b y tw o passes over the k bins from dif ferent dire ctions. The total comp lexity of computin g k hash ev aluatio ns is again O ( d + k ) which is the same as that of the existing densiﬁcation scheme. W e need an additional storage of the k bits (rough ly hun dreds or thousand s in practice) which is practically negligible. It is not difﬁ cult to show that H + satisﬁes th e LSH proper ty for resemblan ce, which we state as a theor em. Theorem 3 Pr  H + j ( S 1 ) = H + j ( S 2 )  = R (28) H + leads to an unbiased estimator of resemblance ˆ R + ˆ R + = 1 k k − 1 X j =0 1 {H + j ( S 1 ) = H + j ( S 2 ) } . ( 29) 6 V arianc e Analysis of Impro ved Scheme When m = 1 (an e vent with prob  1 k  f 1 + f 2 − a ≃ 0 ), i.e., only one simu ltaneously non-empty bin , b oth the sche mes are exactly same. For simplicity of expressions, we will assume that the number of simultaneous non-empty bins is strictly greater than 1, i.e., m > 1 . The general case has an extra term for m = 1 , which ma kes the expression un nec- essarily complicated withou t chan ging the ﬁnal c onclusion . Follo wing the notation as in Sec. 3, we denote M N + j = 1 { I j emp = 0 and H + j ( S 1 ) = H + j ( S 2 ) } (30) M E + j = 1 { I j emp = 1 and H + j ( S 1 ) = H + j ( S 2 ) } (31) The two expectations E  P i 6 = j M N + i M N + j     m  and E  P i 6 = j M N + i M E + j     m  are the same as given by Lemma 1 and Lemma 2 respectiv ely , as all the argumen ts used to prove them still hold for the ne w scheme. The only change is in the term E  P i 6 = j M E i M E j     m  . Lemma 4 E   X i 6 = j M E + i M E + j     m   = ( k − m )( k − m − 1) × " 3 R 2( m + 1) + (2 m − 1) R ˜ R 2( m + 1) # (32) The theor etical variance of the n ew estimator ˆ R + is given by the following Theorem 4. Theorem 4 V ar ( ˆ R + ) = R k + A + R k 2 + B + R ˜ R k 2 − R 2 (33) A + = E  N emp (4 k − N emp + 1) 2( k − N emp + 1)  B + = E " 2 k 3 + N 2 emp − N emp (2 k 2 + 2 k + 1) − 2 k 2( k − N emp + 1) # The new scheme red uces the value of p (see Eq.(24)) fro m 2 m +1 to 1 . 5 m +1 . As argued in Sec. 4, th is redu ces the overall variance. Here, we state it a s th eorem that V ar ( ˆ R + ) ≤ V ar ( ˆ R ) a lways. Theorem 5 V ar ( ˆ R + ) ≤ V a r ( ˆ R ) (34) Mor e pr ecisely , V ar ( ˆ R ) − V ar ( ˆ R + ) = E  ( N emp )( N emp − 1) 2 k 2 ( k − N emp + 1) [ R − R ˜ R ]  (35) The probability of simultaneously empty bin s increases with increasing sparsity in dataset and the total number of bins k . W e c an see from Theorem 5 that with more simul- taneously empty bins, i.e., higher N emp , the g ain with the improved scheme H + is higher compare d to H . Hen ce, H + should be signiﬁcantly better tha n the existing scheme for very sparse datasets or in scenario s when we need a large number of hash v alue s. 7 Evaluations Our ﬁrst experiment concerns the validation of the theo reti- cal v ar iances of the tw o densiﬁcation schemes. The second experiment focuses o n comparing th e two schemes in th e context of near neighbor search with LSH. 7.1 Comparisons of Mean Square Err ors W e empirically verify the theoretical v ariances o f R an d R + and their effects in many pra ctical scen arios. T o achieve th is, we extracted 12 pairs of words ( which cover a wide spe ctrum of sparsity and similarity) fro m the web - crawl dataset which con sists of word rep resentation from 2 16 docume nts. Every w ord is represen ted as a binary v ec- tor (or set) of D = 2 16 dimension, with a feature v alue of 1 indicating the presence of that word in the correspo nding docume nt. See T able 1 for detailed info rmation of the data. For all 12 pairs o f word s, we estimate the resemblance u s- ing the two estimators R and R + . W e plot the empirical 10 1 10 2 10 3 10 4 10 5 10 −5 10 −4 10 −3 10 −2 10 −1 k (Num of Hashes) MSE HONG−KONG Old Old−Theo Imp Imp−Theo 10 1 10 2 10 3 10 4 10 5 10 −6 10 −5 10 −4 10 −3 10 −2 10 −1 k (Num of Hashes) MSE RIGHTS−RESERVED Old Old−Theo Imp Imp−Theo 10 1 10 2 10 3 10 4 10 5 10 −6 10 −5 10 −4 10 −3 10 −2 10 −1 k (Num of Hashes) MSE A−THE Old Old−Theo Imp Imp−Theo 10 1 10 2 10 3 10 4 10 5 10 −5 10 −4 10 −3 10 −2 10 −1 k (Num of Hashes) MSE UNITED−STATES Old Old−Theo Imp Imp−Theo 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 10 −2 10 −1 k (Num of Hashes) MSE TOGO−GREENLAND Old Old−Theo Imp Imp−Theo 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 10 −2 10 −1 k (Num of Hashes) MSE ANTILLES−ALBANIA Old Old−Theo Imp Imp−Theo 10 1 10 2 10 3 10 4 10 5 10 −5 10 −4 10 −3 10 −2 10 −1 k (Num of Hashes) MSE CREDIT−CARD Old Old−Theo Imp Imp−Theo 10 1 10 2 10 3 10 4 10 5 10 −5 10 −4 10 −3 10 −2 10 −1 k (Num of Hashes) MSE COSTA−RICO Old Old−Theo Imp Imp−Theo 10 1 10 2 10 3 10 4 10 5 10 −5 10 −4 10 −3 10 −2 10 −1 k (Num of Hashes) MSE LOW−PAY Old Old−Theo Imp Imp−Theo 10 1 10 2 10 3 10 4 10 5 10 −4 10 −3 10 −2 10 −1 k (Num of Hashes) MSE VIRUSES−ANTIVIRUS Old Old−Theo Imp Imp−Theo 10 1 10 2 10 3 10 4 10 5 10 −6 10 −5 10 −4 10 −3 10 −2 10 −1 k (Num of Hashes) MSE REVIEW−PAPER Old Old−Theo Imp Imp−Theo 10 1 10 2 10 3 10 4 10 5 10 −5 10 −4 10 −3 10 −2 k (Num of Hashes) MSE FUNNIEST−ADDICT Old Old−Theo Imp Imp−Theo Figure 6: Mean Square Error (MSE) of the old scheme ˆ R and the improved scheme ˆ R + along with their theoretical values on 12 word pairs (T able 1) from a web crawl dataset. T able 1: In formatio n o f 12 pairs of word vectors. Each word stands for a set of documents in which the word is contained . For example, “ A ” correspo nds to the set of doc- ument IDs which containe d word “ A ”. W ord 1 W ord 2 f 1 f 2 R HONG KONG 940 948 0.925 RIGHTS RESER VED 12,23 4 11,272 0.877 A THE 39,063 42,754 0.644 UNITED ST A TES 4,079 3,981 0.591 TOGO GREE NLAND 231 200 0.528 ANTILLES ALB ANIA 184 275 0.457 CREDIT CARD 2,999 2,697 0.285 COST A RICO 773 611 0.234 LO W P A Y 2,936 2,828 0.112 VIR USES ANTIVIR US 212 152 0.113 REVIEW P AP ER 3,197 1,944 0.078 FUNNIEST ADDICT 68 77 0.028 Mean Squar e Err o r (MSE) o f both estimators with respect to k which is the numbe r of hash ev aluations. T o v alid ate the theor etical variances (which is also the M SE because the estimators are unbiased ), we also plot the values of the theoretical v ariances computed from Theorem 2 and Theo- rem 4. T he results are summarized in Figure 6. From the plots we can see that th e theoretical and the em- pirical MSE v alu es overlap in both the cases v alid ating both Theo rem 2 and Theorem 4. When k is small both the schemes h av e similar variances, but when k increases the improved schem e always shows better variance. For very sparse pairs, we start seeing a signiﬁcant dif f erence in variance e ven for k as small as 1 00. For a sparse p air , e.g., “TOGO” and “GREENLAND”, the dif ference in variance, between the two schemes, is mor e com pared to the d ense pair “ A” and “THE”. This is in agreeme nt with Theorem 5 . 7.2 Near Neighbor Retrieval with LSH In this experiment, we e valuate the tw o hashin g schemes H and H + on the standard ( K , L ) -parameteriz ed LSH al- gorithm [14, 2] for retrieving n ear neighbors. T wo publicly av ailab le sparse text datasets are described in T able 2. T able 2: Data set informa tion. Data # dim # nonzer os # train # qu ery RCV1 47 ,236 73 100,00 0 5,000 URL 3 ,231, 961 115 90,0 00 5,000 In ( K, L ) -parameterized LSH algor ithm for near neighbor search, we generate L different meta-hash functions. Each of these meta-hash functions is formed b y concatenating K different hash values as B j ( S ) = [ h j 1 ( S ); h j 2 ( S ); ... ; h j K ( S )] , (36) where h ij , i ∈ { 1 , 2 , ..., K } , j ∈ { 1 , 2 , ..., L } , are K L re- alizations of the hash fun ction under consideration. The ( K, L ) -param eterized LSH works in two phases: 1. Pr eprocessing Phase: W e construct L hash tab les from th e d ata by storing e lement S , in the tra in set, at location B j ( S ) in hash-table j . 2. Query P hase: Given a qu ery Q , we rep ort the union of all the p oints in the buckets B j ( Q ) ∀ j ∈ { 1 , 2 , ..., L } , where the union is over L hash tables. For every dataset, based on th e similarity le vels, we chose a K based on standard re commend ation. For this K we show results for a set of values of L depending on the recall val- ues. Please refer to [2] for details on the implementation of LSH. Since bo th H an d H + have the same collision prob- ability , the choice of K and L is the same in bo th cases. For e very query point, the gold standard top 10 near neigh- bors from the training set are computed based on actual re- semblance. W e then comp ute the recall of these gold stan- dard n eighbo rs and the total number of poin ts retrieved by the ( K, L ) bucketing sch eme. W e report the mea n co m- puted over all the p oints in the query set. Since the exper- iments inv olve randomizatio n, the ﬁnal r esults p resented are av eraged ov er 10 independ ent runs. The recalls and the points retrieved p er query are summarized in Figure 7. 50 100 150 200 0 1000 2000 3000 Points Retrieved per Query L (Number of Tables) RCV1 K=3 Top 10 old Imp 50 100 150 200 0 20 40 60 80 Recall (%) L (Number of Tables) RCV1 K=3 Top 10 old Imp 25 50 75 100 0 1000 2000 3000 4000 Points Retrieved per Query L (Number of Tables) URL K=15 Top 10 old Imp 25 50 75 100 0 20 40 60 80 100 Recall (%) L (Number of Tables) URL K=15 Top 10 old Imp Figure 7 : A verage nu mber of points scanned per query and the mean recall values of top 10 near neighbors, obtained from ( K, L ) -par ameterized LSH algorithm , using H (o ld) and H + (Imp). Both schem es achieve the sam e reca ll but H + reports fewer poin ts compar ed to H . Results are aver- aged over 10 indepen dent runs. It is clear from Figure 7 that the imp roved hashing scheme H + achieves the same recall but at the same time retrieves less nu mber o f po ints com pared to the old scheme H . T o achieve 90% recall o n URL d ataset, the old scheme re- triev es arou nd 33 00 po ints per query on an a verage wh ile the impr oved schem e only n eeds to check around 2700 points per query . For RCV1 dataset, with L = 200 th e old scheme retrie ves around 3000 points and achieves a re- call of 80% , while the same recall is achieved by the im- proved scheme after re trieving only about 235 0 p oints per query . A goo d hash function p rovides a right ba lance b e- tween recall a nd number o f points retriev ed. In pa rticular, a hash function which achieves a giv en recall and at th e same time re triev es less number of po ints is desirable be cause it implies b etter p recision. The above results clearly demo n- strate th e super iority of the indexing scheme with improved hash function H + over the indexing scheme with H . 7.3 Why H + retrieves less number of point s than H ? The n umber of points retrie ved, by the ( K, L ) parameter- ized LSH algo rithm, is d irectly related to the collision prob - ability of th e meta-hash fun ction B j ( . ) (Eq.(36)). Given S 1 and S 2 with resemblan ce R , the higher the probability of e vent B j ( S 1 ) = B j ( S 2 ) , un der a h ashing scheme, the more number of points will be retriev ed per table. The an alysis of the v ar iance (secon d moment) abo ut the ev ent B j ( S 1 ) = B j ( S 2 ) u nder H + and H provides some reasonable insigh t. Recall that since both estimators under the two h ashing schemes are unbiased, the analysis of the ﬁrst moment does not provide information in this regard. E  1 {H j 1 ( S 1 ) = H j 1 ( S 2 ) } × 1 {H j 2 ( S 1 ) = H j 2 ( S 2 ) }  = E  M N j 1 M N j 2 + M N j 1 M E j 2 + M E j 1 M N j 2 + M E j 1 M E j 2  As we know from our analysis that the ﬁrst thr ee terms in- side expectation , in the RHS of th e above equation, behaves similarly for b oth H + and H . The fourth term E  M E j 1 M E j 2  is likely to b e smaller in c ase of H + because of smaller values of p . W e theref ore see that H retrieves more p oints than necessary as com pared to H + . Th e difference is v is- ible when empty bins d ominate and M E 1 M E 2 = 1 is more likely . This happen s in the case of spar se datasets which are common in practice. 8 Conclusion Analysis of the d ensiﬁcation scheme for on e permutatio n hashing, which reduces the processing time of minwise hashes, r ev eals a sub-op timality in the existing procedur e. W e pr ovide a simple improved p rocedu re which adds more random ness in the current d ensiﬁcation technique lead ing to a prov ably better scheme, especially for very spa rse datasets. The improvement comes with out any compro- mise with the computation an d only requires O ( d + k ) (lin- ear) c ost for gener ating k hash evaluations. W e ho pe that our improved scheme will be adopted in practice. Acknowledgeme nt Anshumali Shri vastav a is a Ph.D. s tudent partially sup- ported by NSF (DMS080 8864 , III124 9316) and ONR (N0001 4-13- 1-07 64). The work of Ping Li is partially s up- ported by AFOSR (F A95 50-13 -1-0 137), ONR (N00014- 13-1- 0764) , and NSF (I II136 0971 , BIGD A T A1 41921 0). A Proofs For the analysis, it is sufﬁcient to conside r the conﬁgura- tions, o f empty an d non -empty bins, arising after throwing | S 1 ∪ S 2 | balls unifor mly into k bins with exactly m non- empty bins an d k − m empty bins. Under unifor m throwing of balls, any o rderin g of m non-empty and k − m em pty bins is equally likely . Th e proo fs in volve elementary com - binatorial arguments of counting conﬁgurations. A.1 Proof of Lemma 1 Giv en exactly m simultaneo usly non-empty bins, any two of them can be chosen in m ( m − 1 ) w ays (with or der- ing of i and j ). Eac h term M N i M N j , fo r both simultane- ously non -empty i and j , is 1 with probab ility R ˜ R ( Note, E  M N i M N j   i 6 = j, I i emp = 0 , I j emp = 0  = R ˜ R ). A.2 Proof of Lemma 2 The permu tation is random and any sequen ce of simulta- neously m non- empty an d remainin g k − m empty bins are equa l likely . This is because, while rand omly throw- ing | S 1 ∪ S 2 | b alls in to k b ins with exactly m non-empty bins ev ery sequence o f simultaneou sly em pty and non- empty bins has equal probability . Gi ven m , there are total 2 m ( k − m ) different pairs of empty an d non-em pty bins (includin g th e ord ering). Now , for e very simultan eously empty bin j, i.e., I j emp = 1 , M E j replicates M N t corre- sponding to nearest non-em pty Bin t which is towards the circular right. There are t wo cases we need to consider: Case 1: t = i , which has probab ility 1 m and E ( M N i M E j | I i emp = 0 , I j emp = 1 ) = E ( M N i | I i emp = 0) = R Case 2: t 6 = i , which has probab ility m − 1 m and E ( M N i M E j | I i emp = 0 , I j emp = 1) = E ( M N i M N t | t 6 = i, I i emp = 0 , I t emp = 0) = R ˜ R Thus, the value of E  P i 6 = j M N i M E j     m  comes out to be 2 m ( k − m ) " R m + ( m − 1) R ˜ R m # which is the desired expression. A.3 Proof of Lemma 3 Giv en m , we h av e ( k − m )( k − m − 1) different pairs of simultaneou s non-emp ty bins. There are tw o cases, if the closest simultaneous n on-em pty bins tow ar ds th eir cir cu- lar right are identical, then for such i and j , M E i M E j = 1 with probability R , else M E i M E j = 1 with probab ility R ˜ R . Let p be the p robab ility that two simultaneo usly empty bins i and j have the same closest bin on the right. Then E  P i 6 = j M E i M E j     m  is given by ( k − m )( k − m − 1) h pR + (1 − p ) R ˜ R i (37) because with pr obability (1 − p ) , it uses estimators from different simultaneous non- empty bins and in that case the M E i M E j = 1 with probability R ˜ R . Consider Figure 3, where we ha ve 3 simultaneou s non- empty bins, i.e., m = 3 (shown by colo red boxes). Gi ven any two simultaneous empty bin s Bin i and Bin j (ou t of total k − m ) they will occupy any of the m + 1 = 4 blank positions. The arrow sho w s the chosen non- empty bins for ﬁlling the empty b ins. Ther e are ( m + 1 ) 2 + ( m + 1) = ( m + 1)( m + 2) different ways of ﬁtting tw o simultaneous non-em pty bins i and j between m non-em pty bins. Note, if both i and j go to the same blank position they can be permuted . This adds extra term ( m + 1 ) . If bo th i an d j ch oose the s ame bla nk space or the ﬁrst and the last blank space, then both the simultaneous empty bins, Bin i and Bin j , cor respond s to the same no n-emp ty bin. The numb er of ways in wh ich this happ ens is 2( m + 1) + 2 = 2 ( m + 2) . So , we have p = 2( m + 2) ( m + 1)( m + 2) = 2 m + 1 . Substituting p in Eq.(37) leads to the desired expression. A.4 Proof of Lemma 4 Similar to th e p roof of L emma 3, we need to co mpute p which is th e probab ility that two simu ltaneously empty bins, Bin i an d Bin j , use in formatio n fr om the same bin. As ar gued befor e, the total number of positions for any two simultaneou sly empty bins i and j , given m simultaneo usly non-em pty b ins is ( m + 1 )( m + 2) . Conside r Figure 4, un- der th e impr oved scheme, if bo th Bin i and Bin j choose the same blank position then they choo se the same simultane- ously non- empty bin with probability 1 2 . If Bin i and Bin j choose consecutive positions (e.g., position 2 an d p osition 3) the n they c hoose the same simultaneou sly non-e mpty bin (Bin b ) with pr obability 1 4 . There are se veral bou ndary cases to consider too. Acc umulating the terms leads to p = 2( m +2) 2 + 2 m +4 4 ( m + 1)( m + 2) = 1 . 5 m + 1 . Substituting p in Eq.(37) yields the desired result. Note that m = 1 (an event with almo st zero prob ability) leads to the value of p = 1 . W e ignore this case beca use it unnecessarily comp licates the ﬁnal expre ssions. m = 1 can be easily handle d and does not affect the ﬁnal con clusion. Refer ences [1] A. Agar wal, O. Chapelle, M. Dudik, and J. Langford. A reliab le effecti ve tera scale linear learning system. T echn ical report, arXi v:1110 .4198 , 2011. [2] A. Andoni and P . Ind yk. E 2lsh: Exact euclidean lo- cality sensiti ve hashing. T echnical report, 2004 . [3] R. J. Bayardo, Y . Ma, and R. Srikant. Scaling u p all pairs similar ity searc h. In WWW , pages 131–140 , 2007. [4] A. Z. Broder . On the resemblan ce and containm ent of documents. In the Compr ession and Complexity of Sequen ces , pages 21–2 9, Positano, Italy , 1997. [5] A. Z. Bro der, M. Char ikar, A. M. Frieze, and M. Mitzenmacher . Min-wise indepen dent p ermuta- tions. In STOC , pages 327–3 36, Dallas, TX, 1998 . [6] G. Buehrer and K. Chellapilla. A scalable pattern mining app roach to web gr aph comp ression with commun ities. In WSDM , pa ges 95–10 6, Stanf ord, CA, 2008 . [7] J. L. Carter and M. N. W egman. Uni versal classes of hash function s. In STOC , pages 106–112, 1977. [8] T . Chan dra, E. Ie, K. Go ldman, T . L . Llinares, J. M c- Fadden, F . Pereira, J. Redstone, T . Shaked, an d Y . Sin ger . Sibyl: a system fo r large scale machin e learning. [9] M. S. Cha rikar . Similar ity estimation technique s from round ing algor ithms. In ST OC , pages 380–38 8, Mo n- treal, Quebec, Canada, 2002. [10] S. Chien a nd N. Immo rlica. Semantic similarity be- tween search engine queries using temporal correla- tion. In WWW , pages 2–11, 2005. [11] F . Chierichetti, R. Kumar , S. Lattanzi, M. Mitze n- macher, A. P anconesi, and P . Ragh av an . On com- pressing social networks. In KDD , pages 21 9–22 8, Paris, France, 2009. [12] D. Fetterly , M. Manasse, M. Najork, and J. L. W ien er . A large-scale study of the e volution of web pages. In WWW , pages 669– 678, B udapest, Hungary , 2 003. [13] M. R. Henzinger . Finding near-duplicate web pages: a large-scale e valuation o f algorithms. In SIGIR , pages 284– 291, 2006. [14] P . Indyk and R. Motwani. Appro ximate nearest neigh - bors: T owards rem oving the curse of dimensio nality . In STOC , pages 604–613 , Dallas, TX, 1998. [15] P . Li and K. W . Chur ch. Using sketches to estimate associations. In HLT/EMN LP , pages 7 08–7 15, V an - couver, BC, Canada, 2005 . [16] P . Li, K. W . Church, and T . J. Hastie. Con ditional ran- dom samplin g: A sketch-based sampling techniq ue for spar se data. In NIPS , pages 873–88 0, V ancouver, BC, Canada, 2006 . [17] P . Li, A. C. K ¨ onig, and W . Gui. b -bit m inwise hashing for estimating thr ee-way similarities. In Advances in Neural In formation Pr o cessing S ystems , V ancouver, BC, 2010. [18] P . Li, A. B. Owen, and C.-H. Z hang. One pe rmutation hashing. In NIPS , Lake T ahoe, NV , 2012. [19] P . Li, A. Shriv a sta va, J. Moore, and A. C. K ¨ onig. Hashing algorithms for large-scale lear ning. In NIPS , Granada, Spain, 2011. [20] M. Mitzenmac her and S. V ad han. Why simp le hash function s w ork: exploiting the entro py in a data stream. In SOD A , 2008. [21] M. Najork, S. Gollapu di, and R. Panigrah y . L ess is more: samp ling the neigh borho od graph makes salsa better and faster . In WSDM , pages 24 2–25 1, Barcelona, Spain, 2009 . [22] N. Nisan . Pseudoran dom g enerators for space- bound ed co mputation s. In Pr ocee dings of the twenty- second annu al ACM symp osium on Theory of com- puting , STOC, pages 204–212 , 1990. [23] A. Shriv asta va and P . Li. Beyond p airwise: Prov- ably f ast algorithms for approximate k-way similarity search. In N IPS , Lake T ahoe, NV , 2013. [24] A. Shriv astav a and P . Li. Densifying one permutation hashing v ia ro tation fo r fast near neighbor search. I n ICML , Beijing, China, 2014 . [25] S. T ong. Lessons lear ned developing a prac- tical large scale machine learning system. http://goog leresearch. blogspot.com/2010/04/lessons-learned-deve l o p i n g - p r a c t i c a l . h t m l , 2008. [26] K. W einb erger , A. Dasgup ta, J. Lan gford , A. Smola, and J. Attenberg. Feature hashing fo r large scale m ul- titask learning. I n ICML , pages 1113 –1120 , 20 09.

Improved Densification of One Permutation Hashing

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment