A Heterogeneous High Dimensional Approximate Nearest Neighbor Algorithm

A Heterogeneous High Dimensional Approximate Neare st Neighbor Algorithm Moshe Dubiner A C K N O W L E D G M E N T I would like to t hank Phil Lon g and David Pablo Cohn for revie wing rough drafts of t his paper and suggesting many clariﬁcations. The remaining obscurity is my fault. M. Dubiner is with Google, e-mail: moshe@google.com Manuscript submitted to IEEE Transac tions on Information Theory on March 3, 2007. JOURNAL OF L A T E X CLASS FILES, V OL. 1, NO. 1, MARCH 2007 1 A Heterogeneous High Dimensional Approximate Neare st Neighbor Algorithm Abstract W e consider the prob lem of ﬁnding high dimension al ap proximate n earest n eighbors. Suppo se there are d independent rar e fe atures, each having its own independent statistics . A point x will h a ve x i = 0 denote the absence of feature i , and x i = 1 its existence. Sparsity means that u sually x i = 0 . Distance between points is a variant of the Hamming distance. Dimensional reduction con verts the sparse heteroge neous prob lem into a lower d imensional full hom ogeneous p roblem. Howev er we will see that the conv erted problem ca n be much harder to solve than th e origin al problem. Instead we suggest a direct app roach. It con sist s of T tries. I n tr y t we rearrange th e co ordinates in de creasing orde r of (1 − r t,i ) p i, 11 p i, 01 + p i, 10 ln 1 p i, 1 ∗ (1) where 0 < r t,i < 1 ar e uniform p seudo-rando m n umbers, and the p ′ s are the coordin ate’ s statistical parameters. The poin ts are lexicogr aphically ord ered, and each is compared to its neigh bors in that order . W e analyze a gener alization of this alg orithm, show that it is optim al in so me class of algorith ms, and estimate the necessary number of tries to success. I t is governed by an inf ormation like function, which w e call bucketing for est information. Any dou bts whether it is “info rmation” are dispelled by another pap er , where unrestricted bucketing inform ation is deﬁned . I . I N T RO D U C T I O N Suppose we ha ve two bags of points, X 0 and X 1 , randomly distributed in a high-di mensional space. The points are independent of each other , wi th o ne exception: there is on e unknown poi nt x 0 in bag X 0 that is signiﬁcantly closer to an unknown point x 1 in bag X 1 than would be accounted fo r by chance. W e want an efﬁcient algo rithm for quickly ﬁnding these two ’paired’ points. The re ader might wonder why we need two sets, instead of working a s usual with X = X 0 ∪ X 1 . W e ha ve come a full circle on this issue. The practical problem that got us interested in this theory in volve d texts from two languages, hence two dif ferent sets. Ho we ver it seemed that the asymmetry between X 0 and X 1 was not impo rtant, so we dev eloped a one set theory . Than we found out that keeping X 0 , X 1 separate makes thing clearer . JOURNAL OF L A T E X CLASS FILES, V OL. 1, NO. 1, MARCH 2007 2 Let us start with t he well known s imple homogeneous marginally Bernoulli(1/2) example. Suppose X 0 , X 1 ⊂ { 0 , 1 } d of sizes n 0 , n 1 respectiv ely are randomly chosen as ind ependent Bernoulli(1/2) variables, with one exception. Choose random ly one poin t x 0 ∈ X 0 , xor it with a random Bernoulli( p ) vector and overwrite one randomly chosen x 1 ∈ X 1 . A symmetric description is to say that x 0 , x 1 i ’ th bits ha ve t he join t probability P =    p/ 2 (1 − p ) / 2 (1 − p ) / 2 p/ 2    (2) For som e p > 1 / 2 . W e assume that we know p . In practice it will hav e to be estim ated. Let ln M = ln n 0 + ln n 1 − I ( P ) d (3) where I ( P ) = p ln(2 p ) + (1 − p ) ln(2(1 − p )) (4) is the m utual inform ation between t he special pai r’ s single coordin ate values. Inform ation theory tells us that we can not hope to pin the special pair down i nto less than W possib ilities, but can come close t o it in some asympt otic sense. Assum e that W is small. How can we ﬁnd the closest pai r? The trivial way to do it is to compare all the n 0 n 1 pairs. A better w ay has been known for a long time. The earliest references I am aw are of are Karp,W a arts and Zweig [6], Broder [3], Ind yk and Motwani [5]. They do not lim it t hemselves to this simplis tic problem, but t heir approach clearly handles it. W ith out restricting generality let n 0 ≤ n 1 . W e random ly choose k ≈ log 2 n 0 (5) out of the d coordinates, and compare the point pairs which agree on these coordinates (in other words, fall into the same bucket). The expected number of comparisons is n 0 n 1 2 − k ≈ n 1 (6) while t he probability of success of one comparison is p k . In case o f failure we try again, wit h other random k coordinates. At ﬁrst glance it mi ght seem that the expected number of tries until success is p − k , but that is not true because the attempts are interdependent. The correct computation is done in the next section. In the unlimited data case d → ∞ ind ee d T ≈ p − k ≈ n log 2 1 /p 0 (7) JOURNAL OF L A T E X CLASS FILES, V OL. 1, NO. 1, MARCH 2007 3 Is this optim al? Alon [1] has suggest ed the possibilit y of improvement by using Hamming’ s perfect code. W e have found that i n the n 0 = n 1 = n case, T ≈ n log 2 1 /p can be reduced to T ≈ n 1 /p − 1+ ǫ (8) for any ǫ > 0 , see [7]. Unfortunately thi s seems hard to con vert into a practical algorith m. In practice, most approx imate nearest neighbor p roblems are h eter ogeneous. Coordinates are not in dependent either , but there is a lot to learn from the independent case. For starters l et the joint probabilit y matrix be position dependent: P i =    p i / 2 (1 − p i ) / 2 (1 − p i ) / 2 p i / 2    1 ≤ i ≤ d (9) This is an imp ortant example which we will refer t o as th e marginally Bernoulli(1/2) example. It turns out that in each try coordinate i shoul d be chosen with probability max " p i − p cut 1 − p cut , 0 # (10) for some cutoﬀ probabilit y 0 ≤ p cut ≤ 1 . An int uiti ve argument leading to that equati on appears in section III. Section V present s an independent data model, and a general nearest neighbor algorithm using i ts parameters. Section XIII proves a lower bound for its success probability . Section XII proves an upper bound for a m uch lager class of algorithms. T he lower and upper bound are asymptoticall y similar . The number of tries T satisﬁes ln T ∼ max λ ≥ 0 " λ ln n 0 − d X i =1 F ( P i , λ ) # (11) where F ( P i , λ ) i s deﬁned in (49 ). The s imilarity to the information theoretic (3) su ggests that F ( P i , λ ) is some sort of information function. W e call it the buc keting forest information function. [7] proves a sim ilar estimate for the performance of the “best p ossible” bucketing algorithm, in volving a buck eting information function with a very information theoretic look. Section VIII shows that our algorithm preserves sparseness. Section IX shows that dimensional reduction is bad for sparse data. JOURNAL OF L A T E X CLASS FILES, V OL. 1, NO. 1, MARCH 2007 4 I I . T H E H O M O G E N E O U S M A R G I N A L L Y B E R N O U L L I ( 1 / 2 ) E X A M P L E The well known ho mogeneous marginally Bernoulli(1/ 2) example has been presented in the introduction. It will be analyzed in detail because the main purpose of this paper is genera lizing it. The analysis i s non-generalizable, b ut the issues remain. Recall th at we ha ve a joint probability matrix P =    p/ 2 (1 − p ) / 2 (1 − p ) / 2 p/ 2    (12) For som e p > 1 / 2 . W it hout restricting generality let n 0 ≤ n 1 . W e random ly choose k ≈ min[log 2 n 0 , (2 p − 1) d ] (13) out of the d coordin ates. The reason for k ≤ (2 p − 1) d (which was om itted in t he introductio n for simplicit y) will emerge later . W e compare point pairs which agree on th e chosen k coordinates. This is a random algorithm solving a random problem, s o we ha ve t w o leve ls of randomness. Usually wh en we will compute probabiliti es or expectations i t will be with respect to these two sources to gether . The expected number of comparisons is n 0 n 1 2 − k while the probability of success of one compariso n is p k . (These statements are true assum ing only model randomness). In case of failure we t ry again, with other random k coordinates. In o rder to estimate the expected number of tries t ill success we hav e to enumerate how many bits ar e identical in the special pair x 0 , x 1 . L et this number be j . Then the p robability of success i n a single try conditioned on j is    j k    ,    d k    . Hence the expected number of comparisons is T n 1 where T = n 0 2 − k d X j =0    d j    p j (1 − p ) d − j    d k    ,    j k    For sm all d/k th is is to o pessimis tic because most of the contribution to the above sum comes from unli k ely low j ’ s. W e kno w t hat wit h probabi lity about 1/2, j ≥ pd . Henc e we get a su cc ess probability of about 1/2 with an expected T = n 0 2 − k d X j = pd    d j    p j (1 − p ) d − j    d k    ,    j k    ≈ ≈ n 0 2 − k    d k    ,    pd k    = n 0 k − 1 Y i =0 1 − i/d 2( p − i/d ) Now it is clear that increasing k abov e (2 p − 1 ) d increases T , which is cou nterproducti ve. JOURNAL OF L A T E X CLASS FILES, V OL. 1, NO. 1, MARCH 2007 5 I I I . A N I N T U I T I V E A R G U M E N T F O R T H E M A R G I N A L L Y B E R N O U L L I ( 1 / 2 ) E X A M P L E In full generalit y our algorithm is not very in tuiti ve. In t his section we will present an intu iti ve ar gument for the special case of the joint probabi lity matrices P i =    p i / 2 (1 − p i ) / 2 (1 − p i ) / 2 p i / 2    1 ≤ i ≤ d (14) The i mpatient reader may skip th is and the next section, jumping d irectly to the algorithm. Let us order the coordinates in decreasing order of importance p 1 ≥ p 2 ≥ · · · ≥ p d (15) Moreover let us bunch coordinates together int o g groups of d 1 , d 2 , . . . , d g coordinates, where P g h =1 d h = d , and the members of group h all have the same probabilit y q h p d 1 + ··· + d h − 1 +1 = · · · = p d 1 + ··· + d h = q h (16) Out of the d h coordinates in group h , the special pair wi ll agree in approx imately q h d h ’good’ coordinates. Let us make things s imple by pretending that this is the e xact value (never mind that it is not an integer). W e want to choose k = log 2 n 0 (17) coordinates and compare pairs whi ch agree on them. The greedy approach seems to choose as many as poss ible from the group 1, but condit ional greed disagrees. Let us pick the ﬁrst coordinate random ly from group 1. If it is b ad, the who le try is lost. If it is good, group 1 is reduced to size d 1 − 1 , out of which q 1 d 1 − 1 are good. Hence the probability that a remainin g coordinate is good is reduced to q 1 d 1 − 1 d 1 − 1 (18) After taking m coordinates out of group 1, its probabili ty decreases to q 1 d 1 − m d 1 − m (19) Hence after taking m = q 1 − q 2 1 − q 2 d 1 (20) JOURNAL OF L A T E X CLASS FILES, V OL. 1, NO. 1, MARCH 2007 6 coordinates, g roup 1 m er ges with group 2 . W e wi ll randomly chose coordinates from th is mer ged group till i ts probability drops to q 3 . A t that point the p robability of a second group coordinate to be chosen is q 2 − q 3 1 − q 3 (21) while the probability of a ﬁrst group coordinate being picked either before or after the union i s q 1 − q 2 1 − q 2 + 1 − q 1 − q 2 1 − q 2 ! q 2 − q 3 1 − q 3 = q 1 − q 3 1 − q 3 (22) This goes on till at som e q l = p cut we hav e k coordinates. Then the probability that coordinate i is chosen is max " p i − p cut 1 − p cut , 0 # (23) as stated in the introductio n. The cutoff probability is determined by d X i =1 max " p i − p cut 1 − p cut , 0 # ≈ k (24) The previous equation can be iterativ ely solved. Howe ver it is better to look from a different angle. For each try we wi ll hav e to generate d independent u niform [0 , 1 ] random real numbers 0 < r 1 , r 2 , . . . , r d < 1 (25) one random number per coordinate. Then we take coordinate i iff r i ≤ p i − p cut 1 − p cut (26) Let us rev erse directio n. Generate r i ﬁrst, and then compute for which p cut ’ s coordinate i is taken: p cut ≤ 2 − λ i = max  p i − r i 1 − r i , 0  (27) Denoting the right h and sid e by 2 − λ i is unnecessarily cum bersome at this stage, but will make sense later . W e will call λ i the random exp onen t of coordinate i (random because it is r i dependent). Remember that p cut > 0 so λ i = ∞ m eans that for that value of r i coordinate i can not be us ed. Now which value of p cut will g et us k coordinates? T here is no need to s olv e equations. Sort the λ i ’ s in nondecreasing order , and pick out t he ﬁrst k . Hence p cut = 2 − λ cut (28) where the cutoﬀ exp onen t λ cut is the value of the k ′ th ordered random exponent. JOURNAL OF L A T E X CLASS FILES, V OL. 1, NO. 1, MARCH 2007 7 It takes some tim e to comprehend the effect of equatio n (27). The random element seems over- whelming. Th e p robability t hat coordinate 1 will h a ve larger random exponent t han coordin ate 2 when p 1 > p 2 is 1 2 1 − p 1 1 − p 2 (29) In particular the probabil ity that a useless coordinate wi th p i = 0 . 5 precedes a go od coordinate with p i = 0 . 9 i s 0.1 ! Howe ver the chance that the us eless coordinate will be ranked among the ﬁrst k is very small, unless we hav e so little data that it is better to take k < ln n 0 . I V . A N U N L I M I T E D H O M O G E N E O U S D A TA E X A M P L E The pre vious section com pletely av oids an im portant aspect of the general problem which will be presented by the following example. Suppose we hav e an unlimit ed amount of data d → ∞ of the same type P =    p 00 p 01 p 10 p 11    1 ≤ i ≤ d (30) where p 00 + p 01 + p 10 + p 11 = 1 (31) This is the joint probability of the dependent pair , and the marginal probabilities govern the distribution of the remainin g points. In the set X 0 the probability that bit i is 0 is p 0 ∗ = p 00 + p 01 (32) and similarly in X 1 p ∗ 0 = p 00 + p 10 (33) The * means “do n’ t care”. A reasonable p airing algorithm (very similar in thi s case to the general algorithm) is to pick coordinates at random 1 ≤ i 1 , i 2 , . . . ≤ d . After pi cking k coordinates, an X 0 point x l = ( x l 1 , x l 2 , . . . , x ld ) is in a bucket of expected si ze n 0 k Y t =1 p x lt ∗ (34) Hence it makes s ense to increase k only up to the point where n 0 Q k t =1 p x lt ∗ < 1 , and then compare with all X 1 points in its cell. This makes k poin t dependent. The expected num ber of comparisons in a single try is at mo st n 1 . What is the approximate success probabili ty? JOURNAL OF L A T E X CLASS FILES, V OL. 1, NO. 1, MARCH 2007 8 Our initial estimate was t he following. The probability that the special pair will agree in a single coordinate is p 00 + p 11 The amount of information in a single X 0 coordinate is − p 0 ∗ ln p 0 ∗ − p 1 ∗ ln p 1 ∗ so we will need about k ≈ ln n 0 − p 0 ∗ ln p 0 ∗ − p 1 ∗ ln p 1 ∗ (35) coordinates, and the success probability is estimated by ( p 00 + p 11 ) k ≈ n − ln( p 00 + p 11 ) p 0 ∗ ln p 0 ∗ + p 1 ∗ ln p 1 ∗ 0 (36) This estimate turns out to be disast rously wrong. For the bad m atrix    1 − 2 ǫ ǫ ǫ 0    (37) with small ǫ it suggests exponent − 1 / ln ǫ , while clearly it is worse than 1. Th e interested reader might pause to ﬁgure out what went wrong, and how thi s argument can be salvaged. There is an almo st e xact simple answer with a surprisi ng ﬂa vor . W e expect n − λ 0 , so let us check that for consistency . Pick t he ﬁrst coordinate. W ith p robability p 00 , the e x pectation n 0 is reduced to n 0 p 0 ∗ . W i th probabil ity p 11 it is reduced t o n 0 p 1 ∗ , and with probability p 22 = 1 − p 00 − p 11 the try is already lost. Hence n − λ 0 ≈ p 00 ( n 0 p 0 ∗ ) − λ + p 11 ( n 0 p 1 ∗ ) − λ (38) Happily n 0 drops out, leaving us with p 00 p − λ 0 ∗ + p 11 p − λ 1 ∗ = 1 (39) which determi nes the e xponent λ . It is v ery easy to con vert thi s in formal argument int o a formal theorem and proof. A harder task awaits us. V . T H E G E N E R A L A L G O R I T H M A N D I T S P E R F O R M A N C E Deﬁnition 5.1: The independent data model is the following. W e generalize from bi ts to b discrete values. Let the sets X 0 , X 1 ⊂ { 0 , 1 , . . . , b − 1 } d (40) of cardinalities # X 0 = n 0 , # X 1 = n 1 (41) JOURNAL OF L A T E X CLASS FILES, V OL. 1, NO. 1, MARCH 2007 9 be rando mly construct ed in t he following way . The X 0 points are i dentically distributed indepen- dent Bernoulli random vectors, with p i,j ∗ denoting the probability that coordinate i has value j . There is a special pair of X 0 , X 1 points, randomly chosen out of the n 0 n 1 possibili ties. For that pair the prob ability th at both their i ’th coordinates equal j is p i,j with no dependency between coordinates. The rest of the X 1 points can be anything. (W e abbre viate th e usual notati on p i,j j to p i,j , because we will consider onl y the diagonal and the marginal prob abilities.) Denote p i,b = 1 − b − 1 X j =0 p i,j (42) P i =    p i, 0 p i, 1 . . . p i,b − 1 p i, 0 ∗ p i, 1 ∗ . . . p i,b − 1 ∗    (43) W e propose the following algorithm. It consi sts of sev eral bucke ting tries. For each try we generate d independent uniform [0 , 1] random real numbers 0 < r 1 , r 2 , . . . , r d < 1 (44) one random number per coordinate. F or each coordinate i we deﬁne its random exp onent λ i ≥ 0 to be the unique soluti on of the monotone equation b − 1 X j =0 p i,j (1 − r i ) p λ i i,j ∗ + r i = 1 (45 ) or + ∞ when there is no so lution. ( p λ i i,j ∗ means ( p i,j ∗ ) λ i ). W e lexicographically sort all the n 0 + n 1 points, with lo wer exponent c oordinates giv en precedence over larger e x ponent c oordinates, and the coordinate values 0 , 1 , . . . , b − 1 arbitrarily arranged, eve n witho ut consist enc y . Now each X 1 point is compared with the preceding a and following a X 0 points (or fewer near the ends). The comparisons are d one in so me one-on-one way , and t he algorithm i s considered successful if i t asks for th e correct comparison. The best a is problem and computer dependent, b ut is never lar ge. Eac h try makes at most 2 an 1 comparisons. Of course there is e xtra n 0 + n 1 point handli ng work. A nice way to write the lexicographic ordering of the algorit hm follows. Suppose that in try t the sorted random exponents are λ π 1 < λ π 2 < · · · < λ π d (46) Then each point x = ( x 1 , x 2 , . . . , x d ) ∈ { 0 , 1 , . . . , b − 1 } d (47) JOURNAL OF L A T E X CLASS FILES, V OL. 1, NO. 1, MARCH 2007 10 is projected into the interval [0 , 1] by R t ( x ) = d X i =1 p π 1 ,x π 1 ∗ p π 2 ,x π 2 ∗ · · · p π i − 1 ,x π i − 1 ∗ x π i − 1 X j =0 p π i ,j ∗ The projecti on order is a lexicographic order . For l ar ge dimensi on d , R t ( x ) is approximately uniformly distributed in [0 , 1] . W e wil l prove that the number of tries T needed for success satisﬁes ln T ∼ max λ ≥ 0 " λ ln n 0 − d X i =1 F ( P i , λ ) # (48) where Deﬁnition 5.2: The buck eting forest information function F ( P i , λ ) is F ( P i , λ ) = min 0 ≤ q i, 0 , . . . , q i,b P b j =0 q i,j = 1 P b − 1 j =0 q i,j p λ i,j ∗ ≤ 1 b X j =0 p i,j ln p i,j q i,j = (49) = max 0 ≤ r i ≤ 1 b X j =0 p i,j ln 1 − r i + r i ( j 6 = b ) p λ i,j ∗ ! (50) The two du al extrema points are related by q i,j = p i,j 1 − r i + r i ( j 6 = b ) p λ i,j ∗ (51) For P b − 1 j =0 p i,j p λ i,j ∗ ≤ 1 r i = 0 , q i,j = p i,j , F ( P i , λ ) = 0 . Otherwise b − 1 X j =0 q i,j p λ i,j ∗ = b − 1 X j =0 p i,j (1 − r i ) p λ i,j ∗ + r i = 1 (52) W e will get (49) from the upper bound theorem, and (50) from the lower bound theorem. Their equiv alence is a simple (tho ugh a bit surprising) application of L agra nge multipliers in a con ve x setting. Representation (50) impli es that F ( P , λ ) is an increasing con ve x function of λ . The c utoﬀ exp onen t λ cut attains (48). It has sev eral meanin gs. 1) In each try the coordinates with λ i ≤ λ cut deﬁne a b ucket of size e ǫn 0 for some small real ǫ . 2) If we doubl e n 0 the number of tries needed to achie ve success prob ability 1 / 2 is approx- imately multip lied by 2 λ cut . JOURNAL OF L A T E X CLASS FILES, V OL. 1, NO. 1, MARCH 2007 11 3) If we delete coordinate i , t hen the num ber o f t ries needed t o achie ve success probabi lity 1 / 2 is on av erage multip lied by e F i,p ( λ cut ) . Switching X 0 and X 1 may result in a diffe rent al gorithm. Coordinate values can be changed and/or m er ged in po ssibly di f ferent ways for X 0 , X 1 . For each possib ility we hav e an estimate of its eff ectiv eness, and the best sho uld be taken. In real application s there is dependence, and the prob abilities ha ve to be estimated. Our practical experience indicates that this is a robust algorithm. Details will be described in another paper . V I . A N A L T E R NA T I V E A L G O R I T H M There is an i nteresting alternative to the random ordering of coordinates. Suppose we have training sets X 0 , X 1 both o f size n , such that each X 0 point is pai re d with a k no wn X 1 point. Let us estimate the probabilities P i by their empirical a verage s. F o r each coordinate i its exp onen t λ i ≥ 0 is deﬁned by b − 1 X j =0 p i,j p λ i i,j ∗ = 1 (53) Arrange the coordinates in the greedy order of nondecreasing e xponents. Perform the ﬁrst try using that order just like in the previous algorithm. Re move the pairs found from the traini ng data, and repeat recursiv ely on the reduced traini ng data. Stop after the training set is reduced to 1 / 3 (for example) of its original si ze , or yo u run out of m emory . The memory problem can be alleviated by keeping only the heads of coordi nate list s, and/or running training and working tries in parallel. This s impler algorit hm has a more compl icated and/o r less efﬁcient im plementation, and lacks theory . V I I . R E T U R N O F T H E M A R G I N A L L Y B E R N O U L L I ( 1 / 2 ) E X A M P L E For the marginally Bernoulli(1/2) example equation (45) is 2 p i / 2 (1 − r i )2 − λ i + r i = 1 (54) which can be recast as the familiar 2 − λ i = p i − r i 1 − r i (55) JOURNAL OF L A T E X CLASS FILES, V OL. 1, NO. 1, MARCH 2007 12 The buck eting forest i nformation functi on is F ( P i , λ ) =      p i ln p i 2 − λ + (1 − p i ) ln 1 − p i 1 − 2 − λ p i ≥ 2 − λ 0 p i ≤ 2 − λ The cutoff exponent attains (48). The extremal conditio n is the familiar d X i =1 max " p i − 2 − λ cut 1 − 2 − λ cut , 0 # = log 2 n 0 (56) Let us now specialize to p 1 = p 2 = · · · = p d = p . Then λ cut = − log 2 pd − log 2 n 0 d − log 2 n 0 (57) Notice that log 2 n 0 > (2 p − 1) d is equi valent to λ cut > 1 . In g ener al λ cut > 1 signals that the a vailable b ucketing forest informat ion is of such low quality t hat the trees are worse than random near their leafs. V I I I . S P A R S I T Y Let us specialize to sparse bits: b = 2 , p i, 1 ∗ , p i, + p i, 11 << 1 (58) W e wil l also assume that for som e ﬁxed δ > 0 p i, 11 ≥ δ ( p i, 01 + p i, 10 ) (59) The equation p i, 00 (1 − r i ) p λ i i, 0 ∗ + r i + p i, 11 (1 − r i ) p λ i i, 1 ∗ + r i = 1 (60) has two asymptoti c regimes: one i n which p λ i i, 0 ∗ is nearly constant and p λ i i, 1 ∗ changes, and vi ce versa. The ﬁrst regime is the important one: p i, 00 + p i, 11 (1 − r i ) p λ i i, 1 ∗ + r i ≈ 1 (61) λ i ≈ ln   1 − 1 (1 − r i )  1+ p i, 11 p i, 01 + p i, 10    ln p i, 1 ∗ (62) JOURNAL OF L A T E X CLASS FILES, V OL. 1, NO. 1, MARCH 2007 13 In p rac tice the probabilit ies have to be estimated from t he data, and sparse estimates must be unreliable, so we used the more conservati ve 1 / ˜ λ i = (1 − r i ) p i, 11 p i, 01 + p i, 10 ln 1 p i, 1 ∗ (63) A ver y important practical poi nt i s th at the general algorithm preserv es sparsity . Suppose that instead of points x = ( x 1 , x 2 , . . . , x d ) ∈ { 0 , 1 } d (64) we hav e subsets of a features s et D of cardinalit y d : D x ⊂ D (65) In try t we use a hash function hash t : D → [0 , 1] . For each feature i ∈ D its random exponent λ i is computed using the pseudo random r i = hash t ( i ) (66) and the random exponents of x are sorted λ π 1 < λ π 2 < · · · < λ π ν (67) Then the sequence of features ( π 1 , π 2 , . . . , π ν ) (68) is a sparse representation of x whose lexicographic order is used in try t . I X . T H E D OW N S I D E O F D I M E N S I O N A L I T Y R E D U C T I O N Another way of handl ing sparse a pproximate neighbo r problem s is to con vert t hem in to d ense problems by a random projection. For dense problems taking some k out of t he d coo rdinates can be an effecti ve way to reduce dimension. For sparse problems such a samplin g reduction will remain sparse, hence dense projection matri ce s are used inst ea d. W e will s ho w t hat this can result in a much worse algorithm. Let us consider t he unlimit ed hom ogeneous data e xample with p 01 = p 10 , n 0 = n 1 = n (69) JOURNAL OF L A T E X CLASS FILES, V OL. 1, NO. 1, MARCH 2007 14 because i n general it is not clear which projections to take and how to analyze their performance. W e have a d di mensional Hamming cube { 0 , 1 } d . T he Hamming distance between two random X 0 , X 1 points is approximately 2 p 0 ∗ p 1 ∗ d (70) The Hamming distance between the two special points is approximately 2 p 0 , 1 d (71) Hence when the dimensi on d is l ar ge, the random t o special dist ance s ratio tends to c = p 0 ∗ p 1 ∗ p 01 (72) The ideal dimensi onality reduction would be to project { 0 , 1 } d into a much lower dimensional { 0 , 1 } k in su ch a way that the i mages of the X 0 , X 1 points are random { 0 , 1 } k points, and the distance between the t w o s pecial im ages is appro ximately k / 2 c ( k/ 2 is t he approxim ate dis tance between two random image points ). Hence after the d imensionality reduction we wil l have a homogeneous marginally Bernoulli(1/2 ) problem with p = 1 − 1 / 2 c (73) The standard nearest neighbor algorithm solves t his in approximat ely n log 2 2 c 2 c − 1 (74) tries. Actual dim ensional reduction s fall short o f t his ideal. Th e Ind yk and M otwa ni theory [5] states that n 1 /c (75) tries sufﬁ ce. The truth is somewhere in between. In contrast without dimensional ity reduction o ur algorit hm takes approximately n λ tries where λ is determined by 1 − p 1 ∗ − p 01 (1 − p 1 ∗ ) λ + p 11 p λ 1 ∗ = 1 (76) In the asymptotic region (58,59) i nserting r = 0 into (62) resul ts in λ ≈ ln h 1 + 2 p 01 p 11 i ln 1 /p 1 ∗ ≈ ln c +1 c − 1 ln 1 /p 1 ∗ (77) W e encourage the interested reader to look at his f a v orite dimensional reduction scheme, and see that the ln 1 /p 1 ∗ factor is really lost. JOURNAL OF L A T E X CLASS FILES, V OL. 1, NO. 1, MARCH 2007 15 X . L E X I C O G R A P H I C A N D B U C K E T I N G F O R E S T S Our g ener al algorithm is of the fol lo win g type. Deﬁnition 10.1: A lexicographic tree algorithm i s the foll o wing . The d coordinates are ar- ranged according to some permutation. Th an a complete lexicographic ordered tree is generated. It is deﬁned recursi vely as a root pointing to wards b subtrees, with the edges denoting the possible values of t he ﬁrst (after permutat ion) coordinate arbitraril y ordered. The s ubtrees are complete lexicographic ordered trees for the remain ing d − 1 coordinates. In particular the lexicographic tree has b d ordered leafs, each denoting a point in { 0 , 1 , . . . , b − 1 } d . A lexicographic tree algorithm arranges the n 0 + n 1 X 0 ∪ X 1 points according to the tree, and then compares each x 1 point with its a neigh bors right and left. Thi s insures n o more than 2 an 1 comparisons per tree. A lexicographic forest is sim ply a forest of lexicographic trees, each having i ts own permutation. It succeeds iff at least one tree succeeds. An obvious g ener alization is Deﬁnition 10.2: A semi-lexicographic tree algorithm has a ’ﬁrst’ coordinate and then recur- siv ely each subtree is semi -le x icographic, until all coordinates are exhausted. For e xample we can start with coordinate 3, and than consi der coo rdinate 5 if the va lue is 0, or coordinate 2 if the value is 1 and so on. The success probabil ity of a lexicographic forest is very comp licated,e ven before ra ndomizing the algorithm. For that reason we will consider an uglier non-rob u st class of algorithms that are easier to understand and analyze. Deﬁnition 10.3: A buck eting t re e algorithm is predictably recursiv ely deﬁned. Eith er compare all pairs (a leaf bucket), or take one coordinat e, split the data int o b parts according to its value (some parts may be empty), and apply a bucketing t re e algorithm on each part separately . In order t o ha ve no more than an 0 expected comparisons we wi ll insist t hat each leaf expects no more than a points b elonging to X 0 . A b ucketing forest is simply a forest of bucketing trees. It succeeds iff at least one tree succeeds. The success probabilit y of a bucketing forest is n o bed of roses. Let u s denote a leaf by w ∈ { 0 , 1 , . . . , b } d , w ith b indicating t hat the correspondi ng coordinate is not taken. The leaf w expects n 0 d Y i =1      p i,w i ∗ w i < b 1 w i = b (78) JOURNAL OF L A T E X CLASS FILES, V OL. 1, NO. 1, MARCH 2007 16 X 0 points, and its success probabilit y is d Y i =1      p i,w i w i < b 1 w i = b (79) The success probability o f a tree is the sum of the success probabilities of its lea fs. The success probability of the who le forest is less than the tree sum . Suppose t he who le forest cont ains L leafs w 1 , w 2 , . . . , w L . Let y ∈ { 0 , 1 , . . . , b } d denote the abbreviated state of the special poi nts: y i =      x 0 ,i x 0 ,i = x 1 ,i b x 0 ,i 6 = x 1 ,i (80) The v alue b denotes disagreement and its prob ability is p i,b = 1 − P b − 1 j =0 p i,j . The su cc ess probability of the whole forest is S = X y ∈{ 0 , 1 ,. ..,b } d d Y i =1 p i,y i · · " 1 − L Y l =1 1 − d Y i =1 ( w l,i == y i || w l,i == b ) !# Remember t hat ( w l,i == y i || w l,i == b ) = 0 , 1 hence the two ri ghtmost products are jus t logical ands, and 1 − () is a l ogical not. Our algorith m i s alm ost a buck et ing forest, except that the leaf condition is data dependent (for robustness). A truly variable s cheme can sh ape the buckets in a more compli ca ted data dependent way , see for example Gennaro Savino and Zezula [4]. Non-tree buck eting can use sev eral coordinates together , so that the resulting buckets are not boxes, see for e xample Andon i and Indyk [2] or [7]. X I . A B U C K E T I N G F O R E S T U P P E R B O U N D In this section we will bound th e performance of buck eting forest algorithm s. It is tricky , but technically s impler and more elegant t han proving a lo wer bound on th e performance of a single algorithm. Theor em 11.1: Ass ume the independent data model. The success p robability P of a nonempty buck eting tree whose l ea fs all have probabilities at most 1 / N is at most P ≤ N − λ d Y i =1 max   1 , b − 1 X j =0 p i,j p λ i,j ∗   (81) JOURNAL OF L A T E X CLASS FILES, V OL. 1, NO. 1, MARCH 2007 17 for any λ ≥ 0 . W e do not ev en have to assum e p i,j ≤ p i,j ∗ . Pr oof: Use induction. W i thout losing generality split coordinate 1 . The indu ction step P ≤ b − 1 X j =0 p 1 ,j ( N p 1 ,j ∗ ) − λ d Y i =2 max   1 , b − 1 X j =0 p i,j p λ i,j ∗   (82) is valid for both proper and point-on ly subtrees. The maximization wit h 1 is necessary because coordinates can be ignored. Theor em 11.2: Ass ume the independent data model. Suppose an bucketing forest contains T trees, its su cc ess probability is S , and all its l ea fs have pro babilities at most 1 / N . Than for any λ ≥ 0 ln T ≥ λ ln N + ln S 2 − v u u t 4 S d X i =1 V ( P i , λ ) − d X i =1 F ( P i , λ ) where V ( P i , λ ) = b X j =0 p i,j ln p i,j q i,j − b X k =0 p i,k ln p i,k q i,k ! 2 (83) and the q i,j ’ s are the minimizin g ar g uments from F ’ s deﬁnition (49) Pr oof: The pre v ious theorem provides a good bound for t he success probability of a s ingle tree, b ut it is not tight for a forest, because of dependence: the failure of each tree increases the failure p robability of other trees. No w com es an interesting ar gument . Recall that the success probability of the whole forest formula (81). For any z and q i,j > 0 we can bound S ≤ Prob { Z ≥ z } + e z S Q (84) where Z = d X i =1 ln p i,y i q i,y i (85) Prob { Z ≥ z } = X y ∈{ 0 , 1 ,. ..,b } d d Y i =1 p i,y i · d X i =1 ln p i,y i q i,y i ≥ z ! S Q = X y ∈{ 0 , 1 ,. ..,b } d d Y i =1 q i,y i · · " 1 − L Y l =1 1 − d Y i =1 ( w l,i == y i || w l,i == b ) !# W e ins ist upon b X j =0 q i,j = 1 (86) JOURNAL OF L A T E X CLASS FILES, V OL. 1, NO. 1, MARCH 2007 18 so that we can use the previous lemm a to bound S Q ≤ T P q ≤ T N − λ d Y i =1 max   1 , b − 1 X j =0 q i,j p λ i,j ∗   (87) The other term is handled by the Chebyshev bou nd: for z > E( Z ) Prob { Z ≥ z } ≤ V ar( Z ) ( z − E( Z )) 2 (88) T oget her S ≤ V ar( Z ) ( z − E( Z )) 2 + e z S Q (89) The reasonable choice of z = E( Z ) + q 2V ar( Z ) /S (90) results in S ≤ 2 e E( Z )+ √ 2V ar( Z ) /S S Q (91) Notice that this p roof gives no indication that the bound is tight, nor guidance tow ards constructing an actual buck eting forest, (except for telli ng which coordinates to throw away). W e tried to strengthen the theorem in the foll o wing way . Instead of restricting the expected number of points falling into each leaf b ucket, allow lar ger l ea fs and only in sist th at the total number of comparisons is at most aN . Surprisi ngly the strengthened statement is wrong, and a ’lar ge leafs’ bucketing forest is theoretically better t han our algorith m. But it i s complicated and non-robust. X I I . A S E M I - L E X I C O G R A P H I C F O R E S T U P P E R B O U N D There rem ains the problem that we gave a lexicographic forest alg orithm, but a bucketing forest upper bound. It is a technicalit y , which may be skipped over with little loss. Any sem i- lexicographic complete tree can be conv erted into a b ucketing t re e i n an ob vious w ay: Prune t he complete tree from t he leafs down as much as po ssible, preserving the property th at each leaf expects at most a/ 2 poi nts from X 0 . The success probability of t he semi -le xicographic tree is bounded by P ≤ P tree + R (92) JOURNAL OF L A T E X CLASS FILES, V OL. 1, NO. 1, MARCH 2007 19 where P tree is th e success probability of the truncated tree, for which we have a good bound, and a remainder term associated with truncated tree vertex es expecting more than a/ 2 tree points. Lemma 12.1: Assume th e independent data model and consider a s emi-le xicographic tree with the standard coordinate order (that do es not rest rict generality) and a totally random values order . Assume that the special point s pair agree in coordinates 1 , 2 , . . . , i − 1 , but disagree at coordinat e i : y 1 , y 2 , . . . , y i − 1 6 = b, y i = b (93) Conditioning on that, the probability of success is at mos t 2 a n 0 p 1 ,y 1 p 2 ,y 2 · · · p i − 1 ,y i − 1 (94) Pr oof: Denote p = p 1 ,y 1 p 2 ,y 2 · · · p i − 1 ,y i − 1 (95) Let m be the number of X 0 points agreeing with the special pair in their ﬁrst i − 1 coordinates. Its probability dis trib ut ion is 1 +B ernoulli( p, n 0 − 1 ). Let us consider these m points ordered by the algorithm. The rank of the special X 0 point can be 1 , 2 , . . . , m with equal probabilities. Those m ordered point s are broken up in to up to b interv als according to t he value of coordi nate i . Where does the special X 1 point ﬁt in? It is in a diffe rent i nterv al than the X 0 special point, but its location in that int er val, and the order of i nterv als is random. Hence the probabili ty that the two s pecial points are at most a + 1 apart is at most 2 a/m . This has to b e a veraged: n X m =1    n − 1 m − 1    p m − 1 (1 − p ) n − m 2 a m = 2 a np (96) Theor em 12.2: Ass ume the independent data model. Then the success probabi lity of an y semi- lexicographic tree wit h a totally random coordinate values order is at mo st P ≤ 2 ln ( e 4 . 5 N ) N λ d Y i =1 max   1 , b − 1 X j =0 p i,j p λ i,j ∗   (97) for any 0 ≤ λ ≤ 1 , where N = max  1 , 2 n 0 a  (98) JOURNAL OF L A T E X CLASS FILES, V OL. 1, NO. 1, MARCH 2007 20 Pr oof: W ithout restricting g ener ality ass ume t hat t he coordin ate have th e standard order . W e have establi shed that R ≤ X 0 ≤ t ≤ d 0 ≤ w 1 , w 2 , . . . , w t < b N Q t i =1 p i,w i ∗ ≥ 1 4 N t Y i =1 p i,w i p i,w i ∗ ·   1 − b − 1 X j =0 p t +1 ,j   The negati ve terms can be shifted to the next t : R ≤ 4 N + X 1 ≤ t ≤ d 0 ≤ w 1 , w 2 , . . . , w t < b N Q t i =1 p i,w i ∗ ≥ 1 4 N t Y i =1 p i,w i p i,w i ∗ · (1 − p t,w t ∗ ) Denote ˜ R w 1 ,...,w s = X s ≤ t ≤ d 0 ≤ w s +1 , w s +2 , . . . , w t < b N Q t i =1 p i,w i ∗ ≥ 1 t Y i = s +1 p i,w i p i,w i ∗ · (1 − p t,w t ∗ ) W e wil l prove by induction from the leafs down that ˜ R w 1 ,w 2 ,...,w s ≤ N 1 − λ w 1 ,...,w s ln  eN w 1 ,...,w s − 1  · (99) · d Y i = s +1 max   1 , b − 1 X j =0 p i,j p λ i,j ∗   (100) where N w 1 ,...,w s = N s Y i =1 p i,w i ∗ (101) The induction step boils down to ln  eN w 1 ,...,w s − 1  ≥ (1 − p s,w s ∗ ) + ln  eN w 1 ,...,w s − 1 p s,w s ∗  which is obviously true. Theorem (11.2) is con verted i nto Theor em 12.3: Ass ume the independent data mod el. Suppose a s emi-le xicographic forest wi th a totally random coordinate values order contains T trees, its success probabili ty is S , and N = max  1 , 2 n 0 a  (102) JOURNAL OF L A T E X CLASS FILES, V OL. 1, NO. 1, MARCH 2007 21 Than for any 0 ≤ λ ≤ 1 ln T ≥ λ ln N − ln h 2 ln  e 4 . 5 N i + (103) + ln S 2 − v u u t 4 S d X i =1 V ( P i , λ ) − d X i =1 F ( P i , λ ) (104) X I I I . A L O W E R B O U N D Theor em 13.1: Ass ume the independent data model and denote N = 2 n 0 a (105) Let ǫ > 0 b e some small parameter , and let Let λ, r 1 , r 2 , . . . , r d attain min λ ≥ 0 max 0 ≤ r 1 ,...,r d ≤ 1 " − (1 + ǫ ) λ ln N + (106) + d X i =1 b X j =0 p i,j 1 − r i + r i ( j 6 = b ) p λ i,j ∗ ! # (107) The extrema condit ions are d X i =1 b − 1 X j =0 p i,j − r i ln p i,j ∗ (1 − r i ) p λ i,j ∗ + r i = (1 + ǫ ) ln N (108) and r i = 0 or b − 1 X j =0 p i,j (1 − r i ) p λ i,j ∗ + r i = 1 1 ≤ i ≤ d (109) Suppose that for some δ < 1 / 7 d X i =1 b X j =0 p i,j ln[1 − r i + ( j 6 = b ) r i p − λ i,j ∗ ] − − b X k =0 p i,k ln[1 − r i + ( k 6 = b ) r i p − λ i,k ∗ ] ! 2 ≤ ǫ 2 δ λ 2 (ln N ) 2 d X i =1 b − 1 X j =0 p i,j − r i ln p i,j ∗ (1 − r i ) p λ i,j ∗ + r i − − b − 1 X k =0 p i,k − r i ln p i,k ∗ (1 − r i ) p λ i,k ∗ + r i ! 2 ≤ ǫ 2 δ (ln N ) 2 / 4 d X i =1 b − 1 X j =0 p i,j r i (1 − r i )[ln p i,j ∗ ] 2 h (1 − r i ) p λ i,j ∗ + r i i 2 ≤ ǫ 2 δ (ln N ) 2 / 8 (110) JOURNAL OF L A T E X CLASS FILES, V OL. 1, NO. 1, MARCH 2007 22 Then the general algorithm with T tries where ln T ≥ ln 1 δ + (1 + 3 ǫ ) λ ln N − (111) − d X i =1 b X j =0 p i,j 1 − r i + r i ( j 6 = b ) p λ i,j ∗ ! (112) has success probability S ≥ 1 − 7 δ (113) Moreover there exists a bucketing forest wi th T trees and at least 1 − 7 δ success probability . The alarmingly compli ca ted small variance conditions are asympt otically valid, because the var iances g ro w linearly with ln N . Howe ver there is no guarantee that they can be always met. Indeed the upper bound is of t he Chernof inequality large de viation type, and can be a poor estimate in pathological cases. Deﬁnition 13.1: Let Y , Z be joint random variables. W e denote by Y Z the conditional type random va riable Y with its probabil ity density multiplied by e Z E[ e Z ] (114) In the discrete case Z , Y would ha ve values y i , z i with probability p i . Then Y Z has values y i with probability p i e z i P j p j e z j (115) Lemma 13.2: For any random variable Z , and λ ≥ 0 ln Prob { Z ≥ E [ Z λZ ] } ≤ ln E h e λZ i − λ E [ Z λZ ] (116) ln Prob  Z ≥ E [ Z λZ ] − q 2V ar [ Z λZ ]  ≥ (117) ≥ ln E h e λZ i − λ E [ Z λZ ] − ln 2 − λ q 2V ar [ Z λZ ] (118) Pr oof: The upper bound is the Chernof bound. The lowe r bound combines the Chebyshev inequality Prob  | Z λZ − E[ Z λZ ] | ≤ q 2V ar[ Z λZ ]  ≥ 1 2 (119) with the fact that the condition in the curly bracket bounds t he densit ies ratio: ln e λZ E [ e λZ ] = ln e λZ λZ E [ e λZ ] ≤ (120) ≤ − ln E h e λZ i + λ E [ Z λZ ] + λ q 2V ar [ Z λZ ] (121) JOURNAL OF L A T E X CLASS FILES, V OL. 1, NO. 1, MARCH 2007 23 It is amusing, and sometim es useful to note that E[ Z λZ ] = ∂ ln E h e λZ i ∂ λ (122) V ar[ Z λZ ] = ∂ 2 ln E[ e λZ ] ∂ λ 2 (123) W e will n o w prove t he theorem 13.1. Pr oof: Le t λ ≥ 0 be a parameter to be op timized. Let w ∈ { 0 , 1 } d be the random Bernoulli vector w i = ( λ i ≤ λ ) (124) where λ i is the i ’ th random exponent. In a slight abuse of not ation let 0 ≤ r i ≤ 1 denote not a random va riable b ut a probabilit y r i = Prob { w i == 1 } = Prob { λ i ≤ λ } (125) W e could not resist doing t hat because equation (45) is still v alid und er th is in terpre tation. Another po int o f v ie w is to fo r get (45) and cons ider r i a parameter to be opt imized. Again l et y ∈ { 0 , 1 , . . . , b } d denote the abbre viated state of the special points x 0 , x 1 . Let u s consider a single try of our algorithm , conditioned on both y and w . The fol lo wi ng requirements d Y i =1 (1 − w i + w i ( y i 6 = b )) = 1 (126) d Y i =1 (1 − w i + w i p i,y i ∗ ) ≤ 1 N = a 2 n 0 (127) state that the expected number of X 0 points in the bucket deﬁned by the coordinates whose w i = 1 with value y i is at most a/ 2 . Then t he probabilit y that the actual number of bucke t points is more than a is bounded from abov e by 1 / 2 . A more compact way of stati ng (126) and (127) together is Z ( y , w ) ≥ ln N (128) Z ( y , w ) = d X i =1 ln h 1 − w i + w i ( y i 6 = b ) p − 1 i,y i ∗ i (129) Summing ove r w giv es success probability of a sing le try , conditioned over y to be at l east P ( y ) ≥ 1 2 X w ∈{ 0 , 1 } d d Y i =1 [(1 − w i )(1 − r i ) + w i r i ][ Z ( y , w ) ≥ ln N ] JOURNAL OF L A T E X CLASS FILES, V OL. 1, NO. 1, MARCH 2007 24 In short P ( y ) ≥ 1 2 Prob { Z ( y ) ≥ ln N } (130) Conditioning over y makes tries independent of each o ther , h ence the cond itional success probability of at least T tries is at least S ( y ) ≥ 1 − (1 − P ( y )) T ≥ T P ( y ) 1 + T P ( y ) (131) A veraging over y bounds th e success probabilit y S of the algorithm by S ≥ X y ∈{ 0 , 1 ,. ..,b } d d Y i =1 p i,y i · " T P ( y ) 1 + T P ( y ) # (132) In short S ≥ E " T P ( y ) 1 + T P ( y ) # (133) Now we must get our hands dirty . The reverse Chernof inequalit y is ln Prob  Z ( y ) ≥ E h Z ( y ) λZ ( y ) i − r 2V ar h Z ( y ) λZ ( y ) i  ≥ ≥ ln E h e λZ ( y ) i − λ E h Z ( y ) λZ ( y ) i − ln 2 − − λ r 2V ar h Z ( y ) λZ ( y ) i Denoting U ( y ) = ln E h e λZ ( y ) i = d X i =1 ln[1 − r i + ( y i 6 = b ) r i p − λ i,y i ∗ ] V ( y ) = ∂ U ( y ) ∂ λ = E h Z ( y ) λZ ( y ) i = (134) = X 1 ≤ i ≤ d y i 6 = b − r i ln p i,y i ∗ (1 − r i ) p λ i,y i ∗ + r i (135) W ( y ) = ∂ 2 U ( y ) ∂ λ 2 = V ar h Z ( y ) λZ ( y ) i = (136) = X 1 ≤ i ≤ d y i 6 = b r i (1 − r i )[ln p i,y i ∗ ] 2 h (1 − r i ) p λ i,y i ∗ + r i i 2 (137) JOURNAL OF L A T E X CLASS FILES, V OL. 1, NO. 1, MARCH 2007 25 the rev erse Chernof inequalit y can be rewritten as ln Prob  Z ( y ) ≥ V ( y ) − q 2 W ( y )  ≥ (138) ≥ U ( y ) − λV ( y ) − ln 2 − λ q 2 W ( y ) (139) It is time for the second inequality tier . For any δ < 1 / 3 Prob n | U ( y ) − E[ U ] | ≤ q V ar[ U ] /δ , (140) | V ( y ) − E[ V ] | ≤ q V ar[ V ] /δ , (141) W ( y ) ≤ E[ W ] / δ o ≥ 1 − 3 δ (142) where E[ U ] = d X i =1 b X j =0 p i,j ln[1 − r i + ( j 6 = b ) r i p − λ i,j ∗ ] (143) E[ V ] = d X i =1 b − 1 X j =0 p i,j − r i ln p i,j ∗ (1 − r i ) p λ i,j ∗ + r i (144) Hence ln Prob  Z ( y ) ≥ E[ V ] − q V ar[ V ] /δ − q 2E[ W ] /δ  ≥ ≥ E[ U ] − λ E[ V ] − ln 2 − q V ar[ U ] /δ − − λ q V ar[ V ] /δ − λ q 2E[ W ] /δ Now we hav e to pull all strings together . In order to connect with (130) we will require E[ V ] = (1 + ǫ ) ln N (145) q V ar[ V ] + q 2E[ W ] ≤ ǫδ 1 / 2 ln N (146) for some small ǫ > 0 . Recalling (135), condition (145) is achieve d by choo sing λ to attain min λ ≥ 0 [ − (1 + ǫ ) λ ln N + E[ U ]] (147) If (146) holds, then ln P ( y ) ≥ − (1 + 2 ǫ ) λ ln N + E[ U ] − ln 4 − q V ar[ U ] /δ (148) with probability at least 1 − 3 δ . Recalling (133) t he success probabilit y is at least S ≥ 1 − 3 δ 1 + 4 e (1+2 ǫ ) λ ln N − E[ U ]+ √ V ar[ U ] /δ /T (149) JOURNAL OF L A T E X CLASS FILES, V OL. 1, NO. 1, MARCH 2007 26 X I V . C O N C L U S I O N T o sum up, we present three things: 1) An approximate nearest neighbor algorithm (45), and its sparse approxim ation (63). 2) An information style performance estimate (48). 3) A warning against dim ensional reduction of sparse data, see section IX. R E F E R E N C E S [1] N. Alon Priv ate Communication. [2] A. Andoni, P . Indyk Near-Optimal Hashing Algorithms for Appro ximate Neare st Neighbor in High Di men sions F OCS 2006. [3] A. Broder . Identifying and Filtering Near-Duplicate Documents Proc. FUN, 1998. [4] C.Gennaro, P . Sa vino and P .Zezula Similarit y Sear ch in Metric Databases thr ough Hashing Proc. A CM workshop on multimedia, 2001. [5] P . Indyk and R. Motwani. Appr oximate Near est Neighbor: T owar ds Removing the Curse of Dimensionality Proc. 30th Annu. A C M Sympos. Theory Comput., 1998. [6] R.M. Karp, O. W aarts, and G. Zweig. The Bit V ector Intersection Pr oblem Proc. 36th A n nu. IE EE Sympos. Foundations of Computer Science, pp. 621-630, 1995. [7] Bucketing Information and the Statistical High Dimensional Near est Neighbor Problem T o be Published.

A Heterogeneous High Dimensional Approximate Nearest Neighbor Algorithm

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment