A Heterogeneous High Dimensional Approximate Nearest Neighbor Algorithm

Reading time: 1 minute
...

📝 Original Info

  • Title: A Heterogeneous High Dimensional Approximate Nearest Neighbor Algorithm
  • ArXiv ID: 0810.4188
  • Date: 2008-10-24
  • Authors: ** Moshe Dubiner (Google) **

📝 Abstract

We consider the problem of finding high dimensional approximate nearest neighbors. Suppose there are d independent rare features, each having its own independent statistics. A point x will have x_{i}=0 denote the absence of feature i, and x_{i}=1 its existence. Sparsity means that usually x_{i}=0. Distance between points is a variant of the Hamming distance. Dimensional reduction converts the sparse heterogeneous problem into a lower dimensional full homogeneous problem. However we will see that the converted problem can be much harder to solve than the original problem. Instead we suggest a direct approach. It consists of T tries. In try t we rearrange the coordinates in decreasing order of (1-r_{t,i})\frac{p_{i,11}}{p_{i,01}+p_{i,10}} \ln\frac{1}{p_{i,1*}} where 0💡 Deep AnalysisDeep Dive into A Heterogeneous High Dimensional Approximate Nearest Neighbor Algorithm.

We consider the problem of finding high dimensional approximate nearest neighbors. Suppose there are d independent rare features, each having its own independent statistics. A point x will have x_{i}=0 denote the absence of feature i, and x_{i}=1 its existence. Sparsity means that usually x_{i}=0. Distance between points is a variant of the Hamming distance. Dimensional reduction converts the sparse heterogeneous problem into a lower dimensional full homogeneous problem. However we will see that the converted problem can be much harder to solve than the original problem. Instead we suggest a direct approach. It consists of T tries. In try t we rearrange the coordinates in decreasing order of (1-r_{t,i})\frac{p_{i,11}}{p_{i,01}+p_{i,10}} \ln\frac{1}{p_{i,1*}} where 0<r_{t,i}<1 are uniform pseudo-random numbers, and the p’s are the coordinate’s statistical parameters. The points are lexicographically ordered, and each is compared to its neighbors in that order. We analyze a generalizat

📄 Full Content

arXiv:0810.4188v1 [cs.IT] 23 Oct 2008 A Heterogeneous High Dimensional Approximate Nearest Neighbor Algorithm Moshe Dubiner ACKNOWLEDGMENT I would like to thank Phil Long and David Pablo Cohn for reviewing rough drafts of this paper and suggesting many clarifications. The remaining obscurity is my fault. M. Dubiner is with Google, e-mail: moshe@google.com Manuscript submitted to IEEE Transactions on Information Theory on March 3, 2007. JOURNAL OF LATEX CLASS FILES, VOL. 1, NO. 1, MARCH 2007 1 A Heterogeneous High Dimensional Approximate Nearest Neighbor Algorithm Abstract We consider the problem of finding high dimensional approximate nearest neighbors. Suppose there are d independent rare features, each having its own independent statistics. A point x will have xi = 0 denote the absence of feature i, and xi = 1 its existence. Sparsity means that usually xi = 0. Distance between points is a variant of the Hamming distance. Dimensional reduction converts the sparse heterogeneous problem into a lower dimensional full homogeneous problem. However we will see that the converted problem can be much harder to solve than the original problem. Instead we suggest a direct approach. It consists of T tries. In try t we rearrange the coordinates in decreasing order of (1 −rt,i) pi,11 pi,01 + pi,10 ln 1 pi,1∗ (1) where 0 < rt,i < 1 are uniform pseudo-random numbers, and the p′s are the coordinate’s statistical parameters. The points are lexicographically ordered, and each is compared to its neighbors in that order. We analyze a generalization of this algorithm, show that it is optimal in some class of algorithms, and estimate the necessary number of tries to success. It is governed by an information like function, which we call bucketing forest information. Any doubts whether it is “information” are dispelled by another paper, where unrestricted bucketing information is defined. I. INTRODUCTION Suppose we have two bags of points, X0 and X1, randomly distributed in a high-dimensional space. The points are independent of each other, with one exception: there is one unknown point x0 in bag X0 that is significantly closer to an unknown point x1 in bag X1 than would be accounted for by chance. We want an efficient algorithm for quickly finding these two ’paired’ points. The reader might wonder why we need two sets, instead of working as usual with X = X0∪X1. We have come a full circle on this issue. The practical problem that got us interested in this theory involved texts from two languages, hence two different sets. However it seemed that the asymmetry between X0 and X1 was not important, so we developed a one set theory. Than we found out that keeping X0, X1 separate makes thing clearer. JOURNAL OF LATEX CLASS FILES, VOL. 1, NO. 1, MARCH 2007 2 Let us start with the well known simple homogeneous marginally Bernoulli(1/2) example. Suppose X0, X1 ⊂{0, 1}d of sizes n0, n1 respectively are randomly chosen as independent Bernoulli(1/2) variables, with one exception. Choose randomly one point x0 ∈X0, xor it with a random Bernoulli(p) vector and overwrite one randomly chosen x1 ∈X1. A symmetric description is to say that x0, x1 i’th bits have the joint probability P =    p/2 (1 −p)/2 (1 −p)/2 p/2    (2) For some p > 1/2. We assume that we know p. In practice it will have to be estimated. Let ln M = ln n0 + ln n1 −I(P)d (3) where I(P) = p ln(2p) + (1 −p) ln(2(1 −p)) (4) is the mutual information between the special pair’s single coordinate values. Information theory tells us that we can not hope to pin the special pair down into less than W possibilities, but can come close to it in some asymptotic sense. Assume that W is small. How can we find the closest pair? The trivial way to do it is to compare all the n0n1 pairs. A better way has been known for a long time. The earliest references I am aware of are Karp,Waarts and Zweig [6], Broder [3], Indyk and Motwani [5]. They do not limit themselves to this simplistic problem, but their approach clearly handles it. Without restricting generality let n0 ≤n1. We randomly choose k ≈log2 n0 (5) out of the d coordinates, and compare the point pairs which agree on these coordinates (in other words, fall into the same bucket). The expected number of comparisons is n0n12−k ≈n1 (6) while the probability of success of one comparison is pk. In case of failure we try again, with other random k coordinates. At first glance it might seem that the expected number of tries until success is p−k, but that is not true because the attempts are interdependent. The correct computation is done in the next section. In the unlimited data case d →∞indeed T ≈p−k ≈nlog2 1/p 0 (7) JOURNAL OF LATEX CLASS FILES, VOL. 1, NO. 1, MARCH 2007 3 Is this optimal? Alon [1] has suggested the possibility of improvement by using Hamming’s perfect code. We have found that in the n0 = n1 = n case, T ≈nlog2 1/p can be reduced to T ≈n1/p−1+ǫ (8) for any ǫ > 0, see [7]. Unfortunately this seems hard to convert into a practical alg

…(Full text truncated)…

📸 Image Gallery

cover.png page_2.webp page_3.webp

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut