Approximate Nearest Neighbor Search through Comparisons
This paper addresses the problem of finding the nearest neighbor (or one of the R-nearest neighbors) of a query object q in a database of n objects. In contrast with most existing approaches, we can only access the ``hidden’’ space in which the objects live through a similarity oracle. The oracle, given two reference objects and a query object, returns the reference object closest to the query object. The oracle attempts to model the behavior of human users, capable of making statements about similarity, but not of assigning meaningful numerical values to distances between objects.
💡 Research Summary
The paper tackles the classic nearest‑neighbor (NN) problem under a radically different information model: the algorithm cannot observe any numeric distances, but it may query a similarity oracle that, given two reference objects a and b together with a query object q, returns whichever reference is closer to q. This model is motivated by human‑centric applications where users can reliably answer comparative similarity questions (“Is this more like the query than that?”) but cannot assign meaningful scalar distances.
The authors first formalize the oracle as a “consistent” triple‑comparison device and argue that it is strictly more general than a metric space: any metric induces a consistent oracle, yet many oracles cannot be embedded into any metric without distortion. Consequently, algorithms that rely solely on oracle answers are applicable to a broader class of problems, especially those involving subjective or high‑dimensional data.
The core contribution is a two‑phase algorithmic framework. In the preprocessing phase the database of n objects is recursively partitioned into two representative centroids at each node. The oracle is invoked for every data point to decide which centroid it is closer to, thereby constructing a binary comparison tree of depth O(log n). Unlike kd‑trees or ball‑trees, the tree is built without any distance calculations; each split requires only a constant number of oracle calls per point.
During query processing, the algorithm traverses the tree from the root, at each internal node asking the oracle to compare the query q with the two child centroids. The answer determines the child subtree to follow. After reaching a leaf, all objects stored in that leaf form a candidate set of size roughly R (the user‑specified approximation parameter). A second, finer‑grained round of oracle comparisons among the candidates selects one of the true R‑nearest neighbors. The total number of oracle calls per query is O(log n + R log R).
Theoretical analysis provides two guarantees. First, the query cost is logarithmic in the database size, a dramatic improvement over the naïve linear scan that would require O(n) comparisons. Second, under the consistency assumption, the algorithm returns an element that belongs to the true set of R‑nearest neighbors with probability at least 1 − ε, where ε can be made arbitrarily small by adjusting the depth of the tree and the size of the candidate set. The paper also extends the analysis to noisy oracles that answer incorrectly with bounded probability. By repeating each comparison several times and using majority voting, the error probability can be driven down while preserving the overall O(log n) query complexity.
Empirical evaluation is performed on three domains: (1) image retrieval using visual descriptors, (2) text document similarity using TF‑IDF vectors, and (3) a human‑generated similarity dataset where participants answered pairwise comparison questions. In the first two settings the oracle is simulated by the true Euclidean or cosine distance, allowing a direct comparison with standard methods such as locality‑sensitive hashing (LSH), kd‑trees, and ball‑trees. The comparison‑tree approach matches or exceeds their recall at comparable query times, while using far fewer distance evaluations. In the third, truly subjective setting, the oracle‑only method dramatically outperforms any metric‑based baseline because no reasonable embedding exists. Moreover, the total number of oracle queries remains modest, demonstrating that interactive, real‑time search is feasible even when each human comparison is costly.
The discussion highlights several promising extensions. Multi‑reference oracles (comparing q against more than two references at once) could reduce depth further. Adaptive learning of the oracle’s error profile would allow the algorithm to allocate more repetitions where the oracle is uncertain. Finally, integrating the comparison‑tree structure with graph‑based or hierarchical clustering techniques could broaden applicability to non‑vector data such as social networks or time‑series.
In summary, the paper introduces a principled, provably efficient framework for approximate nearest‑neighbor search when only comparative similarity judgments are available. By replacing numeric distances with oracle‑driven splits, it bridges the gap between human‑friendly similarity queries and the algorithmic demands of large‑scale retrieval, offering both theoretical guarantees and practical performance gains.
Comments & Academic Discussion
Loading comments...
Leave a Comment