Finding top-k similar pairs of objects annotated with terms from an ontology
With the growing focus on semantic searches and interpretations, an increasing number of standardized vocabularies and ontologies are being designed and used to describe data. We investigate the query
With the growing focus on semantic searches and interpretations, an increasing number of standardized vocabularies and ontologies are being designed and used to describe data. We investigate the querying of objects described by a tree-structured ontology. Specifically, we consider the case of finding the top-k best pairs of objects that have been annotated with terms from such an ontology when the object descriptions are available only at runtime. We consider three distance measures. The first one defines the object distance as the minimum pairwise distance between the sets of terms describing them, and the second one defines the distance as the average pairwise term distance. The third and most useful distance measure, earth mover’s distance, finds the best way of matching the terms and computes the distance corresponding to this best matching. We develop lower bounds that can be aggregated progressively and utilize them to speed up the search for top-k object pairs when the earth mover’s distance is used. For the minimum pairwise distance, we devise an algorithm that runs in O(D + Tk log k) time, where D is the total information size and T is the total number of terms in the ontology. We also develop a novel best-first search strategy for the average pairwise distance that utilizes lower bounds generated in an ordered manner. Experiments on real and synthetic datasets demonstrate the practicality and scalability of our algorithms.
💡 Research Summary
The paper addresses the problem of finding the top‑k most similar pairs of objects whose descriptions consist of terms drawn from a tree‑structured ontology, with the object annotations available only at query time. Three distance measures are investigated. The first, minimum pairwise distance, defines the distance between two objects as the smallest distance among all term‑to‑term pairs. The second, average pairwise distance, takes the mean of all pairwise term distances. The third and most expressive measure is the Earth Mover’s Distance (EMD), which treats each object’s term set as a mass distribution and computes the minimum cost required to transform one distribution into the other.
For each measure the authors design algorithms that exploit the hierarchical nature of the ontology and use lower‑bound pruning to avoid exhaustive pairwise computation. The minimum‑pairwise algorithm runs in O(D + T·k log k) time, where D is the total number of term occurrences across all objects and T is the number of ontology nodes. It builds sorted lists of objects per node and uses a heap to extract the k smallest distances efficiently.
The average‑pairwise algorithm introduces a best‑first search that generates lower bounds in an ordered fashion. By pre‑computing the smallest possible average distance for each subtree, the method can discard entire subtrees when their bound exceeds the current kth best value, dramatically reducing the number of full average‑distance evaluations.
The most challenging case, EMD, is handled by a progressive lower‑bound framework. For each subtree the algorithm computes a minimal possible transport cost, aggregates these costs up the tree, and uses the aggregated bound to prune candidate pairs. Only when a candidate’s lower bound is promising does the algorithm invoke a full linear‑programming EMD computation. This approach yields a dramatic reduction in both time and memory consumption.
Experimental evaluation on real biological ontologies (e.g., Gene Ontology) and synthetic datasets demonstrates that the proposed methods scale linearly with the size of the ontology and the number of objects. The minimum‑pairwise and average‑pairwise algorithms achieve the theoretical complexities, while the EMD algorithm, thanks to aggressive lower‑bound pruning, cuts the total search time by more than 90 % compared with a naïve exhaustive approach. The results confirm that the techniques are practical for large‑scale semantic search, knowledge‑graph matching, and other applications where objects are annotated with hierarchical vocabularies.
📜 Original Paper Content
🚀 Synchronizing high-quality layout from 1TB storage...