We show how to represent sets in a linear space data structure such that expressions involving unions and intersections of sets can be computed in a worst-case efficient way. This problem has applications in e.g. information retrieval and database systems. We mainly consider the RAM model of computation, and sets of machine words, but also state our results in the I/O model. On a RAM with word size $w$, a special case of our result is that the intersection of $m$ (preprocessed) sets, containing $n$ elements in total, can be computed in expected time $O(n (\log w)^2 / w + km)$, where $k$ is the number of elements in the intersection. If the first of the two terms dominates, this is a factor $w^{1-o(1)}$ faster than the standard solution of merging sorted lists. We show a cell probe lower bound of time $\Omega(n/(w m \log m)+ (1-\tfrac{\log k}{w}) k)$, meaning that our upper bound is nearly optimal for small $m$. Our algorithm uses a novel combination of approximate set representations and word-level parallelism.
Algorithms and data structures for sets play an important role in computer science. For example, the relational data model, which has been the dominant database paradigm for decades, is based on set representation and manipulation. Set operations also arise naturally in connection with database queries that can be expressed as a boolean combination of simpler queries. For example, search engines report documents that are present in the intersection of several sets of documents, each corresponding to a word in the query. If we fix the set of documents to be searched, it is possible to spend time on preprocessing all sets, to decrease the time for answering queries.
The search engine application has been the main motivation in several recent works on computing set intersections [4,11,12]. All these papers assume that elements are taken from an ordered set, and are accessed through comparisons. In particular, creating the canonical representation, a sorted list, is the best possible preprocessing in this context. The comparison-based model rules out some algorithms that are very efficient, both in theory and practice. For example, if the preprocessing produces a hashing-based dictionary for each set, the intersection of two sets S 1 and S 2 can be computed in expected time O(min(|S 1 |, |S 2 |)). This is a factor Θ(log(1 + max(
))) faster than the best possible worst-case performance of comparison-based algorithms.
In this paper we investigate non-comparison-based techniques for evaluating expressions involving unions and intersections of sets on a RAM. (In the search engine application this corresponds to expressions using AND and OR operators.) Specifically, we consider the situation in which each set is required to be represented in a linear space data structure, and propose the multi-resolution set representation, which is suitable for efficient set operations. We show that it is possible in many cases to achieve running time that is sub-linear in the total size of the input sets and intermediate results of the expression. For example, we can compute the intersection of a number of sets in a time bound that is sub-linear in the total size of the sets, plus time proportional to the total number of input elements in the intersection. In contrast, all previous algorithms that we are aware of take at least linear time in the worst case over all possible input sets, even if the output is the empty set. The time complexity of our algorithm improves as the word size w of the RAM grows. While the typical word size of a modern CPU is 64 bits, modern CPU design is superscalar meaning that several independent instructions can be executed in parallel. This means that in most cases (with the notable exception of multiplication) it is possible to simulate operations on larger word sizes with the same (or nearly the same) speed as operations on single words. We expect that word-level parallelism may gain in importance, as a way of making use of the increasing parallelism of modern processor architectures.
The problem of computing intersections and unions (as well as differences) of sorted sets was recently considered in a number of papers (e.g. [4,12]) in an adaptive setting. A good adaptive algorithm uses a number of comparisons that is close (or as close as possible) to the size of the smallest set of comparisons that determine the result. In the case of two sorted sets, this is the number of interleavings when merging the sets. In the worst case this number is linear in the size of the sets, in which case the adaptive algorithm performs no better than standard merging. However, adaptive algorithms are able to exploit “easy” cases to achieve smaller running time. Mirzazadeh in his thesis [15] extended this line of work to arbitrary expressions with unions and intersections. These results are incomparable to those obtained in this paper: Our algorithm is faster for most problem instances, but the adaptive algorithms are faster in certain cases. It is instructive to consider the case of computing the intersection of two sets of size n where the size of the intersection is relatively small. In this case, an optimal adaptive algorithm is faster than our algorithm only if the number of interleavings of the sorted lists (i.e., the number of sublists needed to form the sorted list of the union of the sets) is less than around n/w.
Another idea that has been studied is, roughly speaking, to exploit asymmetry. Hwang and Lin [13] show that merging two sorted lists S 1 and S 2 requires Θ(|S
, in the worst case over all input lists. This is significantly less than
This result was generalized to computation of general expressions involving unions and intersections of sets by Chiniforooshan et al. [11]. Given an expression, and the sizes of the input sets, their algorithm uses a number of comparisons that is asymptotically equal to the minimum number of comparisons required in the worst case over all such sets. 1 The bounds s
This content is AI-processed based on open access ArXiv data.