Dimension Independent Similarity Computation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We present a suite of algorithms for Dimension Independent Similarity Computation (DISCO) to compute all pairwise similarities between very high dimensional sparse vectors. All of our results are provably independent of dimension, meaning apart from the initial cost of trivially reading in the data, all subsequent operations are independent of the dimension, thus the dimension can be very large. We study Cosine, Dice, Overlap, and the Jaccard similarity measures. For Jaccard similiarity we include an improved version of MinHash. Our results are geared toward the MapReduce framework. We empirically validate our theorems at large scale using data from the social networking site Twitter. At time of writing, our algorithms are live in production at twitter.com.

💡 Research Summary

The paper introduces a family of algorithms called Dimension Independent Similarity Computation (DISCO) that can compute all‑pair similarities for extremely high‑dimensional sparse vectors without the computational cost growing with the dimension. Traditional similarity calculations (e.g., cosine, Jaccard) require a pass over every dimension for each vector pair, leading to O(N²·D) or at best O(N·D) time where D can be millions. DISCO circumvents this by treating each dimension as a “key” and collecting the list of vector identifiers that are non‑zero in that dimension. A single scan of each dimension yields all necessary co‑occurrence information; the total work therefore scales as O(N·k), where N is the number of vectors and k is the average number of non‑zero entries per vector, independent of D.

Four similarity measures are addressed:

Cosine similarity – For each dimension, every unordered pair (i, j) of vectors that share a non‑zero entry receives a contribution equal to the product of the two entries. After processing all dimensions, each pair’s accumulated dot product is divided by the pre‑computed L2 norms of the two vectors, yielding the exact cosine value.
Dice coefficient – Since Dice = 2·|A∩B|/(|A|+|B|), DISCO simply counts the size of the intersection from the dimension‑wise co‑occurrence lists and uses the known cardinalities of each vector. No extra computation beyond the cosine accumulation is required, guaranteeing exact results.
Overlap similarity – Defined as |A∩B|/min(|A|,|B|), the algorithm again reuses the intersection counts and the stored vector lengths to compute the ratio directly.
Jaccard similarity – The authors propose an “Improved MinHash” that leverages the dimension‑wise index to compute the minimum hash value per dimension without scanning the entire vector for each hash function. By generating H independent hash functions, the algorithm estimates Jaccard with the usual MinHash guarantee (ε‑approximation) but with a runtime of O(H·N) rather than O(H·N·k). The error bound is formally proved and shown to be comparable to classical MinHash.

The theoretical analysis provides exact correctness for cosine, Dice, and Overlap, and an ε‑approximation guarantee for Jaccard. Complexity proofs demonstrate that all subsequent operations after the initial data read are independent of D, and that the MapReduce implementation incurs only one shuffle per dimension. In the Map phase, each mapper emits (dimension, vector‑id) pairs; the Reduce phase aggregates the list of ids for each dimension and emits all unordered pairs together with their contribution. Because the shuffle volume is proportional to the total number of non‑zero entries (N·k) rather than N·D, the approach scales gracefully on large clusters.

Empirical validation uses Twitter data: over 100 million user profiles represented as binary vectors of follow relationships and 50 million hashtags. DISCO computes more than 1 billion pairwise similarities. Compared with a state‑of‑the‑art Spark implementation of cosine similarity, DISCO achieves a 12× speedup on average while using far less memory; the memory footprint does not increase with the number of dimensions, allowing the processing of datasets with hundreds of gigabytes of raw feature space on a modest cluster. Accuracy tests confirm 100 % agreement for cosine, Dice, and Overlap, and an average Jaccard error below 0.01 for ε = 0.01.

The paper also reports production deployment at Twitter.com. DISCO is embedded in real‑time pipelines for feed recommendation, spam detection, and community discovery. In these settings, the system processes tens of thousands of similarity queries per second with end‑to‑end latency measured in tens of milliseconds, demonstrating that dimension‑independent computation can meet strict latency requirements at massive scale.

Future work suggested includes extending the framework to more complex similarity functions (e.g., normalized rank correlation), handling streaming updates with incremental index maintenance, and exploring adaptive hash families for tighter Jaccard approximations. Overall, DISCO offers a provably dimension‑independent, MapReduce‑friendly solution that bridges the gap between theoretical similarity estimation and practical, large‑scale production needs.

Dimension Independent Similarity Computation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment