Communication-Efficient Jaccard Similarity for High-Performance Distributed Genome Comparisons

Communication-Efficient Jaccard Similarity for High-Performance   Distributed Genome Comparisons
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The Jaccard similarity index is an important measure of the overlap of two sets, widely used in machine learning, computational genomics, information retrieval, and many other areas. We design and implement SimilarityAtScale, the first communication-efficient distributed algorithm for computing the Jaccard similarity among pairs of large datasets. Our algorithm provides an efficient encoding of this problem into a multiplication of sparse matrices. Both the encoding and sparse matrix product are performed in a way that minimizes data movement in terms of communication and synchronization costs. We apply our algorithm to obtain similarity among all pairs of a set of large samples of genomes. This task is a key part of modern metagenomics analysis and an evergrowing need due to the increasing availability of high-throughput DNA sequencing data. The resulting scheme is the first to enable accurate Jaccard distance derivations for massive datasets, using largescale distributed-memory systems. We package our routines in a tool, called GenomeAtScale, that combines the proposed algorithm with tools for processing input sequences. Our evaluation on real data illustrates that one can use GenomeAtScale to effectively employ tens of thousands of processors to reach new frontiers in large-scale genomic and metagenomic analysis. While GenomeAtScale can be used to foster DNA research, the more general underlying SimilarityAtScale algorithm may be used for high-performance distributed similarity computations in other data analytics application domains.


💡 Research Summary

The paper addresses the long‑standing challenge of computing exact Jaccard similarity (J) and Jaccard distance (d_J = 1 − J) for massive collections of high‑throughput sequencing samples. Existing solutions such as Mash, MinHash, or other sketch‑based methods provide fast approximations but suffer from reduced accuracy for highly similar pairs and require large sketch sizes to handle dissimilar datasets. To overcome these limitations, the authors introduce SimilarityAtScale, the first communication‑efficient distributed algorithm that computes exact Jaccard measures by reformulating the problem as a sparse matrix multiplication.

The core idea is to represent each sample as a binary indicator vector over the universe of possible k‑mers. Stacking these vectors yields an indicator matrix A ∈ {0,1}^{m × n}, where m is the number of distinct k‑mers and n the number of samples. The intersection cardinalities between every pair of samples are then obtained as B = Aᵀ A, a sparse matrix‑matrix product. Each entry B_{ij} equals |X_i ∩ X_j|. The union cardinalities are derived from the pre‑computed column sums |X_i| and the intersection matrix: C_{ij} = |X_i| + |X_j| − B_{ij}. Finally, the Jaccard similarity matrix S is computed element‑wise as S_{ij} = B_{ij} / C_{ij}.

Implementing this formulation efficiently on a distributed‑memory system is non‑trivial because a naïve Aᵀ A computation would require all‑to‑all communication. The authors leverage the Cyclops tensor framework and adopt a 2‑D block‑cyclic distribution of the matrices. Each process owns a sub‑block of A and the corresponding sub‑blocks of Aᵀ, and only exchanges the blocks that are needed for the local multiplication. By using asynchronous point‑to‑point communication and overlapping computation with data transfer, the algorithm avoids the costly collective All‑Reduce pattern typical of MapReduce‑style approaches. The communication volume per iteration scales as O(p · log p) rather than O(p²), where p is the number of MPI ranks, dramatically reducing network traffic at large scale.

Complexity analysis shows that the preprocessing (k‑mer extraction and indicator matrix construction) is linear in the number of non‑zero entries nnz(A). The multiplication cost depends on the degree distribution of k‑mers: O(∑_k deg(k)²), where deg(k) is the number of samples containing k‑mer k. Because real genomic data are highly sparse (most k‑mers appear in a small fraction of samples), this cost remains manageable even for billions of k‑mers. Memory usage is minimized by storing A in compressed sparse row/column formats and by streaming intermediate results.

The authors validate their approach with two large‑scale experiments. In the first, they process 2,580 human RNA‑Seq experiments (≈3.3 TB of raw FASTQ) on a 1,024‑node cluster (each node with 32 cores). The full Jaccard matrix (≈1.8 TB) is computed in under 30 minutes, using ≈1.8 TB of distributed memory. In the second, they analyze nearly all publicly available bacterial and viral whole‑genome sequences (≈170 TB of cleaned assemblies), again on 1,024 nodes, completing the all‑pairs similarity in roughly two hours. Compared to Mash and MinHash baselines, SimilarityAtScale is 5–10× faster and achieves an average absolute error below 0.01, even for pairs with J > 0.9 where sketch‑based methods typically degrade.

To make the technology accessible to bioinformaticians, the authors wrap the algorithm in a user‑friendly tool called GenomeAtScale. GenomeAtScale accepts standard FASTA/FASTQ inputs, performs k‑mer extraction, builds the indicator matrix, runs the distributed multiplication, and outputs the Jaccard similarity and distance matrices in common formats. The tool integrates with existing pipelines and is open‑source (GitHub link provided).

In summary, the paper delivers a mathematically exact, communication‑optimal, and highly scalable solution for Jaccard similarity computation on distributed‑memory systems. By exploiting sparse linear algebra and careful communication avoidance, it enables unprecedented scale‑up of genomic distance calculations, opening the door to large‑scale phylogenetic analyses, metagenomic clustering, and other data‑science tasks that rely on set‑based similarity measures.


Comments & Academic Discussion

Loading comments...

Leave a Comment