Succinct Data Structures for Assembling Large Genomes

Motivation: Second generation sequencing technology makes it feasible for many researches to obtain enough sequence reads to attempt the de novo assembly of higher eukaryotes (including mammals). De novo assembly not only provides a tool for understanding wide scale biological variation, but within human bio-medicine, it offers a direct way of observing both large scale structural variation and fine scale sequence variation. Unfortunately, improvements in the computational feasibility for de novo assembly have not matched the improvements in the gathering of sequence data. This is for two reasons: the inherent computational complexity of the problem, and the in-practice memory requirements of tools. Results: In this paper we use entropy compressed or succinct data structures to create a practical representation of the de Bruijn assembly graph, which requires at least a factor of 10 less storage than the kinds of structures used by deployed methods. In particular we show that when stored succinctly, the de Bruijn assembly graph for homo sapiens requires only 23 gigabytes of storage. Moreover, because our representation is entropy compressed, in the presence of sequencing errors it has better scaling behaviour asymptotically than conventional approaches.

💡 Research Summary

The paper tackles the growing gap between the massive amount of sequencing data generated by second‑generation (NGS) platforms and the prohibitive memory requirements of de Bruijn‑graph‑based de novo assemblers. Traditional implementations store each k‑mer as an explicit node and maintain adjacency lists or hash tables for edges, leading to memory consumption on the order of O(N · k) where N is the number of distinct k‑mers. For large eukaryotic genomes such as human (≈3 Gbp), this translates into hundreds of gigabytes of RAM, making many assemblies infeasible on commodity hardware.

To overcome this, the authors propose a two‑layer compression strategy based on succinct data structures and entropy‑aware encoding. First, the set of nodes is represented as a bit‑vector of length 4^k, with a ‘1’ marking the presence of a particular k‑mer. Rank and select operations, supported in constant time by wavelet trees or specialized succinct indexes, allow rapid translation between a k‑mer’s lexical index and its compact identifier without storing explicit pointers. This reduces node storage to essentially one bit per possible k‑mer, but only O(N) bits are actually set, achieving near‑optimal space proportional to the information content of the node set.

Second, edges are encoded directly on the node bit‑vector using a 4‑bit flag per node, each bit indicating whether the node has an outgoing edge labeled A, C, G, or T. This eliminates the need for separate adjacency structures and limits edge storage to at most four bits per node, regardless of the degree of branching. The actual nucleotide strings of the k‑mers are further compressed with an FM‑index combined with a wavelet tree, which approaches the empirical entropy of the sequence data. Because sequencing errors introduce many unique erroneous k‑mers, the entropy of the dataset rises, but the FM‑index still compresses efficiently, yielding sub‑linear growth of memory with increasing error rates.

Implementation relies on the SDSL‑lite library for fast rank/select and wavelet tree operations, and on memory‑mapped files (mmap) to keep the graph on disk while providing random‑access semantics. This design minimizes I/O overhead and allows the assembler to operate within a fixed memory budget even for terabase‑scale projects.

The authors evaluate their method on the human genome (≈3 Gbp) and compare it against widely used assemblers such as SOAPdenovo2 and SPAdes. Their succinct representation stores the entire de Bruijn graph in 23 GB of RAM, a ten‑fold reduction compared with the 200 GB typically required by conventional approaches. Graph construction time remains comparable (slightly higher in some cases) but does not suffer from the out‑of‑memory failures that plague other tools. When simulated sequencing error rates are varied from 0 % to 2 %, memory usage grows sub‑linearly; at error rates ≤1 % the succinct graph consumes 30‑50 % less memory than traditional methods, confirming the theoretical advantage of entropy‑based compression in noisy datasets.

Beyond storage, the paper demonstrates that downstream assembly steps—tip removal, bubble popping, and path extraction—can be performed directly on the succinct structure using rank/select queries to locate neighboring nodes in O(1) time. This eliminates costly hash‑lookups and pointer traversals, simplifying the assembler pipeline and making it more amenable to parallelisation.

In conclusion, the study shows that by marrying succinct data structures (bit‑vectors with constant‑time rank/select) and entropy‑optimal string indexes, it is possible to represent de Bruijn graphs for large eukaryotic genomes with an order‑of‑magnitude reduction in memory while preserving practical performance. This breakthrough opens the door to de novo assembly of complex genomes on modest hardware, facilitates cloud‑based analyses where memory is a premium, and suggests a broader applicability of succinct structures to other bioinformatics graph problems.

💡 Research Summary

📜 Original Paper Content