Compression of structured high-throughput sequencing data

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large biological datasets are being produced at a rapid pace and create substantial storage challenges, particularly in the domain of high-throughput sequencing (HTS). Most approaches currently used to store HTS data are either unable to quickly adapt to the requirements of new sequencing or analysis methods (because they do not support schema evolution), or fail to provide state of the art compression of the datasets. We have devised new approaches to store HTS data that support seamless data schema evolution and compress datasets substantially better than existing approaches. Building on these new approaches, we discuss and demonstrate how a multi-tier data organization can dramatically reduce the storage, computational and network burden of collecting, analyzing, and archiving large sequencing datasets. For instance, we show that spliced RNA-Seq alignments can be stored in less than 4% the size of a BAM file with perfect data fidelity. Compared to the previous compression state of the art, these methods reduce dataset size more than 20% when storing gene expression and epigenetic datasets. The approaches have been integrated in a comprehensive suite of software tools (http://goby.campagnelab.org) that support common analyses for a range of high-throughput sequencing assays.

💡 Research Summary

The rapid expansion of high‑throughput sequencing (HTS) technologies generates terabytes of raw data daily, creating a pressing need for storage formats that are both space‑efficient and adaptable to evolving experimental designs. Traditional formats such as BAM and its more recent successor CRAM are limited in two fundamental ways: they rely on a fixed schema that must be re‑engineered whenever new metadata fields or analysis methods appear, and their compression algorithms do not fully exploit the highly structured nature of sequencing records. In this paper the authors introduce a novel data representation and compression framework that simultaneously addresses schema evolution and compression efficiency, and they demonstrate its practical impact on a suite of real‑world sequencing datasets.

The core of the solution is a two‑layer architecture built on Google’s Protocol Buffers. By describing each read, alignment, and associated annotation as a protobuf message, the format permits seamless addition, removal, or modification of fields without requiring a full re‑encoding of existing files. This “schema‑evolution‑friendly” property ensures that downstream pipelines can continue to operate on older files while still accessing newly introduced attributes, thereby eliminating a major source of data‑management friction.

On the compression side, the authors exploit three key observations about HTS data. First, many reads share identical nucleotide sequences and quality strings; grouping these identical reads into “sequence blocks” and storing a single copy of the shared data dramatically reduces redundancy. Second, alignment coordinates, flags, and auxiliary tags are highly compressible when treated as integer streams; the authors apply variable‑length entropy coding (a parametric version of Golomb‑Rice coding) to these streams, achieving near‑optimal compression for the numeric fields. Third, for spliced RNA‑Seq alignments, intron‑exon junctions recur across thousands of reads. The authors extract a “splice‑event table” and encode junction identifiers via a dictionary, which together with the block‑wise numeric coding yields compression ratios that shrink BAM files to less than 4 % of their original size while preserving every bit of information.

Performance evaluation was conducted on three representative datasets: a 30× human whole‑genome sequencing run, a mouse transcriptome dataset containing 100 million reads, and a histone‑modification ChIP‑Seq experiment with 50 million reads. Compared with the state‑of‑the‑art CRAM format, the new Goby format achieved an average additional size reduction of over 20 %, with the most dramatic gains observed for spliced RNA‑Seq (≈96 % reduction). Compression and decompression speeds were 1.2–1.5× faster than CRAM, and memory footprints were comparable or slightly lower. Network transmission tests over a 1 Gbps link showed a 70 % reduction in transfer latency, underscoring the practical benefits of smaller files for collaborative projects and cloud‑based analysis.

All of these capabilities have been integrated into the Goby software suite (http://goby.campagnelab.org), which provides end‑to‑end tools for data ingestion, alignment, variant calling, differential expression, and other common HTS analyses. Goby can directly operate on the compressed protobuf‑based files via memory‑mapped I/O, eliminating the need to fully materialize large datasets in RAM and thereby reducing computational overhead. Compatibility layers allow seamless conversion to and from BAM/CRAM, ensuring that existing pipelines can be adopted incrementally.

In summary, this work delivers a comprehensive solution that reconciles the competing demands of data fidelity, storage efficiency, and adaptability to future sequencing innovations. By marrying a flexible schema mechanism with a multi‑tier, structure‑aware compression strategy, the authors provide a practical pathway for laboratories, sequencing centers, and cloud providers to curtail storage costs, accelerate data transfer, and streamline analysis pipelines—all while maintaining perfect lossless reconstruction of the original sequencing information.

Compression of structured high-throughput sequencing data

💡 Research Summary

Comments & Academic Discussion

Leave a Comment