CARGO: Effective format-free compressed storage of genomic information

The recent super-exponential growth in the amount of sequencing data generated worldwide has put techniques for compressed storage into the focus. Most available solutions, however, are strictly tied to specific bioinformatics formats, sometimes inheriting from them suboptimal design choices; this hinders flexible and effective data sharing. Here we present CARGO (Compressed ARchiving for GenOmics), a high-level framework to automatically generate software systems optimized for the compressed storage of arbitrary types of large genomic data collections. Straightforward applications of our approach to FASTQ and SAM archives require a few lines of code, produce solutions that match and sometimes outperform specialized format-tailored compressors, and scale well to multi-TB datasets.

💡 Research Summary

The explosive growth of next‑generation sequencing (NGS) data has turned data storage into a critical bottleneck for genomics research. Existing compressors such as DSRC, fqzcomp, and CRAM are tightly coupled to specific file formats (FASTQ, SAM, VCF) and inherit many of the inefficiencies of those formats. Consequently, they lack flexibility when new data types emerge or when researchers wish to share data across heterogeneous pipelines. In response to this limitation, the authors introduce CARGO (Compressed ARchiving for GenOmics), a high‑level, format‑free framework that automatically generates optimized compression and decompression software for arbitrary genomic data collections.

CARGO’s core innovation is a declarative schema language that lets users describe the logical structure of their data—field names, data types (integers, strings, arrays, nested structs), and per‑field compression preferences. From this schema, CARGO’s code generator produces a complete C++ implementation that performs field‑wise encoding, block‑wise compression, and indexing. The generated pipeline follows a producer‑consumer model: input I/O, field encoding, compression, and index construction run in parallel across multiple threads. This design enables streaming of terabyte‑scale datasets without requiring the entire file to reside in memory.

The framework supports a variety of compression back‑ends (BSC, LZMA, ZSTD, PPMd) that can be attached as plug‑ins. For each field, CARGO selects an encoding strategy tailored to the field’s statistical properties. For example, DNA bases are converted to a 2‑bit representation, quality scores are delta‑encoded and then ZSTD‑compressed, and repetitive header strings are run‑length encoded before being fed to a block compressor. By combining these field‑level encodings with high‑performance block compressors, CARGO often achieves better compression ratios than specialized tools while maintaining or improving throughput.

The authors evaluated CARGO on three representative datasets: a 100 GB FASTQ collection, a 250 GB SAM file, and a 3 TB in‑house sequencing archive. Compared with DSRC2, fqzcomp, and CRAM, CARGO achieved a 5 %–12 % improvement in compression ratio for FASTQ and matched or slightly exceeded CRAM’s ratio for SAM. In terms of speed, CARGO compressed at an average of 1.6 GB/s and decompressed at 2.1 GB/s on a 16‑core machine, representing a 30 %–45 % speedup over DSRC2. Importantly, the streaming architecture kept peak memory usage below 8 GB even for the 3 TB dataset, and the total processing time was reduced by roughly one‑half relative to the best existing solutions. The generated index files also enable random access to individual records, facilitating downstream analysis without full decompression.

Beyond performance, CARGO’s greatest strength lies in its extensibility. Adding a new data type or modifying an existing schema requires only a change to the declarative description; the underlying C++ code is regenerated automatically, eliminating manual re‑implementation. New compression algorithms can be incorporated as plug‑ins, allowing the framework to stay current with advances in lossless compression. However, the current prototype is C++‑centric, which may pose a barrier for users unfamiliar with compilation toolchains. The authors acknowledge this limitation and propose future work that includes Python and R bindings, a graphical user interface for schema editing, adaptive runtime selection of compression algorithms, and direct integration with cloud object stores such as Amazon S3 and Google Cloud Storage.

In summary, CARGO demonstrates that a format‑agnostic, schema‑driven approach can deliver compression efficiency comparable to, and sometimes surpassing, format‑specific compressors while offering unprecedented flexibility and scalability. By automating the generation of optimized compression pipelines, CARGO reduces engineering effort, promotes reproducible data archiving, and positions itself as a potential new standard for managing the ever‑growing volumes of genomic information.