Hecate: A Modular Genomic Compressor

Hecate: A Mo dula r Genomic Comp resso r Kamila Szew czyk # Ñ Algorithmic Bioinformatics, Saarland Universit y , Saarbrüc ken, German y Cen ter for Bioinformatics, Saarland Informatics Campus, Germany Sv en Rahmann # Ñ Algorithmic Bioinformatics, Saarland Universit y , Saarbrüc ken, German y Cen ter for Bioinformatics, Saarland Informatics Campus, Germany Abstract W e presen t he c ate , a mo dular lossless genomic compression framework. It is designed around uncommon but practical source-coding choices. Unlik e many single-metho d compressors, he cate treats compression as a conditional co ding problem ov er coupled F AST A/F ASTQ streams (control, headers, nucleotides, case, qualit y , extras). It uses p er-stream co decs under a shared indexed blo c k container. Codecs include alphab et-aw are packing with an explicit side channel for out-of- alphab et residues, an auxiliary-index Burro ws-Wheeler pipeline with custom arithmetic co ding, and a blo ckwise Marko v mixture coder with explicit mo del-comp etition signaling. This architecture yields high throughput, exact random-access slicing, and referen tial mode through streamwise binary diﬀerencing. In a comprehensive b enchmark suite, he c ate provides the b est compression vs. speed trade-oﬀs against state-of-the-art established to ols (MFCompress, NAF, bzip3, A GC), with notably stronger b eha viour on large genomes and high-similarit y referential settings. F or the same compression ratio, he c ate is 2 to 10 times faster. When given the same time budget as other algorithms, hec ate achiev es up to 5% to 10% b etter compression. 2012 ACM Subject Classiﬁcation Theory of computation → Data structures design and analysis; Applied computing → Bioinformatics Keyw ords and phrases data compression, lossless compression, genome compression, source coding, Burro ws-Wheeler transform, arithmetic co ding, Marko v mo dels, referential compression 1 Intro duction The year 1990 saw the launc h of the International Human Genome Pro ject, foreshadowing the start of the era of high-throughput sequencing. Three y ears after, the ﬁrst sp ecialised genomic compression to ol, Bio c ompr ess [Grum bach and T ahi, 1993], w as developed. Since then, the Sequence Read Archiv e (SRA) observed the acquisition of 12 p etabases of new public data in the year 2024 [Shiry ev and Agarwala, 2024]. In 2026, the p ersistent systems gap has b ecome apparent: Sequencing throughput has accelerated faster than practical storage and transp ort infrastructure. This gap is not explained solely by missing algorithms. Man y researchers and practitioners still rely on univ ersal compression metho ds ( pigz [Adler, 2023] or lbzip2 [Izdebski, 2015] pro vide parallel implemen tations of p opular gzip and bzip2 co decs) or even uncompressed formats. In many pro duction pipelines, the b ottleneck is primarily architectural: to ols are either monolithic (one model class ev erywhere) or op erationally awkw ard (w eak random access, brittle indexing, limited integration). C. G. Nevill-Manning and I. H. Witten demonstrated in their 1999 pap er "Pr otein is inc ompr essible" [Nevill-Manning and Witten, 1999] that increasing context (i.e., Marko v mo del order) for predicting the next amino acid yields sharply diminishing returns. Their result do es not preclude adv ancements in one-shot single genomic sequence compression, as they frequen tly contain substan tial amounts of non-co ding regions. The pap er do es how ever place rigid and disapp ointing b ounds on the eﬀectiv eness of reference-free techniques. This 2 Hecate: A Mo dular Genomic Comp resso r result shifts our fo cus to wards referen tial compression. T o ols such as HR CM [Y ao et al., 2019] or AGC [Deorowicz et al., 2023] ha ve demonstrated impressive compression ratios on highly similar genomic sequences, suc h as large collections of individual h uman genomes. A ccording to Auton et al. [2015], the typical diﬀerence b etw een an individual’s genome and a reference h uman genome was estimated at around 20 million base pairs, or 0.6% of the total size. Consequen tly , Christley [2009] provides an amusing example of referen tial compression that, under the assumption that the recipien t has access to an agreed-up on reference genome, allo ws for transmitting a p erson’s genome in the size of an e-mail attac hment. W e brieﬂy p osition he c ate against notable genomic compressors. DELIMINA TE [Mo- hammed et al., 2012] uses tw o-symbol p osition co ding follow ed by external compression. MF Compress [Pinho and Pratas, 2014] and related Mark ov to ols fo cus on ﬁxed-order context mo dels. NAF [Kryuk ov et al., 2019] prioritizes fast decoding through ﬁxed 4-bit packing plus zstd. JAR VIS3 [Sousa et al., 2024] (linear mixing of rep eat and Marko v mo dels to pro vide probabilities to an arithmetic co der) and GeCo3 [Silv a et al., 2020] (neural netw ork mixing of {direct, hashed, substitution-toleran t}-Mark ov mo dels to compress genomic se- quences) emplo y richer probabilistic mixtures. A GC [Deorowicz et al., 2023] is strong for assem bled-genome collections with fast queries. Our goal diﬀers in scop e: W e engineer one mo dular genomic container framew ork that supp orts b oth high-performance reference-free op eration and also a strong referen tial mo de. Moreo ver, we fo cus on ﬂexibilit y on the systems level with stream-level metho d assignment, index tiers, blo ck-local v alidation, and eﬃcient slicing. As a result, on typical F ASTQ ﬁles, whic h mix diﬀerent streams of information, such as headers, sequences and qualit y information, he c ate obtains either smaller compression ratios or faster compression sp eeds than existing to ols. In more detail, our main contributions are as follows. First, as mentioned, we introduce a mo dular genomic con tainer with stream-factorized coding, indexed random access, and in tegrity semantics aligned with blo ck indep endence. Second, we provide a new highly engineered co dec ( he c ate-bwt ) based on the Burro ws-Wheeler transform (BWT). It is a large-blo c k BWT pipeline with auxiliary indices, dynamic 32/64-bit suﬃx arra y paths, and a sophisticated p ost-BWT enco ding stage. Third, we introduce another engineered co dec ( markov-mix ) based on Marko v mixtures and arithmetic co ding. It is an explicit blo ckwise mo del-comp etition codec with heterogeneous update asymmetry and rev erse-context coupling in deep mo dels. F ourth, we contribute a reference-based compression mo de for medium to large genomes that p erforms streamwise binary diﬀerencing separately ov er semantic c hannels, preserving co dec mo dularity after patch reconstruction. 2 Background 2.1 Burro ws-Wheeler T ransfo rm and LF Mapping Giv en a block T [0 ..n − 1] , let S A b e the suﬃx array of T . The BWT output L is deﬁned b y L [ i ] = T [( S A [ i ] − 1) mod n ] , whic h is equiv alen tly the last column of the lexicographically sorted rotation table. T w o equiv alen t conv entions are common: either app end a unique end marker (so the original ro w is identiﬁed by that marker), or omit an explicit marker and store a primary index p suc h that S A [ p ] = 0 . The transform is reversible and tends to create lo cally run-rich output, whic h is easier to entrop y-co de w ell. K. Szew czyk and S. Rahmann 3 In version is driv en by the LF mapping. Let C [ a ] b e the num b er of symbols in L that are lexicographically smaller than sym b ol a , and let Occ ( a, i ) b e the n um b er of o ccurrences of a in L [0 ..i − 1] . Then LF( i ) = C [ L [ i ]] + Occ( L [ i ] , i ) . Starting from the iden tiﬁed original row (via mark er or primary index) and iterating LF reconstructs the original text in reverse order. This LF mac hinery is the core of the FM index [F erragina and Manzini, 2000]. In practice, long blo cks can additionally store sparse restart anchors so inv ersion do es not need to follow LF from a single start p oint ov er long distances. 2.2 Prediction b y P a rtial Matching (PPM) Prediction by Partial Matching (PPM) [Cleary and Witten, 1984] mo dels the next symbol using a hierarch y of con text lengths. F or each p osition, it ﬁrst queries the longest av ailable con text (highest order). If the symbol has not b een observ ed there, the enco der emits an escap e and bac ks oﬀ to shorter contexts until a distribution is found (or a low est-order base mo del is reached). The selected distribution is then encoded with an en tropy co der, typically arithmetic or range coding. This p er-symbol back oﬀ makes PPM robust on heterogeneous data, but frequen t escap es add co ding ov erhead and can weak en very deep sparse con texts. 3 Metho ds: Reference-F ree Comp ression The he c ate framew ork provides three main co dec families for reference-free compression, c hosen to cov er distinct regions of the ratio/sp eed frontier: Lempel-Ziv (via zstandar d [F aceb o ok, 2023]), Burrows-Wheeler T ransform ( he c ate-bwt ), and blo ckwise Marko v mixture co ding ( markov-mix ). The rationale follows well-established empirical knowl edge [Mahoney, 2005]: Lemp el-Ziv metho ds decompress fast at the exp ense of ratio, BWT-based metho ds oﬀer a strong middle ground, and statistical mo dels push ratio further at higher computational cost. Each co dec op erates on semantically factored streams under a shared container, so the c hoice of back end can b e made p er-stream without changing the ﬁle format. 3.1 Container and Preprocessing A t enco de time, he c ate factors a F AST A/F ASTQ input into seman tic streams: con trol ( CTRL ), headers ( HDR ), nucleotide payload ( NUC ), case ( CASE ), quality ( QUALITY ), and an extra (non-IUP AC) channel ( EXTRA ). The container then allows for assignmen t of each stream to a codec independently: W e suggest BWT or Marko v models for n ucleotides, zstd or ra w co ding for low-en tropy con trol and header streams, and dedicated co ders (for example the built-in rans-o1 ) for quality . This factorization reduces the mo deling problem from one heterogeneous joint pro cess into several narro w er conditional pro cesses with co dec-sp eciﬁc inductiv e bias. As a prepro cessing step, the n ucleotide stream is pack ed into a 2-bit or 4-bit represen tation, yielding a 75% or 50% size reduction b efore an y en trop y co ding. Out-of-alphab et characters are handled exactly through the EXTRA side channel: Each non-IUP AC character emits a delta-co ded p osition and one extra b yte. F or packing width k ∈ { 2 , 4 } bits, this is beneﬁcial whenev er the fraction of non-A CGT symbols satisﬁes e/n < (8 − k ) / 40 . Concretely , less than 15% for 2-bit and less than 10% for 4-bit packing. Real genomic data falls well b elow these 4 Hecate: A Mo dular Genomic Comp resso r thresholds. The packing stage runs at appro ximately 500 MB/s in enco ding and 3–10 GB/s in deco ding (single core, depending on output format and memory-mapping av ailability), so it nev er dominates end-to-end cost. 3.2 hecate-b wt The he c ate-bwt co dec is a BWT-based co dec designed to ov ercome speciﬁc limitations of existing blo ck-sorting compressors. The older and well-established bzip2 uses small blocks (900 KB) paired with a relatively weak post-BWT stage (imprecise symbol ranking, run- length co ding, Huﬀman co ding), which limits compression ratio on large genomes. The new er bzip3 [Szew czyk, 2022] supp orts larger blo c ks (up to 512 MB) and a b etter context mo del (partially re-used in this work, originating from Ilya Murv a y ev’s b cm algorithm), but carries alw ays-enabled ﬁltering stages that h urt genomic compression, limited parallelism within BWT construction, slow serial decoding, and high memory usage. The bsc compressor [Grebno v, 2009] achiev es strong ratios via a fast p ost-BWT co der, but is considerably more complex and t ypically compresses sligh tly worse than bzip3 . Finally , the bbb compressor [Mahoney, 2006] resolves the small-blo ck and w eak-co der limitations of bzip2 , but at the cost of v ery slow enco ding and deco ding. 3.2.0.1 BWT construction and auxilia ry-index inversion. he c ate-bwt supp orts blo ck sizes practically limited only by av ailable memory . BWT construc- tion and in version use the parallel suﬃx sorting algorithm libsais [Grebnov, 2021], whic h dispatc hes to 32-bit or 64-bit suﬃx array paths from the actual blo ck size. F or small blo cks ( n < 32 KiB), a classic primary index p suﬃces for inv ersion. F or larger blo c ks, w e use auxiliary-index BWT with stride r = 2 ⌊ log 2 max (1 , ⌊ n/ 8 ⌋ ) ⌋ , and serialize a ﬁxed 256-entry auxiliary table A [Ohlebusc h et al., 2014]. These entries act as deterministic restart scaﬀolds during inv ersion: rather than follo wing the LF mapping from a single primary index p , in version can resume from precomputed anchors, b ounding the restart distance by O ( r ) . This gives practical in version lo cality while keeping the auxiliary pa yload ﬁxed at 256 words regardless of blo ck size. F or a blo c k of n b ytes at SA width w ∈ { 32 , 64 } bits, the structural metadata ov erhead is L meta ( n, w ) = 8 + 16 + 256 w + 64 + 64 ⌈ n/C ⌉ bits , where C = 2 24 is the ch unk size for indep endent arithmetic-co ded ch unks. The ov erhead rate L meta ( n, w ) /n is negligible for practical blo ck sizes. 3.2.0.2 Bit wise probabilit y mo del. BWT-based compressors diﬀer considerably in p erformance dep ending on how they enco de the result of the BWT. W e present an enco ding based on the arithmetic co ding algorithm tailored to genomic data. Eac h BWT output byte is co ded MSB-ﬁrst through a depth-8 binary context tree with a leading one bit (contexts c ∈ { 1 , . . . , 256 } ). W e p erform this through the use of a predictor that estimates the probability of the next bit b eing 1 given the context. Let u (0) c , u (1) p,c , u (2) q ,c ∈ [0 , 65535] b e the order-0 and tw o order-1 coun ters (with p b eing the previous b yte and q the b yte b efore that). Then, let t b e the current bit-wise p osition in the K. Szew czyk and S. Rahmann 5 BWT, and consequently b t ∈ { 0 , 1 } , c t , p t and q t b e the current bit, context, previous b yte and b yte b efore that, resp ectively . The base predictor blends these as ˆ Pr t = 6  u (0) c t + u (1) p t ,c t  + 4 u (2) q t ,c t 16 . This is a ﬁxed-weigh t mixture with eﬀective contribution ratios 3 : 3 : 2 for the three coun ter families, reﬂecting that direct and one-symbol-back contexts carry roughly equal predictiv e w eight on BWT output, while the gapp ed context u (2) (whic h skips the directly preceding b yte and shares its table with u (1) ) con tributes a stabilizing third opinion. R un state is mo deled explicitly: f t = 1 [ run length > 2] . A run-conditioned secondary sym b ol estimation (SSE) table s (2 c t + f t ) ,j with j ∈ { 0 , . . . , 16 } is linearly in terp olated at the quan tized base prediction: j t = j ˆ Pr t / 2 12 k , λ t = ( ˆ Pr t mo d 2 12 ) / 2 12 , ˆ s t = (1 − λ t ) s 2 c t + f t , j t + λ t s 2 c t + f t , j t +1 . The ﬁnal arithmetic split probability is q t = ( ˆ Pr t + ˆ s t ) / 2 17 . Coun ters are initialized at 2 15 (un biased midp oint); SSE entries are seeded as an ordered grid s c,j ≈ j · 2 12 . 3.2.0.3 Counter dynamics. The three counter families use asymmetric exp onential-mo v e up dates: U τ ( v , b ) = ( v − ⌊ v / 2 τ ⌋ , b = 0 , v + ⌊ (65535 − v ) / 2 τ ⌋ , b = 1 . W e set τ = 3 for u (0) , τ = 5 for u (1) , and τ = 7 for SSE entries. Each family thus acts as an exp onen tial mo ving estimator with half-life t 1 / 2 ( τ ) ≈ ( ln 2) 2 τ : roughly 6, 22, and 89 symbols resp ectiv ely . This three-rate separation is delib erate. Under long BWT runs (characteristic of rep etitiv e genomic regions), the fast lane u (0) trac ks lo cal statistics, the medium lane u (1) pro vides context-conditioned stability , and the slo w SSE lane resists transient ﬂuctuations. The blend therefore approximates a multiscale estimator without explicit mixture weigh ts. The choice of ﬁxed-step exp onential moving a v erage is dictated by b oth p erformance and the speciﬁcs of the Burrows-Wheeler T ransform output distribution. A practical alternative w ould b e the Krichevsky-T roﬁmo v estimator [Krichevsky and T roﬁmo v, 1981], which has a strong theoretical pedigree as a minimax optimal adaptive estimator for Bernoulli pro cesses. Ho wev er, its implementation issues a costly idiv instruction p er up date, whic h is prohibitive at the billions-of-symbol scale of BWT output. The ﬁxed-step up date, b y contrast, can b e implemen ted with fast bit shifts and additions with b etter cache b ehaviours ( 512 × u16 vs 512 × 2 × u32 ). Simple and well known quan titativ e analysis for stationary sources shows that the eﬀectiv e sample size of the ﬁxed-step estimator is n eﬀ ∼ 2 τ and V ar ( p ∞ ) = 2 − τ 2 − 2 − τ θ (1 − θ ) = θ (1 − θ ) 2 τ +1 − 1 . Notably , the v ariance is not correlated with the total amoun t of samples. Since v ariance do es not v anish, excess log-loss remains and the total regret grows linearly in data size. This is juxtap osed with the Krichevsky-T roﬁmo v estimators that minimise worst-case regret, con verge with E [ ˆ θ ] → θ and E [ ˆ θ ] − θ = 1 − 2 θ N +2 (th us bias decreasing ∼ O (1 / N ) ), and minimise p er-sym b ol redundancy to O (log N / N ) . That said, the BWT output distribution is far from stationary . Even if the symbol streams ha v e a strong lo cal regularity , the predictors receive a binarized view of the data 6 Hecate: A Mo dular Genomic Comp resso r T able 1 F eature comparison: PPM vs. marko v-mix. F eature PPM mark o v-mix Mo dels n mo dels with ﬁxed orders 0 to n − 1 Curated set of heterogeneous orders and storage mo des Mo del selection P er-symbol basis using con text tree and escap e mechanism P er-blo ck basis using estimated co d- ing cost Mo del up dates All mo dels up dated after each symbol Winner receiv es full update; others receiv e asymmetric up dates Data structure T rie or suﬃx arra y with p ointer-hea vy tra v ersal Flat arrays with branc h-light, SIMD- friendly up dates High-order deci- sivit y Escap es damp en deep-order conﬁ- dence Deep con text chosen when lo cally op- timal; no escap e ov erhead that contains man y strong, structured non-stationarities: the distribution of the next bit hea vily dep ends on whether we are in a run, the preﬁx and the current magnitude buck et. F urther, probabilities can swing quickly and dramatically , esp ecially at run b oundaries. This motiv ates the use of ensem bles of ﬁxed-step EMA (sometimes called Vitter coun ters in the con text of statistical data compression) with diﬀerent τ parameters, whic h can b e seen as a simple and eﬃcient wa y to approximate a m ultiscale estimator that can adapt to b oth fast and slo w distributional changes without the ov erhead of a full Krichevsky-T roﬁmo v up date. 3.2.0.4 P arallel chunk ed co ding. The BWT output is partitioned in to ⌈ n/C ⌉ indep enden t c hunks of size C = 2 24 b ytes. Arithmetic coding across c h unks is embarrassingly parallel once the BWT is pro duced: c hunk headers record p er-ch unk compressed sizes, enabling direct random-access deco ding. Since ⌈ n/C ⌉ is small for practical blo cks, scheduling o verhead sta ys sub dominant and parallel eﬃciency is controlled primarily by ch unk gran ularity and memory lo cality . 3.3 ma rkov-mix markov-mix is a statistical co dec that treats each blo ck of nucleotides as a lo cal exp ert- selection problem. In contrast to per-symbol PPM back oﬀ (Section 2), markov-mix selects the b est-p erforming exp ert p er blo ck of 80 symbols, then co des the exp ert index explicitly . T ables 1 and 2 summarize the structural diﬀerences. 3.3.0.1 Exp ert family . The co dec main tains ﬁv e exp erts M = { m 0 , . . . , m 4 } , eac h parameterized by a tuple ( k , α, ρ, ρ rc , c max ) : (3 , 0 , 0 , 0 , 65535) , (7 , 0 , 0 , 0 , 1023) , (11 , 2 , 0 , 1 , 255) , (15 , 6 , 1 , 1 , 15) , (13 , 9 , 1 , 0 , 0) . Here k is the con text order, α con trols count scaling, ρ con trols whether non-selected exp erts p erform full coun t up dates ( ρ = 1 ) or context-only updates ( ρ = 0 ), ρ rc enables rev erse- complemen t coupling, and c max is the counter saturation threshold. Storage mo de dep ends on context-space size: 16-bit coun ters for small spaces ( k = 3 , 7 ), 8-bit for medium/deep spaces ( k = 11 , 13 ), and pack ed 4-bit nibbles for v ery deep order ( k = 15 ). This yields a total mo del state of approximately 2.45 GB, dominated b y the order-15 table at 2 GB. K. Szew czyk and S. Rahmann 7 T able 2 F eature comparison: MFCompress vs. mark ov-mix. F eature MF Compress mark o v-mix Estimator ( α + β · #( s, c )) / ( | Σ | · #( c )) ; ﬂoating- p oin t 2 γ · #( s, c ) / ( | Σ | · #( c )) ; ﬁxed-p oint Mo del selection Order-3 mo del with α = 1 , β = 50 Order-5 mo del with γ = 0 Co ding back end Arithmetic co ding (CA CM with rescaling) Range co ding Con text handling Re-scanning of past data giv en a p oin ter In-register, branchless rolling con text up dates IUP AC/unkno wn path In terlea v ed in main stream Binary unkno wn gate + dedicated unkno wn-sym b ol mo del Pro cessing P arallel (multi-ﬁle) Serial (multi-core via block-lev el par- allelism) 3.3.0.2 P er-context prediction. F or exp ert m , con text h t , and symbol s ∈ { 0 , 1 , 2 , 3 } , counts c m,h t ,s induce frequencies f m,h t ,s = 1 + 2 α m c m,h t ,s , P m ( s | h t ) = f m,h t ,s P 3 a =0 f m,h t ,a . The coun t scaling factor 2 α m ampliﬁes the inﬂuence of observ ed coun ts in deep models, where con text o ccupancy is sparse but observ ations are highly informativ e. F orward contexts ev olve as a base-4 shift register: h t +1 = (( h t mo d 4 k − 1 ) ≪ 2) + x t . When reverse-complemen t coupling is enabled ( ρ rc = 1 ), a paired reverse context is updated simultaneously , which helps on strand-symmetric motifs. 3.3.0.3 Exp ert selection. Eac h exp ert is scored on a blo ck B = ( x 1 , . . . , x 80 ) using a log-table surrogate of cross-entrop y: ˜ C m ( B ) = 80 X t =1  ln F m,h t − ln f m,h t ,x t  , computed in scaled ﬁxed-p oint nats ( × 10 6 ). The chosen expert is m ∗ = arg min m ∈M ˜ C m ( B ) , and m ∗ itself is range-co ded by an order-5 auxiliary mo del o ver previous exp ert choices. In the worst case, the selector contributes log 2 5 ≈ 2 . 32 bits p er blo c k, but empirical selector en tropy is muc h lo w er due to lo cal persistence of exp ert identit y – genomic neighborho o ds tend to stay within a single comp ositional regime across many consecutive blo cks. 3.3.0.4 Asymmetric up dates. The selected exp ert receives full symbol-count up dates and con text adv ance. Non-selected exp erts receiv e either context-only up dates ( ρ = 0 ) or full up dates ( ρ = 1 ), with rev erse- complemen t context handling controlled by ρ rc . This asymmetry is a core design choice: deep experts maintain phase with lo cal context drift even when not selected, while selectiv e coun t up dates control ov erhead. On counter saturation ( c m,h,s = c max > 0 ), all four symbol coun ters in that context are halved before incrementing the winning symbol; for c max = 0 , coun ters are reset b efore increment. 8 Hecate: A Mo dular Genomic Comp resso r 3.3.0.5 4-bit unkno wn-path factorization. In 4-bit mo de, the co dec explicitly factorizes the co ding distribution for each blo ck as P ( B ) = P ( z B ) P ( m ∗ B ) 80 Y t =1 P ( u t ) P ( x t | u t =0 , m ∗ B ) 1 [ u t =0] · P ( v t | u t =1) 1 [ u t =1] , where z B is a blo ck-lev el ﬂag indicating the presence of non-A CGT sym b ols, u t is a p er- sym b ol ACGT-vs-unkno wn gate, and v t ∈ { 0 , . . . , 11 } is the unknown-sym b ol index. This prev ents ambiguit y co des from con taminating ACGT exp ert counts while preserving exact 4-bit rev ersibility . 3.4 zstanda rd zstandar d [F aceb o ok, 2023] serves as the high-throughput bac k end for streams where statistical mo dels oﬀer diminishing returns: headers, control bytes, and cases where fast decompression matters more than marginal ratio gains. The co dec is applied after 2-bit or 4-bit pac king, whic h giv es he c ate an adv an tage o ver NAF [Kryuko v et al., 2019] (whic h uses exclusiv ely 4-bit pac king with a ﬁxed alphab et). Coupled with packing, zstandar d ’s decompression sp eeds (of order one nanosecond p er byte) are not diminished. 4 Referential Comp ression he c ate supports referential compression of assembled genomes by op erating o ver semantic streams rather than raw F AST A text. Each stream is diﬀerenced against the homologous reference stream using hdiﬀ [Sisong, 2013], which iden tiﬁes exact-matc h segments via suﬃx arra y lo okup with a Blo om ﬁlter to skip unnecessary O ( log n ) searc hes. The output is a series of copy descriptors and unmatched literal segmen ts. F or the nucleotide stream, pack ed b ytes are ﬁrst expanded to symbol-domain v ectors b efore diﬀerencing and re-pac ked after patching, to preserve exact pac ked semantics. The unmatc hed segments are then compressed b y an y of the co decs ab ov e; typically zstandar d or he c ate-bwt . Under a substitution-dominant approximation with mismatc h rate p o ver alphab et Σ , the conditional entrop y of target giv en reference ob eys H ( T | R ) ≈ | T |  h 2 ( p ) + p log 2 ( | Σ | − 1)  . This is the information-theoretic reason referential gains increase sup erlinearly as similarity gro ws: for highly similar assemblies (e.g., individual human genomes against GRCh38, with p ≈ 0 . 006 [A uton et al., 2015]), the gap b etw een H ( T ) and H ( T | R ) is large enough that even non-trivial patch metadata is amortized. When edits cluster into r literal runs, descriptor cost scales as O ( r log n ) rather than O ( e log n ) , a further structural adv an tage for assembled genomes where mismatches are sparse and lo calized. 5 Exp erimental Evaluation 5.1 Metho dology All b enchmarks w ere run on a Zen 3 (Ryzen 9 5950X) CPU with 128 gigabytes of 3200 MT/s DDR4 RAM, p erformed entirely within memory via a RAM disk, on a quiet temp erature- con trolled dedicated system. More details can be found in the supplemen tary material. he c ate K. Szew czyk and S. Rahmann 9 w as compiled with cargo build –release . W e rep ort three metrics: compression ratio in bits p er byte ( R bpb = 8 S/ N , low er is b etter), enco ding time in nanoseconds p er b yte ( t enc , lo wer is b etter), and decoding time ( t dec , lo wer is b etter), all using w all-clo ck time av eraged o ver ten runs. W e use wall-clock time delib erately . Data compression is commonly assumed to b e memory-b ound, but as a mostly serial workload 1 that induces una voidable branc h and cac he misses 2 , it is at least equally reliant on clo ck sp eed and pip eline eﬃciency . Rep orting w all time rewards implemen tations that make eﬀectiv e use of av ailable hardw are, rather than p enalizing parallel co decs for synchronization ov erhead. 5.2 Reference-F ree Benchma rk Results W e ev aluate on seven genomes spanning bacterial, through mammalian to large plan t scales: GCF_000008865.2 : Escherichia c oli O157:H7 str. Sakai; 5.6 Mb. GCA_004837865.1 : Musa b albisiana isolate DH-PKW; 492.8 Mb. GCA_021556685.1 : R attus norve gicus s. SHRSP/BbbUtx; 2.9 Gb. GCF_000001405.40 : GR Ch38.p14 ; 3.3 Gb. GCF_000006565.2 : T oxoplasma gondii TGA4; 63.7 Mb. GCA_000404065.3 : Pinus tae da v2.0; 22.5 Gb. GCF_009914755.1 : T2T-CHM13v2.0; 3.1 Gb. E. c oli serv es as a benchmark for small genomes with limited rep etitiv e structure. Musa b albisiana is a mo derately sized plant genome with high rep eat con ten t. The three mammalian genomes stress large-blo ck b ehaviour, memory lo cality , and high-order context stability . Finally , Pinus tae da is a large plan t genome with extreme rep eat con tent, which tests the limits of BWT-based metho ds and the b eneﬁts of large blo cks. W e exclude co decs that do not reac h 2 bpb (e.g., pigz , lbzip2 ). The readers are referred to the supplemen tary material for complete results on all co decs and samples. W e were unable to include GeCo3 in our visualisations due to its slo w enco de/deco de recipro cal throughput (on the tested samples) of 1300–2900 ns/B. Figure 1 shows selected represen tative b enc hmark results. Unlik e other comparisons, we do not normalise the input data by remo ving linebreaks, headers and case information: they are preserved as-is as compared to the original NCBI F AST A ﬁles. The pattern across the datasets is consistent. he c ate-bwt achiev es excellent compression ratios (1.58–1.60 bpb on CHM13; 1.60–1.61 on R. norve gicus ) at a highly comp etitive throughput. markov-mix reac hes the b est ratios in our comparison (1.56 bpb on CHM13) but at higher computational cost. Compared to MFCompress at similar ratio, markov-mix is 2 − 4 × faster to enco de and 2 − 3 × faster to deco de. he c ate-zstd matches NAF’s op erating region while b eneﬁting from 2-bit packing. The ov erall picture: he c ate ’s co decs dominate or tigh ten the Pareto fron tier across the ratio/sp eed plane, with the strongest gains on large genomes where large-block BWT and high-order Marko v mo dels ha ve the most room to exploit long-range structure. 1 Decoders typically p ossess only causal (past) con text, limiting the degree of parallelism for processing incoming data. 2 Predictable branc hes and loads in a noiseless co der w ould imply redundancy in the compressed stream, which is parado xical. 10 Hecate: A Mo dular Genomic Comp resso r T able 3 Referen tial compression of tw o human assem blies against GR Ch38. Metho d Compressed size Enco de (w all) Deco de (wall) he c ate ( -EMbwt:100 ) 35.6 MB 3:19 10.9 s A GC (v3.2.0) 828.5 MB 1:00 6.2 s 5.3 Referential Benchma rk Results W e consider GRCh38 as the reference genome for human data, testing against tw o assemblies from the HPRC: GCA_044167135.1_HG01167_hap1_hprc_f2 . GCA_042077855.1_HG00133_hap2_hprc_f2 . The dataset totals 9.2 GB. T able 3 summarizes the results. he c ate ac hieves a 23 × b etter compression ratio than AGC at comparable aggregate CPU time (530 s user vs. 550 s). AGC is purp ose-built for fast queries on large genome collections and is not optimized for the pairwise referen tial scenario tested here. The comparison is nonetheless informative: it demonstrates that stream-level diﬀerencing follo wed b y strong per- stream co ding can substan tially outp erform collection-oriented to ols on individual assem bly pairs. 6 Discussion T wo limitations are explicit in the current design. First, the strongest markov-mix conﬁgura- tion requires appro ximately 2.45 GB of mo del state. This is the familiar context-capacit y trade-oﬀ: Larger context spaces reduce appro ximation error in heterogeneous regions but increase cache pressure and resident fo otprint. A natural next step is hierarchical state compression (e.g., sparse or shared count slabs) that preserves eﬀective context depth while reducing memory . Second, referen tial mo de currently materializes full stream pa yloads b efore patc h/diﬀ, rather than op erating in a streaming fashion. A streaming referen tial pipeline with blo ck-aligned dep endency cuts could reduce memory and improv e partial-deco ding ergonomics, but requires careful dep endency sc heduling to av oid ratio regressions from ov er- constrained patc h segmentation. Both are delib erate ﬁrst-generation trade-oﬀs: Algorithmic b eha viour sta ys explicit, deterministic, and repro ducible. More broadly , the results support a systems-level claim: F or genomic data, compression qualit y dep ends as muc h on arc hitecture as on estimator sophistication. Container seman tics, p er-stream metho d assignmen t, and decode/index constraints aﬀect the achiev able practical fron tier. The design of he c ate makes these engineering choices explicit and indep endently tunable, whic h lik ely is wh y improv ements p ersist across genome scales and similarity regimes. 7 Conclusion The he c ate framework demonstrates that practical genomic compression impro ves when codec theory and systems architecture are co-designed. A uxiliary-index BWT with custom multi- scale mixing, explicit blo c kwise Mark ov expert comp etition, and seman tic-stream referential diﬀerencing join tly produce a stronger ratio/sp eed fron tier than monolithic pip elines. In reference-free mode, he c ate is 2 to 10 times faster than prior to ols at the same compression ratio. Under a ﬁxed time budget, it achiev es 5% to 10% b etter compression. In referential REFERENCES 11 mo de, stream wise diﬀerencing yields 23 times b etter compression than A GC on individual assem bly pairs. The central design principle is decomposition: Stream factorization trades monolithic mo del mismatch for bounded side c hannels. A uxiliary-index BWT provides large-blo ck transform gains with con trolled inv ersion metadata. The markov-mix co dec approximates p er-blo ck oracle exp ert choice with b ounded selector ov erhead. Our referential mo de targets H ( T | R ) -regime op erating p oints on homologous collections. This decomp osition links algorithmic choices to concrete systems constraints; this is why the improv emen ts we achiev e are not attributable to any single tric k, but to their co ordination. The he c ate framework is op en-source, released under the terms of the Aﬀero General Public License v3.0. The source co de, documentation, and issue trac ker can be found at https://gitlab.com/rahmannlab/hecate . References M. A dler. pigz: Parallel implementation of gzip. 2023. URL https://zlib.net/pigz/ . A. Auton, L. D. Bro oks, R. M. Durbin, E. P . Garrison, H. M. Kang, J. O. Korbel, et al. A global reference for human genetic v ariation. Natur e , 526(7571):68–74, 2015. S. Christley . Human genomes as email attachmen ts. Bioinformatics , 25(2):274–275, 2009. J. Cleary and I. Witten. Data compression using adaptive co ding and partial string matching. IEEE T r ansactions on Communic ations , 32(4):396–402, 1984. doi: 10.1109/TCOM.1984. 1096090. S. Deorowicz, A. Danek, and H. Li. Agc: compact representation of assembled genomes with fast queries and up dates. Bioinformatics , 39(3):btad097, 2023. F aceb o ok. Zstandard: F ast real-time compression algorithm. 2023. URL https://github. com/facebook/zstd . P . F erragina and G. Manzini. Opp ortunistic data structures with applications. In Pr o c e e dings 41st A nnual Symp osium on F oundations of Computer Scienc e , pages 390–398, 2000. doi: 10.1109/SF CS.2000.892127. I. Grebnov. libbsc: High p erformance blo ck-sorting data compression library . 2009. URL https://github.com/IlyaGrebnov/libbsc . I. Grebnov. F ast linear-time construction of suﬃx arrays. 2021. URL https://github.com/ IlyaGrebnov/libsais . S. Grumbac h and F. T ahi. Compression of dna sequences. In Pr o c e e dings DCC93: Data Compr ession Confer enc e , pages 340–350. IEEE, 1993. M. Izdebski. lbzip2: Parallel implementation of bzip2. 2015. URL https://github.com/ kjn/lbzip2 . R. Kric hevsky and V. T roﬁmov. The p erformance of universal enco ding. IEEE T r ansactions on Information The ory , 27(2):199–207, 1981. doi: 10.1109/TIT.1981.1056331. K. Kryuko v, M. T. Ueda, S. Nakagaw a, and T. Imanishi. Nucleotide archiv al format (naf ) enables eﬃcient lossless reference-free compression of dna sequences. Bioinformatics , 35 (19):3826–3828, 2019. M. Mahoney . Large text compression b enchmark. 2005. URL http://mattmahoney.net/ dc/text.html . M. Mahoney . Big blo ck b wt (burrows-wheeler transform) compressor. 2006. URL https: //mattmahoney.net/dc/#bbb . 12 REFERENCES M. H. Mohammed, A. Dutta, T. Bose, S. Chadaram, and S. S. Mande. Deliminate–a fast and eﬃcient metho d for loss-less compression of genomic sequences. Bioinformatics , 28 (19):2527–2529, 2012. C. G. Nevill-Manning and I. H. Witten. Protein is incompressible. In Pr o c e e dings DCC’99 Data Compr ession Confer enc e (Cat. No. PR00096) , pages 257–266, 1999. doi: 10.1109/ DCC.1999.755675. Enno Ohlebusc h, Timo Beller, and Mohamed I. Abouelho da. Computing the burrows–wheeler transform of a string and its reverse in parallel. Journal of Discr ete A lgorithms , 25:21– 33, 2014. ISSN 1570-8667. doi: https://doi.org/10.1016/j.jda.2013.06.002. URL https: //www.sciencedirect.com/science/article/pii/S1570866713000397 . 23rd Ann ual Symp osium on Combinatorial Pattern Matching. A. J. Pinho and D. Pratas. Mfcompress: a compression to ol for fasta and multi-fasta data. Bioinformatics , 30(1):117–118, 2014. S. A. Shiry ev and R. Agarwala. Indexing and searching p etabase-scale nucleotide resources. Natur e Metho ds , 21(6):994–1002, 2024. M. Silv a, D. Pratas, and A. J. Pinho. Eﬃcient dna sequence compression with neural netw orks. GigaScienc e , 9(11):giaa119, 11 2020. ISSN 2047-217X. doi: 10.1093/gigascience/giaa119. H. Sisong. A c/c++ library and command-line tools for diﬀ and patch b etw een binary ﬁles. 2013. URL https://github.com/sisong/HDiffPatch . M. J. P . Sousa, A. J. Pinho, and D. Pratas. Jarvis3: an eﬃcien t enco der for genomic data. Bioinformatics , 40(12):btac725, 2024. K. Szewczyk. bzip3: A better and stronger spiritual successor to bzip2. 2022. URL https://github.com/iczelia/bzip3 . H. Y ao, Y. Ji, K. Li, S. Liu, J. He, and R. W ang. Hrcm: An eﬃcient h ybrid referential compression metho d for genomic big data. BioMe d R ese ar ch International , 2019:3108950, 2019. REFERENCES 13 1 . 4 1 . 6 1 . 8 2 10 0 10 1 10 2 10 3 Enco de ( Pinus tae da ) 1 . 4 1 . 6 1 . 8 2 10 0 10 1 10 2 Deco de ( Pinus tae da ) 1 . 6 1 . 8 2 10 0 10 1 10 2 10 3 Enco de ( Homo sapiens , GRCh38) 1 . 6 1 . 8 2 10 0 10 1 10 2 Deco de ( Homo sapiens , GRCh38) 1 . 6 1 . 8 10 0 10 1 10 2 10 3 Enco de ( Musa b albisiana ) 1 . 6 1 . 8 10 0 10 1 10 2 Deco de ( Musa b albisiana ) 1 . 6 1 . 8 2 10 0 10 1 10 2 10 3 Enco de ( Homo sapiens , T2T) 1 . 6 1 . 8 2 10 0 10 1 10 2 Deco de ( Homo sapiens , T2T) 1 . 6 1 . 7 1 . 8 1 . 9 10 0 10 1 10 2 10 3 Enco de ( R attus norve gicus ) 1 . 6 1 . 7 1 . 8 1 . 9 10 0 10 1 10 2 Deco de ( R attus norve gicus ) hecate bwt mark ov-mix hecate zstd MFCompress ( t =1 ) MFCompress ( t =16 ) NAF Figure 1 Enco de and deco de times for GCA_000404065.3.fna ( Pinus tae da , 22.5 Gb), GCF_000001405.40.fna ( Homo sapiens (GRCh38.p14), 3.3 Gb), GCA_004837865.1.fna ( Musa b albisiana , 499 Mb), GCF_009914755.1.fna ( Homo sapiens (T2T-CHM13v2.0), 3.2 Gb) and GCA_021556685.1.fna ( R attus norve gicus , 2.9 Gb).

Hecate: A Modular Genomic Compressor

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment