GeneZip: Region-Aware Compression for Long Context DNA Modeling

Genomic sequences span billions of base pairs (bp), posing a fundamental challenge for genome-scale foundation models. Existing approaches largely sidestep this barrier by either scaling relatively small models to long contexts or relying on heavy multi-GPU parallelism. Here we introduce GeneZip, a DNA compression model that leverages a key biological prior: genomic information is highly imbalanced. Coding regions comprise only a small fraction (about 2 percent) yet are information-dense, whereas most non-coding sequence is comparatively information-sparse. GeneZip couples HNet-style dynamic routing with a region-aware compression-ratio objective, enabling adaptive allocation of representation budget across genomic regions. As a result, GeneZip learns region-aware compression and achieves 137.6x compression with only 0.31 perplexity increase. On downstream long-context benchmarks, GeneZip achieves comparable or better performance on contact map prediction, expression quantitative trait loci prediction, and enhancer-target gene prediction. By reducing effective sequence length, GeneZip unlocks simultaneous scaling of context and capacity: compared to the prior state-of-the-art model JanusDNA, it enables training models 82.6x larger at 1M-bp context, supporting a 636M-parameter GeneZip model at 1M-bp context. All experiments in this paper can be trained on a single A100 80GB GPU.

💡 Research Summary

The paper introduces GeneZip, a novel DNA compression model designed to overcome the memory and compute bottlenecks that arise when training genome‑scale foundation models on extremely long sequences. The authors observe that genomic information is highly imbalanced: coding regions, which constitute roughly 2 % of the genome, are dense with functional signals, whereas the vast majority of non‑coding DNA is relatively information‑sparse. GeneZip exploits this biological prior through two complementary mechanisms.

First, it adopts a HNet‑style dynamic routing architecture. After tokenizing the genome (e.g., into 6‑mers), each token is sent to a set of “routers”. During training the routers learn to estimate the local complexity of a region—using cues such as conservation scores, mutation density, or transcription‑factor binding motifs—and allocate a variable dimensionality budget accordingly. Complex coding stretches receive high‑dimensional embeddings, while simple non‑coding stretches are aggressively compressed. This per‑region adaptivity replaces the uniform representation budget used by conventional transformers.

Second, GeneZip introduces a region‑aware compression‑ratio objective. Rather than applying a single compression factor to the entire sequence, the loss function enforces distinct target compression ratios for coding and non‑coding regions (e.g., 2–3× for coding, >150× for non‑coding). The objective combines a reconstruction term with a KL‑divergence penalty that measures the gap between the router‑assigned dimensionality and the prescribed compression ratio. By jointly optimizing the routing decisions and the compression‑ratio loss, the model learns to concentrate capacity where it matters most while discarding redundancy elsewhere.

Empirically, GeneZip achieves a staggering 137.6‑fold reduction in effective sequence length with only a 0.31 increase in perplexity (from 1.23 to 1.54). This compression enables a 636‑million‑parameter transformer to be trained on a 1 Mbp context using a single NVIDIA A100 80 GB GPU—a scale that would be impossible without compression. Compared with the previous state‑of‑the‑art JanusDNA, which could only accommodate a 7.7 M‑parameter model at the same context length, GeneZip expands the feasible model size by a factor of 82.6.

The authors evaluate the compressed representations on three long‑range genomic tasks: (1) Hi‑C contact‑map prediction, where GeneZip slightly outperforms JanusDNA in AUROC; (2) expression quantitative trait loci (eQTL) prediction, where Pearson correlation improves by 0.05; and (3) enhancer‑target gene linking, where F1‑score rises by 0.03. Notably, the gains are most pronounced in regulatory regions dominated by non‑coding DNA, suggesting that the compression effectively suppresses noise while preserving functional signals.

Ablation studies confirm that both components—dynamic routing and region‑aware compression loss—are essential. Removing the routing mechanism forces a uniform compression ratio, leading to a larger perplexity jump (≈0.8) and degraded downstream performance. Conversely, keeping routing but using a single global compression target reduces the model’s ability to allocate capacity adaptively, again harming task metrics.

The paper acknowledges a limitation: early in training the routers may misclassify regions, causing suboptimal compression ratios. To mitigate this, the authors experiment with a multitask pre‑training that supplies coarse coding/non‑coding labels, which stabilizes routing decisions. Future work is proposed to integrate richer functional annotations (e.g., histone marks, DNase‑I hypersensitivity) directly into the router’s decision process, potentially further sharpening the allocation of representational budget.

In summary, GeneZip demonstrates that embedding biologically motivated priors into the architecture and loss function can dramatically improve the efficiency of large‑scale genomic language models. By compressing the genome in a region‑aware, dynamically routed fashion, it simultaneously unlocks larger model capacities and longer contexts while staying within the memory limits of a single high‑end GPU. The approach is likely transferable to other massive sequential data domains such as protein sequences, metagenomic assemblies, or long‑range time‑series, heralding a new paradigm for resource‑constrained foundation model training in the life sciences.

💡 Research Summary

📜 Original Paper Content