Beyond Conditional Computation: Retrieval-Augmented Genomic Foundation Models with Gengram

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Current genomic foundation models (GFMs) rely on extensive neural computation to implicitly approximate conserved biological motifs from single-nucleotide inputs. We propose Gengram, a conditional memory module that introduces an explicit and highly efficient lookup primitive for multi-base motifs via a genomic-specific hashing scheme, establishing genomic “syntax”. Integrated into the backbone of state-of-the-art GFMs, Gengram achieves substantial gains (up to 14%) across several functional genomics tasks. The module demonstrates robust architectural generalization, while further inspection of Gengram’s latent space reveals the emergence of meaningful representations that align closely with fundamental biological knowledge. By establishing structured motif memory as a modeling primitive, Gengram simultaneously boosts empirical performance and mechanistic interpretability, providing a scalable and biology-aligned pathway for the next generation of GFMs. The code is available at https://github.com/zhejianglab/Genos, and the model checkpoint is available at https://huggingface.co/ZhejiangLab/Gengram.

💡 Research Summary

The paper addresses a fundamental limitation of current genomic foundation models (GFMs): they rely on dense neural computation over single‑nucleotide tokens to implicitly learn conserved motifs, which is inefficient for tasks dominated by multi‑base regulatory elements. Inspired by the Engram memory module, the authors introduce Gengram, a lightweight conditional memory component specifically designed for genomic sequences. Gengram maintains hash‑based lookup tables for all possible k‑mers with lengths k = 1 to 6. Because the DNA alphabet (A, T, C, G, N) is tiny, a deterministic base‑|Σ| encoding provides collision‑free indexing, enabling O(1) retrieval per k‑mer.

During forward propagation, for each position t the module scans a causal window W preceding t, enumerates all contiguous k‑mers, deduplicates them, and retrieves their embeddings from the corresponding tables. Within each k‑mer length, the retrieved vectors are mean‑pooled to produce a fixed‑size summary m(N)ₜ; the six summaries are concatenated, linearly projected into a gate vector zₜ and a content vector uₜ. A scaled‑dot‑product between RMS‑normalized backbone hidden state Xₜ and zₜ is passed through a sigmoid to obtain a gating scalar ũₜ, which modulates uₜ via SiLU activation. The gated memory signal is added residually to the backbone hidden state before the attention block. This design yields three key advantages: (1) linear time complexity O(n) for a sequence of length n when W and the set of k‑mer lengths are fixed; (2) negligible parameter overhead (~60 M additional parameters, ≈0.5 % of a 10 B model); and (3) a dynamic gating mechanism that lets the model decide when motif memory should influence the representation, improving training stability and interpretability.

Extensive experiments were conducted on a 1.2 B‑parameter transformer (0.3 B activated parameters) pre‑trained on up to 200 B tokens from the Human Pangenome Reference Consortium and RefSeq. The authors performed a systematic layer‑wise insertion study, finding that shallow layers (e.g., layer 3) capture local base patterns, middle layers (layer 6) abstract motif clusters, and deep layers (layer 10) integrate long‑range context. The combination {3, 6, 10} consistently yielded the lowest validation loss. Gengram was evaluated on 18 zero‑shot benchmark datasets covering five categories: genomic structure understanding, gene regulation prediction, epigenetic profiling, variant effect prediction, and clinical impact. Across all categories, models equipped with Gengram outperformed the baseline by up to 14 % absolute AUROC, with the largest gains on promoter and 5′ UTR detection where motif information is critical.

The module proved architecture‑agnostic: it integrated seamlessly with standard multi‑head attention, GQA, and MLA variants, as well as with sparse Mixture‑of‑Experts (MoE) transformers, where it also helped balance expert loads. Ablations showed that increasing the window size beyond W = 64 offered diminishing returns, and that using up to six k‑mer lengths struck a sweet spot between expressivity and computational cost.

Beyond performance, the authors examined the learned memory space. Embeddings displayed reverse‑complement symmetry, and gating activations were markedly higher in known functional regions such as promoters, enhancers, and 5′ UTRs, indicating that Gengram captures biologically meaningful structure rather than memorizing raw frequencies. Because the key space is fixed, the memory can be extended to new species or to incorporate experimentally validated motifs without retraining the entire model, offering a practical pathway for continual learning in genomics.

In summary, Gengram introduces a novel modeling primitive for genomic foundation models: an explicit, efficient, and interpretable motif memory. It delivers substantial empirical gains with minimal overhead, scales linearly with sequence length, and aligns its internal representations with established biological knowledge, thereby charting a scalable and biology‑aligned direction for the next generation of genomic AI systems.

Beyond Conditional Computation: Retrieval-Augmented Genomic Foundation Models with Gengram

💡 Research Summary

Comments & Academic Discussion

Leave a Comment