JEPA-DNA: Grounding Genomic Foundation Models through Joint-Embedding Predictive Architectures

Reading time: 5 minute
...

📝 Original Info

  • Title: JEPA-DNA: Grounding Genomic Foundation Models through Joint-Embedding Predictive Architectures
  • ArXiv ID: 2602.17162
  • Date: 2026-02-19
  • Authors: ** 논문에 명시된 저자 정보가 제공되지 않았습니다. (저자명 및 소속을 확인하려면 원문 PDF 혹은 학회 페이지를 참고하시기 바랍니다.) **

📝 Abstract

Genomic Foundation Models (GFMs) have largely relied on Masked Language Modeling (MLM) or Next Token Prediction (NTP) to learn the language of life. While these paradigms excel at capturing local genomic syntax and fine-grained motif patterns, they often fail to capture the broader functional context, resulting in representations that lack a global biological perspective. We introduce JEPA-DNA, a novel pre-training framework that integrates the Joint-Embedding Predictive Architecture (JEPA) with traditional generative objectives. JEPA-DNA introduces latent grounding by coupling token-level recovery with a predictive objective in the latent space by supervising a CLS token. This forces the model to predict the high-level functional embeddings of masked genomic segments rather than focusing solely on individual nucleotides. JEPA-DNA extends both NTP and MLM paradigms and can be deployed either as a standalone from-scratch objective or as a continual pre-training enhancement for existing GFMs. Our evaluations across a diverse suite of genomic benchmarks demonstrate that JEPA-DNA consistently yields superior performance in supervised and zero-shot tasks compared to generative-only baselines. By providing a more robust and biologically grounded representation, JEPA-DNA offers a scalable path toward foundation models that understand not only the genomic alphabet, but also the underlying functional logic of the sequence.

💡 Deep Analysis

📄 Full Content

Genomic Foundation Models (GFMs) have the potential to significantly advance our understanding of genomes and how DNA elements at various scales interact to give rise to complex functions [1]. These models, including DNABERT-2 [2], Nucleotide Transformer [3], HyenaDNA [4], and Evo [5], are Large Language Models (LLMs) adapted for DNA sequences, operating with context windows ranging from a few thousand base pairs to megabase-scale inputs. Typically, these architectures rely on self-supervised token prediction objectives, such as Masked Language Modeling (MLM) or autoregressive Next Token Prediction (NTP), to learn genomic representations.

While effective for identifying local motifs and sequence patterns [2,3], these methods face a fundamental limitation in capturing the broader, functional logic of the genome [1,6]. We term this limitation the “granularity trap”. In the MLM/NTP paradigms, the model is tasked with reconstructing individual masked tokens (e.g., A, C, T, G). While this encourages a high-fidelity understanding of local syntax, it does not inherently require the model to internalize the high-level biological consequences of a sequence. Consequently, these models may over-allocate capacity to high-frequency “noise,” such as non-coding repetitive elements or neutral polymorphisms, while failing to ground representations in global functional contexts, such as long-range enhancer-promoter interactions.

To bridge the gap between genomic syntax and biological semantics, we introduce JEPA-DNA, a novel framework that incorporates the Joint-Embedding Predictive Architecture (JEPA) [7] into genomic pre-training. Unlike generative objectives that operate in the raw token space, the JEPA paradigm predicts the latent representations of masked segments. While Joint-Embedding architectures have seen preliminary success in transcriptomics for modeling gene expression vectors [8], JEPA-DNA represents the first adoption of this paradigm to the high-resolution, multiscale domain of genomic sequences. By coupling token-level recovery with a predictive objective in the latent space, supervised via a [CLS] token, JEPA-DNA encourages the learning of abstract, functional features that are invariant to low-level sequence noise.

Our approach is uniquely versatile: JEPA-DNA can be deployed as a standalone pre-training objective or as a continuous pre-training phase to “ground” existing GFMs, extending both NTP and MLM paradigms and different architectures. This “latent grounding” serves as a corrective layer for pretrained models, anchoring their token-level knowledge to a more stable, semantic world model of genomic function. We evaluate JEPA-DNA across a suite of genomic benchmarks, specifically focusing on linear probing and zero-shot protocols to isolate the quality of the learned representations. Our empirical results demonstrate that latent grounding consistently elevates performance across functional tasks compared to generative-only baselines. Ultimately, JEPA-DNA demonstrates that moving beyond local nucleotide reconstruction is essential for developing foundation models that internalize the high-level regulatory mechanisms governing the genome.

In summary, our contributions are as follows:

• We introduce a novel application of Joint-Embedding Predictive Architectures to the genomic domain, shifting the pre-training focus from literal token reconstruction to latent feature prediction.

• We empirically demonstrate that by operating in the embedding space, our model captures higher-order functional semantics that standard MLM and NTP objectives may ignore.

• Our proposed method can be used for training models from scratch or as a refinement phase for GFMs and is compatible across architectures, providing a consistent means to learn functional sequence features.

• Through linear probing and zero-shot experiments, we establish that JEPA-DNA learns more linearly-separable and biologically-relevant features than generative baselines that use standard LLM objectives.

2 Related Work The development of GFMs has been largely inspired by the success of Large Language Models (LLMs) in Natural Language Processing. Early iterations, such as DNABERT [9], adapted the BERT architecture [10] replacing its original WordPiece subword tokenizer with k-mer tokenization to capture bidirectional context within the genome. More recently, DNABERT-2 [2] and the Nucleotide Transformer [3] expanded this scale by training on diverse multi-species datasets, demonstrating that larger context windows and higher parameter counts can improve “out-of-the-box” performance on downstream tasks like promoter prediction and variant effect prediction.

Beyond Transformer-based architectures, recent advancements have focused on overcoming the quadratic scaling of self-attention to model longer genomic dependencies. HyenaDNA [4] utilizes the Hyena operator to process sequences at single-nucleotide resolution across long contexts. Similarly, Evo2

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut