DART-Eval: A Comprehensive DNA Language Model Evaluation Benchmark on Regulatory DNA

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recent advances in self-supervised models for natural language, vision, and protein sequences have inspired the development of large genomic DNA language models (DNALMs). These models aim to learn generalizable representations of diverse DNA elements, potentially enabling various genomic prediction, interpretation and design tasks. Despite their potential, existing benchmarks do not adequately assess the capabilities of DNALMs on key downstream applications involving an important class of non-coding DNA elements critical for regulating gene activity. In this study, we introduce DART-Eval, a suite of representative benchmarks specifically focused on regulatory DNA to evaluate model performance across zero-shot, probed, and fine-tuned scenarios against contemporary ab initio models as baselines. Our benchmarks target biologically meaningful downstream tasks such as functional sequence feature discovery, predicting cell-type specific regulatory activity, and counterfactual prediction of the impacts of genetic variants. We find that current DNALMs exhibit inconsistent performance and do not offer compelling gains over alternative baseline models for most tasks, while requiring significantly more computational resources. We discuss potentially promising modeling, data curation, and evaluation strategies for the next generation of DNALMs. Our code is available at https://github.com/kundajelab/DART-Eval.

💡 Research Summary

The paper introduces DART‑Eval, a comprehensive benchmark suite designed to assess the utility of DNA language models (DNALMs) on regulatory non‑coding DNA. While large self‑supervised models have transformed natural language processing, protein sequence modeling, and are now being applied to genomics, existing DNA benchmarks focus on surrogate tasks and suffer from data‑biases, outdated baselines, and limited biological relevance. DART‑Eval addresses these gaps by providing five representative downstream tasks that capture the core applications of regulatory DNA models: (1) distinguishing high‑confidence cis‑regulatory elements (cCREs) from compositionally matched controls, (2) detecting transcription‑factor (TF) motif instances, (3) predicting cell‑type‑specific regulatory activity, (4) quantitative regression of chromatin accessibility, and (5) counterfactual prediction of variant effects. Each task is evaluated under three regimes: zero‑shot (using either mean‑layer embeddings or model likelihoods without any fine‑tuning), probing (training a shallow CNN on frozen final‑layer embeddings), and fine‑tuning (low‑rank LoRA adapters applied to the entire model).

The authors benchmark six contemporary, annotation‑agnostic DNALMs—Caduceus, DNABERT‑2, GENA‑LM, HyenaDNA, Mistral‑DNA, and Nucleotide Transformer—spanning 7 M to 1.6 B parameters, and compare them against strong ab initio baselines: a ChromBPNet‑like dilated CNN for quantitative tasks and simple CNN classifiers for the other tasks. Data are carefully curated: cCREs are 350 bp sequences from ENCODE, negatives are dinucleotide‑shuffled preserving composition; TF motif evaluation uses 1,443 curated PWMs; variant sets are filtered to control for linkage disequilibrium and include high‑confidence causal variants.

Results reveal three major findings. First, in zero‑shot likelihood mode, most DNALMs achieve high discrimination of regulatory versus background sequences (accuracy > 0.9), but embedding‑based zero‑shot performs poorly, indicating that current models capture token‑level probability well but do not produce biologically informative embeddings. Second, probing shows only marginal differences between DNALM embeddings and a baseline CNN trained from scratch; in many cases the baseline outperforms the DNALM, suggesting limited transferability of the learned representations. Third, fine‑tuning with LoRA yields modest gains for some models but rarely surpasses the ab initio CNNs; on the most challenging counterfactual variant‑effect task, DNALMs perform near chance, likely because pre‑training data lack sufficient variant context and the models struggle with long‑range regulatory interactions.

Computationally, DNALMs demand orders of magnitude more GPU memory and training time than the lightweight CNN baselines, yet they do not deliver proportional performance benefits. The authors also critique prior benchmarks for confounding factors such as GC‑content mismatches and improper control of LD, which can inflate reported gains.

In the discussion, the paper proposes several avenues for next‑generation DNALMs: (i) regulatory‑specific tokenization schemes (e.g., k‑mers combined with TF‑binding tokens), (ii) multi‑task pre‑training on diverse functional genomics assays (ChIP‑seq, ATAC‑seq, DNase‑seq) to embed quantitative signal, (iii) data augmentation that incorporates variant information and LD structure, and (iv) parameter‑efficient fine‑tuning techniques (LoRA, AdapterFusion) to reduce resource demands. The authors emphasize that DART‑Eval itself will be continuously updated with newer datasets and baselines, providing a living platform for rigorous evaluation.

Overall, DART‑Eval demonstrates that current DNALMs, despite their size, offer inconsistent and often negligible advantages over well‑engineered ab initio models on biologically meaningful regulatory tasks, while incurring substantially higher computational costs. The benchmark thus serves as a critical resource for the community to develop more effective, efficient, and biologically grounded DNA language models.

DART-Eval: A Comprehensive DNA Language Model Evaluation Benchmark on Regulatory DNA

💡 Research Summary

Comments & Academic Discussion

Leave a Comment