Rethinking Genomic Modeling Through Optical Character Recognition

Rethinking Genomic Modeling Through Optical Character Recognition
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recent genomic foundation models largely adopt large language model architectures that treat DNA as a one-dimensional token sequence. However, exhaustive sequential reading is structurally misaligned with sparse and discontinuous genomic semantics, leading to wasted computation on low-information background and preventing understanding-driven compression for long contexts. Here, we present OpticalDNA, a vision-based framework that reframes genomic modeling as Optical Character Recognition (OCR)-style document understanding. OpticalDNA renders DNA into structured visual layouts and trains an OCR-capable vision–language model with a \emph{visual DNA encoder} and a \emph{document decoder}, where the encoder produces compact, reconstructible visual tokens for high-fidelity compression. Building on this representation, OpticalDNA defines prompt-conditioned objectives over core genomic primitives-reading, region grounding, subsequence retrieval, and masked span completion-thereby learning layout-aware DNA representations that retain fine-grained genomic information under a reduced effective token budget. Across diverse genomic benchmarks, OpticalDNA consistently outperforms recent baselines; on sequences up to 450k bases, it achieves the best overall performance with nearly $20\times$ fewer effective tokens, and surpasses models with up to $985\times$ more activated parameters while tuning only 256k \emph{trainable} parameters.


💡 Research Summary

The paper “Rethinking Genomic Modeling Through Optical Character Recognition” introduces OpticalDNA, a novel framework that reconceptualizes DNA sequence modeling as a visual document understanding problem rather than a one‑dimensional token stream. The authors begin by highlighting two fundamental shortcomings of current genomic foundation models that adopt large language model (LLM) architectures: (1) lack of structure‑aware reading, because genomes contain sparse functional elements embedded in long stretches of low‑information background, making exhaustive token‑by‑token processing wasteful; and (2) absence of understanding‑driven compression, as the low information density of DNA demands semantic compression that sequential models cannot provide.

To address these issues, OpticalDNA renders raw DNA sequences into multi‑page images. Each page is a fixed‑resolution canvas (640 × 640 px) on which nucleotides are drawn in a monospace font (≈14 pt) row‑by‑row, yielding roughly 1,800 bases per page. Crucially, every nucleotide is annotated with a pixel‑level bounding box linking the 1‑D genomic coordinate to a 2‑D region. This creates a “DNA document” where the genome is treated like a scanned text document, enabling the use of OCR‑style techniques.

The visual encoder adopts the SAM‑Conv‑CLIP‑L backbone, partitioning each page into non‑overlapping 16 × 16 patches, extracting features, projecting them to a common dimension, and then aggregating across pages with a learned fusion module. The result is a compact sequence of visual tokens that retain enough information to reconstruct the original bases, yet reduce the effective token count by roughly a factor of 20 compared with base‑ or k‑mer tokenization.

Training proceeds via prompt‑conditioned pre‑training on six OCR‑inspired tasks that map directly onto core genomic operations:

  • T1 – free‑form DNA transcription (pure OCR).
  • T2 – transcription plus spatial grounding (outputting text‑box pairs).
  • T3 – ROI‑based transcription (given bounding boxes).
  • T4 – masked‑region completion within supplied ROIs.
  • T5 – subsequence retrieval with grounding (query string → list of matching boxes).
  • T6 – chromosome‑level classification.

Each task is framed as a two‑turn conversation (user prompt + assistant answer) containing the page images and the required output, mirroring modern vision‑language models. The decoder, used only during pre‑training, learns to map visual tokens to the structured outputs, thereby encouraging region‑aware reasoning. After pre‑training, the encoder is frozen and downstream tasks are solved with a lightweight MLP head, requiring only 256 k trainable parameters.

Extensive experiments across diverse benchmarks—including eQTL prediction, promoter detection, variant effect prediction, and chromosome classification—demonstrate that OpticalDNA consistently outperforms state‑of‑the‑art sequence‑based models. On sequences up to 450 kb, it achieves the best overall performance while using ~20× fewer effective tokens and surpasses models that have up to 985× more activated parameters. Ablation studies confirm that both the visual token compression and the region‑grounded prompt objectives contribute substantially to the gains.

In summary, OpticalDNA shows that treating the genome as a visual document and leveraging OCR‑style vision‑language training can dramatically improve computational efficiency, enable explicit spatial reasoning, and provide a new paradigm for genome‑scale representation learning. This work opens the door to more interpretable, scalable, and structure‑aware AI systems for genomics.


Comments & Academic Discussion

Loading comments...

Leave a Comment