DNACHUNKER: Learnable Tokenization for DNA Language Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

DNA language models are increasingly used to represent genomic sequence, yet their effectiveness depends critically on how raw nucleotides are converted into model inputs. Unlike natural language, DNA offers no canonical boundaries, making fixed tokenizations a brittle design choice under shifts, indels, and local repeats. We introduce \modelname{}, a masked DNA language model that incorporates a learnable adaptive segmentation module to produce context-dependent, variable-length units. Building on a dynamic segmentation procedure, \modelname{} learns to allocate finer granularity to functionally enriched regions while compressing repetitive or redundant sequence. We pre-train \modelname{} on the human reference genome (HG38) and evaluate it on the Nucleotide Transformer and Genomic Benchmarks, where it consistently improves over strong fixed-tokenization baselines. Further analyses and ablations indicate that the learned segmentation is structured rather than incidental: the model preferentially uses shorter units around promoters and exons, and longer units in repetitive regions, yielding representations that are both mutation-resilient and biologically-informed.

💡 Research Summary

DNA Chunker introduces a learnable, context‑dependent tokenization scheme for DNA language models, addressing the brittleness of fixed‑size k‑mers or BPE when faced with insertions, deletions, and repetitive regions. The model consists of three main components: (1) a hierarchical encoder that first embeds raw nucleotides with lightweight bidirectional Mamba layers, then iteratively predicts chunk boundaries using a cosine‑similarity‑based routing network; (2) a main network composed of eight Transformer blocks that processes the compressed sequence of chunk embeddings, capturing long‑range dependencies with Rotary Position Embeddings; and (3) a hierarchical decoder that reverses the compression, expanding chunk representations back to base‑pair resolution for masked language modeling.

A key innovation is the “mask protection” mechanism. During MLM pre‑training,

DNACHUNKER: Learnable Tokenization for DNA Language Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment