AntigenLM: Structure-Aware DNA Language Modeling for Influenza

AntigenLM: Structure-Aware DNA Language Modeling for Influenza
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Language models have advanced sequence analysis, yet DNA foundation models often lag behind task-specific methods for unclear reasons. We present AntigenLM, a generative DNA language model pretrained on influenza genomes with intact, aligned functional units. This structure-aware pretraining enables AntigenLM to capture evolutionary constraints and generalize across tasks. Fine-tuned on time-series hemagglutinin (HA) and neuraminidase (NA) sequences, AntigenLM accurately forecasts future antigenic variants across regions and subtypes, including those unseen during training, outperforming phylogenetic and evolution-based models. It also achieves near-perfect subtype classification. Ablation studies show that disrupting genomic structure through fragmentation or shuffling severely degrades performance, revealing the importance of preserving functional-unit integrity in DNA language modeling. AntigenLM thus provides both a powerful framework for antigen evolution prediction and a general principle for building biologically grounded DNA foundation models.


💡 Research Summary

AntigenLM is a structure‑aware DNA language model designed to forecast influenza A antigenic evolution by leveraging the full eight‑segment viral genome. The authors first assembled a large corpus of 54,512 high‑quality influenza A genomes from GISAID, filtering out incomplete or low‑coverage sequences. Each genome is represented as a single contiguous nucleotide string formed by concatenating the eight segments (PB2, PB1, PA, HA, NP, NA, MP, NS) in a fixed order, preserving the natural segment orientation without any multiple‑sequence alignment or artificial markers.

The backbone is a compact GPT‑2‑style decoder‑only Transformer with six layers, a hidden dimension of 384, and six attention heads. Crucially, the positional embedding range is extended to 13,000 tokens, allowing the model to ingest the entire ~13 kb influenza genome in one pass. Two task‑specific heads sit on top of the shared backbone: (i) a language‑modeling head that predicts the next nucleotide in an autoregressive fashion, and (ii) a classification head that maps the hidden state at a sentinel token to a subtype label. This multi‑task design enables the model to learn both generative representations of evolutionary dynamics and discriminative features for subtype identification.

Pretraining is conducted on the full‑genome sequences using a standard causal language‑model loss. To assess the importance of preserving genomic structure, five ablation variants are introduced: (a) segment‑wise (each segment processed separately), (b) incomplete‑genome (random windows that mix segments), (c) antigen‑only nucleotide (only HA and NA nucleotides), (d) antigen‑only protein (HA and NA amino‑acid sequences), and (e) the full‑genome baseline.

Fine‑tuning addresses two downstream tasks. For forecasting, the model receives three historical monthly blocks, each formatted as “ ”, and is trained to generate the fourth block representing the next month’s HA/NA sequences. During inference, only the three historical blocks are supplied, and the model autoregressively generates the future HA and NA until a separator token appears or a length limit is reached. For subtype classification, a single HA/NA pair with a separator token is fed to the model; the final hidden state is passed through a linear layer to predict the subtype. Both tasks can be trained separately or jointly via a weighted sum of their losses.

Evaluation is performed on post‑2022 sequences held out for testing, stratified by subtype, month, and geography. Models are trained on pre‑2022 data from Europe and Asia, then tested on Japan (in‑distribution) and the United States (out‑of‑distribution). AntigenLM outperforms traditional phylogenetic approaches used by WHO, Local Branching Index (LBI) methods, and the state‑of‑the‑art site‑wise evolutionary predictor beth‑1. Specifically, AntigenLM achieves a lower amino‑acid mismatch rate (improving by >15 % on average) when predicting future HA/NA sequences, and attains near‑perfect subtype classification accuracy (>99.8 %).

Ablation results confirm that the full‑genome, segment‑order‑preserving input is essential: the segment‑wise and incomplete‑genome variants lose 10‑20 % of forecasting accuracy, while antigen‑only models suffer even larger drops, especially in subtype classification. The protein‑only variant performs worse than its nucleotide counterpart, underscoring the importance of nucleotide‑level signals such as synonymous mutations, codon usage, and non‑coding regulatory elements.

The authors also discuss computational efficiency. Despite handling 13 kb sequences, the model’s modest size enables pretraining on tens of thousands of genomes within a few days on standard multi‑GPU hardware, making it practical for real‑time surveillance pipelines.

In summary, AntigenLM demonstrates that (1) preserving functional‑unit integrity during pretraining, (2) employing a long‑context transformer, and (3) jointly training generative and discriminative objectives together yield a powerful framework for influenza antigen evolution prediction. The work provides a blueprint for building biologically grounded foundation models applicable to other rapidly evolving RNA viruses and highlights the value of genome‑wide context beyond protein‑only representations.


Comments & Academic Discussion

Loading comments...

Leave a Comment