Genomic-Informed Heterogeneous Graph Learning for Spatiotemporal Avian Influenza Outbreak Forecasting
Accurate forecasting of Avian Influenza Virus (AIV) outbreaks within wild bird populations necessitates models that account for complex, multi-scale transmission patterns driven by diverse factors. While conventional spatiotemporal epidemic models are robust for human-centric diseases, they rely on spatial homophily and diffusive transmission between geographic regions. This simplification is incomplete for AIV as it neglects valuable genomic information critical for capturing dynamics like high-frequency reassortment and lineage turnover at the case level (e.g., genetic descent across regions), which are essential for understanding AIV spread. To address these limitations, we systematically formulate the AIV forecasting problem and propose a Bi-Layer genomic-aware heterogeneous graph fusion pipeline. This pipeline integrates genetic, spatial, and ecological data to achieve highly accurate outbreak forecasting. It 1) defines a multi-layered graph structure incorporating information from diverse sources and multiple layers (case and location), 2) applies cross-relation smoothing to smooth information flow across edge types, 3) performs graph fusion that preserves critical structural patterns backed by theoretical spectral guarantees, and 4) forecasts future outbreaks using an autoregressive graph sequence model to capture transmission dynamics. To support research, we release the Avian-US dataset, which provides comprehensive genetic, spatial, and ecological data on US avian influenza outbreaks. BLUE demonstrates superior performance over existing baselines, highlighting the efficacy of integrating multi-layer information for infectious disease forecasting. The code is available at: https://github.com/cruiseresearchgroup/BLUE.
💡 Research Summary
The paper addresses the challenge of forecasting Avian Influenza Virus (AIV) outbreaks in wild bird populations by integrating genomic, spatial, and ecological information into a unified heterogeneous graph framework called BLUE (Bi‑Layer genomic‑aware heterogeneous graph fusion pipeline). Recognizing that AIV spreads through both long‑range migratory movements and rapid genetic reassortment, the authors formulate the problem as predicting future infection counts from a sequence of dynamic heterogeneous graphs.
BLUE constructs a bi‑layer graph for each time step: a static location layer containing nodes for geographic sites, and a dynamic case layer containing nodes for each reported infection. Three edge types are defined: (i) spatial edges between locations weighted by a Gaussian kernel of geographic distance, (ii) genetic edges between cases weighted by pairwise similarity derived from HA sequences using the Kimura 2‑parameter model, and (iii) assignment edges linking each case to its reporting location. This structure captures both geographic proximity and hidden genetic linkages that may connect distant outbreaks.
To reconcile the heterogeneity of node types and edge relations, the authors introduce a cross‑layer smoothing block. Inspired by mean‑field inference in Markov Random Fields, the block performs K rounds of relation‑specific message passing, iteratively refining node embeddings while preserving type‑specific semantics. This step encourages coherent representations among epidemiologically related groups across layers.
Because the bi‑layer graph can be large and contains redundant information, BLUE employs a Locality‑Sensitive Hashing (LSH) sampler to fuse the heterogeneous structure into a single location‑level graph. Crucially, the fusion process is constrained by a spectral regularizer that minimizes the Frobenius norm between the Laplacian of the original bi‑layer graph and that of the fused graph, providing a theoretical guarantee that the diffusion geometry is preserved.
The fused graph sequence is then fed into an autoregressive graph sequence model. An encoder with multi‑head attention captures spatial dependencies at each time step, while a decoder predicts future location‑wise case counts for H weeks ahead, using teacher‑forcing to stabilize training.
Experiments are conducted on the newly released Avian‑US dataset, which aggregates five years of US AIV surveillance data, including HA sequences, bird abundance, and environmental covariates for 48 states. BLUE is benchmarked against seven baselines, including STGCN, DCRNN, EpiGNN, Cola‑GNN, and other human‑disease‑oriented spatio‑temporal GNNs. Across MAE, RMSE, R², and MAPE, BLUE consistently outperforms all baselines, with the most pronounced gains in regions where genetic reassortment is frequent. An ablation study shows that (1) removing cross‑layer smoothing degrades performance the most, (2) omitting the spectral regularizer leads to a 5‑7 % drop in accuracy, and (3) using the full bi‑layer graph without LSH incurs higher computational cost without accuracy benefits.
The contributions are threefold: (1) a novel bi‑layer heterogeneous graph formulation that captures both spatial and genetic transmission pathways, (2) a theoretically grounded graph fusion method that preserves fine‑grained structural information, and (3) the public release of a comprehensive AIV surveillance dataset to foster further research.
Limitations include sensitivity to reporting delays and missing case data, dependence on the quality of genetic sequencing, and the current focus on state‑level locations rather than finer habitat granularity. Future work will explore online learning with real‑time reporting, extension to finer spatial resolutions, and application of the framework to other wildlife diseases.
Comments & Academic Discussion
Loading comments...
Leave a Comment