Benchmarking Artificial Intelligence Models for Daily Coastal Hypoxia Forecasting
Coastal hypoxia, especially in the northern part of Gulf of Mexico, presents a persistent ecological and economic concern. Seasonal models offer coarse forecasts that miss the fine-scale variability needed for daily, responsive ecosystem management. We present study that compares four deep learning architectures for daily hypoxia classification: Bidirectional Long Short-Term Memory (BiLSTM), Medformer (Medical Transformer), Spatio-Temporal Transformer (ST-Transformer), and Temporal Convolutional Network (TCN). We trained our models with twelve years of daily hindcast data from 2009-2020 Our training data consists of 2009-2020 hindcast data from a coupled hydrodynamic-biogeochemical model. Similarly, we use hindcast data from 2020 through 2024 as a test data. We constructed classification models incorporating water column stratification, sediment oxygen consumption, and temperature-dependent decomposition rates. We evaluated each architectures using the same data preprocessing, input/output formulation, and validation protocols. Each model achieved high classification accuracy and strong discriminative ability with ST-Transformer achieving the highest performance across all metrics and tests periods (AUC-ROC: 0.982-0.992). We also employed McNemar’s method to identify statistically significant differences in model predictions. Our contribution is a reproducible framework for operational real-time hypoxia prediction that can support broader efforts in the environmental and ocean modeling systems community and in ecosystem resilience. The source code is available https://github.com/rmagesh148/hypoxia-ai/
💡 Research Summary
This paper addresses the pressing need for daily forecasts of coastal hypoxia on the northern Gulf of Mexico, a phenomenon that threatens marine ecosystems and local economies. Seasonal statistical models currently in operational use lack the temporal resolution required to capture rapid changes in stratification, river plume dynamics, and sediment oxygen demand. To fill this gap, the authors benchmark four modern deep‑learning sequence‑modeling architectures—Bidirectional Long Short‑Term Memory (BiLSTM), Temporal Convolutional Network (TCN), Medformer (a multiscale medical transformer), and a Spatio‑Temporal Transformer (ST‑Transformer)—under an identical data‑processing and validation framework.
The dataset consists of daily hindcast outputs from the coupled COAWST (ROMS‑NEMURO) model for the Louisiana‑Texas shelf. Training data cover the summer months (May–August) from 2009 to 2020, providing 1,471 spatial snapshots at 25 km² resolution. Test data comprise the same seasonal window for 2020‑2024. Three physically‑based predictors are used: Potential Energy Anomaly (PEA) as a proxy for water‑column stratification, sediment oxygen consumption rate (SOC), and temperature‑dependent organic matter decomposition rate (DCPTemp). These variables are organized into 7‑day sliding windows, cyclically encoded (sine/cosine) for day‑of‑year, month and hour, min‑max normalized, and masked for land. Because hypoxic events are rare, the authors apply SMOTE oversampling on flattened sequences and weighted random sampling during training to mitigate class imbalance.
Model architectures are described in detail. BiLSTM employs two layers of 120 hidden units per direction with 30 % dropout. TCN uses three dilated causal convolutional layers (kernel size 3, dilation rates 1, 2, 4) to capture long‑range dependencies while preserving causality. Medformer decomposes the input into multiscale patches and applies a two‑stage attention mechanism (local then global) with four heads and two layers. ST‑Transformer treats each grid cell as a spatial token and jointly applies multi‑head attention across space and time (16 heads, three encoder layers). All models share the same optimizer (Adam, learning rate 1e‑3), batch size (64), early‑stopping criteria, and are evaluated with five‑fold cross‑validation and independent year‑wise testing.
Performance metrics focus on operational relevance: accuracy, F1‑score (to handle imbalance), and AUC‑ROC (ranking ability). All models achieve high scores, but ST‑Transformer consistently outperforms the others, attaining AUC‑ROC values between 0.982 and 0.992 and F1‑scores around 0.91. The other models reach AUC‑ROC ≈ 0.98 and F1‑scores of 0.88–0.90. To assess whether differences are statistically meaningful, the authors conduct McNemar’s test on paired prediction outcomes; results indicate significant differences between ST‑Transformer and the remaining architectures.
The paper’s contributions are threefold: (1) a rigorous, reproducible benchmark of contemporary deep‑learning time‑series models for hypoxia prediction, (2) the first application of McNemar’s test in an environmental AI context to quantify prediction disparities, and (3) the release of a complete codebase (https://github.com/rmagesh148/hypoxia‑ai) enabling real‑time operational deployment.
Limitations are acknowledged. The dataset is restricted to summer months, potentially overlooking off‑season dynamics. SMOTE, while alleviating imbalance, may disrupt temporal continuity. No model‑interpretability techniques (e.g., SHAP, Grad‑CAM) are employed, limiting insight into the physical drivers behind predictions. Future work should expand the temporal coverage, explore hybrid physics‑informed neural networks, and integrate explainable‑AI methods to enhance trust for stakeholders.
In summary, the study demonstrates that deep‑learning sequence models, especially those incorporating spatio‑temporal attention, can reliably forecast daily hypoxic conditions on the Gulf of Mexico shelf. The provided framework and open resources set a valuable benchmark for subsequent research in marine environmental forecasting and broader ocean‑modeling communities.
Comments & Academic Discussion
Loading comments...
Leave a Comment